Another service from Omega

*****
Other available formats: [.PDF] [.TEX]

Support Vector Machines

Carlos C. Rodríguez
http://omega.albany.edu:8008/

Finding the Hyperplane with Largest Margin

Let us assume that we have n labeled examples (x1,y1),,(xn,yn) with labels yi {1,-1}. We want to find the hyperplane < w,x > +b = 0 (i.e. with parameters (w,b) ) satisfying the followin three conditions:
  1. The scale of (w,b) is fixed so that the plane is in canonical position w.r.t. {x1,,xn}. i.e.,

    min
    i n 
    | < w,xi > + b | = 1
  2. The plane with parameters (w,b) separates the +1's from the the -1's. i.e.,
    yi ( < w,xi > + b ) 0  for all i n
  3. The plane has maximum margin r = 1/|w|. i.e., minimum |w|2.
Of course there may not be a separating plane for the observed data. Let us assume, for the time being, that the data is in fact linearly separable and we'll take care of the general (more realistic) case later.
Clearly 1 and 2 combine into just one condition:
yi( < w,xi > + b ) 1  for all i n.
Thus, we want to solve the following optimization problem,

minimize  1

2
|w|2
over all w Rd and b R subject to,

yi( < w,xi > + b ) -1 0  for all i n.
This is a very simple quadratic programming problem. There are readily available algorithms of complexity O(n3) that can be used for solving this problem. For example the so called interior point algorithms that are variations of the Karmarkar algorithm for linear programming will do. But, when n and d are large (tens of thousands) even the best QP methods will fail. A very desirable characteristics of SVMs is that most of the data ends up being irrelevant. The relevant data are only the points that end up exactly on the margin of the optimal classifier and these are often a very small fraction of n.

KKT-Theory

The problem that we need to solve is a special case of the general problem of minimizing a convex function f(x) subject to n inequality constraints gj(x) 0 for j=1,2,,n where the functions gj are also convex. Let's call this problem (CO). Notice that in our case x=(w,b) Rd+1 and the constraints are linear in the unknowns x. Don't get confused with our previous xi's.
The characterization of the solution to the convex optimization (CO) problem is given by the so called Karush-Kuhn-Tucker conditions.
Theorem (KKT-Conditions)

-
x
 
 solves the (CO) problem
if and only if there exists
-
l
 
=(
-
l
 

1 
,,
-
l
 

n 
) 0
a vector of non-negative Lagrange multipliers so that
(
-
x
 
,
-
l
 
)  is a saddle point of the Lagrangean,

L(x,l) = f(x) - n

j=1 
lj gj(x).
i.e., for all x and for all l 0 we have,
L(
-
x
 
,l) L(
-
x
 
,
-
l
 
) L(x,
-
l
 
).
Before proving (half of) this theorem notice that there is an easy to understand intuitive reason behind this result. Think of the term added (subtracted actually) to f(x) to form the Lagrangean L, as a penalty for an x that violates the constraints. In fact, if gj(x) < 0, the term -ljgj(x) > 0 can be made arbitrarily large by increasing lj. Thus, the minimizer of L(x,l) over x will have to make gj(x) 0. On the other hand if gj(x) > 0 then it is best to take lj=0 to maximize L(x,l) as a function of l. It is possible to show that, the saddle point condition is equivalent to,


max
x 

min
l 0 
L(x,l) = L(
-
x
 
,
-
l
 
) =
min
l 0 

max
x 
L(x,l).
Proof: Let us show that the saddle point condition is in fact sufficient for solving the (CO) problem. That it is also necessary depends on Farkas's Lemma and it is much more difficult to prove. We need to show that the saddle point condition implies,
  1. for all j n,
    gj(
    -
    x
     
    ) 0
    and,
  2. for all x that satisfies the constraints,
    f(
    -
    x
     
    ) f(x)
To show 1, suppose that the ith constraint is violated. Then by taking
li >
-
l
 

i 
and
lj =
-
l
 

j 
 for all j i
we get,
L(
-
x
 
,l) > L(
-
x
 
,
-
l
 
)
which contradicts the saddle point condition.
To show 2, take l = 0 on the left hand side of the saddle point condition and take x satisfying the constraints on the right. Then,
f(
-
x
 
) = L(
-
x
 
,0) L(x,
-
l
 
) f(x).
Which proves 2.
When all, the objective function f(x), and the constrainning functions gj(x) are differentiable (they are infinitely differentiable in the case of SVMs) the condition for a saddle point is simply that at that point the tangent plane to the surface z = L(x,l) is parallel to the (x,l) plane. The saddle point of L can be obtained by solving the system of equations,
x L(x,l)
=
0, i.e.,  f(x) =

j 
ljgj(x)
and
 
ljgj(x) = 0  for all j n from where,  l L(x,l) = 0.
The second set of equations are known as complementarity conditions and are a consequence of the constraint that l 0.

The Dual

The minmax = maxmin characterization of the saddle point of the Lagrangean L provides an alternative way to find the solution of the (CO) problem. Instead of minimizing f(x) subject to the gj(x) 0 constraints one can equivalently maximize W(l), where
W(l) =
min
x 
L(x,l)
subject to the constraint that l 0. This provides an alternative route to the same saddle point of L.

The Support Vectors of SVMs

Let us apply the KKT-conditions to our original problem of finding the separating hyperplane with maximum margin. The Lagrangean in this case is,

L(w,b,l) = 1

2
d

i=1 
wi2 - n

j=1 
lj{yj( < w,xj > + b ) - 1 },
and the KKT-conditions for optimality are,
wL
=
0, i.e.,  w = n

j=1 
ljyjxj
bL
=
0, i.e.,  n

j=1 
ljyj = 0
lj{yj( < w,xj > + b ) - 1 }
=
0,  for all j n.
These provide a complete characterization of the optimal plane. The normal w must be a linear combination of the observed vectors xj, that's the first set of equations. The coefficients of this linear combination must add up to 0, that's the second equation. Finally the complementarity conditions tell us that the only non-zero Lagrange multipliers lj are those associated to the vectors xj right on the margin, i.e., such that,
yj( < w,xj > + b ) = 1.
These are called support vectors and they are the only ones needed since
w =

j J0 
ljyjxj
where J0 = {j : xj  is a s.v. }. The support vectors are the observations xj at the exact distance r = 1/|w| from the separating plane. The number of such vectors is usually much smaller than n and that makes it possible to consider very large numbers of examples with xj having many coordinates.

The Dual Problem for SVMs

The dual problem for SVMs turns out to be even simpler than the primal and its formulation shows the way to a magnificent non-linear generalization. For a given vector l of Lagrange multipliers, the minimizer of L(w,b,l) w.r.t. (w,b) must satisfy the optimality conditions obtained above, i.e., w is a l.c. of the xj's with coefficients ljyj that must add up to zero. Hence, replacing these conditions into L(w,b,l) we obtain the dual formulation,

maximize  W(l)
where,
W(l) = n

j=1 
lj - 1

2


i,j 
liljyiyj < xi,xj >
and the l 0 satisfying,
n

j=1 
ljyj = 0.
Maximizing W(l) over l 0 s.t. the above simple linear constraint is satisfied, is the preferred form to feed a QP algorithm. Once an optimal l is obtained we find w as the l.c. of the xj as above and we find b by recalling that the plane must be in canonical position so,

min
i n 
yi( < w,xi > + b) = 1 = yj( < w,xj > + b)  for all j J0
and we get,
b = yj - < w,xj > .
Multiplying through by lj and adding over j we find,

b =
-

i,j 
liljyiyj < xi,xj >



j 
lj
and it can be readily checked that this value coincides with the value of the Lagrange multiplier b associated to the constraint iliyi = 0 (just find lL = 0 for the Lagrangean associated to the dual, i.e. L = W(l) - biliyi). The optimal values of b and of l are often returned by the modern QP solvers based on interior point algorithms.



File translated from TEX by TTH, version 3.63.
On 17 Oct 2004, 15:11.

http://omega.albany.edu:8008/