
Support Vector Machines
Carlos C. Rodríguez
http://omega.albany.edu:8008/
Finding the Hyperplane with Largest Margin
Let us assume that we have n labeled examples (x_{1},y_{1}),¼,(x_{n},y_{n}) with labels y_{i} Î {1,1}. We want to find the
hyperplane < w,x > +b = 0 (i.e. with parameters (w,b) ) satisfying
the followin three conditions:
 The scale of (w,b) is fixed so that the plane is in canonical
position w.r.t. {x_{1},¼,x_{n}}. i.e.,

min
i £ n

 < w,x_{i} > + b  = 1 

 The plane with parameters (w,b) separates the +1's from the
the 1's. i.e.,
y_{i} ( < w,x_{i} > + b ) ³ 0 for all i £ n 

 The plane has maximum margin r = 1/w. i.e., minimum
w^{2}.
Of course there may not be a separating plane for the observed
data. Let us assume, for the time being, that the data is in fact
linearly separable and we'll take care of the general (more realistic)
case later.
Clearly 1 and 2 combine into just one condition:
y_{i}( < w,x_{i} > + b ) ³ 1 for all i £ n. 

Thus, we want to solve the following optimization problem,
over all w Î R^{d} and b Î R subject to,
y_{i}( < w,x_{i} > + b ) 1 ³ 0 for all i £ n. 

This is a very simple quadratic programming problem. There are readily
available algorithms of complexity O(n^{3}) that can be used for
solving this problem. For example the so called interior point
algorithms that are variations of the Karmarkar algorithm for linear
programming will do. But, when n and d are large (tens of
thousands) even the best QP methods will fail. A very desirable
characteristics of SVMs is that most of the data ends up being
irrelevant. The relevant data are only the points that end up exactly
on the margin of the optimal classifier and these are often a very
small fraction of n.
KKTTheory
The problem that we need to solve is a special case of the general
problem of minimizing a convex function f(x) subject to n inequality
constraints g_{j}(x) ³ 0 for j=1,2,¼,n where the functions
g_{j} are also convex. Let's call this problem (CO).
Notice that in our case x=(w,b) Î R^{d+1} and the
constraints are linear in the unknowns x. Don't get confused with our
previous x_{i}'s.
The characterization of the solution to the convex
optimization (CO) problem is given by the so called KarushKuhnTucker
conditions.
 Theorem (KKTConditions)



x

solves the (CO) problem 

if and only if there exists


l

=( 

l

1

,¼, 

l

n

) ³ 0 

a vector of nonnegative Lagrange multipliers so that
( 

x

, 

l

) is a saddle point of the Lagrangean, 

L(x,l) = f(x)  
n å
j=1

l_{j} g_{j}(x). 

i.e., for all x and for all l ³ 0 we have,
L( 

x

,l) £ L( 

x

, 

l

) £ L(x, 

l

). 

Before proving (half of) this theorem notice that there is an easy to
understand intuitive reason behind this result. Think of the term added
(subtracted actually) to f(x) to form the Lagrangean L, as a penalty
for an x that violates the constraints. In fact, if g_{j}(x) < 0,
the term l_{j}g_{j}(x) > 0 can be made arbitrarily large by
increasing l_{j}. Thus, the minimizer of L(x,l) over x
will have to make g_{j}(x) ³ 0. On the other hand if g_{j}(x) > 0
then it is best to take l_{j}=0 to maximize L(x,l) as
a function of l. It is possible to show that, the saddle point
condition is equivalent to,

max
x


min
l ³ 0

L(x,l) = L( 

x

, 

l

) = 
min
l ³ 0


max
x

L(x,l). 

Proof: Let us show that the saddle point condition is in fact
sufficient for solving the (CO) problem. That it is also necessary
depends on Farkas's Lemma and it is much more difficult to prove.
We need to show that the saddle point condition implies,
 for all j £ n,
and,
 for all x that satisfies the constraints,
To show 1, suppose that the ith constraint is violated. Then by taking
and
l_{j} = 

l

j

for all j ¹ i 

we get,
L( 

x

,l) > L( 

x

, 

l

) 

which contradicts the saddle point condition.
To show 2, take l = 0 on the left hand side of the saddle point
condition and take x satisfying the constraints on the right. Then,
f( 

x

) = L( 

x

,0) £ L(x, 

l

) £ f(x). 

Which proves 2. ·
When all, the objective function f(x), and the constrainning functions
g_{j}(x) are differentiable (they are infinitely differentiable in the
case of SVMs) the condition for a saddle point is simply that at that
point the tangent plane to the surface z = L(x,l) is parallel to
the (x,l) plane. The saddle point of L can be obtained by
solving the system of equations,
 

0, i.e., Ñf(x) = 
å
j

l_{j}Ñg_{j}(x) 
 
 

l_{j}g_{j}(x) = 0 for all j £ n from where, Ñ_{l} L(x,l) = 0. 
 

The second set of equations are known as complementarity conditions
and are a consequence of the constraint that l ³ 0.
The Dual
The minmax = maxmin characterization of the saddle point of the
Lagrangean L provides an alternative way to find the solution of the
(CO) problem. Instead of minimizing f(x) subject to the g_{j}(x) ³ 0
constraints one can equivalently maximize W(l), where
subject to the constraint that l ³ 0. This provides an alternative
route to the same saddle point of L.
The Support Vectors of SVMs
Let us apply the KKTconditions to our original problem of finding the
separating hyperplane with maximum margin. The Lagrangean in this case
is,
L(w,b,l) = 
1
2


d å
i=1

w_{i}^{2}  
n å
j=1

l_{j}{y_{j}( < w,x_{j} > + b )  1 }, 

and the KKTconditions for optimality are,
 

0, i.e., w = 
n å
j=1

l_{j}y_{j}x_{j} 
 
 

0, i.e., 
n å
j=1

l_{j}y_{j} = 0 
 

l_{j}{y_{j}( < w,x_{j} > + b )  1 } 


 

These provide a complete characterization of the optimal plane. The normal
w must be a linear combination of the observed vectors x_{j}, that's
the first set of equations. The coefficients of this linear combination
must add up to 0, that's the second equation. Finally the complementarity
conditions tell us that the only nonzero Lagrange multipliers l_{j}
are those associated to the vectors x_{j} right on the margin, i.e., such
that,
y_{j}( < w,x_{j} > + b ) = 1. 

These are called support vectors and they are the only ones needed
since
w = 
å
j Î J_{0}

l_{j}y_{j}x_{j} 

where J_{0} = {j : x_{j} is a s.v. }. The support vectors are
the observations x_{j} at the exact distance r = 1/w from the
separating plane. The number of such vectors is usually much smaller than
n and that makes it possible to consider very large numbers of examples
with x_{j} having many coordinates.
The Dual Problem for SVMs
The dual problem for SVMs turns out to be even simpler than the primal and
its formulation shows the way to a magnificent nonlinear generalization.
For a given vector l of Lagrange multipliers, the minimizer of
L(w,b,l) w.r.t. (w,b) must satisfy the optimality conditions
obtained above, i.e., w is a l.c. of the x_{j}'s with coefficients
l_{j}y_{j} that must add up to zero. Hence, replacing these
conditions into L(w,b,l) we obtain the dual formulation,
where,
W(l) = 
n å
j=1

l_{j}  
1
2


å
i,j

l_{i}l_{j}y_{i}y_{j} < x_{i},x_{j} > 

and the l ³ 0 satisfying,
Maximizing W(l) over l ³ 0 s.t. the above simple
linear constraint is satisfied, is the preferred form to feed a QP algorithm.
Once an optimal l is obtained we find w as the l.c. of
the x_{j} as above and we find b by recalling that the plane must
be in canonical position so,

min
i £ n

y_{i}( < w,x_{i} > + b) = 1 = y_{j}( < w,x_{j} > + b) for all j Î J_{0} 

and we get,
b = y_{j}  < w,x_{j} > . 

Multiplying through by l_{j} and adding over j we find,
b = 
 
å
i,j

l_{i}l_{j}y_{i}y_{j} < x_{i},x_{j} > 



and it can be readily checked that this value coincides with the value of the
Lagrange multiplier b associated to the constraint
å_{i}l_{i}y_{i} = 0 (just find Ñ_{l}L = 0 for
the Lagrangean associated to the dual, i.e.
L = W(l)  bå_{i}l_{i}y_{i}). The optimal values of b
and of l are often returned by the modern QP solvers based on
interior point algorithms.
File translated from
T_{E}X
by
T_{T}H,
version 3.63. On 17 Oct 2004, 15:11.
