\documentclass{article}
\usepackage[pdftex]{graphics}
\newcommand{\A}{\mbox{${\cal A}$\ }}
\newcommand{\B}{\mbox{${\cal B}$\ }}
\newcommand{\G}{\mbox{${\cal G}$\ }}
\newcommand{\C}{\mbox{${\cal C}$\ }}
\newcommand{\X}{\mbox{${\cal X}$\ }}
\newcommand{\SnA}{\mbox{${\cal S}~(n,\A)$\ }}
\newcommand{\Sn}{\mbox{${\cal S}$~}}
\newcommand{\sgn}{\mbox{sgn\ }}
\newcommand{\vol}{\mbox{vol}}
\title{Linear Classifiers}
\author{Carlos C. Rodr\'{\i}guez \\
{\em http://omega.albany.edu:8008/}}
\begin{document}
\maketitle
\section*{Classification with Hyperplanes}
Assume we have observed $n$ examples of labeled data
$(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})$ but now let us take the labels
$Y\in\{1,-1\}$. This convention simplifies some of the formulas below
and it is the standard for support vector machines.
We'll be interested in classification rules of the form,
\[ g(x) = \sgn \left( + b \right) \]
where $<,>$ denotes the (euclidean) inner product in $R^{d}$ and $w\in
R^{d}$ and $b\in R$ parametrize all the rules of this type. These
rules are called linear. The class $\G$ of linear classifiers is
about the simplest possible way of separating two kinds of objects in
$R^{d}$. A classifier in $\G$ assigns the label $+1$ or $-1$ to a data
point $x\in R^{d}$ depending on weather we arrive to $x$ from the plane
by going in the direction of the normal $w$ or its oposite $-w$.
Recall that the set of vectors,
\[ \{ x : + b = 0\} \]
correspond to the hyperplane perpendicular to $w$ that goes through every
point $x_{0}$ such that, \( = -b \). For example the following
picture shows such an hyperplane that perfectly separates the O's (with
label $+1$) from the X's (with label $-1$).
\begin{center}
%\epsfbox{two-gaus.eps}
\includegraphics{hyper}
\end{center}
The hyperplane with parameters $(w,b)$ is the same as the hyperplane with
parameters $(cw,cb)$ for any non-zero scalar $c$. To fix the scale we say
that a plane is in {\em canonical position} with respect to a set of points
$\X = \{x_{1},\ldots,x_{n} \}$ if,
\[ \min_{i\le n} | + b | = 1. \]
Clearly, the perpendicular distance from $x_{i}$ to the hyperplane
with parameters $(w,b)$ is the magnitud of the projection of
$(x_{i}-x_{0})$ onto the unit normal direction $w/|w|$. It is given by,
\[ d_{i} = \frac{||}{|w|} = \frac{| + b |}{|w|} \]
as it can be seen from the above picture. Thus, if we assume that the
plane is in canononical position w.r.t. $\{x_{i}: i\le n\}$ then,
\[ \rho = \min_{i\le n} d_{i} = \frac{1}{|w|}. \]
This minimun distance $\rho$ is known as the {\em margin} between a separating
hyperplane and the points w.r.t. which that plane is in general position.
\section*{Maximum Margin Classifiers}
When there is a hyperplane with
parameters $(w,b)$ that separates the points $\{x_{i}: y_{i} = 1, i\le n\}$
from the points $\{x_{i}: y_{i} = -1, i\le n\}$ then, if we take the
plane in canonical position w.r.t. $\X=\{x_{1},\ldots,x_{n}\}$ we have,
\[ y_{i}( + b ) \ge 1 \mbox{\ for all $i\le n$\ }. \]
There could be many such planes but it is geometrically obvious that
the one farthest from the points should be prefered over all the
others. The intuition behind this is the fact that the farther the
separating boundary is from the observations the more likely it seems
for future data to end up being correctly classified by this plane. In
other words, we expect better generalization power for boundaries with
larger margin, i.e., for planes with small $|w|$ (recall that $\rho =
1/|w|$. This intuition is confirmed by the following theorem.
\begin{description}
\item[Theorem] Consider hyperplanes through the origin in canonical position
w.r.t. $\X=\{x_{1},\ldots,x_{n}\}$. Then, the set of linear classifiers,
$g_{w}(x) = \sgn $ with $|w| \le \Lambda$ has VC dimension,
\[ V \le \min\{ R^{2}\Lambda^{2}, R^{d}\Lambda^{d}\} \]
where $R$ is the radius of the smallest sphere around the origin containing
$\X$
\end{description}
Before proving it notice that in fact the length of $w$ controls the VC
dimension so that the larger the margin the larger the generalization
power as expected.
{\bf Proof:}
We need to show both, $V\le R^{d}\Lambda^{d}$ and $V\le R^{2}\Lambda^{2}$.
The first inequality follows from the fact that the maximum number
of spheres of radius $1/\Lambda$ that still fit inside a sphere of radius $R$
is at most $\vol(R)/\vol(1/\Lambda)= (R\Lambda)^{d}$, where
$\vol(r)= C_{d}r^{d}$ is the volume of the sphere of radius $r$ in $R^{d}$.
More than this number of points is imposible to shatter with margin of
at least $1/\Lambda$. Thus, $V\le (R\Lambda)^{d}$. For the other inequality
again we show that the number of points that can be shattered with margin $1/\Lambda$
is at most $(R\Lambda)^{2}$. If $n$ points can be shattered then for all possible
choices of $y_{i}\in \{1,-1\}$ there is an hyperplane with parameter $w$ with
$|w|\le \Lambda$, in canonical position w.r.t. $\{x_{1},\ldots,x_{n}\}$ separating
$+1$'s from $-1$'s, i.e., satisfying,
\[ y_{i} \ge 1 \mbox{\ for all $i=1,\ldots,n$}. \]
But this together with Cauchy-Schwartz inequality and the fact that $|w|\le \Lambda$
imply,
\begin{eqnarray*}
n &\le& | | \le |w| |\sum_{i=1}^{n} y_{i}x_{i}| \\
&\le& \Lambda |\sum_{1}^{n} y_{i}x_{i} |.
\end{eqnarray*}
To bound the sum consider the case when the labels $y_{i}$ are Rademacher variables, i.e.
iid with $P\{y_{i}=1\}=1-P\{y_{i}= -1\} = 1/2$. From the fact that $E\{y_{i}y_{j}\}=0$
when $i\ne j$ and that $y_{i}^{2}=1$ for all $i\le n$, we obtain
\begin{eqnarray*}
E|\sum y_{i}x_{i}|^{2} &=& \sum_{i=1}^{n} E\{\} \\
&=& \sum_{i=1}^{n} \left\{ \sum_{j\ne i} E[] +
E[] \right\} \\
&=& \sum_{i=1}^{n} |x_{i}|^{2} \le n R^{2}.
\end{eqnarray*}
If the bound is true for the expectation when Rademacher variables are used, then there
must exist $y_{i}$'s for which,
\[ |\sum y_{i}x_{i}|^{2} \le n R^{2}. \]
Squaring the initial inequality and using the above bound, we obtain
\[ n^{2} \le \Lambda^{2} n R^{2} \]
and the result follows after deviding through by $n.\bullet$
\end{document}