\documentclass{article}
\usepackage[pdftex]{graphicx}
\newcommand{\F}{\mbox{${\cal F}$\ }}
\newcommand{\La}{\mbox{${\cal L}$\ }}
\newcommand{\Hs}{\mbox{${\cal H}$\ }}
\title{Regression}
\author{Carlos C. Rodr\'{\i}guez \\
{\em http://omega.albany.edu:8008/}}
\begin{document}
\maketitle
\section*{ Classical Regression }
Recall the classical regression problem. Given observed data
$(x_{1},y_{1}),\ldots, (x_{n},y_{n})$ iid as a vector
$(X,Y)\in R^{d+1}$ we want to estimate $f^{*}(x) = E(Y|X=x)$ i.e., the regression
function of $Y$ on $X$. When the vector $(X,Y)$ is multivariate gaussian
the regression function is $f^{*}(x) = \alpha + L(x)$ with $L(x)$ linear,
and the Ordinary (yak!) Least Squares (OLS) estimator coincides with the MLE
(Maximum Likelihood Estimator). Very often the distribution of $(X,Y)$
is not explicitly known beyond the $n$ observations and the available
prior information about the meaning of these data. A typical assumption
is to think of the $y_{j}$ as the result of sampling the regression
function at $f^{*}(x_{j})$ with gaussian measurement error. The model is
that conditionally on $x_{1},\ldots, x_{n}$ the values of $y_{1},\ldots,y_{n}$
are independent with $Y_{j}$ depending on $X_{j}$ only and
$Y_{j}|X_{j}=x_{j}$ being $N(f^{*}(x_{j}), \sigma^{2})$. Thus, for
$j= 1,2,\ldots, n$
\[ y_{j} = f^{*}(x_{j}) + \epsilon_{j} \]
where $\epsilon_{1},\epsilon_{2},\ldots, \epsilon_{n}$ are iid $N(0,\sigma^{2})$.
In this way the distribution of $(X,Y)$ is modeled semiparametrically
with $(f,\sigma)$ where $f\in \F$ is a function in some space of functions \F
and $\sigma > 0$ is a positive scalar parameter.
When \F is taken as the $m$ dimensional space $\F_{m}$ generated by functions,
$g_{1}(x),\ldots,g_{m}(x)$ estimation of the regression function reduces to
the linear optimization problem,
\[ \hat{f} = \mbox{arg}\min_{f\in\F_{m}} \sum_{i=1}^{n} \left(
y_{i} - f(x_{i}) \right)^{2} \]
The solution is then given as the orthogonal (euclidean) projection of
the observed vector $y^{T} = (y_{1},y_{2},\ldots,y_{n})$ onto the
space generated by the columns of the matrix $G\in R^{m\times n}$
with entries $g_{ij} = g_{i}(x_{j})$. In fact, the above optimization problem
can be written as,
\[ || y - \hat{y} ||^{2} = \min_{w\in R^{m}} || y - Gw ||^{2} \]
\begin{center}
%make sure you have file.pdf and file.gif
\includegraphics[scale=0.5]{regplot}
\end{center}
As shown by the picture, the rejection vector (i.e. $y$ minus its projection)
must be orthogonal to the linear space generated by the columns of $G$,
in particular (and equivalently) to each of these columns, obtaining the
standard set of normal equations,
\[ 0 = G^{T}(y - \hat{y}) = G^{T}(y - G\hat{w}) \]
with solution,
\[ \hat{w} = (G^{T}G)^{-1}G^{T}y \]
The more general case of Weighted Least Squares (WLS) corresponding to
the innerproduct $ = x^{T}Az$ generated by a symmetric positive
definite matrix $A$, is just,
\[ \hat{w} = (G^{T}AG)^{-1}G^{T}A y. \]
The matrix $A$ encodes a covariance structure for the measurement errors,
$\epsilon_{1},\ldots,\epsilon_{n}$.
\section*{Over fitting and Kernel Regression}
How should $\F$ be chosen?. On the one hand, we would like \F to be big so
not to constrain the form of the true regression function too much.
On the other hand, big \F's make the task of searching for the best
$f\in \F$ more difficult and more importantly without a constraint on
the explanatory capacity of \F the solution will show no power of generalization.
A big enough \F will always have at least one member $f$, able to fit all the
observations perfectly, without error, but this $f$ provides no assurance
that $f(x)$ is not as bad as it can possibly be for any point $x$ not
in the training set. To be able to assure that the size of the mistake
on future data will not exceed a given value with high probability,
(i.e. to have PAC bounds) we must constrain the capacity of \F somehow.
Over the years, statisticians and numerical analysts have invented all kinds
of ad-hoc devices for achieving this goal. These are known as regularization
methods. They boil down to adding a penalty term to the OLS empirical
term, often of the form $\Omega(||f||)$ where $\Omega$ is an increasing
function and \F is assumed to be a space with a norm. The problem to
be solved becomes,
\[ \min_{f\in \F} \sum_{i=1}^{n} (y_{i} - f(x_{i}))^{2}
\mbox{\ subject to\ } ||f|| \le r_{n} \]
where the sequence of radiuses $r_{n} \rightarrow \infty$ as $n \rightarrow \infty$,
but not too quickly, (at a given rate that depends of \F) so that some
form of asymptotic stochastic convergence of the solution $f_{n}$ towards the
projection of the true regression function $f^{*}$ onto \F is achieved.
\subsection*{Kernel Regression}
Reproducing Kernel Hilbert Spaces (RKHS) provide convenient choices
for \F.
\begin{description}
\item[Theorem:]
Let $K$ be a Mercer kernel and let $\Hs_{K}$ be the associated RKHS.
If
\[ C((x_{1},y_{1},f(x_{1})),\ldots, (x_{n},y_{n},f(x_{n})))\]
is any cost function that depends on $f\in \Hs_{K}$ only through the
values of $f(x_{j})$ at the observed $x_{j}$, then the minimizer of
\[ U(f) = C\left((x_{1},y_{1},f(x_{1})),\ldots, (x_{n},y_{n},f(x_{n}))\right) +
\Omega(|| f ||) \]
where $\Omega$ is an increasing function, is always achieved at a point
$f_{n}\in \Hs_{K}$ of the form,
\[ f_{n}(x) = \sum_{j=1}^{n} w_{j} K(x_{j},x). \]
Thus, when $\F = \Hs_{K}$, a big fat infinite dimensional space, the regularized
empirical cost $U = C+\Omega$ is minimized by solving a classic regression
problem with $\F_{n} = \mbox{spann}\{K(x_{1},\cdot),\ldots, K(x_{n},\cdot)\}$.
\end{description}
{\bf Proof:} The proof is surprisingly simple. Every $f\in \Hs_{K}$
can be written as $f = g + h$ where $g\in \F_{n}$ and $h\in \F_{n}^{\perp}$.
We show that,
\[
U(f) = U(g+h) \ge U(g)
\]
with equality if and only if $h=0$. This follows easily from the reproducing
property of the kernel spaces. For all $j \le n$,
\[ f(x_{j}) = < K(x_{j},\cdot), g+h> = = g(x_{j}) \]
since $h \perp \F_{n}$ by hypothesis. Thus, $C(f)=C(g)$. On the other hand,
since $\Omega$ is strictly increasing and $g \perp h$, by the pythagorean
theorem we have,
\begin{eqnarray*}
\Omega(||f||) &=& \Omega\left( (||g||^{2} + ||h||^{2})^{1/2} \right) \\
&\ge & \Omega(||g||)
\end{eqnarray*}
with equality if and only if $h=0$. Hence,
$U(f)= C(g)+\Omega(||g+h||) \ge C(g)+\Omega(||g||) = U(g). \bullet$
\section*{Support Vector Regression}
For given values $\alpha >0$, and $\epsilon >0$ define the empirical cost
function ,
\[ C = \alpha \sum_{i=1}^{n} |y_{i} - f(x_{i})|_{\epsilon} \]
where,
\[ |z|_{\epsilon} = \max \{0, |z| - \epsilon \} \]
is known as the $\epsilon$ insensitive function, and take,
\[ \Omega(||f||) = \frac{1}{2} ||f||^{2}. \]
With these choices, kernel regression becomes support vector regression.
The parameter $\epsilon$ controls the sparness of the solution. The
smoothing parameter $\alpha$ controls the relative importance of
the empirical cost $C$ relative to the complexity penalty $\Omega$.
The derivation of the support vector regression problem follows closely
the derivation of support vector machines for classification.
We first setup a primal optimization problem for minimizing the above
$\epsilon$-insensitive regularized empirical cost over functions,
$f(x) = + b$ for the euclidean innerproduct.
Then we consider the dual problem. This turns out to be a simple
quadratic programming problem that depends on the observed data
only through the values of $()$. Just as in the classification
case, we can apply the kernel trick and rip the benefits of
nonlinear kernel regression at the linear regression cost!
\subsection*{The Primal Problem for SV Regression}
We seek the solution of,
\[ \mbox{minimize} \alpha \sum_{i=1}^{n} |y_{i} - f(x_{i})|_{\epsilon}
+ \frac{1}{2} \sum_{i=1}^{n} w_{i}^{2} \]
over $b,w$ when $f(x) = + b$. This is equivalent to,
\[ \mbox{minimize} \alpha \sum_{i=1}^{n} u_{i}
+ \frac{1}{2} \sum_{i=1}^{n} w_{i}^{2} \]
over $u_{i},w_{i},b$
subject to: $u_{i} \ge |y_{i}-f(x_{i})|_{\epsilon}$ for $i\le n$. Each
of the last $n$ inequalities corresponds to three inequalities,
\[ u_{i} \ge 0,\ u_{i} \ge y_{i} - f(x_{i}) - \epsilon, \
u_{i} \ge f(x_{i}) - y_{i} - \epsilon \]
Applying the standard trick of adding non negative slack variables
$\xi_{i}$ and $\xi^{*}_{i}$ we soften the inequalities and allow
small violations. So we replace the above constrained optimization
problem with,
\[ \mbox{minimize} \alpha \sum_{i=1}^{n} u_{i} + \frac{\alpha}{2}
\sum_{i=1}^{n} (\xi_{i} + \xi^{*}_{i})
+ \frac{1}{2} \sum_{i=1}^{n} w_{i}^{2} \]
subject to: for $i\le n$,
\begin{eqnarray*}
& & y_{i} - f(x_{i}) - \epsilon \le u_{i} + \xi_{i} \\
& & f(x_{i}) - y_{i} - \epsilon \le u_{i} + \xi^{*}_{i} \\
& & u_{i} \ge 0, \ \xi_{i} \ge 0, \ \xi^{*}_{i} \ge 0.
\end{eqnarray*}
The objective function was chosen so that we can factorize out
$\alpha/2$ and write,
\[ \frac{\alpha}{2}\sum_{i=1}^{n} (\{u_{i}+\xi_{i}\} + \{u_{i}+\xi^{*}_{i}\})
+ \frac{1}{2} \sum_{i=1}^{n} w_{i}^{2} \]
In this way we can get rid of the $u_{i}$ by just replacing $u_{i}+\xi_{i}$ by
$\xi_{i}$ and $u_{i}+\xi^{*}_{i}$ by $\xi^{*}_{i}$ every where. Also, replace
$\alpha/2$ by a new $\alpha$ to obtain the problem:
\[ (Primal)\ \ \mbox{minimize} \ \alpha \sum_{i=1}^{n} (\xi_{i} + \xi^{*}_{i})
+ \frac{1}{2} \sum_{i=1}^{n} w_{i}^{2} \]
over, $w_{i},b, \xi_{i}, \xi^{*}_{i}$ subject to: for $i\le n$,
\begin{eqnarray*}
& & y_{i} - f(x_{i}) - \epsilon \le \xi_{i} \\
& & f(x_{i}) - y_{i} - \epsilon \le \xi^{*}_{i} \\
& & \xi_{i} \ge 0, \ \xi^{*}_{i} \ge 0 \\
& & \mbox{ and \ } f(x) = + b.
\end{eqnarray*}
\subsection*{The Dual Problem for SV Regression}
The Lagrangian in terms of non negative Lagrange multipliers is,
\begin{eqnarray*}
\La &=& \alpha \sum_{i}(\xi_{i}+\xi^{*}_{i}) + \frac{1}{2} \sum_{i} w_{i}^{2} \\
& & + \sum_{i} \lambda_{i} \{ y_{i} - f(x_{i}) - \epsilon - \xi_{i} \} \\
& & + \sum_{i} \lambda^{*}_{i} \{ f(x_{i}) - y_{i} - \epsilon - \xi^{*}_{i} \} \\
& & - \sum_{i} \beta_{i} \xi_{i} - \sum_{i} \beta^{*}_{i} \xi^{*}_{i}.
\end{eqnarray*}
To compute the dual we need to find,
\[ W(\lambda,\lambda^{*},\beta,\beta^{*}) = \min_{w,b,\xi,\xi^{*}} \La \]
The values of $w,b,\xi,\xi^{*}$ where the minimum is achieved must satisfy,
\begin{eqnarray*}
\nabla_{w}\La &= 0& \Longleftrightarrow
w = \sum_{i} (\lambda_{i}-\lambda^{*}_{i}) x_{i} \\
\nabla_{b}\La &= 0& \Longleftrightarrow
\sum_{i} \lambda_{i} = \sum_{i} \lambda^{*}_{i} \\
\nabla_{\xi}\La &= 0& \Longleftrightarrow \lambda_{j} + \beta_{j} = \alpha
\mbox{\ for\ } j \le n. \\
\nabla_{\xi^{*}}\La &= 0& \Longleftrightarrow \lambda^{*}_{j} + \beta^{*}_{j} = \alpha
\mbox{\ for\ } j \le n. \\
\end{eqnarray*}
Replacing these equations into $\La$ we obtain that all the terms involving $\xi_{i}$
and $\xi^{*}_{i}$ dissapear from $\La$ and with them, $\beta$ and $\beta^{*}$.
Therefore, $W$ is only a function of $\lambda$ and $\lambda^{*}$. We get,
replacing $K(x_{i},x_{j})$ for the innerproducts $$ (the kernel trick!) that,
\begin{eqnarray*}
W(\lambda,\lambda^{*}) & = & \frac{1}{2} \sum_{i,j}
(\lambda_{i}-\lambda^{*}_{i})(\lambda_{j}-\lambda^{*}_{j}) K(x_{i},x_{j}) \\
& & + \sum_{i} (\lambda_{i}-\lambda^{*}_{i}) y_{i}
- \epsilon \sum_{i} (\lambda_{i}+\lambda^{*}_{i}) \\
& & - \sum_{i,j}
(\lambda_{i}-\lambda^{*}_{i})(\lambda_{j}-\lambda^{*}_{j}) K(x_{i},x_{j})
\end{eqnarray*}
The first and last terms simplify to produce,
\begin{eqnarray*}
W(\lambda,\lambda^{*}) & = & - \epsilon \sum_{i} (\lambda_{i}+\lambda^{*}_{i}) \\
& & - \frac{1}{2} \sum_{i,j}
(\lambda_{i}-\lambda^{*}_{i})(\lambda_{j}-\lambda^{*}_{j}) K(x_{i},x_{j}) \\
& & + \sum_{i} (\lambda_{i}-\lambda^{*}_{i}) y_{i}.
\end{eqnarray*}
The dual problem becomes,
\[ (Dual)\ \ \max_{\lambda,\lambda^{*}} W(\lambda,\lambda^{*}) \]
subject to:
\begin{eqnarray*}
& & \sum_{j} \lambda_{j} = \sum_{j} \lambda^{*}_{j} \\
& & 0 \le \lambda_{j} \le \alpha, \ 0 \le \lambda^{*}_{j} \le \alpha.
\end{eqnarray*}
where we have replaced the equalities $\lambda_{j}+\beta_{j} = \alpha,$
involving $\lambda_{j} \ge 0$, $\beta_{j} \ge 0$ by the equivalent inequalities
shown above, that do not involve the $\beta$s.
As it was the case for classification, the dual problem is a simple quadratic
programming problem that can be solved with efficient algorithms that are
publicly available.
The solution from the QP solver is then used to produce the estimate,
\[ \hat{w} = \sum_{i} (\lambda_{i}-\lambda^{*}_{i}) x_{i} \]
The KKT complementarity conditions, for the slack variables are
of the type $\xi_{j}(\lambda_{j}-\alpha) = 0$ so we can write the
other complementarity conditions as follows,
\begin{eqnarray*}
\lambda_{j}\{ y_{j} - \hat{f}(x_{j}) - \epsilon \} &= 0&
\mbox{\ provided\ } \lambda_{j} < \alpha \\
\lambda^{*}_{j}\{\hat{f}(x_{j}) - y_{j} - \epsilon \} &= 0&
\mbox{\ provided\ } \lambda^{*}_{j} < \alpha \\
\end{eqnarray*}
are valid (and non trivial) for all
$j\in J_{0}$, and $j\in J^{*}_{0}$ (resp.), where
\[ J_{0} = \{ j : j\le n, \mbox{\ and\ } 0 < \lambda_{j} < \alpha \}\]
with $J^{*}_{0}$ defined analogously.
These, and the complementarity conditions
associated to the inequalities $\lambda_{j} \le \alpha$ and
$\lambda^{*}_{j} \le \alpha$ make many $\lambda_{j} = \lambda^{*}_{j}$
to be either $0$ or $\alpha$ and producing a sparse solution. The value
for $b$ can be obtained from any of the above complementarity conditions,
but a more accurate value is obtained by combining efforts. Replacing
the estimated values of the regression function at the training points,
\[ \hat{f}(x_{j}) = \sum_{i} (\lambda_{i}-\lambda^{*}_{i}) K(x_{i},x_{j}) + b \]
into the complementarity conditions, solving for $b$, multiplying through
by $\lambda_{j}$ and $\lambda^{*}_{j}$ and adding over $j\in J$ with
$J= J_{0}\cap J^{*}_{0}$ we get,
\begin{eqnarray*}
\sum_{j\in J}\lambda_{j} b &=& \sum_{j\in J} \lambda_{j}\{ y_{j} -
\sum_{i} (\lambda_{i}-\lambda^{*}_{i}) K(x_{i},x_{j}) \} \\
\sum_{j\in J}\lambda^{*}_{j} b &=& \sum_{j\in J}
\lambda^{*}_{j}\{ y_{j} -
\sum_{i} (\lambda_{i}-\lambda^{*}_{i}) K(x_{i},x_{j}) \}
\end{eqnarray*}
adding the two equations, we finally obtain the estimate
\[
b = \frac{\sum_{j\in J} (\lambda_{j}+\lambda^{*}_{j}) \{ y_{j} -
\sum_{i} (\lambda_{i}-\lambda^{*}_{i}) K(x_{i},x_{j}) \}}
{ \sum_{j\in J} (\lambda_{j}+\lambda^{*}_{j}) }
\]
\section*{Example: SV Regression in Action}
The following picture shows $n=30$ samples (the circles) from the true
regression line (the red curve) with gaussian error and $\sigma = 0.5$.
The green curve is the estimated regression line computed using a gaussian
kernel. The blue curves show the $\epsilon = 0.4$ insensitive tube around
the estimate. The support vectors are marked with plus signs and a value
of $\alpha = 1.5$ was used. The Maple(9.5) code is available from this site
and uses the QPsolve program in efficient matrix form from
the new Maple optimization package.
\begin{center}
%make sure you have file.pdf and file.gif
\includegraphics[scale=0.5]{svreg}
\end{center}
\end{document}