
Regression
Carlos C. Rodríguez
http://omega.albany.edu:8008/
Classical Regression
Recall the classical regression problem. Given observed data
(x_{1},y_{1}),¼, (x_{n},y_{n}) iid as a vector
(X,Y) Î R^{d+1} we want to estimate f^{*}(x) = E(YX=x) i.e., the regression
function of Y on X. When the vector (X,Y) is multivariate gaussian
the regression function is f^{*}(x) = a+ L(x) with L(x) linear,
and the Ordinary (yak!) Least Squares (OLS) estimator coincides with the MLE
(Maximum Likelihood Estimator). Very often the distribution of (X,Y)
is not explicitly known beyond the n observations and the available
prior information about the meaning of these data. A typical assumption
is to think of the y_{j} as the result of sampling the regression
function at f^{*}(x_{j}) with gaussian measurement error. The model is
that conditionally on x_{1},¼, x_{n} the values of y_{1},¼,y_{n}
are independent with Y_{j} depending on X_{j} only and
Y_{j}X_{j}=x_{j} being N(f^{*}(x_{j}), s^{2}). Thus, for
j = 1,2,¼, n
y_{j} = f^{*}(x_{j}) + e_{j} 

where e_{1},e_{2},¼, e_{n} are iid N(0,s^{2}).
In this way the distribution of (X,Y) is modeled semiparametrically
with (f,s) where f Î F is a function in some space of functions F
and s > 0 is a positive scalar parameter.
When F is taken as the m dimensional space F _{m} generated by functions,
g_{1}(x),¼,g_{m}(x) estimation of the regression function reduces to
the linear optimization problem,

^
f

= arg 
min
f Î F _{m}


n å
i=1

( y_{i}  f(x_{i}) )^{2} 

The solution is then given as the orthogonal (euclidean) projection of
the observed vector y^{T} = (y_{1},y_{2},¼,y_{n}) onto the
space generated by the columns of the matrix G Î R^{m×n}
with entries g_{ij} = g_{i}(x_{j}). In fact, the above optimization problem
can be written as,
 y  
^
y

^{2} = 
min
w Î R^{m}

 y  Gw ^{2} 

As shown by the picture, the rejection vector (i.e. y minus its projection)
must be orthogonal to the linear space generated by the columns of G,
in particular (and equivalently) to each of these columns, obtaining the
standard set of normal equations,
0 = G^{T}(y  
^
y

) = G^{T}(y  G 
^
w

) 

with solution,

^
w

= (G^{T}G)^{1}G^{T}y 

The more general case of Weighted Least Squares (WLS) corresponding to
the innerproduct < x,z > = x^{T}Az generated by a symmetric positive
definite matrix A, is just,

^
w

= (G^{T}AG)^{1}G^{T}A y. 

The matrix A encodes a covariance structure for the measurement errors,
e_{1},¼,e_{n}.
Over fitting and Kernel Regression
How should F be chosen?. On the one hand, we would like F to be big so
not to constrain the form of the true regression function too much.
On the other hand, big F 's make the task of searching for the best
f Î F more difficult and more importantly without a constraint on
the explanatory capacity of F the solution will show no power of generalization.
A big enough F will always have at least one member f, able to fit all the
observations perfectly, without error, but this f provides no assurance
that f(x) is not as bad as it can possibly be for any point x not
in the training set. To be able to assure that the size of the mistake
on future data will not exceed a given value with high probability,
(i.e. to have PAC bounds) we must constrain the capacity of F somehow.
Over the years, statisticians and numerical analysts have invented all kinds
of adhoc devices for achieving this goal. These are known as regularization
methods. They boil down to adding a penalty term to the OLS empirical
term, often of the form W(f) where W is an increasing
function and F is assumed to be a space with a norm. The problem to
be solved becomes,

min
f Î F


n å
i=1

(y_{i}  f(x_{i}))^{2} subject to f £ r_{n} 

where the sequence of radiuses r_{n} ® ¥ as n ® ¥,
but not too quickly, (at a given rate that depends of F ) so that some
form of asymptotic stochastic convergence of the solution f_{n} towards the
projection of the true regression function f^{*} onto F is achieved.
Kernel Regression
Reproducing Kernel Hilbert Spaces (RKHS) provide convenient choices
for F .
 Theorem:

Let K be a Mercer kernel and let H _{K} be the associated RKHS.
If
C((x_{1},y_{1},f(x_{1})),¼, (x_{n},y_{n},f(x_{n}))) 

is any cost function that depends on f Î H _{K} only through the
values of f(x_{j}) at the observed x_{j}, then the minimizer of
U(f) = C((x_{1},y_{1},f(x_{1})),¼, (x_{n},y_{n},f(x_{n}))) + W( f ) 

where W is an increasing function, is always achieved at a point
f_{n} Î H _{K} of the form,
f_{n}(x) = 
n å
j=1

w_{j} K(x_{j},x). 

Thus, when F
= H _{K}, a big fat infinite dimensional space, the regularized
empirical cost U = C+W is minimized by solving a classic regression
problem with F _{n} = spann{K(x_{1},·),¼, K(x_{n},·)}.
Proof: The proof is surprisingly simple. Every f Î H _{K}
can be written as f = g + h where g Î F _{n} and h Î F _{n}^{^}.
We show that,
with equality if and only if h=0. This follows easily from the reproducing
property of the kernel spaces. For all j £ n,
f(x_{j}) = < K(x_{j},·), g+h > = < K(x_{j},·),g > = g(x_{j}) 

since h ^F _{n} by hypothesis. Thus, C(f)=C(g). On the other hand,
since W is strictly increasing and g ^h, by the pythagorean
theorem we have,
 

W( (g^{2} + h^{2})^{1/2} ) 
 
 

 

with equality if and only if h=0. Hence,
U(f) = C(g)+W(g+h) ³ C(g)+W(g) = U(g). ·
Support Vector Regression
For given values a > 0, and e > 0 define the empirical cost
function ,
C = a 
n å
i=1

y_{i}  f(x_{i})_{e} 

where,
z_{e} = 
max
 {0, z  e} 

is known as the e insensitive function, and take,
W(f) = 
1
2

f^{2}. 

With these choices, kernel regression becomes support vector regression.
The parameter e controls the sparness of the solution. The
smoothing parameter a controls the relative importance of
the empirical cost C relative to the complexity penalty W.
The derivation of the support vector regression problem follows closely
the derivation of support vector machines for classification.
We first setup a primal optimization problem for minimizing the above
einsensitive regularized empirical cost over functions,
f(x) = < w,x > + b for the euclidean innerproduct.
Then we consider the dual problem. This turns out to be a simple
quadratic programming problem that depends on the observed data
only through the values of ( < x_{i},x_{j} > ). Just as in the classification
case, we can apply the kernel trick and rip the benefits of
nonlinear kernel regression at the linear regression cost!
The Primal Problem for SV Regression
We seek the solution of,
minimize a 
n å
i=1

y_{i}  f(x_{i})_{e} + 
1
2


n å
i=1

w_{i}^{2} 

over b,w when f(x) = < w,x > + b. This is equivalent to,
minimize a 
n å
i=1

u_{i} + 
1
2


n å
i=1

w_{i}^{2} 

over u_{i},w_{i},b
subject to: u_{i} ³ y_{i}f(x_{i})_{e} for i £ n. Each
of the last n inequalities corresponds to three inequalities,
u_{i} ³ 0, u_{i} ³ y_{i}  f(x_{i})  e, u_{i} ³ f(x_{i})  y_{i}  e 

Applying the standard trick of adding non negative slack variables
x_{i} and x^{*}_{i} we soften the inequalities and allow
small violations. So we replace the above constrained optimization
problem with,
minimize a 
n å
i=1

u_{i} + 
a
2


n å
i=1

(x_{i} + x^{*}_{i}) + 
1
2


n å
i=1

w_{i}^{2} 

subject to: for i £ n,
 

y_{i}  f(x_{i})  e £ u_{i} + x_{i} 
 
 

f(x_{i})  y_{i}  e £ u_{i} + x^{*}_{i} 
 
 

u_{i} ³ 0, x_{i} ³ 0, x^{*}_{i} ³ 0. 
 

The objective function was chosen so that we can factorize out
a/2 and write,

a
2


n å
i=1

({u_{i}+x_{i}} + {u_{i}+x^{*}_{i}}) + 
1
2


n å
i=1

w_{i}^{2} 

In this way we can get rid of the u_{i} by just replacing u_{i}+x_{i} by
x_{i} and u_{i}+x^{*}_{i} by x^{*}_{i} every where. Also, replace
a/2 by a new a to obtain the problem:
(Primal) minimize a 
n å
i=1

(x_{i} + x^{*}_{i}) + 
1
2


n å
i=1

w_{i}^{2} 

over, w_{i},b, x_{i}, x^{*}_{i} subject to: for i £ n,
 

y_{i}  f(x_{i})  e £ x_{i} 
 
 

f(x_{i})  y_{i}  e £ x^{*}_{i} 
 
 

 
 

 

The Dual Problem for SV Regression
The Lagrangian in terms of non negative Lagrange multipliers is,
 

a 
å
i

(x_{i}+x^{*}_{i}) + 
1
2


å
i

w_{i}^{2} 
 
 

+ 
å
i

l_{i} { y_{i}  f(x_{i})  e x_{i} } 
 
 

+ 
å
i

l^{*}_{i} { f(x_{i})  y_{i}  e x^{*}_{i} } 
 
 

 
å
i

b_{i} x_{i}  
å
i

b^{*}_{i} x^{*}_{i}. 
 

To compute the dual we need to find,
W(l,l^{*},b,b^{*}) = 
min
w,b,x,x^{*}

L 

The values of w,b,x,x^{*} where the minimum is achieved must satisfy,
 

Û w = 
å
i

(l_{i}l^{*}_{i}) x_{i} 
 
 

Û 
å
i

l_{i} = 
å
i

l^{*}_{i} 
 
 

Û l_{j} + b_{j} = a for j £ n. 
 
 

Û l^{*}_{j} + b^{*}_{j} = a for j £ n. 
 
  

Replacing these equations into L we obtain that all the terms involving x_{i}
and x^{*}_{i} dissapear from L and with them, b and b^{*}.
Therefore, W is only a function of l and l^{*}. We get,
replacing K(x_{i},x_{j}) for the innerproducts < x_{i},x_{j} > (the kernel trick!) that,
 


1
2


å
i,j

(l_{i}l^{*}_{i})(l_{j}l^{*}_{j}) K(x_{i},x_{j}) 
 
 

+ 
å
i

(l_{i}l^{*}_{i}) y_{i}  e 
å
i

(l_{i}+l^{*}_{i}) 
 
 

 
å
i,j

(l_{i}l^{*}_{i})(l_{j}l^{*}_{j}) K(x_{i},x_{j}) 
 

The first and last terms simplify to produce,
 

 e 
å
i

(l_{i}+l^{*}_{i}) 
 
 

 
1
2


å
i,j

(l_{i}l^{*}_{i})(l_{j}l^{*}_{j}) K(x_{i},x_{j}) 
 
 

+ 
å
i

(l_{i}l^{*}_{i}) y_{i}. 
 

The dual problem becomes,
(Dual) 
max
l,l^{*}

W(l,l^{*}) 

subject to:
 


å
j

l_{j} = 
å
j

l^{*}_{j} 
 
 

0 £ l_{j} £ a, 0 £ l^{*}_{j} £ a. 
 

where we have replaced the equalities l_{j}+b_{j} = a,
involving l_{j} ³ 0, b_{j} ³ 0 by the equivalent inequalities
shown above, that do not involve the bs.
As it was the case for classification, the dual problem is a simple quadratic
programming problem that can be solved with efficient algorithms that are
publicly available.
The solution from the QP solver is then used to produce the estimate,

^
w

= 
å
i

(l_{i}l^{*}_{i}) x_{i} 

The KKT complementarity conditions, for the slack variables are
of the type x_{j}(l_{j}a) = 0 so we can write the
other complementarity conditions as follows,

l_{j}{ y_{j}  
^
f

(x_{j})  e} 


 

l^{*}_{j}{ 
^
f

(x_{j})  y_{j}  e} 


 
  

are valid (and non trivial) for all
j Î J_{0}, and j Î J^{*}_{0} (resp.), where
J_{0} = { j : j £ n, and 0 < l_{j} < a} 

with J^{*}_{0} defined analogously.
These, and the complementarity conditions
associated to the inequalities l_{j} £ a and
l^{*}_{j} £ a make many l_{j} = l^{*}_{j}
to be either 0 or a and producing a sparse solution. The value
for b can be obtained from any of the above complementarity conditions,
but a more accurate value is obtained by combining efforts. Replacing
the estimated values of the regression function at the training points,

^
f

(x_{j}) = 
å
i

(l_{i}l^{*}_{i}) K(x_{i},x_{j}) + b 

into the complementarity conditions, solving for b, multiplying through
by l_{j} and l^{*}_{j} and adding over j Î J with
J = J_{0}ÇJ^{*}_{0} we get,
 


å
j Î J

l_{j}{ y_{j}  
å
i

(l_{i}l^{*}_{i}) K(x_{i},x_{j}) } 
 
 


å
j Î J

l^{*}_{j}{ y_{j}  
å
i

(l_{i}l^{*}_{i}) K(x_{i},x_{j}) } 
 

adding the two equations, we finally obtain the estimate
b = 
å
j Î J

(l_{j}+l^{*}_{j}) { y_{j}  
å
i

(l_{i}l^{*}_{i}) K(x_{i},x_{j}) } 

å
j Î J

(l_{j}+l^{*}_{j}) 



Example: SV Regression in Action
The following picture shows n=30 samples (the circles) from the true
regression line (the red curve) with gaussian error and s = 0.5.
The green curve is the estimated regression line computed using a gaussian
kernel. The blue curves show the e = 0.4 insensitive tube around
the estimate. The support vectors are marked with plus signs and a value
of a = 1.5 was used. The Maple(9.5) code is available from this site
and uses the QPsolve program in efficient matrix form from
the new Maple optimization package.
File translated from
T_{E}X
by
T_{T}H,
version 3.63. On 1 Nov 2004, 15:37.
