Lecture VI
An Introduction to Markov Chain Monte Carlo
Lecture VI
Lecture VI
Abstract
Neural Networks as a way to specify nonparametric regression and
classification models.
Feed Forward Neural Nets Models
Feed forward Neural Nets are also known as multilayer perceptrons or
backpropagation networks. The figure shows a network with a layer of 4 hidden
units.
Figure 1: A Feed Forward Network
The outputs are computed from the following formulas,



b_{k} + 
å
j

v_{jk} h_{j}(x) 
 (1)  

tanh(a_{j} + 
å
i

u_{ij} x_{i} ) 
 (2) 
 

where {a_{j}},{b_{k}},{u_{ij}},{v_{jk}} are the parameters of the
network. The parameters with one index are known as biases and those with two
indeces are known as weights. We assume that x = (x_{1},¼,x_{d}) Î R^{d}, h(x) = (h_{1}(x),¼,h_{l}(x)) Î R^{l}, g(x) = (g_{1}(x),¼,g_{p}(x)) Î R^{p}. The hyperbolic tangent,
tanh(z) = 
sinh(z) cosh(z)

= 
e^{z}e^{z} e^{z}+e^{z}

= 
1e^{2z} 1+e^{2z}



Figure 2: Hyperbolic Tangent
is an example of a sigmoid function. A sigmoid is a nonlinear function,
s(z), that goes through the origin, approaches +1 as z®¥
and approaches 1 as z® ¥.
It is known since 1989 (only) that as the number of hidden units increases,
any function defined on a compact set can be approximated by linear
combinations of sigmoids.
Multilayer perceptrons are often used as flexible models for nonparametric
regression and classification. Given data,
(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),¼, (x^{(n)},y^{(n)}) 

with,
y^{(k)} = g(x^{(k)},q) + e^{(k)} 

where
e^{(1)},e^{(2)},¼,e^{(n)} are iid with Ee^{(k)} = 0 

Hence, the g is the regression of y on x, i.e.,
E(y  x,q) = g(x,q) with q Î Q 

The multilayer perceptrons provide a practical way to define the functions
g with high dimensional parameter spaces Q. We take
q = { {a_{j}},{b_{k}},{u_{ij}},{v_{jk}} }. The objective is
to find the predictive distribution of a new target vector y, given the
examples D = ((x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),¼,(x^{(n)},y^{(n)})) and the new vector of inputs x, i.e.,
f(y  x, D) = 
ó õ

f(y  x, q) p(q D) dq 

Under the assumption of quadratic loss, the best guess for y will be its
mean,

^ y

= E(y  x,D) = 
ó õ

g(x,q) p(qD) dq 

These estimates can be approximated by MCMC by sampling
q^{(1)},¼,q^{(N)} from the posterior
and then computing empirical averages,

^ y

N

= 
1 N


N å
j = 1

g(x,q^{(j)}) 

Useful Priors on Feed Forward Networks
In the absence of specific information, the following assumptions about the
prior p(q) are reasonable,
 The components of q are independent and symmetric about 0.
 Parameters of the same kind have the same a priori distributions, i.e.,



 

 

u_{1j},u_{2j},¼ iid for all j 
 

v_{1k},v_{2k},¼ iid for all k 
 
 

With these assumptions if var_{p}(v_{jk}) = l^{1}s^{2}_{v} < ¥ then by
the Central Limit Theorem, as l® ¥ the prior on the output
units converges to a Gaussian process. Gaussian processess are characterized
by their covariance functions and they are often considered innadequate for
modeling complex interdependence of the outputs. To avoid the Gaussian trap,
it is convenient to use a priori distributions for the components of q
that have infinite variance.
A practical choice (used by Neal) is to take,
v_{jk} as tdistribution µ 
æ ç
è

1 + 
v_{jk}^{2} as_{v}^{2}

ö ÷
ø

(a+1)/2

with 0 < a < 2 

Furthermore, if the we take s_{v} = w_{v} l^{1/a} then the
resulting prior will converge as l®¥ to a symmetric stable
process of index a.
Recall that Z_{1},Z_{2},¼,Z_{n} iid with distribution symmetric about
0 are said to be stable of index a if

Z_{1} + ¼+ Z_{n} n^{1/a}

has the same law as Z_{1} 

A distribution is said to be in the domain of attraction of a stable law
if properly normalized sums of independent observations from this
distribution, converge in law to a stable distribution. Hence, distributions
with finite variance are in the domain of attraction of the Gaussians. It is
also well known that distributions with tails going to zero as
x^{(a+1)} as x®¥ are in the domain of attraction
of stable laws of index a which justifies the choice of tdist above.
Postulating an Energy for the Net
An alternative approach without priors (apparently...) is to postulate
directly and Energy function for the network, e. g,
E(q,g) = 
1 n


n å
k = 1

L(y^{(k)},g(x^{(k)},q)) + gq^{2} 

where L(y,z) is the assumed loss when we estimate y with z. Typical
choices are L(y,z) = R(yz) for some nondecreasing function R and
some norm ·. Then choose q to minimize this Energy function.
Often, the smoothness parameter g > 0 is chosen by CrossValidation or
by plain trial and error.
For complicated multi modal energy functions, a combination of simulated
annealing with a classical gradient method (such as conjugate gradients) have
been the most successful.
File translated from T_{E}X by T_{T}H, version 2.32.
On 5 Jul 1999, 22:59.