Lecture VI

# Lecture VI

## Abstract

Neural Networks as a way to specify nonparametric regression and classification models.

## Feed Forward Neural Nets Models

Feed forward Neural Nets are also known as multilayer perceptrons or backpropagation networks. The figure shows a network with a layer of 4 hidden units.

Figure 1: A Feed Forward Network

The outputs are computed from the following formulas,

 gk(x)
 =
 bk + å j vjk hj(x)
(1)
 hj(x)
 =
 tanh(aj + å i uij xi )
(2)

where {aj},{bk},{uij},{vjk} are the parameters of the network. The parameters with one index are known as biases and those with two indeces are known as weights. We assume that x = (x1,¼,xd) Î Rd, h(x) = (h1(x),¼,hl(x)) Î Rl, g(x) = (g1(x),¼,gp(x)) Î Rp. The hyperbolic tangent,

 tanh(z) = sinh(z)cosh(z) = ez-e-zez+e-z = 1-e-2z1+e-2z

Figure 2: Hyperbolic Tangent

is an example of a sigmoid function. A sigmoid is a non-linear function, s(z), that goes through the origin, approaches +1 as z®¥ and approaches -1 as z® -¥.

It is known since 1989 (only) that as the number of hidden units increases, any function defined on a compact set can be approximated by linear combinations of sigmoids.

Multilayer perceptrons are often used as flexible models for nonparametric regression and classification. Given data,

 (x(1),y(1)), (x(2),y(2)),¼, (x(n),y(n))
with,

 y(k) = g(x(k),q) + e(k)
where

 e(1),e(2),¼,e(n) are iid with  Ee(k) = 0
Hence, the g is the regression of y on x, i.e.,

 E(y | x,q) = g(x,q)  with  q Î Q
The multilayer perceptrons provide a practical way to define the functions g with high dimensional parameter spaces Q. We take q = { {aj},{bk},{uij},{vjk} }. The objective is to find the predictive distribution of a new target vector y, given the examples D = ((x(1),y(1)), (x(2),y(2)),¼,(x(n),y(n))) and the new vector of inputs x, i.e.,

 f(y | x, D) = óõ f(y | x, q) p(q| D) dq
Under the assumption of quadratic loss, the best guess for y will be its mean,

 ^y = E(y | x,D) = óõ g(x,q) p(q|D) dq

These estimates can be approximated by MCMC by sampling q(1),¼,q(N) from the posterior and then computing empirical averages,

 ^y N = 1N N å j = 1 g(x,q(j))

### Useful Priors on Feed Forward Networks

In the absence of specific information, the following assumptions about the prior p(q) are reasonable,

1. The components of q are independent and symmetric about 0.
2. Parameters of the same kind have the same a priori distributions, i.e.,
 a1,a2,¼ iid
 b1,b2,¼ iid
 u1j,u2j,¼ iid for all   j
 v1k,v2k,¼ iid for all   k

With these assumptions if varp(vjk) = l-1s2v < ¥ then by the Central Limit Theorem, as l® ¥ the prior on the output units converges to a Gaussian process. Gaussian processess are characterized by their covariance functions and they are often considered innadequate for modeling complex inter-dependence of the outputs. To avoid the Gaussian trap, it is convenient to use a priori distributions for the components of q that have infinite variance.

A practical choice (used by Neal) is to take,

 vjk as t-distribution  µ æç è 1 + vjk2asv2 ö÷ ø -(a+1)/2 with  0 < a < 2
Furthermore, if the we take sv = wv l-1/a then the resulting prior will converge as l®¥ to a symmetric stable process of index a.

Recall that Z1,Z2,¼,Zn iid with distribution symmetric about 0 are said to be stable of index a if

 Z1 + ¼+ Znn1/a has the same law as   Z1
A distribution is said to be in the domain of attraction of a stable law if properly normalized sums of independent observations from this distribution, converge in law to a stable distribution. Hence, distributions with finite variance are in the domain of attraction of the Gaussians. It is also well known that distributions with tails going to zero as |x|-(a+1) as |x|®¥ are in the domain of attraction of stable laws of index a which justifies the choice of t-dist above.

#### Postulating an Energy for the Net

An alternative approach without priors (apparently...) is to postulate directly and Energy function for the network, e. g,

 E(q,g) = 1n n å k = 1 L(y(k),g(x(k),q)) + g||q||2
where L(y,z) is the assumed loss when we estimate y with z. Typical choices are L(y,z) = R(||y-z||) for some nondecreasing function R and some norm ||·||. Then choose q to minimize this Energy function. Often, the smoothness parameter g > 0 is chosen by Cross-Validation or by plain trial and error.

For complicated multi modal energy functions, a combination of simulated annealing with a classical gradient method (such as conjugate gradients) have been the most successful.

File translated from TEX by TTH, version 2.32.
On 5 Jul 1999, 22:59.