Another service from Omega

*****
Other available formats: [.PDF] [.TEX]

PAC Bounds

Carlos C. Rodríguez
http://omega.albany.edu:8008/

What do we have so far?

Let's recall what the VC-theory, introduced in the previous lectures, has accomplished so far.
First we played God and assumed we knew the underlying distribution of Z=(X,Y) that is generating the labeled data. In this case we can find the bayes classifier d* by simply assigning the label y to an observed vector x that maximizes the posterior probability given the observation. Then, we tried the obvious next thing. Instead of assuming that we knew the distribution of Z we assumed that we had the possibility of collecting a sample Z1,,Zn of independent observations with the same distribution as Z. We can then use these examples (e.g. provided by a trusted teacher) to approximate the necessary probabilities P{Z A} with the empirical frequencies Pn{A} defined by,

Pn{A} = 1

n
n

i=1 
1[Zi A].
In particular the VC-theory estimates the risk R(g)=P{g(X) Y} of a given classifier g with the observed frequency of errors Rn(g) = Pn{A} where,

A = {z = (x,y) : g(x) y }.
Hence, the question becomes: How good is PnA as an estimate of PA ? In this way the LLNs got naturally into the picture. It is intuitively clear that some kind of constraint on g is needed to avoid over fitting the observations. For otherwise we could always take gn(Xi) = Yi with no empirical error (i.e. fitting the observed data perfectly) but making a mistake on all other possible data, i.e. gn(X) = 1-Y so that the true error is worst possible, e.g. R(gn) = 1 when P has no atoms. We are then forced to consider g G  only. There is still another compelling reason for restricting classifiers to a given class G . That has to do with the issue of consistency. We mentioned (without proof) the existence of No Free Lunch theorems stating that there are no universally consistent classification rules with fix rates of convergence. Sure, there are universally consistent rules gn (e.g. the knn) for which R(gn) R* (the bayes risk) no matter what the underlying distribution of (X,Y) is. However, one can show that for any sequence of positive numbers an converging to zero, there exist distributions for which R(gn) an for all n. In other words, truth is asymptotically learnable but it may take forever to find out what it actually is!
The need to consider prior information is unavoidable. Data by itself cannot provide the ability to generalize. That may come as a surprise to some empiricists but for the bayesians (specially for those who have read Jaynes) data is another logical proposition and it only has meaning in a given domain of discourse. For example the number 2.5 or the binary string 010011 are not data. They are only data if they are understood as a logical proposition like "the result of such and such experiment was 2.5". In other words we cannot isolate data from prior knowledge. Data is a function of the theory, or the way I like to put it:
There is no data in the vacuum.
Yes, the currently accepted interpretations of quantum theory are in conflict with this view. But the currently accepted interpretations of quantum theory are in conflict with themselves, so there is nothing to worry about.

The need for Uniform LLNs

Thus, we need g G . What G s are good G s? To answer this let gn G  be the classifier minimizing the empirical risk over G , i.e.

Rn(gn) Rn(g)   for all  g G 
and let g* G  be the best classifier in G , i.e.,

R(g*) R(g)   for all  g G .
Finally let R* be the bayes risk, i.e.,

R* R(g)   for all  g.
The estimated error relative to the bayes risk decomposes into two parts,

Rn(gn) - R* = [ Rn(gn) - R(g*) ] + [ R(g*) - R* ]
The first term in square brackets is the estimation error and the second term is the approximation error . Better and bigger sampling reduces the first error and better and bigger models G reduce the second. It is important to notice the tug-of-war about model size implicit in the error decomposition. On the one hand, small models allowing few choices g G  should make the task of identifying the best g* G, with the information provided by the observed data, easier. In other words, the smaller G  is, the more helpful the sample becomes. On the other hand, the bigger the model the closer g* G  is from the bayes rule, and the smaller the approximation error. We'll concentrate primarily on the sampling error.
Three important inequalities to consider:
  1. Observed error compared to minimum true risk in G .

    Rn(gn) - R(g*) = [ Rn(gn) - Rn(g*) ] + [ Rn(g*) - R(g*) ]
    the first term in brackets is non positive, so

    Rn(gn) - R(g*) [ Rn(g*) - R(g*) ]
    thus,
    Rn(gn) - R(g*)
    sup
    G  
    | Rn(g) - R(g) |
  2. True error of estimator compared to best in G .

    R(gn) - R(g*) = [R(gn)-Rn(gn)] + [Rn(gn)-R(g*)]
    and using x |x| and the previous inequality we get,
    R(gn) - R(g*) 2
    sup
    G  
    | Rn(g) - R(g) |
  3. True error compared to empirical error.

    R(gn)-Rn(gn)
    sup
    G  
    | Rn(g) - R(g) |
The three inequalities above led us to consider,


sup
G  
|Rn(g)-R(g)| =
sup
A  
|PnA - PA|  : = ||Rn - R||
and the uniform laws of large numbers came to the table.

Probably Approximately Correct Bounds

The VC inequality,

E||Rn-R|| 2
log 2S (n,A ) 

n

1/2

 
,
and MacDiarmid's concentration inequality,

P{ ||Rn-R|| - E||Rn-R|| > t } e-2nt2
imply,

P{ ||Rn-R|| > e} exp{-2n(e- 2
log 2S (n,A ) 

n

1/2

 
)2}.

Inversion

The following simple fact is often useful for obtaining PAC bounds. If
P{X > t} F(t)
then with probability at least 1-d,

X F-1(d).

PAC Bounds for R(gn)

Thus, for any small d > 0 with probability at least 1-d,

R(gn) R(g*) + 2 Dn
and
R(gn) Rn(gn) + Dn
where,

Dn
1

2n
log 1

d

1/2

 
+ 2
log 2S (n,A ) 

n

1/2

 

The Finite Case

It is highly informative to compare the general bound above with the one obtained when the class G  is finite, i.e., when |G | = N < . Using Boole's and Hoeffding's inequalities we can write,

P{
sup
G  
|Rn(g)-R(g)| > t }
P



g G  
{|Rn(g)-R(g)| > t}



G  
P{ |Rn(g)-R(g)| > t }
N 2 e-2nt2
and applying the inversion method (see above) we get,

Dn
1

2n
log 1

d

1/2

 
+ 2
log 2N

n

1/2

 
The interpretation of the shatter coefficient S (n,A )  as an effective size N when the class is infinite is plain to see. The shatter coefficients (and thus the VC dimension as well) give us the maximum number of points in G that can be distinguished on the basis of a finite sample. Recalling that N equally likely choices have entropy log N we could naturally interpret the logarithm of the shatter coefficients as a kind of entropy for the infinite class G . This, I believe, is the most important lesson to be learned from the VC theory. We have now a method to reduce large (infinite dimensional) non-parametric classes to an effective finite number that we can actually tell apart from each other on the basis of only n < observations. The rest is counting. The bottom line is combinatorics. The fruits of this theory are immense!



File translated from TEX by TTH, version 3.63.
On 7 Oct 2004, 09:11.

http://omega.albany.edu:8008/