
PAC Bounds
Carlos C. Rodríguez
http://omega.albany.edu:8008/
What do we have so far?
Let's recall what the VCtheory, introduced in the previous lectures, has
accomplished so far.
First we played God and assumed we knew the underlying distribution of
Z=(X,Y) that is generating the labeled data. In this case we can
find the bayes classifier d^{*} by simply assigning the label
y to an observed vector x that maximizes the posterior probability
given the observation. Then, we tried the obvious next thing. Instead
of assuming that we knew the distribution of Z we assumed that we
had the possibility of collecting a sample Z_{1},¼,Z_{n} of
independent observations with the same distribution as Z. We can
then use these examples (e.g. provided by a trusted teacher) to
approximate the necessary probabilities P{Z Î A} with the
empirical frequencies P_{n}{A} defined by,
P_{n}{A} = 
1
n


n å
i=1

1[Z_{i} Î A]. 

In particular the VCtheory estimates the risk R(g)=P{g(X) ¹ Y}
of a given classifier g with the observed frequency of errors
R_{n}(g) = P_{n}{A} where,
A = {z = (x,y) : g(x) ¹ y }. 

Hence, the question becomes: How good is P_{n}A as an estimate of PA ?
In this way the LLNs got naturally into the picture. It is intuitively
clear that some kind of constraint on g is needed to avoid over fitting
the observations. For otherwise we could always take g_{n}(X_{i}) = Y_{i}
with no empirical error (i.e. fitting the observed data perfectly)
but making a mistake on all other possible data, i.e. g_{n}(X) = 1Y
so that the true error is worst possible, e.g. R(g_{n}) = 1 when P has
no atoms. We are then forced to consider g Î G only. There is still
another compelling reason for restricting classifiers to a given class G .
That has to do with the issue of consistency. We mentioned (without proof)
the existence of No Free Lunch theorems stating that there are no
universally consistent classification rules with fix rates of convergence.
Sure, there are universally consistent rules g_{n} (e.g. the knn) for which
R(g_{n})® R^{*} (the bayes risk) no matter what the underlying
distribution of (X,Y) is. However, one can show that for any sequence
of positive numbers a_{n} converging to zero, there exist distributions for
which R(g_{n}) ³ a_{n} for all n. In other words, truth is asymptotically
learnable but it may take forever to find out what it actually is!
The need to consider prior information is unavoidable. Data by itself
cannot provide the ability to generalize. That may come as a surprise to some
empiricists but for the bayesians (specially for those who have read
Jaynes) data is another logical proposition and it only has meaning in a
given domain of discourse. For example the number 2.5 or the binary
string 010011 are not data. They are only data if they are understood as
a logical proposition like "the result of such and such experiment
was 2.5". In other words we cannot isolate data from
prior knowledge. Data is a function of the theory, or the way I like to put
it:
There is no data in the vacuum.
Yes, the currently accepted interpretations of quantum theory are in conflict
with this view. But the currently accepted interpretations of quantum theory
are in conflict with themselves, so there is nothing to worry about.
The need for Uniform LLNs
Thus, we need g Î G . What G s are good G s?
To answer this let g_{n} Î G be the classifier minimizing the empirical
risk over G , i.e.
R_{n}(g_{n}) £ R_{n}(g) for all g Î G 

and let g^{*} Î G be the best classifier in G , i.e.,
R(g^{*}) £ R(g) for all g Î G . 

Finally let R^{*} be the bayes risk, i.e.,
The estimated error relative to the bayes risk decomposes into two parts,
R_{n}(g_{n})  R^{*} = [ R_{n}(g_{n})  R(g^{*}) ] + [ R(g^{*})  R^{*} ] 

The first term in square brackets
is the estimation error and the second term is the approximation
error . Better and bigger sampling reduces the first error and better and
bigger models G reduce the second. It is important to notice the tugofwar
about model size implicit in the error decomposition. On the one hand,
small models allowing few choices g Î G should make the task of
identifying the best g^{*} Î G, with the information provided by
the observed data, easier. In other words, the smaller G is, the more
helpful the sample becomes. On the other hand, the bigger the model the
closer g^{*} Î G is from the bayes rule, and the smaller the approximation
error. We'll concentrate primarily on the sampling error.
Three important inequalities to consider:
 Observed error compared to minimum true risk in G .
R_{n}(g_{n})  R(g^{*}) = [ R_{n}(g_{n})  R_{n}(g^{*}) ] + [ R_{n}(g^{*})  R(g^{*}) ] 

the first term in brackets is non positive, so
R_{n}(g_{n})  R(g^{*}) £ [ R_{n}(g^{*})  R(g^{*}) ] 

thus,
R_{n}(g_{n})  R(g^{*}) £ 
sup
G

 R_{n}(g)  R(g)  

 True error of estimator compared to best in G .
R(g_{n})  R(g^{*}) = [R(g_{n})R_{n}(g_{n})] + [R_{n}(g_{n})R(g^{*})] 

and using x £ x and the previous inequality we get,
R(g_{n})  R(g^{*}) £ 2 
sup
G

 R_{n}(g)  R(g)  

 True error compared to empirical error.
R(g_{n})R_{n}(g_{n}) £ 
sup
G

 R_{n}(g)  R(g)  

The three inequalities above led us to consider,

sup
G

R_{n}(g)R(g) = 
sup
A

P_{n}A  PA : = R_{n}  R 

and the uniform laws of large numbers came to the table.
Probably Approximately Correct Bounds
The VC inequality,
ER_{n}R £ 2 
æ è

log 2S (n,A )
n

ö ø

1/2

, 

and MacDiarmid's concentration inequality,
P{ R_{n}R  ER_{n}R > t } £ e^{2nt2} 

imply,
P{ R_{n}R > e} £ exp{2n(e 2 
æ è

log 2S (n,A )
n

ö ø

1/2

)^{2}}. 

Inversion
The following simple fact is often useful for obtaining PAC bounds.
If
then with probability at least 1d,
PAC Bounds for R(g_{n})
Thus, for any small d > 0 with probability at least 1d,
R(g_{n}) £ R(g^{*}) + 2 D_{n} 

and
R(g_{n}) £ R_{n}(g_{n}) + D_{n} 

where,
D_{n} £ 
æ è

1
2n

log 
1
d

ö ø

1/2

+ 2 
æ è

log 2S (n,A )
n

ö ø

1/2



The Finite Case
It is highly informative to compare the general bound above with the
one obtained when the class G is finite, i.e., when G  = N < ¥.
Using Boole's and Hoeffding's inequalities we can write,

P{ 
sup
G

R_{n}(g)R(g) > t } 


P 
ì í
î

È
g Î G

{R_{n}(g)R(g) > t} 
ü ý
þ


 
 


å
G

P{ R_{n}(g)R(g) > t } 
 
 

 

and applying the inversion method (see above) we get,
D_{n} £ 
æ è

1
2n

log 
1
d

ö ø

1/2

+ 2 
æ è

log 2N
n

ö ø

1/2



The interpretation of the shatter coefficient S (n,A ) as an effective
size N when the class is infinite is plain to see. The shatter
coefficients (and thus the VC dimension as well) give us the maximum
number of points in G that can be distinguished on the basis of a
finite sample. Recalling that N equally likely choices have entropy
log N we could naturally interpret the logarithm of the shatter
coefficients as a kind of entropy for the infinite class G . This, I
believe, is the most important lesson to be learned from the VC
theory. We have now a method to reduce large (infinite dimensional)
nonparametric classes to an effective finite number that we can
actually tell apart from each other on the basis of only n < ¥
observations. The rest is counting. The bottom line is combinatorics.
The fruits of this theory are immense!
File translated from
T_{E}X
by
T_{T}H,
version 3.63. On 7 Oct 2004, 09:11.
