Another service from Omega

*****
Other available formats: [.PDF] [.TEX]

Uniform Laws of Large Numbers

Carlos C. Rodríguez
http://omega.albany.edu:8008/

What is a Law of Large Numbers?

I am glad you asked! The Laws of Large Numbers, or LLNs for short, come in three basic flavors: Weak, Strong and Uniform. They all state that the observed frequencies of events tend to approach the actual probabilities as the number of observations increases. Saying it in another way, the LLNs show that under certain conditions, we can assymptotically learn the probabilities of events from their observed frequencies. To add some drama we could say that if God is not cheating and S/he doesn't change the innitial standard probabilistic model too much then, in principle, we (or other machines, or even the universe as a whole) could eventually find out the Truth, the whole Truth, and nothing but the Truth.
Bull! The Devil, is in the details.
I suspect that for reasons not too different in spirit to the ones above, famous minds of the past took the slippery slope of defining probabilities as the limits of relative frequencies. They became known as "frequentists". They wrote the books and indoctrinated generations of confused students.
As we shall see below, all the LLNs follow from the addition and product rules of probability theory. So, no matter what interpretation is ascribed to the concept of probability, if the numerical values of the events under consideration follow the addition and product rules then the LLNs are just an inevitable logical consequence. In other words, you don't have to be a frequentist to enjoy the LLNs. In fact, due to the very existence of the LLNs, it is not possible to define probabilities with the limit frequencies in a consistent way. This is simply because all LLNs state only probabilistic convergence of frequencies to probabilities (the convergence is either in probability or with probability 1). The concept that we want to interpret (namely probability) is needed to define the very concept (namely the LLNs) that is suppose to explain it. The frequentist concept of probability eats its own tail!

The Weak Law

The Weak Law of Large Numbers (WLLN) goes back to the beginnings of probability theory. It was discovered for the case of random coin flips by James Bernoulli at around 1700 but only appeared in print posthumously in his Ars Conjectandy in 1713. Later on, in 1800, Poisson generalized the result for general independent coin flips. After that Tchebychev in 1866 discovered his inequality and generalized the law for arbitrary sequences of independent random variables with second moments. Finally, his student Markov extended it to some classes of dependent random variables. Markov's inequality is almost a triviality but it has found innumerable applications.
Theorem 1 [Markov's inequality] If X is nonnegative and t > 0,
P{ X t } EX

t
Proof: for t > 0,

X X 1[X t] t 1[X t]
and by the monotonicity of expectations we find that,

EX t P{ X t }
Two important consequences of Markov's inequality are:
Tchebychev's inequality
If V(X) denotes the variance of X then,
P{|X-EX| t} = P{|X-EX|2 t2} V(X)

t2
Chernoff's method
For t > 0 find the best s in,
P{X t} = P{ esX est} E esX

est
Thus, when X1,X2,,Xn are independent and identically distributed (iid) as X the sample mean,

-
X
 

n 
= 1

n
n

i=1 
Xi
has mean EX and variance V(X)/n so by Tchebychev, for any e > 0

P{ |
-
X
 

n 
- EX| e} V(X)

ne2
and it immediately follows that,


lim
n 
P{ |
-
X
 

n 
- EX| e} = 0
which is what is meant by the sentence "the sample mean converges in probability to the expected value". That's the WLLN. For the special case of coin flips, i.e. for binary r.v.'s Bin(p), with P{X=1} = 1-P{X=0}=p the Tchebychev bound gives,

P{ |
-
X
 

n 
- p | e} p(1-p)

ne2
showing that the observed frequency of ones converges in probability to the true probability p of observing a 1.

The Strong Law

The bounds above obtained from Tchebychev's inequality are very poor. By using Chernoff's method an exponential bound can be obtained. In fact we have,
Hoeffding's inequality

P

|
-
X
 

n 
- p | e

2 e-2ne2
and by the classic Borel-Cantelli lemma it follows that,

P{ w:
lim
n 
-
X
 

n 
(w) = p } = 1
which is the definition that the observed frequency of ones converges with probability one (or a.s. for almost surely) to the true probability p of observing a 1.
The proof of Hoeffding's inequality uses the following result for bounded r.v.'s with zero mean,
Lemma 1 If EX=0 and a X b, then for any s > 0,
E esX es2(b-a)2/8
Proof: Let a x b and define l [0,1] by
l = b-x

b-a
Notice that for any s > 0 we have,
sx = lsa + (1-l) sb
Thus, exp(.) convex implies,
esx b-x

b-a
esa + x-a

b-a
esb
Replacing x with the r.v. X, taking expectations and letting p = -a/(b-a) (notice that EX=0 implies p [0,1]) we can write,
E{esX}
b

b-a
esa - a

b-a
esb
=
(1-p)esa + p esb
=
(1-p+p es(b-a)) e-ps(b-a)
def
=
 
ef(u),
where u=s(b-a) and,
f(u) = -pu + log(1-p+peu).
The lemma will follow from the last inequality above by showing that,
f(u) u2

8
= s2(b-a)2

8
.
To see that this is true just expand f(u) about zero,
f(u) = f(0)+uf(0)+ 1

2
u2f"(q)
where q [0,u] exists by Taylor's theorem, and notice that f(0)=f(0)=0 and
f"(u) = p(1-p)e-u

(p+(1-p)e-u)2
1

4
this is just a special case of z(1-z) 1/4 for z=p/(p+(1-p)e-u). Alternatively, just take derivative equal 0 to find that the max (1/4) is achieved when e-u = p/(1-p).
Notice that for the special case of X {1,-1} with equal probability 1/2 for each value the result follows at once from,
EesX = cosh s es2/2
by comparing the two series term by term. It is just this case that is needed in the main VC-theorem below.
We are now ready to show
Proof: [Hoeffding's inequality]
We actually show a more general version for X1,,Xn independent with ai Xi bi. Let Zi=Xi-EXi we have,

P{ | n

i=1 
Zi| t }
P{ n

i=1 
Zi t } + P{ n

i=1 
-Zi t }
2 e-st n

i=1 
es2(bi-ai)2/8
=
2 exp{ s2

8
n

i=1 
(bi-ai)2 - st }
where we are using Chernoff's method and the previous lemma. The upper bound is optimized for when s = 4t/(bi-ai)2 producing,

P{ | n

i=1 
Zi| t } 2 e-2t2/(bi-ai)2
which implies the claimed bound for the special case of coin flips. Just replace t=ne and notice that for binary variables (bi-ai)2 = n.

The Modern Strong Uniform Laws

The historical evolution of laws of large numbers have been coincidental with important paradigm shifts in the theory of probability. The weak law of Bernoulli and Poisson with the later refinements of Tchebychev and Markov are characteristic of the early era of probability. Then came the strong laws of Borel, Cantelli, Kolmogorov and others. These characterized the time of the axiomatic formalization of probability as part of measure theory during the first part of the twentieth century. The latest addition, to this saga is what we'll concentrate on here. These are the so called strong uniform laws that have a combinatorial flavor and were discovered by Vapnik and Chervonenkis in the 1970's in connection with statistical learning.
We start with a powerful generalization of Hoeffding's inequality for general functions of independent r.v.'s satisfying the bounded difference assumption. Let S Rn and denote by ei Rn the ith cannonical vector with all zeros except for a 1 in the ith position. We say that a function h : S R has bounded differences in S if for all 1 i n,

| h(x) - h(x + t ei) | ci
for all x S and all t R so that (x+tei) S. This means that the function does not change by more than ci along the ith direction. We have,
McDiarmid's inequality
Let h have bounded differences. For all t > 0,
P{ | h(X1,,Xn) - Eh | t } 2 e-2 t2/ci2
Notice that when h=Xi we recover Hoeffding's inequality.
Proof: [McDiarmid's inequality] The idea is to write,
h - Eh = n

i=1 
Zi
by using,
Zi(X1,,Xi) = E{h | X1,,Xi} - E{h | X1,,Xi-1}
these Zi have zero mean and are bounded a.s. within the interval [Li,Ui] with the lower and upper limits given by the inf and sup over Xi=u of Zi. Thus, Li and Ui depend only on X1,,Xi-1 and Ui-Li ci is inherited from the bounded difference assumption about h. Therefore, using Chernoff's method and the previous lemma we have that for all s > 0,

P{ h-Eh t }
e-st E{esi=1n-1Zi E{esZn |X1,,Xn-1} }
e-st es2i=1nci2/8
where the lemma was used n times. Now optimize s and copy the steps used for the proof of Hoeffding's to obtain the result.
Corollary
Let nn be the empirical probability measure based on the iid sample X1,X2,,Xn. The function,
hn = hn(X1,,Xn) =
sup
A A 
| nn{A} - n{A} |
has bounded differences for any class of sets A.
Proof: By changing only one of the Xi the function hn changes by at most ci=1/n.
It then follows inmediately from McDiarmid's inequality that,

P{ |hn - Ehn| t } 2 e-2nt2
Thus, if we can show that Ehn 0 as n we can deduce from the above inequality that, for any t > 0 and for any n sufficiently large,

P{
sup
A A 
| nn{A} - n{A} | t } 2 e-2nt2
and by the Borel-Cantelli lemma we would have obtained that,


sup
A A 
| nn{A} - n{A} | 0   a.s.
as n, i.e. we'll have a uniform strong law of large numbers over the class A.

Enter Combinatorics

If A is a colletion of subsets of Rd we define the shatter coefficients associated to the class A as,

S(n,A) =
max
x1,,xn Rd 
| { A{x1,,xn} : A A } |.
The integer S(n,A) is the maximum number of subsets of a set of n points that appear in elements of A. Here is a post-modern version of the Vapnik-Chervonenkis inequality due to Devroye and Lugosi.
Theorem: [VC inequality]

E{
sup
A A 
| nn{A} - n{A} | } 2
log2 S(n,A)

n

1/2

 
.
Before proving this, notice that classes A for which the rhs of the above inequality goes to zero allow strong uniform laws of large numbers. In other words, the class A must not be too populated in such a way that the logarithm of its shatter coefficients must increase at a rate slower than n. The proof uses the following Lemma which also has independent interest.
Lemma
EesZi es2 c2/2 implies that,
E{
max
i n 
Zi } c (2 log n)1/2.
Proof:
es E{maxi n Zi}
E{es(maxi n Zi)}
=
E{
max
i n 
es Zi}


i n 
E es Zi
n es2c2/2
where we have used Jensen's inequality and the hypothesis. Hence,

E{
max
i n 
Zi} log n

s
+ s c2

2
is valid for any s > 0. The best bound, claimed by the theorem, is obtained at s=c-1(2log n)1/2.
Proof: [VC inequality]
We divide the proof into three simple parts. First we show,

First symmetrization


E{
sup
A A 
| nn{A} - n{A} | } E{
sup
A A 
| nn{A} - nn{A} | }
where nn denotes the empirical measure associated to an independent copy X1,,Xn of the original sample X1,,Xn. This is just a simple fact that follows from two applications of Jensen's inequality and the fact that the unconditional expectation is the expectation of the expectation conditional on the original sample,

E{
sup
A A 
| nn{A} - n{A} | }
=
E{
sup
A A 
|E{ nn{A} - nn{A}|X1,Xn} | }
E{
sup
A A 
E{| nn{A} - nn{A}||X1,Xn} }
E{E{
sup
A A 
| nn{A} - nn{A}||X1,Xn} }
=
E{
sup
A A 
| nn{A} - nn{A} | }.
The second step is,

Second symmetrization

Introduce independently of the two samples, n independent random signs e1,,en i.e., P{ei = 1} = P{ei = -1} = 1/2 and notice that if Zi are any independent r.v.s symmetric about 0 then the joint distribution of e1Z1,,enZn is the same as the joint distribution of Z1,,Zn. Hence,

E{
sup
A A 
| nn{A} - n{A} | } 1

n
E{
sup
A A 
| n

i=1 
ei(1[Xi A] - 1[Xi A]) | }
where we used Zi = 1[Xi A] - 1[Xi A]. Finally the third step,

Counting and bounding

Here is where combinatorics gets into the picture. To compute the sup over the class A we only need to check a finite number of sets A A, namely those that pick different subsets of the 2n values {x1,x1,,xn,xn}. Thus, we only need to check at most m=S(2n,A) sets in A to find the sup. Let's denote these sets by A1,A2,,Am and let,

Yj = n

i=1 
ei (1[Xi Aj] - 1[Xi Aj])
we can then write,

E{
sup
A A 
| nn{A} - n{A} | }
1

n
E{
max
j m 
|Yj| }
=
1

n
E{ max
{Y1,-Y1,,Ym,-Ym} }
Now we apply the previous Lemma by noticing that,
EesYj = Ee-sYj n

i=1 
es2/2 = en s2/2
and obtain,

E{
sup
A A 
| nn{A} - n{A} | } n

n
(2 log 2m)1/2
the result follows by noticing that m=S(2n,A) S(n,A)2.



File translated from TEX by TTH, version 3.63.
On 30 Sep 2004, 09:57.

http://omega.albany.edu:8008/