Other available formats: [.PDF] [.TEX]

# Learning Patterns

## Set up

We assume labeled data X Î Rd with label Y Î {0,1}. For example X=x could be the vector of pixels of a digital image of a human face and Y=1 may mean that x is the face of a woman instead of a man. The random vector (X,Y) has distribution specified by, for example, a probability measure m for X and the regression function h(x) of Y on X. We have,

 P{X Î A} = óõ A m(dx)
and

 h(x) = P{Y=1 | X=x}
Clearly, by the sum and product rules of probability, we have

 P{(X,Y) Î C} = óõ C0 (1-h(x)) m(dx) + óõ C1 h(x) m(dx)
where C0 contains all the x's such that (x,0) Î C. The joint distribution of (X,Y) can be specified in other ways. Instead of picking first x according to m and then flipping a coin with probability h(x) to determine the label, we could also do it backwards.
First choose a label with (prior) probability P{Y=1} = p and then choose x with density (assuming there is one) f0 or f1 depending on the previously generated label y. Hence, for y Î {0,1}

 P{X Î A | Y=y} = óõ A fy(x) dx
By conditioning of Y we obtain,

 P{X Î A} = (1-p) óõ A f0(x) dx + p óõ A f1(x) dx = óõ A m(dx)
Thus, if m has a density f(x) = m(dx)/dx it must be a mixture,

 f(x) = (1-p) f0(x) + pf1(x)
Also, by the holy theorem,

 h(x) = P{Y=1 | X=x} = f1(x)p f(x)

## The Bayes Classifier

A classifier is a map d: Rd ® {0,1} that assigns 0-1 labels to data vectors x Î Rd. The quality of a classifier d is measured by its risk R(d) which is simply its probability of error,

 R(d) = P{d(X) ¹ Y}
The Bayes rule is the classifier d* with smallest possible risk, so that for all d,

 R(d*) £ R(d)
We call R*=R(d*) the Bayes risk. It provides a measure of difficulty for the pattern recognition task at hand. It is a characteristic of the joint distribution of (X,Y).
A standard result of statistical decision theory is that the Bayes rule for the all-nothing (i.e. 0-1) loss function is the mode of the posterior distribution. Thus, d*(x) = 1[h(x) > 1/2].
Here is an easy proof for the special case of pattern recognition. Let A(d) = {x : d(x) = 1}. Since f1 is a density that integrates to 1, we have:

 R(d) = (1-p) óõ A(d) f0(x) dx + p(1 - óõ A(d) f1(x) dx)
Which can be written as,

 R(d) = p+ óõ [(1-p) f0(x) - pf1(x)] 1A(d)(x) dx
Notice that the rule claimed to be Bayes, assigns 1 when h(x) > 1/2 and this is equivalent to require the expression in square brackets above to be negative.
Now denote by Q(x,d) the function that is being integrated above and show that for all x Î Rd, Q(x,d*) £ Q(x,d) (just consider each of the four cases of x in or outside A(d) and A(d*) separately). It then follows, by the last equation above and the monotonicity of integrals, that d* has the smallest possible probability of error and so it is the Bayes rule as claimed.

## An Example

To make things concrete consider the simple case of deciding wether an observed real number x was generated by a gaussian with mean m0 and variance 1 or a gaussian with mean m1 and variance 1, and the two choices are considered a priori equally likely (see figure).
We want to compute the Bayes rule when the distribution of (X,Y) is characterized by p = 1/2 and fy = N(my,1). For this case,

 A(d*) = { x : f0(x) < f1(x) } = { x : - (x-m0)2 2 < - (x-m1)2 2 }
and simplifies to,

 A(d*) = { x : |x-m1| < |x-m0| }
This shows that the best rule is to choose the distribution with mean closest to the observation x.
If we assume that m0 < m1 and let m=(m0+m1)/2 be the middle point, then the Bayes risk R* is,

 R* = 1 2 óõ ¥ m j(x-m0) dx + 1 2 óõ m -¥ j(x-m1) dx
By letting z=x-m0 in the first integral, z=x-m1 in the second and defining 2a=(m1-m0) we obtain a much simpler form,

 R* = óõ ¥ a j(z) dz
depending only on the separation between the two means. Here are some values,
 |m1-m0| R* 0.1 0.480 0.5 0.401 1.0 0.309 2.0 0.159 4.0 0.023 6.0 0.001

## Learning Patterns from a Teacher

If we knew the distribution of (X,Y) we could compute d*, the Bayes rule, as shown above. Unfortunately, we rearly know the distribution of (X,Y). This creates two problems. First, we cannot directly use d* since it depends on the parameters of the unknown distribution of (X,Y). We need to know either the sign of (h(x) - 1/2) or the sign of [(1-p)f0(x)-pf1(x)]. Second, we can not even evaluate the risk R(d) of any classifier d for that also depends on the unknown distribution of (X,Y).
But if we have available n independent observations (X1,Y1), (X2,Y2),¼, (Xn,Yn) of the vector (X,Y) then we could use them to estimate the missing distribution of (X,Y) with the empirical distribution based on the observations. This could be interpreted as past data that was labeled by a reliable teacher.
Thus, we could approximate the true probability P{(X,Y) Î C} with the observed frequency given by the empirical measure nn,

 nn(C) = 1 n nå i=i 1[(Xi,Yi) Î C]
The true risk R(d), for a given classifier d, is estimated from the observed data with the empirical risk,

 ^ R n (d) = 1 n nå i=1 1[d(Xi) ¹ Yi]
There is then the possibility of evaluating the quality of a d with its observed frequency of errors [^R]n on the available data Dn=(X1,Y1, ¼, Xn,Yn). But we should always keep in mind that this performance is different for different data sequences. A classifier that is built from given data Dn has as its true risk,

 Rn = P{ dn(X;Dn) ¹ Y | Dn }

### Consistency

It is desirable to be able to quantify not just how good a classifier is on a fix data sequence Dn but to also know how a given classification method or rule behaves on most data sequences. In particular the notion of a consistent sequence {dn} of classifiers is simply the requirement that,

 E Rn = P{ dn(X;Dn) ¹ Y } ® R*   as n® ¥.
A classification rule may be consistent for some distributions of (X,Y) but not for others. If however it is consistent for all of them we say that it is universally consistent. Are there any known universally consistent rules? Yes! for example the k-nearest neighbor rule (knn for short) is one of them. The knn is simply the rule that assigns to x the most common observed label among its knn's (in any metric topologically equivalent to the euclidean). Universal consistency for the knn is achieved only when k=k(n) is such that k® ¥ and k/n ®0 as n® ¥. These are the good news. The bad news: the convergence can be arbitrarily slow! Consistent rules with a given rate of convergence for all distributions of (X,Y) do not exist. This is interesting. On the one hand we know that given enough independent examples it is possible to get arbitrarily close to the performance of the Bayes rule. On the other hand it may take arbitrarily long for that to happen.
There are two ways out of this impasse. Use prior information to constraint the possible probability distributions for (X,Y) or change the target R* by considering only a subclass C of classifiers. We take the second route first and show that there are classes C that provide universal rates of convergence to the best dC* for that class, i.e.,

 R(dC*) £ R(d)   for all   d Î C

## Empirical Risk Minimization

Let us assume that we have data Dn as above from where we can estimate the true risk of a classifier d by using the observed frequency of errors on Dn, i.e., by its empirical risk. Let dn* be the classifier that minimizes the empirical risk over a given class C of rules, i.e.,

 ^ R n (dn*) £ ^ R n (d)   for all  d Î C
We would like to know what kinds of classes C allow universal empirical learning, in the sense that for all distributions of (X,Y),

 R(dn*) ® R(dC*)   as  n ® ¥
It is intuitively clear that what's needed is some kind of constraint on the "size" (a better term would be capacity) of C. For example if C contains all possible classification functions, then no matter what the sequence Dn is, we can always find a function in this class that fits the data perfectly. However we also feel intuitively that such a rule will over fit the observed data and it will show poor performance on sequences other than the observed Dn, i.e. we expect such rule to have poor generalization power. This intuitive reasoning will be rigorously confirmed by means of the celebrated Vapnik-Chervonenkis theory.
We first notice the following simple result:
Lemma 1
 R(dn*) - inf d Î C R(d) £ 2 sup d Î C | ^ R n (d) - R(d) |
and

 | ^ R (dn*) - R(dn*) | £ sup d Î C | ^ R n (d) - R(d) |
Proof:
 R(dn*) - inf d Î C R(d)  =   R(dn*) - ^ R (dn*) + ^ R (dn*) - inf d Î C R(d)

 £ sup d Î C | R(d)- ^ R n (d) | + sup d Î C | ^ R n (d) - R(d) |
The second part is trivial.
The above lemma shows that in order to achieve universal consistency on a given class, it is sufficient for the supremum appearing on the rhs of the inequalities to go to zero with probability 1, i.e., all we need to show is strong uniform convergence, over the given class, of empirical error to the true error.

File translated from TEX by TTH, version 3.63.
On 19 Sep 2004, 12:06.

http://omega.albany.edu:8008/