
I hope you find the stuff in this site of help in your quest to learn.
My notes are not copyrighted in any way. In fact you are encouraged to
copy them!
Introduction to Probability
Pattern Recognition

Learning Patterns
 Basic definitions, The Bayes Rule, An example, Learning with a teacher,
Definition of consistency, Empirical Risk Minimization and the need for
Uniform Laws of Large Numbers.
 Reference:
 Mostly from "A Probabilistic Theory of Pattern Recognition"
by Devroye, Gyorfi, and Lugosi. 1996. Springer.
Laws of Large Numbers

Uniform Laws
 What's a Law of Large Numbers again?, The Weak Law, Markov's and
Tchebychev's inequalities, Chernoff's Method, The Strong Law,
Hoeffding's inequality, The modern Strong Uniform Laws, McDiarmid's
inequality, Enter Combinatorics.
 References: Mostly from "Combinatorial Methods in Density Estimation"
by Devroye and Lugosi. 2001. Springer.
 Classics: Chernoff 1952 with
his method. To know more about the powerful newage concentration
of measure phenomenon see Talagrand
1996 survey and references in there.
VC Dimension

Examples and Tools
 Direct computation of VC dimension for simple classes. Halflines, and
intervals. General properties of shatter coefficients. SouthWest sets
and rectangles in dimension d. The simplests (as far as I know) proof
of Sauer's Lemma. Upper bounds of shatter coefficients in terms of
VC dimension. The SteeleDudley general class.
 References: The ones above plus a bit from Pollard's 1984 book:
"Convergence of Stochastic Proccesses". Springer.
Back to the Beginning: What have we learned?

PAC Bounds
 Decomposition of errors. Probably Approximately Correct bounds.
Inversion. PAC bounds for the true risk of an estimator.
The finite case. Interpretation of VC dimension and shatter coefficients
as measures of the effective size of an infinite class.
 References: The ones above plus
BoucheronBousquetLugosi 2004
Linear Rules and Fat Margins

Classification with Hyperplanes
 Here is the line; The ones on this side are +1 the ones on the other
side are 1. Fat lines have better generalization power. Definition of
margin and of plane in canonical position.
 References: SVM and Kernel Methods. Slides by Bernhard Scho"lkopf.
learningwithkernels.org
Support Vector Machines

SVMs
 Finding the hyperplane with largest margin. The KarushKuhnTucker
(KKT) conditions. The Dual Problem. The Support Vectors of SVMs.
The Quadratic Programming problem of SVMs.
 References: Bernhard Scho"lkopf slides above and,
Tamas Terlaky Notes
on KKT. Examples in Octave.
The Kernel Trick

RKHS
 Introduction. Feature Space. Simple Example. Mercer Kernel.
 References: Bernhard Scho"lkopf slides above and,
Aronszajn 1950 classic.
Cucker/Smale 2001
classic 2b.
Regression

Regression in Hilbert Space
 Classical Regression. Kernel Regression. SV Regression.
 References: Bernhard Scho"lkopf slides above and,
Maple Code
with an example of
how to use it.
Alex Smola's 1998
Thesis
Kernel PCA
 Any algorithm that depends on the observed data only through its Grammian
(i.e. its matrix of inner products) can be kernelized!
 Classical example: Principal Component Analysis.
 The original paper of
Scho"lkopf, Smola, and Mu"ller still says it best.
