Optimal Recovery of Local Truth
C. C. RODRIGUEZ
Abstract
Probability mass curves the data space with horizons!. Let f be a
multivariate probability density function with continuous second
order partial derivatives. Consider the problem of estimating the
true value of f(z) > 0 at a single point z, from n independent
observations. It is shown that, the fastest possible estimators
(like the k-nearest neighbor and kernel) have minimum asymptotic
mean square errors when the space of observations is thought as
conformally curved. The optimal metric is shown to be generated by
the Hessian of f in the regions where the Hessian is definite.
Thus, the peaks and valleys of f are surrounded by singular
horizons when the Hessian changes signature from Riemannian to
pseudo-Riemannian. Adaptive estimators based on the optimal variable
metric show considerable theoretical and practical improvements over
traditional methods. The formulas simplify dramatically when the
dimension of the data space is 4. The similarities with General
Relativity are striking but possibly illusory at this point.
However, these results suggest that nonparametric density estimation
may have something new to say about current physical theory.
1 Introduction
During the past thirty years the theory of Nonparametrics has been
dominating the scene in mathematical statistics. Parallel to the
accelerating discovery of new technical results, a consensus has been
growing among some researchers in the area, that we may be witnessing a
promising solid road towards the elusive Universal Learning Machine
(see e.g. [,]).
The queen of nonparametrics is density estimation. All the fundamental
ideas for solving the new problems of statistical estimation in
functional spaces (smoothing, generalization, optimal minimax rates,
etc.) already appear in the problem of estimating the probability
density (i.e. the model) from the observed data. More over, it is now
well known that a solution for the density estimation problem
automatically implies solutions for the problems of pattern
recognition and nonparametric regression as well as for most problems
that can be expressed as a functional of the density.
In this paper I present a technical result, about optimal
nonparametric density estimation, that shows at least at a formal
level, a surprising similarity between nonparametrics and General
Relativity. Simply put,
probability mass curves the data space with horizons.
What exactly it is meant by this is the subject of this paper but
before proceeding further a few comments are in order. First of all,
let us assume that we have a set {x1,¼,xn} of data.
Each observation xj consisting of p measurements that are
thought as the p coordinates of a vector in Âp. To make the
data space into a probability space we endow Âp with the field
of Borelians but nothing beyond that. In particular no a priori metric
structure on the data space is assumed. The n observations are
assumed to be n independent realizations of a given probability
measure P on Âp. By the Lebesgue decomposition theorem, for
every Borel set B we can write,
|
P(B) = |
ó õ
|
B
|
f(x) l(dx) + n(B) |
| (1) |
where n is the singular part of P that assigns positive
probability mass to Borel sets of zero Lebesgue volume. Due to the
existence of pathologies like the Cantor set in one dimension and its
analogies in higher dimensions, the singular part n cannot be
empirically estimated (see e.g. []). Practically
all of the papers on density estimation rule out the singular part of
P a priori. The elimination of singularities by fiat has permitted
the construction of a rich mathematical theory for density estimation,
but it has also ruled out a priori the study of models of mixed
dimensionality (see []) that may be necessary for
understanding point masses and spacetime singularities coexisting with
absolutely continuous distributions.
We assume further that in the regions where f(x) > 0 the density f
is of class C2 i.e., it has continuous second order partial
derivatives.
1.1 Nonparametrics with the World in Mind
The road from Classical Newtonian Physics to the physics of today can
be seen as a path paved by an increasing use of fundamental concepts
that are statistical in nature. This is obvious for statistical
mechanics, becoming clearer for quantum theory, and appearing almost
as a shock in General Relativity. Not surprisingly there have been
several attempts to take this trend further (see e.g.
[,,,]) in the direction of
Physics as Inference.
Now suppose for a moment that in fact some kind of restatement of the
foundations of physics in terms of information and statistical
inference will eventually end up providing a way to advance our
current understanding of nature. As of today, that is either already a
solid fact or remains a wild speculation, depending on who you ask.
In any case, for the trend to take over, it will have to be able to
reproduce all the successes of current science and make new correct
predictions. In particular it would have to reproduce General
Relativity. Recall that the main lesson of General Relativity is that
space and time are not just a passive stage on top of which the
universe evolves. General Relativity is the theory that tells (through
the field equation) how to build the stage (left hand side of the
equation) from the data (right hand side of the equation). The
statistical theory that tells how to build the stage of inference (the
probabilistic model) from the observed data is: Nonparametric
Density Estimation. It is therefore reassuring to find typical
signatures of General Relativity in density estimation as this paper
does. Perhaps Physics is not just a special case of statistical
inference and all these are only coincidences of no more relevance
than for example the fact that multiplication or the logarithmic
function appear everywhere all the time. That may be so, but even in
that case I believe it is worth noticing the connection for
undoubtedly GR and density estimation have a common goal: The
dynamic building of the stage.
More formally. Let f be a multivariate probability density function
with continuous second order partial derivatives. Consider the
problem of estimating the true value of f(z) > 0 at a single point
z, from n independent observations. It is shown that, fastest
possible estimators (including the k-nearest neighbor and kernel as
well as the rich class of estimators in
[]) have minimum asymptotic mean square
errors when the space of observations is thought as conformally
curved. The optimal metric is shown to be generated by the Hessian of
f in the regions where the Hessian is definite. Thus, the peaks and
valleys of f are surrounded by horizons where the Hessian changes
signature from Riemannian to pseudo-Riemannian.
The result for the case of generalized k-nearest neighbor estimators
[] has circulated since 1988 in the form of a
technical report []. Recently I found that a special
case of this theorem has been known since 1972
[] and undergone continuous development in
the Pattern Recognition literature, (see e.g.
[,,,]).
2 Estimating Densities from Data
The canonical problem of density estimation at a point z Î Âp
can be stated as follows: Estimate f(z) > 0 from n independent
observations of a random variable with density f.
The most successful estimators of f(z) attempt to approximate the
density of probability at z by using the observations x1,¼,xn to build
both, a small volume around z and, a weight for this volume in terms
of probability mass. The density is then computed as the ratio of the
estimated mass over the estimated volume. The two classical examples
are the k-nearest neighbor (knn) and the kernel estimators.
2.1 The knn
The simplest and historically the first example of a nonparametric
density estimator is [] the knn. The knn estimator
of f(z) is defined for k Î {1,2,¼,n} as,
where lk is the volume of the sphere centered at the point
z Î Âp of radius R(k) given by the distance from z to the
kth-nearest neighbor observation. If l denotes the Lebesgue
measure on Âp we have,
where,
|
S(r) = { x Î Âp : ||x-z|| £ r } |
| (4) |
The sphere S(r) and the radius R(k) are defined relative to a
given norm, ||·|| in Âp. The stochastic behavior of the
knn depends on the specific value of the integer k chosen in
(2). Clearly, in order to achieve consistency (e.g.
stochastic convergence of hn(z) as n®¥ towards
the true value of f(z) > 0) it is necessary to choose k = k(n) as a
function of n. The volumes lk must shrink, to control the
bias, and consequently we must have k/n® 0 for hn(z)
to be able to approach a strictly positive number. On the other hand,
we must have k®¥ to make the estimator dependent on
an increasing number k of observations and in this way to control
its variance. Thus, for the knn to be consistent, we need k to
increase with n but at a rate slower than n itself.
The knn estimator depends not only on k but also on a choice of
norm. The main result of this paper follows from the characterization
of the ||·|| that, under some regularity conditions, produces
the best asymptotic (as n®¥) performance for density
estimators.
2.2 The kernel
If we consider only regular norms ||·||, in the sense that for
all sufficiently small values of r > 0,
|
l(S(r)) = l(S(1)) rp º brp |
| (5) |
then, the classical kernel density estimator can be written as:
where,
|
Mm = |
1 n
|
|
æ è
|
å
xj Î S(m)
|
Km-1(xj-z) |
ö ø
|
|
| (7) |
The smoothing parameter m = m(n) is such that k = [nmp]
satisfies the conditions for consistency of the knn, Km-1(x) = K(m-1x) where the kernel function K is a non negative
bounded function with support on the unit sphere (i.e. K(x) = 0 for
||x|| > 1) and satisfying,
Notice that for the constant kernel (i.e. K(x) = 1 for ||x|| £ 1)
the estimator (6) approximates f(z) by the proportion of
observations inside S(m) over the volume of S(m). The general
kernel function K acts as a weight function allocating different
weights Km-1(xj-z) to the xj's inside S(m). To
control bias (see () below) the kernel K is usually
taken as a decreasing radially symmetric function in the metric
generated by the norm ||·||. Thus, Km-1(xj-z)
assigns a weight to xj that decreases with its distance to z.
This has intuitive appeal, for the observations that lie closer to z
are less likely to fall off the sphere S(m), under repeated
sampling, than the observations that are close to the boundary of
S(m).
The performance of the kernel as an estimator for f(z) depends first
and foremost on the value of the smoothness parameter m. The
numerator and the denominator of gn(z) depend not only on m
but also on the norm ||·|| chosen and the form of the kernel
function K. As it is shown in theorem () these three
parameters are inter-related.
2.3 Double Smoothing Estimators
The knn (2) and the kernel (6) methods are two
extremes of a continuum. Both, hn(z) and gn(z) estimate
f(z) as probability-mass-per-unit-volume. The knn fixes the
mass to the deterministic value k/n and lets the volume
lk to be stochastic, while the kernel method fixes the
volume l(S(m)) and lets the mass Mm to be random.
The continuum gap between (2) and (6) is filled
up by estimators that stochastically estimate mass and volume by
smoothing the contribution of each sample point with different
smoothing functions for the numerator and denominator (see
[]).
Let b ³ 1 and assume, without loss of generality that bk is an
integer. The double smoothing estimators with deterministic weights
are defined as,
|
fn(z) = |
1 n
|
|
n å
i = 1
|
K |
æ ç
è
|
z-xi R(k)
|
ö ÷
ø
|
|
|
| (9) |
where,
|
wi = |
ó õ
|
i/bk
(i-1)/bk
|
w(u) du |
| (10) |
and w(·) is a probability density on [0,1] with mean c.
3 The Truth as n®¥ ?
In nonparametric statistics, in order to assess the quality of an
estimator fn(z) as an estimate for f(z), it is necessary to
choose a criterion for judging how far away is the estimator from what
it tries to estimate. This is sometimes regarded as revolting and
morally wrong by some Bayesian Fundamentalists. For once you choose a
loss function and a prior, logic alone provides you with the Bayes
estimator and the criterion for judging its quality. That is
desirable, but there is a problem in high dimensional spaces. In
infinite dimensional hypothesis spaces (i.e. in nonparametric
problems) almost all priors will convince you of the wrong thing! (see
e.g. [,] for a non-regular way
out see []). These kind of Bayesian nonparametric
results provide a mathematical proof that: almost all fundamental
religions are wrong, (more data can only make the believers more
sure that the wrong thing is true!). An immediate corollary is that:
Subjective Bayesians can't go to Heaven. Besides, the choice of
goodness of fit criterion is as ad-hoc (an equivalent) to the choice
of a loss function.
3.1 The Natural Invariant Loss Function and Why the MSE is not
that Bad
The most widely studied goodness of fit criterion is the Mean Square
Error (MSE) defined by,
where the expectation is over the joint distribution of the sample
x1,¼,xn . By adding and subtracting T = Efn(z) and expanding the square,
we can express the MSE in the computationally convenient form,
|
|
|
|
E|fn(z) - T|2 + |T - f(z)|2 |
| |
|
| (12) |
|
By integrating (11) over the z Î Âp and
interchanging E and ò (OK by Fubbini's theorem since the
integrand ³ 0) we obtain,
|
(MISE) = E |
ó õ
|
|fn(z) - f(z)|2 dz |
| (13) |
The Mean Integrated Square Error (MISE) is just the expected L2
distance of fn from f. Goodness of fit measures based on the
(MSE) have two main advantages: They are often easy to compute and
they enable the rich Hilbertian geometry of L2. On the other hand
the (MSE) is unnatural and undesirable for two reasons: Firstly, the
(MSE) is only defined for densities in L2 and this rules out
a priori all the densities in L1\L2 which is
unacceptable. Secondly, even when the (MISE) exists, it is difficult
to interpret (as a measure of distance between densities) due to its
lack of invariance under relabels of the data space. Many researchers
see the expected L1 distance between densities as the natural
loss function in density estimation. The L1 distance does in fact
exist for all densities and it is easy to interpret but it lacks the
rich geometry generated by the availability of the inner product in
L2. A clean way out is to use the expected L2
distance between the wave functions yn = Ö{fn} and
y = Öf.
Theorem 1
The L2 norm of wave functions is invariant under relabels of the
data space, i.e.,
|
|
ó õ
|
|yn(z) - y(z)|2 dz = |
ó õ
|
| |
~ y
|
n
|
(z¢) - |
~ y
|
(z¢)|2 dz¢ |
| (14) |
where z = h(z¢) with h any one-to-one smooth function.
Proof: Just change the variables. From, the change of variables
theorem the pdf of z¢ is,
|
|
~ f
|
(z¢) = f(h(z¢)) |
ê ê
ê
|
¶(h) ¶(z¢)
|
ê ê
ê
|
|
| (15) |
from where the wave function of z¢ is given by,
|
|
~ y
|
(z¢) = y(h(z¢)) |
ê ê
ê
|
¶(h) ¶(z¢)
|
ê ê
ê
|
1/2
|
|
| (16) |
Thus, making the substitution z = h(z¢) we get,
|
|
|
|
|
ó õ
|
|yn(h(z¢)) - y(h(z¢))|2 |
ê ê
ê
|
¶(h) ¶(z¢)
|
ê ê
ê
|
dz¢ |
| |
|
|
|
ó õ
|
|yn |
ê ê
ê
|
¶(h) ¶(z¢)
|
ê ê
ê
|
1/2
|
- y |
ê ê
ê
|
¶(h) ¶(z¢)
|
ê ê
ê
|
1/2
|
|2 dz¢ |
| |
|
| (17) |
|
Q.E.D.
The following theorem shows that a transformation of the MSE of a
consistent estimator provides an estimate for the expected L2
norm between wave functions.
Theorem 2
Let fn(z) be a consistent estimator of f(z). Then,
|
E |
ó õ
|
|yn - y|2 dz = |
1 4
|
|
ó õ
|
|
E|fn(z) - f(z)|2 f(z)
|
dz + (smaller order terms) |
| (18) |
Proof: A first order Taylor expansion of Öx about
x0 gives,
|
Öx - | Ö
|
x0
|
= |
1 2
|
|
(x-x0)
|
+ o((x-x0)2) |
| (19) |
Substituting x = fn(z), x0 = f(z) into (19) squaring
both sides and taking expectations we obtain,
|
E|yn(z) - y(z)|2 = |
1 4
|
|
E|fn(z)-f(z)|2 f(z)
|
+ o(E|fn(z)-f(z)|2) |
| (20) |
integrating over z and interchanging E and ò we arrive at
(18).
Q.E.D.
Proceeding as in the proof of theorem 3.1 we can show
that
where, as before, z« z¢ is any one-to-one smooth
transformation of the data space and [f\tilde] is the density of
z¢. Thus, it follows from (21) that the leading term on
the right hand side of (18) is also invariant under
relabels of the data space. The nice thing about the L2 norm of
wave functions, unlike (21), is that it is defined even
when f(z) = 0.
4 Some Classic Asymptotic Results
We collect here the well known Central Limit Theorems (CLT) for the
knn and kernel estimators together with some remarks about
nonparametric density estimation in general. The notation and the
formulas introduced here will be needed for computing the main result
about optimal norms in the next section.
Assumption 1
Let f be a pdf on Âp of class C2 with non singular
Hessian, H(z) at z Î Âp, and with f(z) > 0, i.e., the
matrix of second order partial derivatives of f at z exists, it is
non singular and its entries are continuous at z.
Assumption 2
Let K be a bounded non negative function defined on the unit sphere,
S0 = {x Î Âp: ||x|| £ 1} and satisfying,
|
|
ó õ
|
||x|| £ 1
|
K(x) dx = l(S0) º b |
| (22) | |
ó õ
|
||x|| £ 1
|
x K(x) dx = 0 Î Âp |
| (23) |
|
Theorem 3 [CLT for knn]
Under assumption 4, if k = k(n) is taken in the
definition of the knn (2) in such a way that for some a > 0
then, if we let Zn = Ök(hn(z) - f(z)) we have,
|
|
lim
n®¥
|
P(Zn £ t) = |
ó õ
|
t
-¥
|
|
1
|
exp |
æ ç
è
|
- |
(y-B(z))2 2V(z)
|
ö ÷
ø
|
dy |
| (25) |
where,
|
B(z) = |
æ ç
è
|
a[(p+4)/2p] 2f2/p(z)
|
ö ÷
ø
|
|
ì í
î
|
b-1-2/p |
ó õ
|
||x|| £ 1
|
xTH(z)x dx |
ü ý
þ
|
|
| (26) |
and,
Proof: This is a special case of [].
Theorem 4 [CLT for kernel]
Under assumptions 4, and 4 if m = m(n)
is taken in the definition of the kernel (6) in such a
way that for some a > 0, k = [nmp] satisfies (24)
then, if we let Zn = Ök(gn(z) - f(z)) we have
(25) where now,
|
B(z) = |
æ ç
è
|
a[(p+4)/2p] 2
|
ö ÷
ø
|
|
ì í
î
|
b-1 |
ó õ
|
||x|| £ 1
|
xTH(z)x K(x) dx |
ü ý
þ
|
|
| (28) |
and,
|
V(z) = f(z) |
ì í
î
|
b-2 |
ó õ
|
||x|| £ 1
|
K2(x) dx |
ü ý
þ
|
|
| (29) |
Proof: The sample x1,¼,xn is assumed to be iid f and therefore the
kernel estimator gn(z) given by (6) and (7)
is a sum of iid random variables. Thus, the classic CLT applies and we
only need to verify the rate (24) and the asymptotic
expressions for the bias (28) and variance
(29). We have,
|
|
|
|
|
1 bmp
|
|
1 n
|
|
n å
j = 1
|
|
ó õ
|
K |
æ ç
è
|
xj-z m
|
ö ÷
ø
|
f(xj) dxj |
| (30) | |
|
|
|
1 bmp
|
|
ó õ
|
K(y) f(z+my)mp dy |
| (31) | |
|
|
|
ó õ
|
|
K(y) b
|
|
ì í
î
|
f(z) + mÑf(z)·y + |
m2 2
|
yTH(z)y + o(m2) |
ü ý
þ
|
dy |
| (32) | |
|
| f(z) + |
m2 2b
|
|
ó õ
|
yTH(z)y K(y) dy + o(m2) |
| (33) |
|
where we have changed the variables of integration to get
(31), used assumption 4 and Taylor's theorem
to get (32) and used assumption 4 to obtain
(33). For the variance we have,
|
|
|
| (34) | |
|
|
|
1 nb2m2p
|
|
ì í
î
|
|
ó õ
|
||y|| £ 1
|
f(z+my) K2(y) mpdy |
| |
|
|
- |
æ è
|
ó õ
|
||y|| £ 1
|
f(z+my)K(y) mpdy |
ö ø
|
2
|
ü ý
þ
|
|
| (35) | |
|
|
f(z) nb2mp
|
|
ó õ
|
||y|| £ 1
|
K2(y)dy + o |
æ ç
è
|
1 nmp
|
ö ÷
ø
|
|
| (36) |
|
where we have used var(K) = EK2-(EK)2 and changed the
variables of integration to get (35), used
assumption 4 and (0th order) Taylor's theorem to get
(36). Hence, the theorem follows from (33) and
(36) after noticing that (24) and k = nmp
imply,
|
|
|
|
k[(4+p)/2p]n-2/p = (n-[4/(p+4)]k)[(p+4)/2p] ® a[(p+4)/2p] |
| (37) | |
|
| (38) |
|
Q.E.D.
Theorem 5 [CLT for double smoothers]
Consider the estimator fn(z) defined in (9).
Under assumptions 4, 4, and (24)
if we let Zn = Ök(fn(z) - f(z)) we have
(25) where now,
|
B(z) = |
æ ç
è
|
a[(p+4)/2p] 2[bf(z)]2/p
|
ö ÷
ø
|
b-1 |
ì í
î
|
ó õ
|
||x|| £ 1
|
xTH(z)x [K(x)+l0] dx |
ü ý
þ
|
|
| (39) |
and,
|
V(z) = f2(z) |
ì í
î
|
b-1 |
ó õ
|
||x|| £ 1
|
K2(x) dx - l1 |
ü ý
þ
|
|
| (40) |
with,
|
|
|
|
|
b2/p c
|
|
ó õ
|
1
0
|
u1+[2/p] w(u) du - 1 |
| (41) | |
|
| 1 - c-2b-1 |
ó õ
|
1
0
|
|
ì í
î
|
ó õ
|
1
y
|
w(x) dx |
ü ý
þ
|
2
|
dy |
| (42) |
|
Proof: See []. Remember to
substitute K by b-1K since in the reference the Kernels are
probability densities and in here we take them as weight functions
that integrate to b.
4.1 Asymptotic Mean Square Errors
Let fn be an arbitrary density estimator and let
Zn = Ök(fn(z)-f(z)). Now suppose that fn(z) is
asymptotically normal, in the sense that when k = k(n) satisfies
(24) for some a > 0, we have (25) true. Then,
all the moments of Zn will converge to the moments of the
asymptotic Gaussian. In particular the mean and the variance of
Zn will approach B(z) and V(z) respectively. Using,
(12) and (24) we can write,
|
|
lim
n®¥
|
n4/(p+4) E|fn(z)-f(z)|2 = |
V(z) a
|
+ |
B2(z) a
|
|
| (43) |
We call the right hand side of (43) the asymptotic mean
square error (AMSE) of the estimator fn(z). The value of a can
be optimized to obtain a global minimum for the (AMSE) but it is well
known in nonparametrics that the rate n-4/(p+4) is best possible
(in a minimax sense) under the smoothness asumption 4
(see e.g. []). We can take care of the knn, the
kernel, and the double smoothing estimators simultaneously by noticing
that in all cases,
has a global minimum of,
|
(AMSE)* = |
ì í
î
|
(1+4/p) |
æ ç
è
|
p 4
|
ö ÷
ø
|
[4/(p+4)]
|
ü ý
þ
|
a1[4/(p+4)]a2[p/(p+4)] |
| (45) |
achieved at,
|
a* = |
æ ç
è
|
pa1 4a2
|
ö ÷
ø
|
[p/(p+4)]
|
|
| (46) |
Replacing the corresponding values for a1 and a2
for the knn, for the kernel, and for the double smoothing estimators,
we obtain that in all cases,
|
(AMSE)* = (const. indep. of f) |
ì í
î
|
f(z) |
æ ç
è
|
D2 f(z)
|
ö ÷
ø
|
[p/(p+4)]
|
ü ý
þ
|
|
| (47) |
where,
|
|
|
|
|
ó õ
|
||x|| £ 1
|
xTH(z)x G(x) dx |
| (48) | |
|
| (49) |
|
with G(x) = 1 for the knn, G(x) = K(x) for the kernel,
G(x) = K(x)+l0 for the double smoothers (see (39)
and (41)) and, if ej denotes the jth canonical
basis vector (all zeroes except a 1 at position j),
|
rj = |
ó õ
|
||x|| £ 1
|
(x·ej)2 G(x) dx |
| (50) |
Notice that (49) follows from (48),
(23) and the fact that H(z) is the Hessian of f at
z. The generality of this result shows that (47) is
typical for density estimation. Thus, when fn is either the knn,
the kernel, or one of the estimators in (
rodriguez86), we have:
|
|
lim
n®¥
|
n4/(p+4)E|fn(z)-f(z)|2 ³ c f(z) |
æ ç
è
|
D2 f(z)
|
ö ÷
ø
|
[p/(p+4)]
|
|
| (51) |
The positive constant c may depend on the particular estimator but
it is independent of f. Dividing both sides of (51) by
f(z), integrating over z, using theorem 3.1 and
interchanging E and ò we obtain,
|
|
lim
n®¥
|
n4/(p+4)E |
ó õ
|
|yn(z)-y(z)|2 dz ³ 4c |
ó õ
|
|
ê ê
ê
|
D y(z)
|
ê ê
ê
|
[2p/(p+4)]
|
dz |
| (52) |
The worst case scenario is obtained by the model f = y2 that
maximizes the action given by the right hand side of
(52),
|
L = |
ó õ
|
|
ê ê
ê
|
|
1 y(z)
|
|
p å
j = 1
|
rj |
¶2y2 ¶zj2
|
(z) |
ê ê
ê
|
[2p/(p+4)]
|
dz |
| (53) |
This is a hard variational problem. However, it is worth noticing that
the simplest case is obtained when the exponent is 1, i.e. when the
dimension of the data space is p = 4. Assuming we were able to find a
solution, this solution would still depend on the p parameters
r1,¼,rp. A choice of rj's is equivalent to
the choice of a global metric for the data space. Notice also, that
the exponent becomes 2 for p = ¥ and that for p ³ 3 (but not
for p = 1 or 2) there is the possibility of non trivial (i.e.
different from uniform) super-efficient models for which estimation can
be done at rates higher than n-4/(p+4). These super-efficient
models are characterized as the non negative solutions of the Laplace
equation in the metric generated by the rj's, i.e., non
negative (f(z) ³ 0) solutions of,
|
|
p å
j = 1
|
rj |
¶2f ¶zj2
|
(z) = 0 |
| (54) |
Recall that there are no non trivial (different from constant) non
negative super-harmonic functions in dimensions one or two but there
are plenty of solutions in dimension three and higher. For example the
Newtonian potentials,
with the norm,
|
||z||r2 = |
p å
j = 1
|
|
æ ç ç
ç è
|
zj
|
ö ÷ ÷
÷ ø
|
2
|
|
| (56) |
will do, provided the data space is compact. The existence of (hand
picked) super-efficient models is what made necessary to consider best
rates only in the minimax sense. Even though we can estimate a
Newtonian potential model at faster than usual nonparametric rates, in
any neighborhood of the Newtonian model the worst case scenario is at
best estimated at rate n-4/(p+4) under second order smoothness
conditions.
5 Choosing the Optimal Norm
All finite (p < ¥) dimensional Banach spaces are isomorphic (as
Banach spaces) to Âp with the euclidian norm. This means, among
other things, that in finite dimensional vector spaces all norms
generate the same topology. Hence, the spheres {x Î Âp:||x|| £ r} are Borelians so they are Lebesgue measurable and thus,
estimators like the knn (2) are well defined for arbitrary
norms. It is possible, in principle, to consider norms that are not
coming from inner products but in this paper we look only at Hilbert
norms ||·||A of the form,
where A Î L with L defined as the open set of real
non-singular p×p matrices. For each A Î L define the
unit sphere,
|
SA = {x Î Âp: xTATAx £ 1} |
| (58) |
its volume,
and the A-symmetric (i.e. ||·||A radially symmetric)
kernel, KA,
where K satisfies assumption 4 and it is I-symmetric,
i.e., radially symmetric in the euclidian norm. This means that K(y)
depends on y only through the euclidian length of y, i.e. there
exists a function F such that,
The following simple theorem shows that all A-symmetric functions
are really of the form (60).
Theorem 6
For any A Î L, [K\tilde] is A-symmetric if and only if we
can write
|
|
~ K
|
(x) = K(Ax) for all x Î Âp |
| (62) |
for some I-symmetric K.
Proof: [K\tilde](x) is A-symmetric iff [K\tilde](x) = F(||x||A2) for some function F. Choose
K(x) = [K\tilde](A-1x). This K is I-symmetric since
K(x) = F((AA-1x)T(AA-1x)) = F(xTx). More over,
[K\tilde](x) = [K\tilde](A-1(Ax)) = K(Ax). Thus,
(62) is necessary for A-symmetry. It is also obviously
sufficient since the assumed I-symmetry of K in (62)
implies that [K\tilde](x) = F((Ax)T(Ax)) = F(||x||A2) so it
is A-symmetric.
Q.E.D.
An important corollary of theorem 5 is,
Theorem 7
Let A,B Î L. Then, [K\tilde] is AB-symmetric if and only
if [K\tilde]B-1 is A-symmetric.
Proof: By the first part of theorem 5 we have
that [K\tilde] = K°A°B with K some I-symmetric. Thus,
[K\tilde]°B-1 = K°A is A-symmetric by
the second part of theorem 5.
Q.E.D.
Let us denote by b(A,K) the total volume that a kernel K
assigns to the unit A-sphere SA, i.e.,
In order to understand the effect of changing the metric on a density
estimator like the kernel (6), it is convenient to write
gn explicitly as a function of everything it depends on; The
point z, the metric A, the A-symmetric kernel function
[K\tilde], the positive smoothness parameter m and, the data set
{x1,¼,xn}. Hence, we define,
|
gn(z;A, |
~ K
|
,m,{x1,¼,xn}) = |
1 n
|
|
n å
j = 1
|
|
~ K
|
|
æ ç
è
|
|
xj-z m
|
ö ÷
ø
|
|
|
| (64) |
The following result shows that kernel estimation with metric AB is
equivalent to estimation of a transformed problem with metric A.
The explicit form of the transformed problem indicates that a
perturbation of the metric should be regarded as composed of two
parts: Shape and volume. The shape is confounded with the form of the
kernel and the change of volume is equivalent to a change of the
smoothness parameter.
Theorem 8
Let A,B Î L, m > 0, and [K\tilde] an AB-symmetric
kernel. Then, for all z Î Âp and all data sets
{x1¼,xn} we have,
|
gn(z;AB, |
~ K
|
,m,{x1,¼,xn}) = gn( |
^ B
|
z;A, |
~ K
|
°B-1,|B|-1/pm, { |
^ B
|
x1,¼, |
^ B
|
xn}) |
| (65) |
where |B| denotes the determinant of B and [^B] = |B|-1/pB
is the matrix B re-scaled to have unit determinant.
Proof: To simplify the notation let us denote,
Substituting AB for A in (64) and using
theorem 5 we can write the left hand side of
(65) as,
|
|
1 n
|
|
n å
j = 1
|
K |
æ ç
è
|
AB |
æ ç
è
|
xj-z m
|
ö ÷
ø
|
ö ÷
ø
|
|
= |
1 n
|
|
n å
j = 1
|
(K°A) |
æ ç ç
ç è
|
|
mB
|
ö ÷ ÷
÷ ø
|
b(A,K°A) (mB)p
|
|
|
where K is some I-symmetric kernel and we have made the change of
variables x = B-1y in b(AB,[K\tilde]) (see (63)
). The last expression is the right hand side of (65).
Notice that, K°A = [K\tilde]B-1 is
A-symmetric.
Q.E.D.
5.1 Generalized knn Case with Uniform Kernel
In this section we find the norm of type (57) that
minimizes (47) for the estimators of the knn type with
uniform kernel which include the double smoothers with K(x) = 1. As it
is shown in theorem 5 a change in the determinant of
the matrix that defines the norm is equivalent to changing the
smoothness parameter. The quantity (57) to be minimized
was obtained by fixing the value of the smoothness parameter to the
best possible, i.e. the one that minimizes the expression of the
(AMSE) (43). Thus, to search for the best norm at a fix
value of the smoothness parameter we need to keep the determinant of
the matrix that defines the norm constant. We have,
Theorem 9
Consider the problem,
|
|
min
|A| = 1
|
|
æ è
|
ó õ
|
SA
|
xTH(z)x dx |
ö ø
|
2
|
|
| (67) |
where the minimum is taken over all p×p matrices with
determinant one, SA is the unit A-ball and H(z) is the
Hessian of the density f Î C2 at z which is
assumed to be nonsingular.
When H(z) is indefinite, i.e. when H(z) has both positive and
negative eigenvalues, the objective function in (67)
achieves its absolute minimum value of zero when A is taken as,
|
A = c-1 diag( |
æ ú
Ö
|
|
,¼, |
æ ú
Ö
|
|
, |
æ ú
Ö
|
|
,¼, |
æ ú
Ö
|
|
) M |
| (68) |
where the xj are the absolute value of the eigenvalues of
H(z), m is the number of positive eigenvalues, M is the matrix of
eigenvectors and c is a normalization constant to get detA = 1
(see the proof for more detailed definitions).
If H(z) is definite, i.e. when H(z) is either positive or
negative definite, then for all p×p real matrices A with
detA = 1 we have,
|
|
ê ê
|
|
ó õ
|
SA
|
xTH(z)x dx |
ê ê
|
³ |
2pp p+3
|
p |
ê ê
|
|
det
| H(z) |
ê ê
|
1/p
|
|
| (69) |
with equality if and only if,
|
A = |
H+1/2(z) |H+1/2(z)|1/p
|
|
| (70) |
where H+1/2(z) denotes the positive definite square-root of
H(z) (see the proof below for explicit definitions).
Proof: Since f Î C2 the Hessian is a real
symmetric matrix and we can therefore find an orthogonal matrix M
that diagonalizes H(z), i.e. such that,
|
H(z) = MT D M with MTM = I |
| (71) |
where,
|
D = diag(x1,x2,¼,xm,-xm+1,¼, -xp) |
| (72) |
where all the xj > 0 and we have assumed that the columns
of M have been ordered so that all the m positive eigenvalues
appear first and all the negative eigenvalues
-xm+1,¼,-xp appear last so that (71)
agrees with (72).
Split D as,
|
|
|
|
diag(x1,¼,xm,0,¼,0) - diag(0,¼,0,xm+1,¼,xp) |
| |
|
| (73) |
|
and use (71) and (73) to write,
|
|
|
| |
|
|
(D+1/2M)T(D+1/2M) - (D-1/2M)T(D-1/2M) |
| |
|
| (74) |
|
Using (74) and the substitution y = Ax we obtain,
|
|
|
|
|
ó õ
|
yTy £ 1
|
yT(A-1)T (S+TS+ - S-TS- ) A-1y dy |
| |
|
|
ó õ
|
yTy £ 1
|
< SA-1y, SA-1y > dy |
| (75) |
|
where,
|
S = S+ + S- = (D+ + D-)1/2 M |
| (76) |
and < .,. > denotes the pseudo-Riemannian inner product,
|
< x,y > = |
m å
i = 1
|
xiyi - |
p å
i = m+1
|
xiyi |
| (77) |
By letting G = diag(1,¼,1,-1,¼,-1) (i.e. m ones followed
by p-m negative ones) be the metric with the signature of H(z) we
can also write (77) as,
Let,
where b1,¼,bp denote the columns of B.
Substituting (79) into (75), using the linearity
of the inner product and the symmetry of the unit euclidian sphere we
obtain,
|
|
|
| |
|
|
|
å
j
|
|
å
k
|
< bj,bk > |
ó õ
|
SI
|
yjyk dy |
| (80) | |
|
| |
|
| (81) |
|
where r stands for,
|
r = |
ó õ
|
SI
|
(y1)2 dy = |
2pp p+3
|
|
| (82) |
From (79) and (81) it follows that problem
(67) is equivalent to,
|
|
min
|B| = |S|
|
|
æ è
|
p å
j = 1
|
< bj,bj > |
ö ø
|
2
|
|
| (83) |
When H(z) is indefinite, i.e. when m Ï {0,p} it is
possible to choose the columns of B so that
åj < bj,bj > = 0 achieving the global
minimum. There are many possible choices but the simplest one is,
|
B = c · diag( |
|
| ___ Öp-m
|
, |
| ___ Öp-m
|
,¼, |
| ___ Öp-m
|
m
|
, |
Öm,Öm,¼,Öm p-m
|
) |
| (84) |
since,
|
|
p å
j = 1
|
< bj,bj > = c2 m ( |
| ___ Öp-m
|
)2 - c2 (p-m) (Öm)2 = 0. |
| (85) |
The scalar constant c needs to be adjusted to satisfy the
constraint that detB = detS. From
(79), (84) and (76) we obtain
that the matrix for the optimal metric can be written as,
|
A = B-1S = |
c-1
|
S+ + |
c-1 Öm
|
S- |
| (86) |
From (86) we get,
|
ATA = |
c-2 p-m
|
S+TS+ + |
c-2 m
|
S-TS- |
| (87) |
Finally from (74) we can rewrite (87) as,
|
|
|
|
c-2 MT |
æ ç
è
|
|
1 p-m
|
D+ + |
1 m
|
D- |
ö ÷
ø
|
M |
| (88) | |
|
| c-2 MT diag( |
x1 p-m
|
,¼, |
xm p-m
|
, |
xm+1 m
|
,¼, |
xp m
|
) M |
| (89) |
|
Notice that when p-m = m (i.e. when the number of positive equals
the number of negative eigenvalues of H(z)) the factor 1/m can be
factorized out from the diagonal matrix in (89) and in this
case the optimal A can be expressed as,
|
A = |
H+1/2(z) |H+1/2(z)|1/p
|
|
| (90) |
where we have used the positive square-root of H(z) defined as,
|
H+1/2(z) = diag( | Ö
|
x1
|
,¼, | Ö
|
xp
|
)M |
| (91) |
In all the other cases for which H(z) is indefinite, i.e. when
m Ï {0,p/2,p} we have,
|
A = c-1 diag( |
æ ú
Ö
|
|
,¼, |
æ ú
Ö
|
|
, |
æ ú
Ö
|
|
,¼, |
æ ú
Ö
|
|
) M |
| (92) |
The normalization constant c is fixed by the constraint that
detA = 1 as,
|
c = (p-m)-[m/2p] m-[((p-m))/2p] | |
det
| H(z)|[1/2p] |
| (93) |
This shows (68).
Let us now consider the only other remaining case when H(z) is
definite, i.e. either positive definite (m = p) or negative definite
(m = 0). Introducing l0 as the Lagrange multiplier
associated to the constraint detB = detS we obtain that the
problem to be solved is,
|
|
min
b1,¼bp,l0
|
L(b1,b2,¼,bp,l0) |
| (94) |
where the Lagrangian L is written as a function of the
columns of B as,
|
L(b1,b2,¼,bp,l0) = |
æ è
|
p å
j = 1
|
< bj,bj > |
ö ø
|
2
|
- 4 l0 ( |
det
| (b1,¼,bp)- |
det
| S) |
| (95) |
The -4l0 instead of just l0 is chosen to simplify
the optimality equations below. The optimality conditions are,
|
|
¶L ¶bj
|
= 0 for j = 1,¼,p and |
¶L ¶l0
|
= 0 |
| (96) |
where the functional partial derivatives are taken in the Fréchet
sense with respect to the column vectors bj. The Fréchet
derivatives of quadratic and multi linear forms are standard text-book
exercises. Writing the derivatives as linear functions of the vector
parameter h we have,
|
|
|
| (97) | |
|
|
det
| (b1,¼, |
h j-th col.
|
,¼,bp) |
| (98) |
|
Thus, using (97) and (98) to compute the
derivative of (95) we obtain that for all h and all
j = 1,¼,p we must have,
|
|
¶L ¶bj
|
(h) = 2 |
ì í
î
|
p å
k = 1
|
< bk,bk > |
ü ý
þ
|
2 < bj,h > - 4l0 |
det
| (b1,¼,h,¼,bp) = 0 |
| (99) |
When åk < bk,bk > ¹ 0 we can rewrite (99)
as,
|
< bj,h > = c0-1 |
det
| (b1,¼,h,¼,bp) |
| (100) |
But now we can substitute h = bi with i ¹ j into
(100) and use the fact that the determinant of a matrix
with two equal columns is zero, to obtain,
|
< bj,bi > = 0 for all i ¹ j. |
| (101) |
In a similar way, replacing h = bj into (100), we get
|
< bj,bj > = c0-1 |
det
| B = c |
| (102) |
where c is a constant that needs to be fixed in order to satisfy
the constraint that detB = detS.
We have shown that the optimal matrix B must have orthogonal columns
of the same length for the G-metric. This can be expressed with a
single matrix equation as,
Substituting (79) into (103) and re-arranging terms
we obtain,
|
|
|
| (104) | |
|
|
c-1 (S+T + S-T) G (S+ + S-) |
| |
|
| |
|
| (105) |
|
From (105), (103), (82) and
(81) we obtain,
|
|
ê ê
|
|
ó õ
|
SA
|
xT H(z) x dx |
ê ê
|
³ rp |c| |
| (106) |
and replacing the values of r and c we obtain
(69).
Q.E.D.
5.2 Yet Another Proof When The Hessian is Definite
Consider the following lemma.
Lemma 1
Let A,B be two p×p non-singular matrices with the same
determinant. Then
|
|
ó õ
|
SA
|
||x||2B dx ³ |
ó õ
|
SB
|
||y||2B dy |
| (107) |
Proof:
Just split SA and SB as,
and write,
|
|
ó õ
|
SA
|
||x||2B dx = |
ó õ
|
SB
|
||x||2B dx - |
ó õ
|
ScASB
|
||x||2B dx + | |
|