A bitnet is a dag of binary nodes representing a manifold of probability
distributions for a sequence of binary variables. Bitnets are
riemannian manifolds of finite volume when Fisher information is used as
the metric.
I compute the exact volumes for several interesting topologies including
the complete graph, the directed line of n nodes,
the exploding star (naive bayes) and its reverse, the collapsing star.
I show a fast algorithm for approximating the volumes of general bitnets.
Computer experiments show that the average Ricci scalar of bitnets is
always half an integer. Typically bitnets have singular boundaries
obtained when some of the conditional probabilities for the binary
variables are either zero or one.
At the singular points the Ricci scalar becomes negative infinity.
A bitnet, is a regular statistical model for a sequence of binary variables.
The joint probabilities of the sequence are efficiently
described by a directed acyclic graph (dag) whose vertices are the variables
and whose directed edges indicate stochastic influence. Figure
(1) shows four examples of bitnets of 3 variables.
The assumption of a network of stochastic influences for the
variables (i.e. a bitnet) allows the following useful factorization of
joint probabilities,
p(x_{V}) =
Õ
i Î V
p(x_{i} | x_{pa(i)}).
(1)
Where V is the set of vertices of the bitnet, pa(i) Ì V are the
indices of the parents of x_{i} and if A Ì V we
denote by x_{A} the set of bits with indeces in A.
It follows from (1) that all the joint probabilities are obtained
from the values of p(x_{i}|x_{pa(i)}). There are 2^{|pa(i)|} possible
values for x_{pa(i)} and therefore we need that many independent parameters
in (0,1) for the ith node. Thus, the total number of independent
parameters in (0,1) necessary to specify all the joint probabilities of
a bitnet is,
d =
å
i Î V
2^{|pa(i)|}.
(2)
For example, the dimensions of the four bitnets in figure (1)
are (from left to right): 7, 5, 5, and 6.
This paper studies the geometry and topology of the hypothesis space of
all the probability distributions of a sequence of binary variables that
satisfy the network of stochastic dependencies encoded by a bitnet.
By (1), there is a one to one map between
the objects in this set and the points in the d dimensional
open unit cube. We call this kind of hypothesis space a bitnet model.
Figure
Figure 1: Four examples of bitnets. From left to right are:
a) [K\vec]_{3} the complete dag of 3. b) [L\vec]_{3} the directed
line of 3. c) [E\vec]_{3} the exploding star of 3. d) [C\vec]_{3},
the collapsing star of 3.
When |V|=n the information metric, (i.e., Fisher information matrix)
has n^{2} components,
g_{ij}(q) = E(l_{i}(X)l_{j}(X) | q)
(3)
with,
l_{k}(x) =
¶logp_{q}(x)
¶q_{k}
(4)
where q = (q_{1},¼,q_{d}) is the vector of parameters
in the d-cube. In order to obtain a simple expression for the metric
and the volume of a bitnet, we need to introduce some notation.
If pa(i) = {j_{1},j_{2},¼,j_{k}} with j_{1} < j_{2} < ¼ < j_{k}
then we denote by x_{pa(i)} both, the set of variables that are parents
of i and the integer obtained when writing the bits that are parents of
i in ascending order, i.e.,
In words, m(i) is the number of parameters associated to the i-th
bit, t(i) is the index of the first parameter of i. Hence,
(q_{t(i)},¼,q_{t(i)+m(i)-1}) is the vector of parameters
associated to i. The integer k=k(i,x) is such that,
For example, for [K\vec]_{3} (see figure (1)), the parameter
q_{5} represents the probability that bit 3 is on given that bit
1 is off and bit 2 is on. In this case, x_{pa(3)}=01 which is 1 in decimal.
Also t(3)=4, k=4+1=5 and v(5)=3. Notice that v is a kind of inverse
of t; v(j) gives the bit number i associated to the parameter
q_{j}.
We can now compute the derivative (4). Taking the logarithm
of (1) and using (7) we obtain,
l_{k} =
¶logp_{q}(x)
¶q_{k}
=
ì ï ï í
ï ï î
-1
1-q_{k}
ifx_{i}=0,wheni=v(k)andx_{pa(i)} = k-t(i)
1
q_{k}
ifx_{i}=1,wheni=v(k)andx_{pa(i)} = k-t(i)
(8)
Notice that i is just v(k). It appears in the previous formula only
to simplify the notation. With (8) we compute g_{jk}(q)
by noticing that the product l_{j}*l_{k} has only four possible values,
[1/((1-q_{j})(1-q_{k}))], [(-1)/((1-q_{j})q_{k})], [(-1)/(q_{j}(1-q_{k}))], and [1/(q_{j}q_{k})].
Thus, the expected value is just the sum of each value times the probability
of obtaining it. It is now straight forward to see that when j ¹ k,
Where we have used the fact that the probabilities for each of the four
cases factorize as a product of two terms when j ¹ k due to the
conditional independence assumptions implied by the bitnet model. But
when j=k there is no such factorization and,
g_{jj}(q)
=
é ë
(1-q_{j})
(1-q_{j})^{2}
- 0 - 0 +
q_{j}
q_{j}^{2}
ù û
p_{j}(q)
=
p_{j}(q)
q_{j}(1-q_{j})
(10)
where,
p_{j}(q) = P_{q}[x_{pa(v(j))}=j-t(v(j))]
(11)
Notice that the p_{j} form n sequences of probability distributions since,
for each i=1,¼,n we have:
å
v(j)=i
p_{j}(q) =
t(i)+m(i)-1 å
j=t(i)
p_{j}(q) = 1.
(12)
Since the metric tensor is diagonal, its determinant is the product
of the diagonal entries. The total intrinsic volume occupied by a
given bitnet is then the integral over the d-cube of the element of
volume. We obtain the general expression,
vol(bitnet) =
ó õ
\scriptscriptstyle[0,1]^{d}
æ Ö
d Õ
j=1
p_{j}(q)
q_{j}(1-q_{j})
dq.
(13)
Collecting j's according to v(j)=i we can also write,
vol(bitnet) =
ó õ
\scriptscriptstyle[0,1]^{d}
n Õ
i=1
W_{i}^{1/2} dq
(14)
where,
W_{i} =
Õ
v(j)=i
p_{j}(q)
q_{j}(1-q_{j})
=
Õ
r
P[x_{pa(i)} = r]
Õ
r
q_{t(i)+r}(1-q_{t(i)+r})
=
Õ
r
p_{i}(r)
Õ
r
r_{i}(r)
(15)
the index r=0¼m(i)-1 runs over all the possible values of x_{pa(i)}
(see (5)) and the last equality serves as a definition for
p_{i}(r) and r_{i}(r).
Using the fact that all the p_{j}(q) £ 1 (they are probabilities),
the monotonicity of integrals, Fubini's theorem, and the Dirichlet integral:
ó õ
1
0
dt
Ö
t(1-t)
= p
(16)
(i.e., Beta(1/2,1/2)=p) we have,
vol(bitnet) £ p^{d},
(17)
with equality, if and only if, the bitnet consists of totally disconnected
nodes. In that case, d=n, p_{j}=1, and the bits are independent variables.
Thus, all bitnet models are riemannian manifolds of finite volume in the
information metric. The upper bound (17) can be improved
(see below).
The complete bitnet of n nodes has all the available arrows.
It has the highest possible dimension,
d = 2^{0}+2^{1}+¼+2^{n-1} = 2^{n}-1 for a dag of n.
The hypothesis space [K\vec]_{n} is nothing but the multinomial model of
the discrete variable whose 2^{n} outcomes correspond to each possible
binary sequence of n bits. Instead of using the parameters
q = (q_{1},¼,q_{d}) of conditional probabilities
specified by the bitnet, one can always use p=(p_{0},¼,p_{d})
with åp_{i} = 1 of the joint probabilities for each sequence of
n bits. This maps the d-cube into the d-simplex. We can now
map the d-simplex to the positive part of the d-sphere of points
t=(t_{0},¼,t_{d}) with t_{i} > 0 and åt_{i}^{2} = 1 by simply
using the standard t_{i} = Ö{p_{i}} transformation. If follows
immediately from this considerations that the information volume of the
complete bitnet is the same as that of the multinomial which turns out
(by using the standard Ö{p_{i}} transformation)
to be exactly half the volume of the unit d-sphere S^{d}, i.e.,
vol(S^{d}) =
2 p^{[(d+1)/2]}
G(
d+1
2
)
.
(18)
Thus,
vol(
®
K
n
) =
p^{2n-1}
(2^{n-1}-1)!
(19)
This sequence becomes exponentially small very quickly,
vol(
®
K
n
)
=
p,p^{2},
p^{4}
6
,
p^{8}
5040
,¼ ~
1
Ö
2p
æ è
p
k
ö ø
k
e^{-k}
(20)
=
3.14, 9.86, 16.2, 1.87, 0.0000683, ¼
where the asymptotic expression is for k=2^{(n-1)} as n®¥.
The Ricci scalar is, not surprisingly, constant and it can be obtained
from the corresponding value for the multinomial model:
The n nodes are connected linearly, i.e., one is the parent of two who
is the parent of three,..., who is the parent of n. The dimension
is d = 1+(n-1)2 = 2n-1. The exact volume becomes very difficult to compute
for values of n ³ 4 but computer experiments carried with
the aid of vTool^{1} show that,
vol(
®
L
n
) = 4
æ è
p
2
ö ø
3n-4
.
(22)
The first two cases n Î {1,2} are trivial (they are also complete bitnets)
so the values of p and p^{2} are inmediatly obtained. The case
n=3 is already not easy (maple can't do it alone) but a special change of
variables in 3D shows that this volume must be the same as the volume
of [E\vec]_{2+1} which is [(p^{5})/8]
(see below equation (24)). Many other values
of n have been estimated by Monte Carlo and they provide strong evidence
to the validity of (22).
The Ricci scalar for the case n=3 is computed with vTool as,
R = 5 -
1
2r
(23)
where r is the variance of the central node. I believe that to
hold in general, i.e., that the scalar curvature always depends on the
variance of the central nodes. By looking at the components of the
Riemann tensor it is possible to show that the scalar curvature is always
independent of the parameters of the leave nodes.
One parent node with n children. This is what the machine learning
community calls Naïve Bayes. The children bits
(e.g., the observed symptoms) are assumed to be independent conditionally
on the parent (the presence or absence of the disease). This bitnet is
by far the most popular bayesian network due to its simplicity and to
its surprising accuracy and robustness. As I argue below, their volumes
can be used to explain, at least in part, their success in applications.
The volume of [E\vec]_{n+1} is easily obtained from the general formula
(14),
vol(
®
E
n+1
) =
ó õ
é ë
1
r_{1}
r_{1}
r_{2}r_{3}
r_{1}
r_{4}r_{5}
¼
r_{1}
r_{2n}r_{2n+1}
ù û
1/2
dq
(24)
separating the variables by Fubini, using (16) and defining,
B(r) =
ó õ
1
0
t^{r/2} (1-t)^{r/2} dt =
ó õ
r_{i}^{r/2} dq =
G(
r
2
+1)^{2}
G(r+1)
(25)
we obtain,
vol(
®
E
n+1
) = p^{2n} B(n-1).
(26)
The sequence of volumes explodes exponentially,
vol(
®
E
n+1
)
=
p^{2},
p^{5}
8
,
p^{6}
6
, ¼ ~
Ö
2pn
æ è
p^{2}
2
ö ø
n
(27)
=
9.87, 38.25, 160.24, 698.69, 3121.6,¼
The computation of the scalar curvature was obtained with the help of
vTool. It depends only on the variance
r = r_{1} associated to the parent (center) node,
Ricci(
®
E
n+1
) = R(r) =
n
2
é ë
(2n+1) -
n-1
2r
ù û
£
3
2
n
(28)
As r® 0 the scalar curvature diverges:
R(r)® -¥.
The boundary ¶[E\vec]_{n+1}, consisting of all the models with
coordinates q Î [0,1]^{d} with at least one component
q_{j}=0, contains the surface r = 0 of singularities.
This has an easy explanation. The estimation of the parameters of the
model obviously becomes more difficult when the variance of the parent
node is close to zero. If there is only one in a billion chance of
observing a given disease, we need billions of observations to estimate
the probabilities with a Naïve Bayes model. In the limit of zero variance
we are at an inferential black hole. No information (about any of
the parameters of the model) can ever be extracted from data.
Notice that the scalar curvature does not depend on the parameters of
leave nodes. That is always the case for all bitnets.
The average scalar curvature, < R > , is computed by integrating R given by
(28), with respect to the volume element of [E\vec]_{n+1}
and dividing by the total volume (26). We obtain:
This bitnet corresponds to the one child of a promiscuous mother!
i.e., n parent nodes with only one child. It has dimension d=2^{n}+n
and its volume is computed straight forwardly from (14),
vol(
®
C
n+1
) =
ó õ
ì í
î
1
r_{1}
1
r_{2}
¼
1
r_{n}
[r_{1}r_{2}¼r_{n}]^{2n-1}
r_{n+1}¼r_{n+2n}
ü ý
þ
1/2
dq
(30)
integrating out the parameters of the 2^{n} leave nodes, using Fubini and
the function B defined in (25) we can write,
vol(
®
C
n+1
) = p^{2n} B(2^{n-1}-1)^{n}
(31)
With the aid of vTool we can show that the Ricci scalar has the form,
R = a - b
æ è
1
r_{1}
+
1
r_{2}
+¼+
1
r_{n}
ö ø
(32)
where (a,b) depend only on n. The only known values are for
n=1,2,3,4. They are (a,b) = (3/2,0), (10,1/2), (54,3), (272,14) with
average Ricci scalars < R > = 3/2, 2, 6, ?. Equation (32)
tells us that, as geometric objects, a child node with only one parent
is radically different than a child node with two or more parents. When
n=1 the space has constant scalar curvature but when n ³ 2 the
curvature diverges to minus infinity as we let the variance of any of the
parent nodes to go to zero. So what's so different about the cases
n=1 and n=2? What does curvature really mean statistically?
I think what makes the cases n=1 and n=2 different is that with
only two nodes we can reverse the arrow, obtaining a bitnet which is
markov equivalent to the original one (same V structures) but when
n=2 reversing arrows produces a non markov equivalent bitnet. Thus,
with only one parent and one child, if the parent has a variance very
close to zero the apparent singularity disappears by the reparametrization
implied by the reversing of the arrow making the parent a leave node. Being
a true geometric invariant, the information contained in the Ricci scalar
holds in all possible descriptions (reparametrizations, markov equivalence
transformations, etc..) of the model and it must be telling us something
significant about the difficulty of estimation at each point.
For most large bitnets the exact computation of their volumes becomes
impractical. This section shows cook-book formulas for a lower and an
upper bound for the volumes of general bitnets. These bounds can be
shown to be exact for complete bitnets and for bitnets with maximum
depth of 1 (i.e., bitnets without grand parents). The geometric mean
between the lower and the upper bound has been observed to perform
remarkably well in large simulation studies [3].
I claim that, the exact volume Z of a general bitnet is always bounded
between L and U, i.e., L £ Z £ U where, the upper and lower
bounds are given in terms of the function B defined in (25),
by
L =
n Õ
i=1
B^{m(i)}(a_{i})
(33)
with
a_{i} = -1 +
å
j Î ch(i)
2^{|pa(j)|-|pa(i)|-1}
(34)
and
U =
n Õ
i=1
B^{m(i)}(b_{i})
(35)
with b_{i} being the same as (34) except that now the sum does not
run over all the children of i, but only over those children j of i
for which pa(j) Ì pa(i).
Why should we care about the volumes of bitnets?
One answer is model selection.
On the one hand we want our models to be sufficiently large
to be able to approximate the true distribution closely. On the other
hand we only have limited resources and we need small models that can
be handled efficiently. There are many ways to measure the size
of a model but, not surprisingly, the information volume is explicitly
showing up in the formulas. Consider, for example, the latest expression
for the minimum description length (MDL) criterion for model selection
[4,1],
MDL = -
N å
i=1
logp(y_{i}|
^
q
) +
d
2
log
N
2p
+ logV
(36)
where N is the sample size, (y_{1},¼,y_{N}) is the observed data,
[^(q)] is the MLE of the vector of parameters, d is the dimension
of the model (number of free parameters) and V is the information volume
of the model. The MDL approximates the length of the best possible code for
the observed data that can be built with the help of the model. According to
the MDL criterion, the best model is the one that allows the shortest possible
encoding of the observations [4].
It so happens that (36) is also the o(1) approximation
to -logP(M|y) (see [1,5]) i.e., minus
the logarithm of the posterior probability of the model M. Thus, on the
basis of the observed data alone, with no other prior information at hand,
given the choice between two models of the same dimensionality providing
the same fit to the data we must prefer the one with smaller volume V.
One must notice however, that the volume only appears as the third term
(of order 1=N^{0}) of the asymptotic expansion (as N®¥).
The first term (-log likelihood fit) scales linearly with N, the
second term scales linearly in the number of parameters d but
logarithmically on the sample size N. Thus, for sufficiently large N,
the volume term will eventually become negligible but exactly when
it does become negligable depends on the specific models being considered.
Simulation studies [3] show improvements in model selection
of the order of 28% on the average, when the expression (36)
is used instead of the traditional MDL without the volume term.
The modern expression for the MDL (36) exemplifies the natural
desirability of models with small volume. Desirability for large volumes
is naturally found on measures of generalization power. As just one example
consider the expression for the generalization power of heat kernels
[2],
logN(e,F_{R}(x)) = O
æ è
æ è
V
t^{[d/2]}
ö ø
log^{[(d+2)/2]}
æ è
1
e
ö ø
ö ø
(37)
where N(e,F_{R}(x)) denotes the size of the
smallest e-cover of the space F_{R}(x) which is
the ball of radius R (in the sup-norm) of functions defined on the
data x in terms of a heat kernel K_{t}. The specific technicalities
of this method of estimation are not very important for us here. The main
point is that for a given accuracy of estimation (e),
between two competing models
of the same dimension d, we must choose the one with the largest information
volume V in order to increase generalization power.
Figure
Figure 2: Complexity of [E\vec]_{n} and [L\vec]_{n} for different sample
sizes N=100,500,1000. The magnitude of curves increases with N and
Explode > Line
where by "Complexity" we simply mean the sum of the last two terms
in (36). These are the terms that do not involve the
observed data, only their number N, the dimension of the model d,
and its volume V. Figure (2) shows the complexity
terms for the [E\vec]_{n} and [L\vec]_{n} bitnets for binary
sequences of different lengths n < 12 and for three sample sizes
N=100, 500, 1000. The picture is clear. The complexities (actually
just their volumes for they have the same d) of the exploding star
and the line are very similar straight lines with the exploding star
always above the line. Complexity increases with both, the size of
the network n and the sample size N. This means that by adding
more leaves (symptoms) to a Naïve Bayes network we increase its
complexity and by increasing its volume we (probably) increase its
power of generalization.
On the other hand figure (3) shows a very different behavior for
the [C\vec]_{n} bitnet. The complexity reaches a maximum saturation point and
then decreases without bound. Thus, adding more parents may help increase
generalization power but after crossing the saturation size the bitnet will
loose complexity and probably its power to generalize correctly as well.
I do believe that there is more to the story of model complexity than what's
available from (36). Recall that (36) is only the
first three terms of an asymptotic expansion. The neglected terms contain the
model curvatures (see [5]) and by neglecting them we are
failing to account for the difficulty of extracting information out of the
sample due to the presence of high model curvatures. One may define model complexity
in pure geometric terms as < R > and then try to characterize the models of extreme
complexity with data and in the vacuum, without data. This type of variational
problem has been very successful at describing observational data in physics
and a lot is already known about it.
V. Balasubramanian.
A geometric formulation of occam's razor for inference of parametric
distributions.
Technical report, Princeton, Physics PUPT-1588, Sept. 1996.
E.J.M. Lauria.
Learning the structure and parameters of bayesian belief
networks: An application and a methodology in IT implementation.
PhD thesis, The University at Albany, School of Information Science
and Policy, 2003.