\documentclass{article}
\usepackage{epsfig}
\newenvironment{code}{\begin{verse}
\tt $\{$ \\
}{$\}$ \\
\end{verse}}
\def\latex{\LaTeX}
\def\N{{\rm I\!N}} %natuerliche Zahlen
\newcommand{\baseref}{{\it http://omega.stat.psu.edu:8008/summer99/lecture5}}
\newcommand{\href}[2]{{#2 {\center ({\it \baseref/#1})}}}
\newtheorem{mytheorem}{Theorem}
\newtheorem{lemma}{Lemma}
\title{Lecture VIII}
\begin{document}
\begin{abstract}
Using the exponential and mixture connections in the space of distributions
for sampling. Applications: Thermodynamic integration, The half Monty-Carlos
Method for sampling from \(f\) by generating from \(g\).
\end{abstract}
\section*{ Paths in The Space of Distributions }
Given two probability densities (or probability mass functions if the space
of observations is discrete) \(f\) and \(g\) we define a path connecting them
as a piece-wise smooth map from the interval \([0,1]\) to the space of
distributions for the data, \(t \longrightarrow \gamma_{t}\) such that,
\(\gamma_{0} = g\) and \(\gamma_{1} = f\). See the picture.
\begin{figure}
\begin{center}
\epsfysize=2.0in
\epsfbox{path.eps}
\end{center}
\caption{Path Connecting \(g\) and \(f\)}
\end{figure}
We assume that \(f(x)\) is a complicated expression (with \(x\) possibly high
dimensional) from which we want to sample. Due to the complexity of \(f(x)\)
we assume that there is no simple method for generating samples from \(f\)
but we assume that there is available a method for generating samples from
\(g\).
It is intuitively clear how a path connecting the target distribution \(f\)
with the simpler distribution \(g\) may help by providing a way to arrive to
the target in little steps along the path. In the following sections we show
two examples of the application of this simple but powerful idea.
There are many ways of building paths between densities but there are two
special connections that have been proven useful, not only for generating
random variables with the computer but also in the general theory of
information geometry. I am of the opinion that the utility of the
exponential and mixture connections is just the tip of the iceberg. I think
that there is a whole unexplored geometric world lurking behind computer
sampling.
\subsubsection*{The Exponential Connection}
A path can be obtained by mixing the loglikelihoods of two extreme
distributions,
\[ \log( \gamma_{t}(x)) = t \log\ f(x) + (1-t)\log\ g(x) - \log\ Z_{t}\]
where \(Z_{t}\) is a normalization constant. We can also write,
\[ \gamma_{t}(x) = \frac{ f^{t}(x) g^{1-t}(x) }{Z_{t}} \]
with,
\[ Z_{t} = \int f^{t}(x) g^{1-t}(x) dx \]
This path is known as the exponential connection between the densities \(g\)
and \(f\). The path is an exponential family with \(t\) as the natural
parameter and the loglikelihood ratio \(\log(f(x)/g(x))\) as sufficient
statistic. This connection provides a notion of a straight line in the space
of distributions that is diametrically opposed to the notion of straightness
provided by the mixture connection introduced next.
\subsubsection*{ The Mixture Connection}
Another path connecting the target density \(f\) to the easier to sample from
density \(g\), is given by the mixture connection,
\[ \gamma_{t}(x) = t f(x) + (1-t) g(x) \]
This path provides an alternative notion of a straight line in the space of
distributions. Different ways of defining straightness give different ways
of defining curvature and parallelism for example.
\subsection*{ Thermodynamic Integration }
Thermodynamic integration is a technique that makes use of the exponential
connection for estimating the ratio of two normalization constants. The
problem of computing the ratio of two normalization constants appears in the
important problem of model selection or hypothesis testing. Suppose that we
want to compare two competing models \(M_{0}, M_{1}\) for the available data
\(D\). We need to compute the ratio of the two posterior probabilities for
the models. By Bayes' theorem we get,
\[ \log \frac{P(M_{0} | D)}{P(M_{1} | D)} = \log\frac{P(D|M_{0})}{P(D|M_{1})}
+ \log\frac{\pi(M_{0})}{\pi(M_{1})} \]
where, \(\pi(M_{i})\) are the prior probabilities for the two models. The
above expression is a recurrent head ache for the faithful Bayesians for two
reasons. First, the standard improper non informative priors can't be used
here. The ratio of infinities for the second term on the right is undefined
when improper priors are used. Second, it is not enough to know the
likelihood for the data up to a normalization constant since the first term
on the right needs the ratio of both normalization constants in order to
produce a definite numerical answer.
Let us then assume that the target density is only known up to a
proportionality constant, i.e.,
\[ f(x) = \frac{1}{Z_{1}} h(x) \]
with,
\[ Z_{1} = \int h(x) dx \]
unknown. Allow also the possibility of having \(g\) unnormalized, even though
this is less common, i.e. define,
\[ Z_{0} = \int g(x) dx \]
which it is of course 1 when \(g\) is a proper density. The objective is to
compute,
\begin{eqnarray*}
\log \frac{Z_{1}}{Z_{0}} &=& \log\frac{\int h}{\int g} \\
&=& \log\frac{\int\frac{h}{g} g}{\int g} \\
&=& \log\left<\frac{h(x)}{g(x)}\right>_{g} \\
&\approx&
\log\left\{\frac{1}{N}\sum_{j=1}^{N}\frac{h(x_{j})}{g(x_{j})} \right\}
\end{eqnarray*}
where the \(x_{1},x_{2},\ldots \) are iid with density proportional to
\(g\). In a few dimensions when \(g\) is very close in shape to \(h\) the
above approximation may work. In general, the simple importance sampling
approximation above, fails to produce useful estimates. The use of the
exponential connection can improve the estimates dramatically in the
following way. Define,
\[ Z_{t} = \int h^{t}(x) g^{1-t}(x) dx \]
which is the normalization constant for the mixture of the log likelihoods of
\(h\) and \(g\). Notice that the definition includes the extremes that we are
after \(Z_{0}\) and \(Z_{1}\). From,
\[ Z_{t+\epsilon} = \int \left(\frac{h(x)}{g(x)}\right)^{\epsilon} h^{t}(x)
g^{1-t}(x) dx \]
we obtain,
\[ \frac{Z_{t+\epsilon}}{Z_{t}} =
\left<\left(\frac{h(x)}{g(x)}\right)^{\epsilon} \right>_{t} \]
where by the last notation we mean the expected value of what is inside the
brackets when \(x\) follows the exponential connection density with parameter
\(t\). For values of \(\epsilon\) close to zero the sample means obtained by
using the Metropolis algorithm produce very accurate estimates of the ratios
for different values of \(t\). Finally by climbing up the path we obtain,
\[ \frac{Z_{1}}{Z_{0}} = \frac{Z_{1/n}}{Z_{0}}\frac{Z_{2/n}}{Z_{1/n}}\cdots
\frac{Z_{(n-1)/n}}{Z_{(n-2)/n}}\frac{Z_{1}}{Z_{(n-1)/n}} \]
which is estimated as,
\begin{eqnarray*}
\log\frac{Z_{1}}{Z_{0}} &=& \sum_{i=0}^{n-1}
\log\left<\left(\frac{h(x)}{g(x)}\right)^{1/n}\right>_{i/n} \\
&\approx& \sum_{i=0}^{n-1} \log \left(\frac{1}{N}\sum_{j=1}^{N}\left(\frac{h(x_{i,j})}{g(x_{i,j})}\right)^{1/n}\right)
\end{eqnarray*}
where \(x_{i,1},x_{i,2},\ldots, x_{i,N}\) are sample from the exponential
connection with parameter \(i/n\). Thus, in order to get the ratio we need to
implement a sequence of MCMCs.
It is also possible to get a good estimate of the ratio by running a single
MC but at the expense of having to generate from an equiprobable mixture of
exponential connections. Here is how: Write,
\begin{eqnarray*}
\lim_{\epsilon\rightarrow 0}\frac{Z_{t+\epsilon} - Z_{t}}{\epsilon Z_{t}} &=&
\lim_{\epsilon\rightarrow
0}\frac{\left<\left(\frac{h(x)}{g(x)}\right)^{\epsilon} \right>_{t} -
1}{\epsilon} \\
\frac{d}{dt}\log Z_{t} &=&
\left<\frac{d}{d\epsilon}\exp(\epsilon\log\frac{h}{g})\right>_{t} = \left<\log\left(\frac{h(x)}{g(x)}\right)\right>_{t}
\end{eqnarray*}
from where we get,
\[ \log Z_{t} = \int_{0}^{t} \left<\log\frac{h(x)}{g(x)} \right>_{\lambda}
d\lambda + \log Z_{0} \]
denoting the exponential connection with parameter \(\lambda\) by
\(\gamma_{\lambda}\) and interchanging the order of integration we obtain,
\[ \log\frac{Z_{1}}{Z_{0}} = \int \log\frac{h(x)}{g(x)}\left[ \int_{0}^{1}
\gamma_{\lambda}(x) d\lambda \right] dx \]
and by calling,
\[ G^{*}(x) = \int_{0}^{1}\gamma_{t}(x) dt \]
the continuous equiprobable mixture of the elements of the path connecting
\(f\) with \(g\) we can write,
\[\log\frac{Z_{1}}{Z_{0}} = \left<\log\frac{h(x)}{g(x)}\right>_{G^{*}} \]
a very beautiful formula in its own very self and also useful. It says
something like
\begin{center}
\begin{verse}
{\it the log of the ratio of averages of two positive functions\\
equals the average of log of ratios \\
with respect to the average exponential connection \\
linking those two functions}
\end{verse}
\end{center}
\subsubsection*{The Plain Vanilla Thermodynamic Integrator}
\begin{code}
sum \(\leftarrow 0\) \\
for \(j=1,2,\ldots,N\) \\
\begin{code}
\( u \leftarrow \mbox{unif}(0,1) \) \\
\( x \leftarrow \) sample \(\gamma_{u}\) by Metropolis \\
sum \(\leftarrow\) sum \( + \log\frac{h(x)}{g(x)} \) \\
\end{code}
return sum\(/N\)
\end{code}
In the above algorithm the metropolis steps can be dramatically improved by
starting from a previously observed sample from \(\gamma_{t}\) with \(t\) as
close as possible to the current \(u\) so that only a few iterations are
necessary.
\section*{The Full Monty-Carlos Method}
This is a new, as far as I know previously unknown method, that allows to
sample from any \(f\) starting from any \(g\). The simplest version of the
method works as long as the ratio of normalization constants of \(f\) and
\(g\) is known. If that ratio is unknown, it must first be
estimated by one of the methods of the previous sections.
A modification of the simpler version allows the freedom of using
unnormalized densities.
The idea is very simple and it exploits a simple but super useful property of
the mixture connection.
\begin{mytheorem}
Let,
\[ \gamma_{t}(x) = t f(x) + (1-t) g(x) \]
be the mixture connection between two fully normalized pdfs (or probability
mass functions) \(f\) and \(g\). Then, for all values of \(x\)
\[ \gamma_{t+\epsilon}(x) < \left(1 + \frac{\epsilon}{t}\right)
\gamma_{t}(x) \]
for all \(t\) and \(\epsilon\) so that the mixture parameters are always in
\([0,1]\).
\end{mytheorem}
{\bf Proof:} just write,
\begin{eqnarray*}
\lambda = \frac{\gamma_{t+\epsilon}(x)}{\gamma_{t}(x)}
&=& \frac{\gamma_{t}(x) +
\epsilon (f(x)-g(x))}{\gamma_{t}(x)} \\
&=& 1 + \epsilon\frac{f(x)-g(x)}{\gamma_{t}(x)} \\
\end{eqnarray*}
and consider two cases.
\begin{description}
\item[Case I: \(f(x) \le g(x)\)]
In this case it follows from the last equation that, \(\lambda\le 1\).
\item[Case II: \(f(x) > g(x)\)]
For this case write the denominator in the last equation above as,
\(\gamma_{t}(x) = g(x)+t(f(x)-g(x))\) and divide the numerator
and the denominator by \((f(x)-g(x))\) to obtain,
\[ \lambda = 1 + \frac{\epsilon}{t + g(x)/(f(x)-g(x))} < 1 +
\frac{\epsilon}{t} \]
\end{description}
{\bf Q.E.D.} \\
The Monty-Carlos method makes use of this theorem in the following way. Start
by generating a sample from \(g=\gamma_{0}\) then move (horizontally in the
picture) with a few metropolis iterations to get a sample from
\(\gamma_{1/n}\). Then, try to climb up towards \(f\) as much as possible by
using the exact rejection constants provided by the previous theorem. When
the sample gets rejected then fall back to the last acceptance that assures
that the sample is from \(\gamma_{k/n}\). This \(x\) is now used as an
initial value for metropolis to try to go one step forward to
\(\gamma_{(k+1)/n}\). Continue in this way until arriving at \(f\).
\begin{figure}
\begin{center}
\epsfysize=2.0in
\epsfbox{monty.eps}
\end{center}
\caption{The Full Monty-Carlos Method}
\end{figure}
When \(n\) is small and \(g\) is close to \(f\) the method can be used for
generating independent samples from \(f\). For really complicated high
dimensional \(f\) a high value for \(n\) is needed and there is no guarantee
that \(g\) is close to \(f\). In such cases, the method can be used for
producing a single sample from \(f\) and after that continue with a version
of metropolis but now knowing that metropolis has converged.
In order to provide a short description of the algorithm will make use of two
functions:
\begin{description}
\item[\(Met(x,t,Stop\_crit)\)] Returns a sample from \(\gamma_{t}\) starting from
\(x\) using the metropolis algorithm with a given stopping criterion
(e.g. maximum number of iterations).
\item[\(Climb(x,k,n)\)] Returns an integer \(k'\) with \(k\le k' \le n\). The
sample \(x\) from \(\gamma_{k/n}\) is pushed-up as high as possible until it
gets rejected at \(\gamma_{k'/n}\).
\end{description}
\subsubsection*{The Full Monty-Carlos Algorithm}
\begin{code}
\(x \leftarrow\) sample from \(g(\cdot)\) \\
\(k \leftarrow 1\)\\
while \((k < n) \)
\begin{code}
\( x \leftarrow \) Met\((x,k/n,Stop\_crit)\) \\
\( k \leftarrow \) Climb\((x,k,n)\) \\
\end{code}
Return \(x\)\\
\end{code}
\subsubsection*{Met\((x,t,Stop\_crit)\)}
\begin{code}
until (Stop\_crit) \\
\begin{code}
\( y \leftarrow \) sample \(g(\cdot)\) \\
\( u \leftarrow \) unif(0,1) \\
if \( u < \min\left(1, \frac{\gamma_{t}(y)}{\gamma_{t}(x)}\right)\)
\ \ \ then \( x \leftarrow y \) \\
\end{code}
Return \(x\)
\end{code}
\subsubsection*{ Climb\((x,k,n)\)}
\begin{code}
\( j \leftarrow 0 \) \\
\( u \leftarrow\) unif(0,1) \\
while \( ( k+j < n ) \) AND \\
\[ \left( u*\left[1+\frac{1}{k+j-1}\right]*\gamma_{(k+j-1)/n}(x)
< \gamma_{(k+j)/n}(x) \right) \] \\
\begin{code}
\( u \leftarrow\) unif(0,1) \\
\( j \leftarrow j+1 \) \\
\end{code}
Return \( k+j \)
\end{code}
\subsection{The Full-Monty Without Normalization Constants}
If \(f\) and \(g\) are only known up to normalization constants
the method works if we just change the {\bf Climb} function to:
\subsubsection*{ Climb\((x,k,n)\)}
\begin{code}
\( j \leftarrow 0 \) \\
\( u \leftarrow\) sample from \(\exp(1)\) \\
while \( ( k+j < n ) \) AND \\
\[ \left( u < \log\left(\frac{(k+j)\gamma_{(k+j-1)/n}(x)}
{(k+j-1)\gamma_{(k+j)/n}(x)}\right) \right) \] \\
\begin{code}
\( u \leftarrow\) sample from \(\exp(1)\) \\
\( j \leftarrow j+1 \) \\
\end{code}
Return \( k+j \)
\end{code}
The fact that it works follows from the following (well known)
theorem:
\begin{lemma}
For a nonnegative function \(h(x)\) the algorithm,
\begin{code}
Repeat
\begin{code}
\(x \leftarrow\) sample from \(g()\)\\
\(u \leftarrow\) sample from \(\exp(1)\) \\
\end{code}
Until \(h(x) \le u\)\\
Return \(x\)
\end{code}
produces a sample with density
\[ f(x) = c g(x) \exp(-h(x)) \]
\end{lemma}
{\bf proof:} \\
For any borel set \(B\) we have,
\begin{eqnarray*}
P[X\in B] &=& \frac{P[X\in B, U \ge h(X)]}{P[U \ge h(X)]} \\
&=& \frac{\int_{B}\int_{y=h(x)}^{\infty}g(x)\exp(-y) dy dx}
{\int g(x)\exp(-h(x)) dx} \\
&=& \frac{\int_{B} g(x)\exp(-h(x)) dx}{\int g(x) \exp(-h(x)) dx}\\
&=& \int_{B} f(x) dx
\end{eqnarray*}
\end{document}