Lecture IV
An Introduction to Markov Chain Monte Carlo
Lecture IV
Lecture IV
Abstract
The Rejection and Acceptance Complement Methods.
Exact Sampling Methods
The Metropolis algorithm and its variants are great, for they can be used to
generate ergodic Markov chains with, in principle, any pre specified
stationary distribution. Just choose an arbitrary starting point and
eventually the chain will begin sampling from its stationary distribution.
The problem is, of course, that (usually) we can't be 100% sure about when
this will happen. Exact sampling methods, when available, are therefore more
reliable and, often more efficient than MCMC methods. For this reason MCMC
sampling is almost always used in combination with exact methods. This is
specially true for the Gibbs sampler where we need to generate from the
full conditionals. What exact method to use depends on the particular
problem. There are literally hundreds of methods available. The current
bible is Devroye's book which is
unfortunately not available online. Besides the only truly universal
exact generator (the inverse cdf method introduced in
lecture I) the most flexible exact algorithm is
the rejection method.
The Rejection Method
The idea, I think first implemented by Von Neumann, is very simple. To
sample from f(x) find another simpler density g(x) from which you
know how to sample and such that c g(x) ³ f(x) for some c > 1 (we
say that c g envelopes f). Like in the following picture:
Then generate uniformly under the graph of
the envelope and accept only samples that fall under the graph of f.
In other words reject a sample if it falls in between the envelope
and the function f.
Copying from the bible:
Theorem 1
If f and g are densities on R^{p} with, f(x) £ c g(x) for
all x Î R^{p}, some c ³ 1. Then the following algorithm
The Rejection Method
{
REPEAT
{
x ¬ sample from g
u ¬ unif(0,1)
y ¬ c [g(x)/f(x)]
}
UNTIL uy £ 1
RETURN x
}
will produce a sample X = x with density f(x).
Proof: Consider the following two lemmas
Lemma 1
Let X be a random pvector with density f(x). Let U be a
uniform(0,1) r.v. independent of X, and let c > 0. Then,
 I.
 (X,c U f(X)) is uniform on,
A = {(x,u) : x Î R^{p}, 0 £ u £ c f(x) }
 II.
 If (X,U) is uniform on A, then X has density f(x).
Proof:
 I.
 Let B Ì A, then




ó õ

P[(x,c U f(x)) Î B  X = x ] f(x) dx 
 


ó õ


ì í
î


ó õ

B_{x} = {u:(x,u) Î B}


1 c f(x)

du 
ü ý
þ

f(x) dx 
 


 

where the second equality follows from Tonelli's theorem, the independence
of X and U and the fact that cf(x)U is uniform(0,cf(x)).
But,
A = 
ó õ

A

du dx = 
ó õ

R^{d}

( 
ó õ

cf(x)
0

du)dx = c 

thus, for any B Ì A measurable we have,
P( (X,cUf(X)) Î B ) = 
B A



which means that (X,cUf(X)) is uniform on A.
 II.
 We just need to show that, "B measurable,
P[X Î B] = ò_{B} f(x) dx. But,



P[ X Î B, 0 £ U £ c f(X) ] 
 

P[(X,U) Î B_{1} = {(x,u):x Î B, 0 £ u £ cf(x)}] 
 

 


1 c


ó õ

B

c f(x) dx = 
ó õ

B

f(x) dx. 

 

Lemma 2
Let X_{1},X_{2},¼ be iid vectors in R^{p}. Let A Ì R^{p} s.t.
P[X Î A] = a > 0. Let Y be the first X_{i} Î A. Then,
"B Ì R^{p} measurable,
Moreover, if X_{1} is uniform on A_{0} É A then Y is uniform on
A.
Proof:




¥ å
i = 1

P[X_{1}\not Î A,¼, X_{i1}\not Î A,X_{i} Î B 
Ç
 A ] 
 


¥ å
i = 1

(ia)^{i1} P[X_{1} Î B 
Ç
 A] 
 


 

Also, if X_{1} is uniform on A_{0} É A then for all measurable B,




P[X_{1} Î ABA_{0}) a

= 
ABA_{0} / A_{0} A_{0}A / A_{0}


 


 

therefore, Y is uniform on A.
We are now ready for the Theorem.
Proof: (that the Rejection Method is valid). In the first line
the algorithm generates (by Lemma 1, I) (X,cUg(X)) uniform on
C = {(x,u): x Î R^{p}, 0 £ u £ c g(x) }. However at
exit time (by Lemma 2), (X,cUg(X)) is uniform on
C = {(x,u): x Î R^{p}, 0 £ u £ f(x) }.
Notice that, by the assumption that f(x) £ cg(x)
(i.e. that g envelopes f), C É A.
Thus, (by Lemma 1, II) X has density f(x).
Best g is f itself
Let N be the number of pairs (x,u) generated by the rejection
algorithm to exit. We have,



P[ reject at least (i1) pairs ] 
 

P[ UY ³ 1 ]^{i1} = (1a)^{i1} 

 

where,



P[ U c 
g(X) f(X)

£ 1 ] = 
ó õ

P[Ucg(x) £ f(x)] g(x)] dx 
 


ó õ


f(x) c g(x)

g(x) dx = 
1 c

. 

 

The expected number of pairs generated by the algorithm is, < N > where,
< N > = 
¥ å
i = 1

P[N ³ i] = 
¥ å
i = 1

(1a)^{i1} = 
1 1(1a)

= 
1 a

= c 

Thus, c is the expected number of rejections and therefore it should be
kept as small as possible, i.e. as close as possible to its minimum value
of 1. Since, f(x) £ c g(x), in order for c to be small g must be
close to f.
Example: To sample from the standard Gaussian we can use the
rejection method with g as the Laplace distribution. Notice that,
f(x) = (2p)^{1/2} exp(x^{2}/2) 

and in order to get an upper bound for f we need a lower bound
for the energy, i.e. for x^{2}/2. But that follows easily from,

1 2

(x  1)^{2} = 
x^{2} 2

+ 
1 2

 x ³ 0 

from where we get,
exp(x^{2}/2) £ exp( 
1 2

 x) = (2p)^{1/2} 
æ ç
è

2e p

ö ÷
ø

1/2


æ ç
è

1 2

exp(x) 
ö ÷
ø



the last term in parenthesis above, is the density of the Laplace
distribution which the above inequality shows to envelope the N(0,1)
with,
c = 
æ ç
è

2e p

ö ÷
ø

1/2

» 1.3155 

Javascript Demo Implementation of the above in Javascript.
Solution by Ke Yang
There are several variations of the rejection method in the bible.
One fairly general method that never rejects any samples is:
The Acceptance Complement Method
Suppose that we don't know how to envelope f but,
f(x) = f_{1}(x) + f_{2}(x) 

with, f_{1}(x) ³ 0, f_{2}(x) ³ 0 and f_{1}(x) ³ g(x) where
g is a density. Furthermore suppose that we know how to sample from
g and from f_{2} (properly normalized), then,
The Acceptance Complement Method
{
x ¬ sample from g
u ¬ unif(0,1)
IF u > f_{1}(x) / g(x) THEN x ¬ sample from f_{2}/òf_{2}
RETURN x
}
Theorem 2
X = x is a sample from f.
Proof: Let a = òf_{2} and suppose that Y has density g. We
have,



P[Y Î B, U £ 
f_{1}(Y) g(Y)

] +P[U > 
f_{1}(Y) g(Y)

] 
ó õ

B

f_{2}(x) dx / a 
 


ó õ

B


f_{1}(x) g(x)

g(x) dx + 
æ ç
è

1  
ó õ


f_{1}(y) g(y)

g(y) dy 
ö ÷
ø


ó õ

B

f_{2}/a 
 


ó õ

B

f_{1} + 
ó õ

B

f_{2} = 
ó õ

B

f 

 

Example: f_{1}(x) = min{f(x),g(x)} and f_{2}(x) = (f(x)g(x))_{+}.
Then, clearly f_{1}(x) £ g(x) and f(x) = f_{1}(x) + f_{2}(x). So if
we know how to sample from g and f_{2}, we are done.
Another useful general split is available for almost flat densities on
[1,1], i.e. densities f(x) such that,
then we may take,



 

f(x)  (M  
1 2

), with M = 
sup
 f(x) 
 


 

which works since g and f_{2} are proportional to densities uniform on
[1,1] (so easy to sample from). That f_{1} is bounded above by g just
follows from the almost flat condition and that f_{1} > 0 follows from the
fact that,
0 £ 
inf
 f(x) £ 
1 2

£ 
sup
 f(x) £ 1 

Hence, the method can be used to generate from many densities symmetric about
with a single mode at 0, for which M = f(0) and inf f = f(1). For example
we can generate from the truncated Cauchy for x £ 1. We have,
f(x) = 
2 p(1+x^{2})

for x £ 1 

which is almost flat since 2/p 1/p = 1/p » 0.318 < 0.5. Using
the property that 1/X is Cauchy when X is Cauchy we can generate a
complete Cauchy by just using the Acceptance Complement Method to generate
from the truncated Cauchy and then with probability 1/2 return 1/X
instead of X.
Example: Javascript implementation of the Cauchy with the above method.
A Trivial Perfect MCMC Method
If there is available a good envelope for f then the rejection method is
preferable to asymptotic methods based on Markov chains. The problem is that
often good envelopes are not easy to find. A bad envelope can be easier to
find and still be useful when combined with a Markov chain method. For
example it may be possible to show that f(x) £ 10000 g(x) which means
that we can use the rejection method but expect to generate, on the average,
10000 (X,U) pairs before accepting one vector X. In this case we can
still use the rejection method to generate a single observation of X which
is warranted to have the correct density f. Then, use the observation
generated by the rejection method as the initial point for a Markov chain
method with statitionary distribution f and harvest the complete path of
the chain that is now sampling from its asymptotic distribution.
File translated from T_{E}X by T_{T}H, version 2.32.
On 5 Jul 1999, 22:59.