Advanced computation.

General strategies

Work in log-posterior scale to avoid under/overflow.
Factor posterior distribution where possible
Reparameterisation sometimes helps

Simulation from posterior

Three basic methods:

inverse cdf
rejection sampling
posterior mode approximation

Rejection sampling

Requirements:

Target density: `f(x)`
Instrumental density: `g(x)`
Constant `M` such that `f(x) <= M g(x)`

Method:

generate `X ~ g` and `U ~ U[0,1]`
set `Y = X` if `U <= (f(X)) / (Mg(X))`, reject otherwise

Finding `M` can be a problem, but in a Bayesian analysis, we can use the following trick. Recall that `q(theta | y) = p(theta) p(y | theta)`. Let `g(theta) = p(theta)` and draw a value `theta^**` from the prior, and `U` from `U[0,1]`. Let `M = p(theta; hat theta)` where `hat theta` is the mode of `p(y; theta)`. Accept `theta^**` if `u <= (q(theta | y))/(Mp(theta)) = (p(y; theta^**))/(p(y; hat theta))`

Importance sampling

Aka sampling importance resampling (SIR). Suppose we have draws `(theta_1, ..., theta_n)` from a distribution `g(theta)` can we "convert it" to a sample from `f(theta | y)`. For each `theta_i` compute `phi_i = (f(theta_i|y))/(g(theta_i))`, `w_i = phi_i / sum phi_j`, then draw `theta^**` from `(theta_1, ...)` with weights `w_i` without replacment.

Most often used to improve approximation to prior.

Posterior mode approximation

Useful in themselves and also as good starting points for more sophisticated methods. Before embarking on complicated simulation, it's a good idea to explore the target distribution, developing crude estimators etc. Computational strategy:

obtain initial crude estimates
find joint posterior mode or marginal posterior modes (better if possible)
use normal expansions at the mode

Finding mode is a (relatively) simple optimisation problem - easier to find than means as we don't need to integrate, and can use unnormalised densities. If joint can be factored, is easier to find joint mode by working along the marginal/conditional modes.

Three common methods are: conditional maximisation, Newton-Rhapson and EM.

Newton-Rhapson

Iterative algorithm based on second order Taylor approximation of `log p(theta | y)` - linear term disappears because will be zero at mode. Constant term doesn't effect density, so second order term leads to Normal approximation `~ N(hat theta, I(hat theta)^-1)`. If multiple modes will need to use a finite mixture model.

EM

If we want to maximise `p(gamma, phi | y) = p(gamma | phi, y) p(phi|y)` can treat `gamma` as missing

Markov Chain Monte Carlo

Based on stocastic process theory, useful for approximating target posterior distributions. Can be used even in high-dimensional examples. Three methods:

Metropolis-Hastings
Metropolis
Gibbs sampler

(M & GS both special cases of MH)

Definitions:

irreducbile if all every state can be reached from every other state
periodic with period `d` if `p^n(j|i)=0` unless `n=kd` `k in NN`. If `d=1` the chain is aperiodic
if a chain is irreducible and aperiodic, then it is ergodotic `bar a_n = sum_t a(X_t)/n -> E{a(X)}` (with geometric rate of convergence and standard error `sqrt( sigma_n^2/n (1 + 2 sum_i p_i(a)) )`, with `p_i` is the ith lag autocorrelation)

Finite-state, irreducible, aperiodic Markov chains have a limiting distribution `lim_(n->oo) p^n(j|i) = pi`. We want to use this in reverse - we have a target distribution, can we generate a Markov chain with it as its limiting distribution?

The Gibbs sampler

Useful when simulating from the conditionals is easy.

Start with vector of initial guesses
Draw first parameter from conditional with all others fixed
Draw second parameter ...

Metropolis-Hastings algorithm

More flexible than Gibbs - rather than sampling from conditionals, MH permits many proposal distributions.

Given draw `theta_t` in iteration `t`, sampling a candidate draw `theta^**` from a proposal distribution `J(theta^** | theta_t)`
Accept the draw with propability `r = (p(theta^** | y) / J(theta^** | theta_t))/(p(theta|y)/J(theta_t | theta^**))`

For `p(theta | y)` to be desired stationary distribution, the jumping distribution needs to be reversible or balanced: `p(theta^(t)) J(theta^((t+1)) | theta^(t) ) p(theta^((t+1))) J(theta^(t) | theta^((t+1)) )`. Reversibility is a sufficient condition, and since balance is typically not achieved by `J` we make the probability of moving from `theta^(t)` to `theta^((t+1))`, `r=(p(theta^**, y))/(p(theta | y))` when the proposal is symmetric - this is the Metropolis algorithm.

Markov chain will converge as long as:

chain has nonzero probability of reaching any point in parameter spac
chain does not shoot off to infinity

`J` affects rate of convergence - optimal `J` is `p(theta | y)` in which case `r=1`. Otherwise choose `J` so that it is easy to compute, easy to sample from, and it leads to rapid convergence and mixing. There are three main approaches:

independence sampler
random walk MH (most popular)
approximation MH

Independence sampler

Proposal distribution does not depend on `theta_t`. Just find a distribution `g(theta)` and generate values from it. Works very well when `g` is a good approximation with heavier tails. With large samples proposal could be normal centered at mode, with variance larger than the inverse fisher information matrix evaluated at mode.

Random walk MH

Easy to use. Proposal is often normal centred at current draw. Difficult to choose `V` - too small: takes to long to move through parameter space, too large: many draws rejected because jumps to extremes. Optimal acceptance rates are 25-50%, gets lower with increasing dimensionality.

Approximation MH

Idea is to improve approximation to `J` as we know more about `theta`, eg. in random walk MH could also vary variance with draw. Proposals typically not symmetric.

Other considerations

Starting values: If chain is irreducible, choice of `theta^0` will not affect convergence. With multiple chains, can choose overdispersed starting values.

How many chains? Even after convergence, draws from stationary distribution are not iid. May need to thin chain by only taking every `k`th draw

Convergence: Can check for convergence by running multiple chains. _GR_ statistic compares within to between chain sums of squares.