Advanced computation.
General strategies
Work in log-posterior scale to avoid under/overflow.
Factor posterior distribution where possible
Reparameterisation sometimes helps
Simulation from posterior
Three basic methods:
- inverse cdf
- rejection sampling
- posterior mode approximation
Rejection sampling
Requirements:
- Target density: `f(x)`
- Instrumental density: `g(x)`
- Constant `M` such that `f(x) <= M g(x)`
Method:
- generate `X ~ g` and `U ~ U[0,1]`
- set `Y = X` if `U <= (f(X)) / (Mg(X))`, reject otherwise
Finding `M` can be a problem, but in a Bayesian analysis, we can use the following trick. Recall that `q(theta | y) = p(theta) p(y | theta)`. Let `g(theta) = p(theta)` and draw a value `theta^**` from the prior, and `U` from `U[0,1]`. Let `M = p(theta; hat theta)` where `hat theta` is the mode of `p(y; theta)`. Accept `theta^**` if `u <= (q(theta | y))/(Mp(theta)) = (p(y; theta^**))/(p(y; hat theta))`
Importance sampling
Aka sampling importance resampling (SIR). Suppose we have draws `(theta_1, ..., theta_n)` from a distribution `g(theta)` can we "convert it" to a sample from `f(theta | y)`. For each `theta_i` compute `phi_i = (f(theta_i|y))/(g(theta_i))`, `w_i = phi_i / sum phi_j`, then draw `theta^**` from `(theta_1, ...)` with weights `w_i` without replacment.
Most often used to improve approximation to prior.
Posterior mode approximation
Useful in themselves and also as good starting points for more sophisticated methods. Before embarking on complicated simulation, it's a good idea to explore the target distribution, developing crude estimators etc. Computational strategy:
- obtain initial crude estimates
- find joint posterior mode or marginal posterior modes (better if possible)
- use normal expansions at the mode
Finding mode is a (relatively) simple optimisation problem - easier to find than means as we don't need to integrate, and can use unnormalised densities. If joint can be factored, is easier to find joint mode by working along the marginal/conditional modes.
Three common methods are: conditional maximisation, Newton-Rhapson and EM.
Newton-Rhapson
Iterative algorithm based on second order Taylor approximation of `log p(theta | y)` - linear term disappears because will be zero at mode. Constant term doesn't effect density, so second order term leads to Normal approximation `~ N(hat theta, I(hat theta)^-1)`. If multiple modes will need to use a finite mixture model.
EM
If we want to maximise `p(gamma, phi | y) = p(gamma | phi, y) p(phi|y)` can treat `gamma` as missing
Markov Chain Monte Carlo
Based on stocastic process theory, useful for approximating target posterior distributions. Can be used even in high-dimensional examples. Three methods:
- Metropolis-Hastings
- Metropolis
- Gibbs sampler
(M & GS both special cases of MH)
Definitions:
- irreducbile if all every state can be reached from every other state
- periodic with period `d` if `p^n(j|i)=0` unless `n=kd` `k in NN`. If `d=1` the chain is aperiodic
- if a chain is irreducible and aperiodic, then it is ergodotic `bar a_n = sum_t a(X_t)/n -> E{a(X)}` (with geometric rate of convergence and standard error `sqrt( sigma_n^2/n (1 + 2 sum_i p_i(a)) )`, with `p_i` is the ith lag autocorrelation)
Finite-state, irreducible, aperiodic Markov chains have a limiting distribution `lim_(n->oo) p^n(j|i) = pi`. We want to use this in reverse - we have a target distribution, can we generate a Markov chain with it as its limiting distribution?
The Gibbs sampler
Useful when simulating from the conditionals is easy.
- Start with vector of initial guesses
- Draw first parameter from conditional with all others fixed
- Draw second parameter ...
Metropolis-Hastings algorithm
More flexible than Gibbs - rather than sampling from conditionals, MH permits many proposal distributions.
- Given draw `theta_t` in iteration `t`, sampling a candidate draw `theta^**` from a proposal distribution `J(theta^** | theta_t)`
- Accept the draw with propability `r = (p(theta^** | y) / J(theta^** | theta_t))/(p(theta|y)/J(theta_t | theta^**))`
For `p(theta | y)` to be desired stationary distribution, the jumping distribution needs to be reversible or balanced: `p(theta^(t)) J(theta^((t+1)) | theta^(t) ) p(theta^((t+1))) J(theta^(t) | theta^((t+1)) )`. Reversibility is a sufficient condition, and since balance is typically not achieved by `J` we make the probability of moving from `theta^(t)` to `theta^((t+1))`, `r=(p(theta^**, y))/(p(theta | y))` when the proposal is symmetric - this is the Metropolis algorithm.
Markov chain will converge as long as:
- chain has nonzero probability of reaching any point in parameter spac
- chain does not shoot off to infinity
`J` affects rate of convergence - optimal `J` is `p(theta | y)` in which case `r=1`. Otherwise choose `J` so that it is easy to compute, easy to sample from, and it leads to rapid convergence and mixing. There are three main approaches:
- independence sampler
- random walk MH (most popular)
- approximation MH
Independence sampler
Proposal distribution does not depend on `theta_t`. Just find a distribution `g(theta)` and generate values from it. Works very well when `g` is a good approximation with heavier tails. With large samples proposal could be normal centered at mode, with variance larger than the inverse fisher information matrix evaluated at mode.
Random walk MH
Easy to use. Proposal is often normal centred at current draw. Difficult to choose `V` - too small: takes to long to move through parameter space, too large: many draws rejected because jumps to extremes. Optimal acceptance rates are 25-50%, gets lower with increasing dimensionality.
Approximation MH
Idea is to improve approximation to `J` as we know more about `theta`, eg. in random walk MH could also vary variance with draw. Proposals typically not symmetric.
Other considerations
Starting values: If chain is irreducible, choice of `theta^0` will not affect convergence. With multiple chains, can choose overdispersed starting values.
How many chains? Even after convergence, draws from stationary distribution are not iid. May need to thin chain by only taking every `k`th draw
Convergence: Can check for convergence by running multiple chains. _GR_ statistic compares within to between chain sums of squares.