Single parameter models.

Conjugate priors

`F` a class of sampling distributions, `P` a class of prior distributions. Then `P` is conjugate for `F` if `p(theta) in P` and `p(y|theta) in F` implies `p(theta | y) in P`. Exponential family sampling distributions have natural conjugate priors: `p(y | theta) prop g(theta)^eta exp(phi(theta)^T t(y))`, `p(theta) prop g(theta)^n exp(phi(theta)^T nu)`, then posterior also in exponential form `p(theta | y) prop g(theta)^(n+eta) exp(phi(theta)^T (t(y) + nu))`.

Table of standard conjugate priors to go here:

Non-informative priors

Minimal effect on posterior - let's data speak for themselves. Conjugate priors can be be almost non-informative.

Uniform prior is natual non-informative prior for location parameters

If density of `y` is of form `p(y - theta | theta)` then is a location density with location parameter `theta`

A scale density is of the form `p(y | sigma) prop theta^(-1) p(y/sigma)`. Arguing similarly to above, must have form `pi(sigma) prop sigma^-1`.

Jeffrey's prior

Non-informative, and invariant to one-to-one transformations. `p(theta) prop [I(theta)]^(1/2) = -E(del/(del theta) log(p(y|theta)))`. Jeffrey's priors are locally uniform (ie. uniformly distributed in every small region) and therefore non-informative. However, sometimes a Jeffrey's prior violates the likelihood principle.

Multiparameter models.

Classical approach: maximize a joint likelihood, proceed in steps. Bayesian approach: base inference on marginal posterior distributions of interest, eg `p(theta_1 | y) = int p(theta_1, theta_2) dtheta_2 = int p(theta_1 | theta_2, y) p(theta_2 |y)`

Integral not usually calculated directly: take a 2 step simulation process

Normal model

Sampling distribution: `y_i ~ N(mu, sigma^2)`

Non-informative prior `p(mu, sigma^2) prop sigma^(-2)`

Conjugate prior: `mu | sigma^2 ~ N(bar y, sigma^2/kappa_0)`, `sigma^2 ~ Inv chi^2(eta_0, sigma^2_0)`:

Both have same posteriors:

Semi-conjugate prior: `mu ~ N(mu_0, tau^2_0)` `sigma^2 ~ inv chi^2(nu_0, sigma^2_0)`. Doesn't end up nice.

Multinomial model

Generalisation of binomial model.

Sampling distribution: `p(y | theta) prop Pi^k_(j=1) theta_j^(y_j)` Conjugate prior: Dirichlet `p(theta | alpha) prop pi^k_(j=1) theta_j^(alpha_j-1)` Posterior prior: Dirichlet: `alpha_j' = alpha_j + y_j`

Can also be represented as the product of `k` independent Poisson rvs `y_j ~ Poi(lambda_j)` with restriction `sum_j y_j = n`, then `theta_j = lambda_j / sum(lambda_i)`

Multivariate normal

Useful distributions and equivalences

Dirichlet-Gamma: Draw `x_1, ..., x_n` from `"Gamma"(delta, alpha_j)` for any common `delta`. Set `theta_j = x_j / sum(x_i)` `Inv-chi^2`-`chi^2`: `sigma^2 / chi^2(nu)`.

Advanced computation.

General strategies

Work in log-posterior scale to avoid under/overflow.
Factor posterior distribution where possible
Reparameterisation sometimes helps

Simulation from posterior

Three basic methods:

Rejection sampling

Requirements:

Method:

Finding `M` can be a problem, but in a Bayesian analysis, we can use the following trick. Recall that `q(theta | y) = p(theta) p(y | theta)`. Let `g(theta) = p(theta)` and draw a value `theta^**` from the prior, and `U` from `U[0,1]`. Let `M = p(theta; hat theta)` where `hat theta` is the mode of `p(y; theta)`. Accept `theta^**` if `u <= (q(theta | y))/(Mp(theta)) = (p(y; theta^**))/(p(y; hat theta))`

Importance sampling

Aka sampling importance resampling (SIR). Suppose we have draws `(theta_1, ..., theta_n)` from a distribution `g(theta)` can we "convert it" to a sample from `f(theta | y)`. For each `theta_i` compute `phi_i = (f(theta_i|y))/(g(theta_i))`, `w_i = phi_i / sum phi_j`, then draw `theta^**` from `(theta_1, ...)` with weights `w_i` without replacment.

Most often used to improve approximation to prior.

Posterior mode approximation

Useful in themselves and also as good starting points for more sophisticated methods. Before embarking on complicated simulation, it's a good idea to explore the target distribution, developing crude estimators etc. Computational strategy:

Finding mode is a (relatively) simple optimisation problem - easier to find than means as we don't need to integrate, and can use unnormalised densities. If joint can be factored, is easier to find joint mode by working along the marginal/conditional modes.

Three common methods are: conditional maximisation, Newton-Rhapson and EM.

Newton-Rhapson

Iterative algorithm based on second order Taylor approximation of `log p(theta | y)` - linear term disappears because will be zero at mode. Constant term doesn't effect density, so second order term leads to Normal approximation `~ N(hat theta, I(hat theta)^-1)`. If multiple modes will need to use a finite mixture model.

EM

If we want to maximise `p(gamma, phi | y) = p(gamma | phi, y) p(phi|y)` can treat `gamma` as missing

Markov Chain Monte Carlo

Based on stocastic process theory, useful for approximating target posterior distributions. Can be used even in high-dimensional examples. Three methods:

(M & GS both special cases of MH)

Definitions:

Finite-state, irreducible, aperiodic Markov chains have a limiting distribution `lim_(n->oo) p^n(j|i) = pi`. We want to use this in reverse - we have a target distribution, can we generate a Markov chain with it as its limiting distribution?

The Gibbs sampler

Useful when simulating from the conditionals is easy.

  1. Start with vector of initial guesses
  2. Draw first parameter from conditional with all others fixed
  3. Draw second parameter ...

Metropolis-Hastings algorithm

More flexible than Gibbs - rather than sampling from conditionals, MH permits many proposal distributions.

  1. Given draw `theta_t` in iteration `t`, sampling a candidate draw `theta^**` from a proposal distribution `J(theta^** | theta_t)`
  2. Accept the draw with propability `r = (p(theta^** | y) / J(theta^** | theta_t))/(p(theta|y)/J(theta_t | theta^**))`

For `p(theta | y)` to be desired stationary distribution, the jumping distribution needs to be reversible or balanced: `p(theta^(t)) J(theta^((t+1)) | theta^(t) ) p(theta^((t+1))) J(theta^(t) | theta^((t+1)) )`. Reversibility is a sufficient condition, and since balance is typically not achieved by `J` we make the probability of moving from `theta^(t)` to `theta^((t+1))`, `r=(p(theta^**, y))/(p(theta | y))` when the proposal is symmetric - this is the Metropolis algorithm.

Markov chain will converge as long as:

`J` affects rate of convergence - optimal `J` is `p(theta | y)` in which case `r=1`. Otherwise choose `J` so that it is easy to compute, easy to sample from, and it leads to rapid convergence and mixing. There are three main approaches:

Independence sampler

Proposal distribution does not depend on `theta_t`. Just find a distribution `g(theta)` and generate values from it. Works very well when `g` is a good approximation with heavier tails. With large samples proposal could be normal centered at mode, with variance larger than the inverse fisher information matrix evaluated at mode.

Random walk MH

Easy to use. Proposal is often normal centred at current draw. Difficult to choose `V` - too small: takes to long to move through parameter space, too large: many draws rejected because jumps to extremes. Optimal acceptance rates are 25-50%, gets lower with increasing dimensionality.

Approximation MH

Idea is to improve approximation to `J` as we know more about `theta`, eg. in random walk MH could also vary variance with draw. Proposals typically not symmetric.

Other considerations

Starting values: If chain is irreducible, choice of `theta^0` will not affect convergence. With multiple chains, can choose overdispersed starting values.

How many chains? Even after convergence, draws from stationary distribution are not iid. May need to thin chain by only taking every `k`th draw

Convergence: Can check for convergence by running multiple chains. _GR_ statistic compares within to between chain sums of squares.

Hierachical models

Combine variation (and covariates) from multiple levels (analagous to mixed models). Need to preserve exchangability, usually in terms of conditional indepdence (given parameters). Call model distributions of model parameters hyperpriors. Need to be very careful when using non-proper hyperpriors.

Often two posterior predictive distributions of interest:

Computation is harder than before because we have more parameters. Easiest when population distribution is conjugate to likelihood. Usual steps:

Steps for computation are usually:

Beta-binomial

`y_j ~ "Bin"(n_j, theta_j)`, `theta_j ~ "Beta"(alpha, beta)`. `p(alpha, beta)` non informative (will need to check integrability of posterior). `p(alpha, beta) propto 1` doesn’t work, `p(alpha/(alpha + beta), (alpha + beta)^(-1/2)) propto 1` does.

Normal hierachical model

(equivalent to one-way random effects). Set up: `J` indepdendent experiments, want to estimate mean `theta_j` from each, sampling model: `y_ij | theta_j ~ N(theta_j, sigma^2)` (`sigma^2` known), and `bar y_j | theta_j ~ N(theta_j, sigma_j^2)`. This sampling model is quite general because of CLTs. What type of posterior estimates for `theta_j`’s might be reasonable? `hat theta_j = bar theta_j` if `n_n` large, `hat theta_j = bar y` reasonable if we believe all means are the same, or more general estimator `hat theta_j = lambda_j bar y_j + (1 - lambda_j) bar y`. `(1 - lambda_j)` called a shrinkage factor.

Joint prior can be written as `p(mu, tau) = p(mu | tau) p(tau)`. Will consider a conditionally flat prior so that `p(mu, tau) propr p(tau)`

For `p(mu | tau) propto 1`, `p(theta_j | mu, tau) = N(hat theta_j, V_j)` where `hat theta_j = (sigma_j^2 mu + tau^2 bar y_j)/(sigma_j^2 + tau^2)`, `1/(V_j) = 1/(sigma_j^2) + 1/(tau^2)`

`p(mu, tau | y) propto p(tau) product_j N(bar y_j; mu, sigma_j^2 + tau^2)`. Now `p(tau | y) propto p(tau) V_mu^(-1/2) product_j N(bar y_j; hat mu, sigma_j^2 + tau^2)`. What should we used for `p(tau)`? A conjugate is always a safe choice so, `p(tau) ~ Inv-chi^2(nu_0, tau_0^2)`