Single parameter models.

Conjugate priors

`F` a class of sampling distributions, `P` a class of prior distributions. Then `P` is conjugate for `F` if `p(theta) in P` and `p(y|theta) in F` implies `p(theta | y) in P`. Exponential family sampling distributions have natural conjugate priors: `p(y | theta) prop g(theta)^eta exp(phi(theta)^T t(y))`, `p(theta) prop g(theta)^n exp(phi(theta)^T nu)`, then posterior also in exponential form `p(theta | y) prop g(theta)^(n+eta) exp(phi(theta)^T (t(y) + nu))`.

Table of standard conjugate priors to go here:

Non-informative priors

Minimal effect on posterior - let's data speak for themselves. Conjugate priors can be be almost non-informative.

Uniform prior is natual non-informative prior for location parameters

it is improper `int p(theta) dtheta = int dtheta = oo`
but posterior may still be proper
not invariant to one-to-one transformations

If density of `y` is of form `p(y - theta | theta)` then is a location density with location parameter `theta`

consider transformation `u = y+c` and `eta = theta + c`, if `p(y|theta)` is location density, then so is `p(u | eta)`
let `(y, theta)` and `(u, eta)` have priors `pi` and `pi^**`, identical structure so should have same non-informative distrubition
we want `P_pi(theta in A) = P_(pi^**)(theta in A)`, ie `P_(pi^**) = P_pi(A - c) AA c in RR`
so `int_A pi(theta) dtheta = int_(A-c) pi(theta) dtheta = int_A pi(theta - c) dtheta`, ie `pi(theta) = pi(theta -c)`, a constant function

A scale density is of the form `p(y | sigma) prop theta^(-1) p(y/sigma)`. Arguing similarly to above, must have form `pi(sigma) prop sigma^-1`.

Jeffrey's prior

Non-informative, and invariant to one-to-one transformations. `p(theta) prop [I(theta)]^(1/2) = -E(del/(del theta) log(p(y|theta)))`. Jeffrey's priors are locally uniform (ie. uniformly distributed in every small region) and therefore non-informative. However, sometimes a Jeffrey's prior violates the likelihood principle.

Multiparameter models.

Classical approach: maximize a joint likelihood, proceed in steps. Bayesian approach: base inference on marginal posterior distributions of interest, eg `p(theta_1 | y) = int p(theta_1, theta_2) dtheta_2 = int p(theta_1 | theta_2, y) p(theta_2 |y)`

Integral not usually calculated directly: take a 2 step simulation process

Normal model

Sampling distribution: `y_i ~ N(mu, sigma^2)`

Non-informative prior `p(mu, sigma^2) prop sigma^(-2)`

`p(mu, sigma^2 | y) prop sigma^(n-2) exp(-1/(2sigma^2) sum(y_i - mu)^2)`.
`p(bar y | y) = int int p(bar y | sigma^2, y) p(mu, sigma^2 | y) p(sigma^2 | y) prop t_(n-1)(bar y, (1+ 1/n)^1/2 s)`

Conjugate prior: `mu | sigma^2 ~ N(bar y, sigma^2/kappa_0)`, `sigma^2 ~ Inv chi^2(eta_0, sigma^2_0)`:

`mu` and `sigma^2` not independent, a priori
`p(mu, sigma^2 | y) prop N-Inv-chi^2(mu_n, sigma^2_n/kappa_n; nu_n,, sigma^2_n)`
- `mu_n = k+0/(kappa_0 + n) mu_0 + n / (kappa_0 + n) bary`
- `kappa_n = kappa_0 + n`
- `nu_n = nu_0 + n`
- `nu_n sigma^2_n = nu_0 sigma^2_0 + (n-1)s^2 + (kappa_0 n)/(kappa_0 +n)(bar y -mu_0)^2`

Both have same posteriors:

`p(mu | sigma^2, y) = N(bar y, sigma^2/n)`
`p(sigma^2 | y) prop (sigma^2)^(-(n+1)/2) exp(-(n-1)s^2/(2sigma^2)) prop InvChi^2(n-1, s^2)`
`p(mu | y ) prop [1 + (n(mu-bar y)^2)/((n-1)s^2)]^(-n/2) prop t_(n-1)(bar y, s^2/n)`

Semi-conjugate prior: `mu ~ N(mu_0, tau^2_0)` `sigma^2 ~ inv chi^2(nu_0, sigma^2_0)`. Doesn't end up nice.

Multinomial model

Generalisation of binomial model.

Sampling distribution: `p(y | theta) prop Pi^k_(j=1) theta_j^(y_j)` Conjugate prior: Dirichlet `p(theta | alpha) prop pi^k_(j=1) theta_j^(alpha_j-1)` Posterior prior: Dirichlet: `alpha_j' = alpha_j + y_j`

Can also be represented as the product of `k` independent Poisson rvs `y_j ~ Poi(lambda_j)` with restriction `sum_j y_j = n`, then `theta_j = lambda_j / sum(lambda_i)`

Multivariate normal

Useful distributions and equivalences

Dirichlet-Gamma: Draw `x_1, ..., x_n` from `"Gamma"(delta, alpha_j)` for any common `delta`. Set `theta_j = x_j / sum(x_i)` `Inv-chi^2`-`chi^2`: `sigma^2 / chi^2(nu)`.

Advanced computation.

General strategies

Work in log-posterior scale to avoid under/overflow.
Factor posterior distribution where possible
Reparameterisation sometimes helps

Simulation from posterior

Three basic methods:

inverse cdf
rejection sampling
posterior mode approximation

Rejection sampling

Requirements:

Target density: `f(x)`
Instrumental density: `g(x)`
Constant `M` such that `f(x) <= M g(x)`

Method:

generate `X ~ g` and `U ~ U[0,1]`
set `Y = X` if `U <= (f(X)) / (Mg(X))`, reject otherwise

Finding `M` can be a problem, but in a Bayesian analysis, we can use the following trick. Recall that `q(theta | y) = p(theta) p(y | theta)`. Let `g(theta) = p(theta)` and draw a value `theta^**` from the prior, and `U` from `U[0,1]`. Let `M = p(theta; hat theta)` where `hat theta` is the mode of `p(y; theta)`. Accept `theta^**` if `u <= (q(theta | y))/(Mp(theta)) = (p(y; theta^**))/(p(y; hat theta))`

Importance sampling

Aka sampling importance resampling (SIR). Suppose we have draws `(theta_1, ..., theta_n)` from a distribution `g(theta)` can we "convert it" to a sample from `f(theta | y)`. For each `theta_i` compute `phi_i = (f(theta_i|y))/(g(theta_i))`, `w_i = phi_i / sum phi_j`, then draw `theta^**` from `(theta_1, ...)` with weights `w_i` without replacment.

Most often used to improve approximation to prior.

Posterior mode approximation

Useful in themselves and also as good starting points for more sophisticated methods. Before embarking on complicated simulation, it's a good idea to explore the target distribution, developing crude estimators etc. Computational strategy:

obtain initial crude estimates
find joint posterior mode or marginal posterior modes (better if possible)
use normal expansions at the mode

Finding mode is a (relatively) simple optimisation problem - easier to find than means as we don't need to integrate, and can use unnormalised densities. If joint can be factored, is easier to find joint mode by working along the marginal/conditional modes.

Three common methods are: conditional maximisation, Newton-Rhapson and EM.

Newton-Rhapson

Iterative algorithm based on second order Taylor approximation of `log p(theta | y)` - linear term disappears because will be zero at mode. Constant term doesn't effect density, so second order term leads to Normal approximation `~ N(hat theta, I(hat theta)^-1)`. If multiple modes will need to use a finite mixture model.

EM

If we want to maximise `p(gamma, phi | y) = p(gamma | phi, y) p(phi|y)` can treat `gamma` as missing

Markov Chain Monte Carlo

Based on stocastic process theory, useful for approximating target posterior distributions. Can be used even in high-dimensional examples. Three methods:

Metropolis-Hastings
Metropolis
Gibbs sampler

(M & GS both special cases of MH)

Definitions:

irreducbile if all every state can be reached from every other state
periodic with period `d` if `p^n(j|i)=0` unless `n=kd` `k in NN`. If `d=1` the chain is aperiodic
if a chain is irreducible and aperiodic, then it is ergodotic `bar a_n = sum_t a(X_t)/n -> E{a(X)}` (with geometric rate of convergence and standard error `sqrt( sigma_n^2/n (1 + 2 sum_i p_i(a)) )`, with `p_i` is the ith lag autocorrelation)

Finite-state, irreducible, aperiodic Markov chains have a limiting distribution `lim_(n->oo) p^n(j|i) = pi`. We want to use this in reverse - we have a target distribution, can we generate a Markov chain with it as its limiting distribution?

The Gibbs sampler

Useful when simulating from the conditionals is easy.

Start with vector of initial guesses
Draw first parameter from conditional with all others fixed
Draw second parameter ...

Metropolis-Hastings algorithm

More flexible than Gibbs - rather than sampling from conditionals, MH permits many proposal distributions.

Given draw `theta_t` in iteration `t`, sampling a candidate draw `theta^**` from a proposal distribution `J(theta^** | theta_t)`
Accept the draw with propability `r = (p(theta^** | y) / J(theta^** | theta_t))/(p(theta|y)/J(theta_t | theta^**))`

For `p(theta | y)` to be desired stationary distribution, the jumping distribution needs to be reversible or balanced: `p(theta^(t)) J(theta^((t+1)) | theta^(t) ) p(theta^((t+1))) J(theta^(t) | theta^((t+1)) )`. Reversibility is a sufficient condition, and since balance is typically not achieved by `J` we make the probability of moving from `theta^(t)` to `theta^((t+1))`, `r=(p(theta^**, y))/(p(theta | y))` when the proposal is symmetric - this is the Metropolis algorithm.

Markov chain will converge as long as:

chain has nonzero probability of reaching any point in parameter spac
chain does not shoot off to infinity

`J` affects rate of convergence - optimal `J` is `p(theta | y)` in which case `r=1`. Otherwise choose `J` so that it is easy to compute, easy to sample from, and it leads to rapid convergence and mixing. There are three main approaches:

independence sampler
random walk MH (most popular)
approximation MH

Independence sampler

Proposal distribution does not depend on `theta_t`. Just find a distribution `g(theta)` and generate values from it. Works very well when `g` is a good approximation with heavier tails. With large samples proposal could be normal centered at mode, with variance larger than the inverse fisher information matrix evaluated at mode.

Random walk MH

Easy to use. Proposal is often normal centred at current draw. Difficult to choose `V` - too small: takes to long to move through parameter space, too large: many draws rejected because jumps to extremes. Optimal acceptance rates are 25-50%, gets lower with increasing dimensionality.

Approximation MH

Idea is to improve approximation to `J` as we know more about `theta`, eg. in random walk MH could also vary variance with draw. Proposals typically not symmetric.

Other considerations

Starting values: If chain is irreducible, choice of `theta^0` will not affect convergence. With multiple chains, can choose overdispersed starting values.

How many chains? Even after convergence, draws from stationary distribution are not iid. May need to thin chain by only taking every `k`th draw

Convergence: Can check for convergence by running multiple chains. _GR_ statistic compares within to between chain sums of squares.

Hierachical models

Combine variation (and covariates) from multiple levels (analagous to mixed models). Need to preserve exchangability, usually in terms of conditional indepdence (given parameters). Call model distributions of model parameters hyperpriors. Need to be very careful when using non-proper hyperpriors.

Often two posterior predictive distributions of interest:

distribution of future observations from this experiment
distribution of future observations from all possible experimenters (ie. hyperparmeters varied as well)

Computation is harder than before because we have more parameters. Easiest when population distribution is conjugate to likelihood. Usual steps:

write `p(phi, theta | y) propto p(phi) p(theta | phi) p(y | theta)`
determine `p(theta | phi, y)` analytically – easy for conjugate models
derive marginal `p(phi | y)` by integrating out `theta`, or `p(phi, theta|y) / p(theta|y, phi)` (but need to be careful because normalising “constant” might not be constant)

Steps for computation are usually:

draw `phi` from `p(phi | y)`
draw `theta` `p(theta | phi, y)` – usually factors into product of independents
draw `tilde y` from appropriate posterior distribution

Beta-binomial

`y_j ~ "Bin"(n_j, theta_j)`, `theta_j ~ "Beta"(alpha, beta)`. `p(alpha, beta)` non informative (will need to check integrability of posterior). `p(alpha, beta) propto 1` doesn’t work, `p(alpha/(alpha + beta), (alpha + beta)^(-1/2)) propto 1` does.

Normal hierachical model

(equivalent to one-way random effects). Set up: `J` indepdendent experiments, want to estimate mean `theta_j` from each, sampling model: `y_ij | theta_j ~ N(theta_j, sigma^2)` (`sigma^2` known), and `bar y_j | theta_j ~ N(theta_j, sigma_j^2)`. This sampling model is quite general because of CLTs. What type of posterior estimates for `theta_j`’s might be reasonable? `hat theta_j = bar theta_j` if `n_n` large, `hat theta_j = bar y` reasonable if we believe all means are the same, or more general estimator `hat theta_j = lambda_j bar y_j + (1 - lambda_j) bar y`. `(1 - lambda_j)` called a shrinkage factor.

`bar_j | theta ~ N(theta_j, sigma_j^2`
`theta_1, ..., theta_j | mu, tau ~ product_j N(mu, tau)`
so `p(theta_1, ..., theta_j) = int product_j N(theta_j; mu, tau^2) p(mu, tau^2) dmu dtau^2`

Joint prior can be written as `p(mu, tau) = p(mu | tau) p(tau)`. Will consider a conditionally flat prior so that `p(mu, tau) propr p(tau)`

For `p(mu | tau) propto 1`, `p(theta_j | mu, tau) = N(hat theta_j, V_j)` where `hat theta_j = (sigma_j^2 mu + tau^2 bar y_j)/(sigma_j^2 + tau^2)`, `1/(V_j) = 1/(sigma_j^2) + 1/(tau^2)`

`p(mu, tau | y) propto p(tau) product_j N(bar y_j; mu, sigma_j^2 + tau^2)`. Now `p(tau | y) propto p(tau) V_mu^(-1/2) product_j N(bar y_j; hat mu, sigma_j^2 + tau^2)`. What should we used for `p(tau)`? A conjugate is always a safe choice so, `p(tau) ~ Inv-chi^2(nu_0, tau_0^2)`