Fundamental paradigms and principles of inference

Frequentest/classical: widely used, built around idea of repeat experiments
Fisherian: uncommon, bases inference on likelihood alone
Bayesian: growing in popularity, bases inference on likelihood and prior probabilities, views probabilities as expressions of belief

Two principles are widely considered to be fundamental: sufficiency and conditionality. Intuitively, if we partition $(x, y)$ into $y$ and $y | x$ , sufficiency says that $y | x$ has no information about $\theta$ and conditionality says that $y$ has no information about $\theta$ .

Let $\mathbf{X} (X_1, X_2, \cdots, X_n)$ be a random vector of observations, with joint pdf $f_n(\mathbf{X}, \mathbf{\theta}) \equiv f_n(x_1, \cdots, x_n;\mathbf{\theta})$ , where $\mathbf{x} \in \mathbb{R}^n$ . Given $\mathbf{x} = x_1, \cdots, x_n$ , any $\mathbf{\hat{\theta}}$ that maximises $L(\mathbf{\theta}) \equiv L(\mathbf{\theta}; \mathbf{x}) = f_n(\mathbf{x}, \mathbf{\theta})$ over $\Theta$ is called a maximum likelihood estimate of the unknown true parameter.

Sufficiency

$T(\mathbf{X})$ is sufficient for $\mathbf{\theta}$ if the distribution of $\mathbf{X}$ given $T(\mathbf{X})$ does not depend on $\mathbf{\theta}$ .

This means we can partition the data into pieces $T(\mathbf{X})$ and $\mathbf{X} | T(\mathbf{X})$ . The latter doesn’t contain any information about $\theta$ , so we should base inference solely on $T(\mathbf{X})$ . You can establish that a statistic is sufficient by calculating the conditional distribution, or use the factorisation theorem.

$T(\mathbf{X})$ is sufficient for $\mathbf{\theta}$ iff we can partition the distribution function into two function, one involving $T(x)$ and $\theta$ , the other $x$ . ie. $f_n(\mathbf{x}, \mathbf{\theta}) = g(T(\mathbf{x}); \mathbf{\theta})h

Shows that maximum likelihood estimators are functions of sufficient statistics because the maximisation of $L(\mathbf{\theta}, \mathbf{x})$ wrt $\mathbf{\theta}$ depends on $\mathbf{x}$ only through the sufficient statistic.

Conditioning

Let $E_1$ and $E_2$ be two experiments with same parameter space $\Theta$ and with densities $f_1$ and $f_2$ , with same unknown parameter. Let $E$ be the mixture experiment composed of $(E_1, x_1)$ with probability $p$ and $(E_2, x_2)$ with probability $(1-p)$ .

The conditionality principle: If we observe $(E_i, x_i)$ , then the information from $(E, (E_i, x_i)$ , is the same as that of $(E_i, x_i)$ . The idea is to condition on (ie. treat as fixed) random variables that don’t contain any info about $\theta$ – very important for frequentest, as defines what “repitition of experiment” means

The likelihood principle

The likelihood principle: If $x_1$ observed from $E_1$ and $x_2$ observed from $E_2$ have the same likelihood functions (to within a constant) then the “information content” wrt inference about $\theta$ is the same.

Implies both the sufficiency and conditionality principles. However, it is not particularly palatable to frequentests because it implies information content does not depend on the notion of sample spaces and repeatability of the experiment.

We will perform inference based on likelihoods because for almost all practical problem it (or some modification of it) works well. Later we will see that MLE leads to estimators with good asymptotic performance.

Properties of maximum likelihood estimation

Nice things:

intuitive
very widely applicable, can combine data from multiple experiments
unaffected by monotonic transformations of the data
MLE of a function of the parameters, is that function of the MLE
theory provides large sample properties
asymptotically efficient estimators
provides general methods of inference

Not-so-nice things:

may be slightly biased
parametric model required and must adequately describe statistical process generating data
can be computationally demanding
fails in some case, eg. if too many nuisance parameters

Usually easier to maximise log-likelihood.

ML estimation doesn’t depend on parameterisation of model. If $g$ is a 1:1 function, then the MLE of $g(\theta) = g(\hat{\theta})$ , and more generally we will define $g(\hat{\theta})$ to be the MLE of $g(\theta)$ . This means we can use the most convenient parameterisation (although some may have better properties than others)

ML estimation invariant to transformation of observations. If $Y$ is a function of $X$ , then $f_Y(y;\theta) = |dx/dy| f_X(x;\theta)$ , where $dx/dy$ does not depend on $\theta$ .

Cramer-Rao inequality

Motivation: Will eventually show that ML estimators converge in distribution at $\sqrt(n)$ rate, subject to very general regularity conditions. Ie., if the data are iid $\sqrt(\hat{\theta}_n - \theta) \arrow_D N_s(0, V(\theta))$ . More large n, MLE is approximately ~ $N(\theta, V(\theta)/n)$ .

Cramer-Rao inequality for $\theta \in \mathcal{R}$

Let:

$X = (X_1, ..., X_n)$ be a sample from distribution, with joint pdf $f_n(x;\theta)$ .
$\delta(X)$ be any unbiased estimator of $\theta$ .

If:

$Var_\theta(\delta) \lt \infty$ , and
$\int {\delta (x)f_x (x;\theta )} dx$ and $\int {f_x (x;\theta )} dx$ can be differentiated wrt $\theta$ under the integral sign

Then:

$Var_\theta(\delta) \ge 1 / I(\theta)$
where $I(\theta ) = E_\theta [ ( {\frac{d l(\theta ;X}{d \theta } )^2 } ]$
$1 / I(\theta)$ is the Cramer-Rao lower bound
$I(\theta)$ is the (expected) Fisher information that $X$ contains about $\theta$ . It quantifies the amount of info the random vector $X$ provides about $\theta$ . Large $I(\theta)$ is good.

Proof:

want to show $Var_\theta (\delta )I(\theta ) \ge 1$
$Var_\theta (\delta )I(\theta ) = E( {\delta (x) - \theta } )^2 E({\frac{{d\,l(\theta ;X)}}{{d\theta }}} )^2 \le E[ {( {\delta (x) - \theta })( {\frac{{d\,l(\theta ;X)}}{{d\theta }}} )} ]^2 $ , by the Cauchy-Schwartz inequality
RHS = $( {\int {(\delta (x) - \theta )( {\frac{{d\,\log f_n }}{{d\theta }}} )f_n dx} } )^2 = ( {\int {(\delta (x) - \theta )\frac{{f'_n }}{{f_n }}f_n dx} } )^2 = ( {\int {(\delta (x) - \theta )f'_n dx} } )^2$
$\int {\theta f'_n } = \theta \frac{{d\int {f_n dx} }}{{d\theta }} = \theta \frac{{d1}}{{d\theta }} = 0$
and $\int {\delta (x)f'_n dx = \frac{d}{{d\theta }}\int {\delta (x)dx} } = \frac{{d\theta }}{{d\theta }}$

CR inequality does not imply existence of an unbiased estimator that achieves the lower bound, or in fact, any unbiased estimator.

CR inequality for $g(\theta)$

If $\delta(X)$ is an unbiased estimator of $g(\theta)$ then $Var_\theta(\delta) \ge \frac{g'(\theta)^2}{I(\theta)}$ . Can be proved through minor modification of above proof, or when $g(\theta)$ is invertible, by reparameterising the likelihood function in terms of $\zeta = g(\theta)$ .

Alternative formulae

$I(\theta) = Var_\theta [ \frac{d l(\theta;X)}{d \theta} ]$
$I(\theta) = -E_\theta [ \frac{d^2 l(\theta;X)}{d \theta^2} ]$

iid case

If indepedent, multiply likelihoods, add log-likelihoods, indentical so distributions all equal: $I_n(\theta) = n I_1(\theta)$ .

Multiparameter case

Results as for single parameter case, but in vector form:

$Var_\theta(\delta) \gt (g')^T I^{-1}(\theta) g^T$
$I_ij(\theta) = E [ \frac{dl}{d\theta_i} \frac{dl}{\theta_j} ]$
$I_ij(\theta) = -E [ \frac{dl}{d\theta_i \theta_j} ]$

Asymptopics

Test statistics ($\theta \in R$ )

Likelihood ratio test

If appropriate assumptions hold, then $2[ l(\hat(\theta)_n, X) - l(\theta_0, X)] \rightarrow_D \chi^2_1$ .

Proof:

expand $l(\theta, X)$ about $\hat{\theta}_n$ and evaluate at true parameter value
$l(\theta_0) = l(\hat{\theta}_n) + l'(\hat{\theta}_n)(\theta_0 - \hat{\theta}_n + \frac{1}{2}l''(\theta^*_n)(\theta_0 - \hat{\theta}_n)^2$
since $l'(\hat{\theta}_n) = 0$ , this gives $2[l(\hat{\theta} -l(\theta_0)] = -l''(\theta^*_n)(\theta_0 - \hat{\theta}_n)^2$
$(\sqrt{n}(\theta_0 - \hat{\theta}_n))^2 \rightarrow_D (N(0, 1 / I_1(\theta_0))^2 = \chi^2_1 / I_1(\theta_0)$
$l''(\theta^*_n)/n \rightarrow_p -I_1(\theta_0)$ since $\theta^*_n \rightarrow_p \theta_0$

For sufficiently large n, an approximate size $\alpha$ hypothesis test of $H_0: \theta = \theta_0$ is given by: Reject $H_0$ if $2[l(\hat{\theta}_n) - l(\theta)] \gt \chi^2_{1,\alpha}$

Wald score-test

Can replace $-l''(\theta^*_n)$ by other asymptotically equivalent values. Wald test replaces it with $I(\hat{\theta})$ (or sometimes with $-l''(\theta_0)$ or $I(\theta_0)$ ) to get $I(\hat{\theta}_n)(\theta_0 - \hat{\theta}_n)^2 \rightarrow_D \chi^2_1$ . Can be viewed as square of approximately standardised asymptotically normal random variables. In practice LR test is used most often.

Confidence intervals

Use the results the hypothesis tests. Eg. confidence interval will be all values of p, st. $l(p) \gt l(\hat{\theta}_n) - \chi^2_{1,\alpha} / 2$

Test statistics ($\theta \in R^s$ )

Extending results to multiparameter case is reasonably straightforward, but very tedious. Results follow.

Asymptotic normality and efficiency of $\hat{\theta}_n \in R^s$ . Exactly the same as for single parameter, but uses vectors. The delta theorem extends similarly.

Simple hypothesis

Want to test if $\theta$ is a particular value. $H_0: \theta = \theta_0$ . Likelihood and Wald score tests extend in the obvious way.

Composite hypothesis

$H_0: \theta \in \Theta_0 \in \Theta$ , where $\Theta_0$ is a $s -r$ dimensional subset of $\Theta$ . We shall assume (reparameterising if necessary) that $\theta = (\psi \lambda) ^T $ , where $\psi \in \real^r$ , $\lambda \in \real^{s-r}$ , and $\theta \in \Theta_0$ are all points in $\Theta$ for which $\psi$ equals some specified value.

Likelihood: $2[l(\hat{\theta}_n) - l(\hat{\theta}_{0n})] \rightarrow_D \chi^2_r$ , where $\hat{\theta}_{0n}$ is the MLE in $\Theta_0$ .

Wald: $(\hat{\psi}_n - \psi_0)^T [[I^{-1}(\hat{\theta}_n)]_{\psi\psi}]^{-1}(\hat{\psi}_n - \psi_0) \rightarrow_D \chi^2_r$ , where $[I^{-1}(\hat{\theta}_n)]_{\psi\psi}$ is the upper $r \times r$ submatrix of $I^{-1}$ , called Fisher information for $\psi$ in presence of nuisance parameter $\lambda$ .

Some probability theory and inequalities

Convergence in distribution

Let $X_1, X_2, ...,$ and $X$ be random variables with cdfs $F_1, F_2, ...$ and $F$ . If $F_n(x) \to F(x)$ then the sequence $X_n$ is said to converge in distribution to $X$ ($X_n \to_D $ ).

Convergence in probability

Let $X_1, X_2, ...,$ and $X$ be random variables on some probability space. Then if $P(|X_n - X| \gt \epsilon \to 0$ as $n \to \infinity \forall \epsilon \gt 0$ , the sequence $X_n$ is said to converge in probability to $X$ ($X_n \to_p X$ ).

$X_n \to_p X \implies X_n \to_D X$
$X_n \to_d c \implies X_n \to_p c$
If $g$ is continuous then and $X_n \to_p X$ , then $g(X_n) \to_p g(X)$ .

Slutsky’s theorem

If $X_n \to_D X$ , $A_n \to_p a$ , $B_n \to_p b$ .

Then $A_n + B_n X_n \to_D a + bX$

Delta theorem

Suppose $\sqrt{n}(X_n - b) \to_D X$ . If $g: \reals \to \reals$ is differentiable eand $g'$ is continuous at $b$ then $\sqrt{n}{( g(X_n) - g(b))} \to_D g'(b)X$ .

Delta theorem ensures reparameterisation of ML has same asymptotic properties.

Jensen’s inequality

Let $D$ be an interval in $\reals$ . If $\phi: D \to \reals$ is convex, then for any random variable $X$ on $D$ , $\phi(E[X]) \le E[\phi(X)]$ .

Cauchy-Schwartz inequality

For any two random variables $X$ and $Y$ , st. $E(X^2) \lt \infinty$ , $E(Y^2) \lt \infinity$ , then $(E[XY])^2 \le E[X^2]E[Y^2]$ .