Fundamental paradigms and principles of inference
- Frequentest/classical: widely used, built around idea of repeat experiments
- Fisherian: uncommon, bases inference on likelihood alone
- Bayesian: growing in popularity, bases inference on likelihood and prior probabilities, views probabilities as expressions of belief
Two principles are widely considered to be fundamental: sufficiency and conditionality. Intuitively, if we partition $(x, y)$ into $y$ and $y | x$ , sufficiency says that $y | x$ has no information about $\theta$ and conditionality says that $y$ has no information about $\theta$ .
Let $\mathbf{X} (X_1, X_2, \cdots, X_n)$ be a random vector of observations, with joint pdf $f_n(\mathbf{X}, \mathbf{\theta}) \equiv f_n(x_1, \cdots, x_n;\mathbf{\theta})$ , where $\mathbf{x} \in \mathbb{R}^n$ . Given $\mathbf{x} = x_1, \cdots, x_n$ , any $\mathbf{\hat{\theta}}$ that maximises $L(\mathbf{\theta}) \equiv L(\mathbf{\theta}; \mathbf{x}) = f_n(\mathbf{x}, \mathbf{\theta})$ over $\Theta$ is called a maximum likelihood estimate of the unknown true parameter.
Sufficiency
$T(\mathbf{X})$ is sufficient for $\mathbf{\theta}$ if the distribution of $\mathbf{X}$ given $T(\mathbf{X})$ does not depend on $\mathbf{\theta}$ .This means we can partition the data into pieces $T(\mathbf{X})$ and $\mathbf{X} | T(\mathbf{X})$ . The latter doesn’t contain any information about $\theta$ , so we should base inference solely on $T(\mathbf{X})$ . You can establish that a statistic is sufficient by calculating the conditional distribution, or use the factorisation theorem.
$T(\mathbf{X})$ is sufficient for $\mathbf{\theta}$ iff we can partition the distribution function into two function, one involving $T(x)$ and $\theta$ , the other $x$ . ie. $f_n(\mathbf{x}, \mathbf{\theta}) = g(T(\mathbf{x}); \mathbf{\theta})hShows that maximum likelihood estimators are functions of sufficient statistics because the maximisation of $L(\mathbf{\theta}, \mathbf{x})$ wrt $\mathbf{\theta}$ depends on $\mathbf{x}$ only through the sufficient statistic.
Conditioning
Let $E_1$ and $E_2$ be two experiments with same parameter space $\Theta$ and with densities $f_1$ and $f_2$ , with same unknown parameter. Let $E$ be the mixture experiment composed of $(E_1, x_1)$ with probability $p$ and $(E_2, x_2)$ with probability $(1-p)$ .
The conditionality principle: If we observe $(E_i, x_i)$ , then the information from $(E, (E_i, x_i)$ , is the same as that of $(E_i, x_i)$ . The idea is to condition on (ie. treat as fixed) random variables that don’t contain any info about $\theta$ – very important for frequentest, as defines what “repitition of experiment” means
The likelihood principle
The likelihood principle: If $x_1$ observed from $E_1$ and $x_2$ observed from $E_2$ have the same likelihood functions (to within a constant) then the “information content” wrt inference about $\theta$ is the same.
Implies both the sufficiency and conditionality principles. However, it is not particularly palatable to frequentests because it implies information content does not depend on the notion of sample spaces and repeatability of the experiment.
We will perform inference based on likelihoods because for almost all practical problem it (or some modification of it) works well. Later we will see that MLE leads to estimators with good asymptotic performance.
Properties of maximum likelihood estimation
Nice things:
- intuitive
- very widely applicable, can combine data from multiple experiments
- unaffected by monotonic transformations of the data
- MLE of a function of the parameters, is that function of the MLE
- theory provides large sample properties
- asymptotically efficient estimators
- provides general methods of inference
Not-so-nice things:
- may be slightly biased
- parametric model required and must adequately describe statistical process generating data
- can be computationally demanding
- fails in some case, eg. if too many nuisance parameters
Usually easier to maximise log-likelihood.
ML estimation doesn’t depend on parameterisation of model. If $g$ is a 1:1 function, then the MLE of $g(\theta) = g(\hat{\theta})$ , and more generally we will define $g(\hat{\theta})$ to be the MLE of $g(\theta)$ . This means we can use the most convenient parameterisation (although some may have better properties than others)
ML estimation invariant to transformation of observations. If $Y$ is a function of $X$ , then $f_Y(y;\theta) = |dx/dy| f_X(x;\theta)$ , where $dx/dy$ does not depend on $\theta$ .
Cramer-Rao inequality
Motivation: Will eventually show that ML estimators converge in distribution at $\sqrt(n)$ rate, subject to very general regularity conditions. Ie., if the data are iid $\sqrt(\hat{\theta}_n - \theta) \arrow_D N_s(0, V(\theta))$ . More large n, MLE is approximately ~ $N(\theta, V(\theta)/n)$ .
Cramer-Rao inequality for $\theta \in \mathcal{R}$
Let:
- $X = (X_1, ..., X_n)$ be a sample from distribution, with joint pdf $f_n(x;\theta)$ .
- $\delta(X)$ be any unbiased estimator of $\theta$ .
If:
- $Var_\theta(\delta) \lt \infty$ , and
- $\int {\delta (x)f_x (x;\theta )} dx$ and $\int {f_x (x;\theta )} dx$ can be differentiated wrt $\theta$ under the integral sign
Then:
- $Var_\theta(\delta) \ge 1 / I(\theta)$
- where $I(\theta ) = E_\theta [ ( {\frac{d l(\theta ;X}{d \theta } )^2 } ]$
- $1 / I(\theta)$ is the Cramer-Rao lower bound
- $I(\theta)$ is the (expected) Fisher information that $X$ contains about $\theta$ . It quantifies the amount of info the random vector $X$ provides about $\theta$ . Large $I(\theta)$ is good.
Proof:
- want to show $Var_\theta (\delta )I(\theta ) \ge 1$
- $Var_\theta (\delta )I(\theta ) = E( {\delta (x) - \theta } )^2 E({\frac{{d\,l(\theta ;X)}}{{d\theta }}} )^2 \le E[ {( {\delta (x) - \theta })( {\frac{{d\,l(\theta ;X)}}{{d\theta }}} )} ]^2 $ , by the Cauchy-Schwartz inequality
- RHS = $( {\int {(\delta (x) - \theta )( {\frac{{d\,\log f_n }}{{d\theta }}} )f_n dx} } )^2 = ( {\int {(\delta (x) - \theta )\frac{{f'_n }}{{f_n }}f_n dx} } )^2 = ( {\int {(\delta (x) - \theta )f'_n dx} } )^2$
- $\int {\theta f'_n } = \theta \frac{{d\int {f_n dx} }}{{d\theta }} = \theta \frac{{d1}}{{d\theta }} = 0$
- and $\int {\delta (x)f'_n dx = \frac{d}{{d\theta }}\int {\delta (x)dx} } = \frac{{d\theta }}{{d\theta }}$
CR inequality does not imply existence of an unbiased estimator that achieves the lower bound, or in fact, any unbiased estimator.
CR inequality for $g(\theta)$
If $\delta(X)$ is an unbiased estimator of $g(\theta)$ then $Var_\theta(\delta) \ge \frac{g'(\theta)^2}{I(\theta)}$ . Can be proved through minor modification of above proof, or when $g(\theta)$ is invertible, by reparameterising the likelihood function in terms of $\zeta = g(\theta)$ .
Alternative formulae
- $I(\theta) = Var_\theta [ \frac{d l(\theta;X)}{d \theta} ]$
- $I(\theta) = -E_\theta [ \frac{d^2 l(\theta;X)}{d \theta^2} ]$
iid case
If indepedent, multiply likelihoods, add log-likelihoods, indentical so distributions all equal: $I_n(\theta) = n I_1(\theta)$ .
Multiparameter case
Results as for single parameter case, but in vector form:
- $Var_\theta(\delta) \gt (g')^T I^{-1}(\theta) g^T$
- $I_ij(\theta) = E [ \frac{dl}{d\theta_i} \frac{dl}{\theta_j} ]$
- $I_ij(\theta) = -E [ \frac{dl}{d\theta_i \theta_j} ]$
Asymptopics
Test statistics ($\theta \in R$ )
Likelihood ratio test
If appropriate assumptions hold, then $2[ l(\hat(\theta)_n, X) - l(\theta_0, X)] \rightarrow_D \chi^2_1$ .
Proof:
- expand $l(\theta, X)$ about $\hat{\theta}_n$ and evaluate at true parameter value
- $l(\theta_0) = l(\hat{\theta}_n) + l'(\hat{\theta}_n)(\theta_0 - \hat{\theta}_n + \frac{1}{2}l''(\theta^*_n)(\theta_0 - \hat{\theta}_n)^2$
- since $l'(\hat{\theta}_n) = 0$ , this gives $2[l(\hat{\theta} -l(\theta_0)] = -l''(\theta^*_n)(\theta_0 - \hat{\theta}_n)^2$
- $(\sqrt{n}(\theta_0 - \hat{\theta}_n))^2 \rightarrow_D (N(0, 1 / I_1(\theta_0))^2 = \chi^2_1 / I_1(\theta_0)$
- $l''(\theta^*_n)/n \rightarrow_p -I_1(\theta_0)$ since $\theta^*_n \rightarrow_p \theta_0$
For sufficiently large n, an approximate size $\alpha$ hypothesis test of $H_0: \theta = \theta_0$ is given by: Reject $H_0$ if $2[l(\hat{\theta}_n) - l(\theta)] \gt \chi^2_{1,\alpha}$
Wald score-test
Can replace $-l''(\theta^*_n)$ by other asymptotically equivalent values. Wald test replaces it with $I(\hat{\theta})$ (or sometimes with $-l''(\theta_0)$ or $I(\theta_0)$ ) to get $I(\hat{\theta}_n)(\theta_0 - \hat{\theta}_n)^2 \rightarrow_D \chi^2_1$ . Can be viewed as square of approximately standardised asymptotically normal random variables. In practice LR test is used most often.
Confidence intervals
Use the results the hypothesis tests. Eg. confidence interval will be all values of p, st. $l(p) \gt l(\hat{\theta}_n) - \chi^2_{1,\alpha} / 2$
Test statistics ($\theta \in R^s$ )
Extending results to multiparameter case is reasonably straightforward, but very tedious. Results follow.
Asymptotic normality and efficiency of $\hat{\theta}_n \in R^s$ . Exactly the same as for single parameter, but uses vectors. The delta theorem extends similarly.
Simple hypothesis
Want to test if $\theta$ is a particular value. $H_0: \theta = \theta_0$ . Likelihood and Wald score tests extend in the obvious way.
Composite hypothesis
$H_0: \theta \in \Theta_0 \in \Theta$ , where $\Theta_0$ is a $s -r$ dimensional subset of $\Theta$ . We shall assume (reparameterising if necessary) that $\theta = (\psi \lambda) ^T $ , where $\psi \in \real^r$ , $\lambda \in \real^{s-r}$ , and $\theta \in \Theta_0$ are all points in $\Theta$ for which $\psi$ equals some specified value.Likelihood: $2[l(\hat{\theta}_n) - l(\hat{\theta}_{0n})] \rightarrow_D \chi^2_r$ , where $\hat{\theta}_{0n}$ is the MLE in $\Theta_0$ .
Wald: $(\hat{\psi}_n - \psi_0)^T [[I^{-1}(\hat{\theta}_n)]_{\psi\psi}]^{-1}(\hat{\psi}_n - \psi_0) \rightarrow_D \chi^2_r$ , where $[I^{-1}(\hat{\theta}_n)]_{\psi\psi}$ is the upper $r \times r$ submatrix of $I^{-1}$ , called Fisher information for $\psi$ in presence of nuisance parameter $\lambda$ .
Some probability theory and inequalities
Convergence in distribution
Let $X_1, X_2, ...,$ and $X$ be random variables with cdfs $F_1, F_2, ...$ and $F$ . If $F_n(x) \to F(x)$ then the sequence $X_n$ is said to converge in distribution to $X$ ($X_n \to_D $ ).
Convergence in probability
Let $X_1, X_2, ...,$ and $X$ be random variables on some probability space. Then if $P(|X_n - X| \gt \epsilon \to 0$ as $n \to \infinity \forall \epsilon \gt 0$ , the sequence $X_n$ is said to converge in probability to $X$ ($X_n \to_p X$ ).
- $X_n \to_p X \implies X_n \to_D X$
- $X_n \to_d c \implies X_n \to_p c$
- If $g$ is continuous then and $X_n \to_p X$ , then $g(X_n) \to_p g(X)$ .
Slutsky’s theorem
If $X_n \to_D X$ , $A_n \to_p a$ , $B_n \to_p b$ .
Then $A_n + B_n X_n \to_D a + bX$
Delta theorem
Suppose $\sqrt{n}(X_n - b) \to_D X$ . If $g: \reals \to \reals$ is differentiable eand $g'$ is continuous at $b$ then $\sqrt{n}{( g(X_n) - g(b))} \to_D g'(b)X$ .
Delta theorem ensures reparameterisation of ML has same asymptotic properties.
Jensen’s inequality
Let $D$ be an interval in $\reals$ . If $\phi: D \to \reals$ is convex, then for any random variable $X$ on $D$ , $\phi(E[X]) \le E[\phi(X)]$ .
Cauchy-Schwartz inequality
For any two random variables $X$ and $Y$ , st. $E(X^2) \lt \infinty$ , $E(Y^2) \lt \infinity$ , then $(E[XY])^2 \le E[X^2]E[Y^2]$ .