Maximum likelihood estimation
Basic likelihood estimation and inference
`bbY = (Y_1, Y_2, ..., Y_n)^T` vector of iid rv with possible values in `Omega_1, ..., Omega_n`, and assume that `bbY in Omega = Omega_1 xx ... xx Omega_n` (positivity condition). `bbtheta = (theta_1, ..., theta_p)^T` vector of parameters st `theta in Theta in RR^p`, `p < n`.Likelihood function `l_n (theta) = prod f_i(y_i | theta)`. Maximum likelihood estimator of `theta` is `hat theta in Theta` st ` l_n(hat theta) >= l_n(theta) AA theta in Theta`. Typically solved by taking logs, then derivatives and finding roots.
Let `P_theta` be distribution of rv indexed by parameter `theta`. Suppose for `theta in Theta`
- `P_theta` have common support
- random variables are iid with common density function `f(y_i | theta)`
- true value of `theta`, `theta_0` lies in interior of `Theta`
Then as `n -> oo` `P[ prod f(y_i | theta_0) > prod f(y_i | theta) ] -> 1`. Provides connection between ML estimate and true value.
If independent and iid:
- `l_n(theta) = prod f(y_i | theta)`
- `L_n(theta) = sum log(f(y_i | theta))`
- `U_(n,k)(theta) = sum 1/(f(y_i|theta)) (grad/(grad theta_k)) f(y_i | theta)` (score function)
- `I_(n,j,k)(theta) = nE[grad/(grad theta_k) log(f(y_i|theta)) grad/(grad theta_j) log(f(y_i | theta))]` (information matrix)
Properties of estimators
Developed under sets of technical conditions called regularity conditions. Many different sets from which we can derive a range of properties. Will focus on two sets, the first guarantees a consistent estimator of `theta`, and the second provides asymptotic normality.
Regularity conditions set 1
- Distributions of `Y_1, ..., Y_n` are distinct and have common support
- True value lies in interior of open interval contained within parameter space
- For almost all `y` the density function is differentiable with respect to all elements of `theta`
Corollary 1: If the parameter space is finite, then there is a sequence of consistent unique ML estimates Corollary 2: If the likelihood has a unique root for each `n` then that sequence of estimators is consistent.
There are four basic (essentially indepdent) things we want to happen:
- existence of mle (or sequence)
- existence of roots of the likelihood equation
- uniqueness of estimators
- consistency of sequences of estimators
Regularly conditions set 2
- `E[grad/(grad theta_k) log(f(Y|theta))] = 0`
- `I_(j,k)(theta) = -E[ grad^2/(grad theta_k grad theta_j) log(f(Y|theta))]`
- the information matrix is positive definite
- `I_n(theta) -> oo`
Can replace 1 and 2 with being under to exchange differentiate under integral.
`=>` there exists a sequence of solutions to the likelihood equations st:- `hat theta_n` is consistent for `theta`
- `n^(1/2)(hat theta_n - theta)` is asymptotically normal with mean `bb0` and covariance `nI_n^(-1)(bb theta)`.
- `hat theta_(n,k)` is asymptotically efficient
Two additional properties of MLEs are useful:
- if a given scalar parameter `theta` has a single sufficient statistic, then the MLE must be a function of that statistic. If the statistic is minimal and complete, then the MLE is unique; if the MLE is unbiased, then it is the UMVU
- invariance: the MLE of a function of parameters, is the function of the MLEs of the parmeters
Wald theory inference
`(hat theta_n - theta)^T I_n(hat theta_n) (hat theta_n - theta) -> Chi^2_p`Let `b(theta) = (R_1(theta), ..., R_r(theta))^T` be an `r xx 1` matrix of defined restrictions on model parameters. Let `C(theta) = (c)_(k,j) = grad/(grad theta_k) R_j(theta)`. Then `W_n = b^T (C I_n^(-1) C^T) b -> X^2_r`.
Likelihood inference
Let `dim{Theta} = p` and `dim{Theta_0} = r` and `hat theta_n = spr_(theta in Theta) L_n(theta)`, `bar theta_n = spr_(theta in Theta_0) L_n(theta)`, then `T_n = -2(L_n(bar theta_n) - L_n(hat theta_n)) -> Chi^2_(p-r)`.