Fundamental paradigms and principles of inference

Two principles are widely considered to be fundamental: sufficiency and conditionality. Intuitively, if we partition $(x, y)$ into $y$ and $y | x$ , sufficiency says that $y | x$ has no information about $\theta$ and conditionality says that $y$ has no information about $\theta$ .

Let $\mathbf{X} (X_1, X_2, \cdots, X_n)$ be a random vector of observations, with joint pdf $f_n(\mathbf{X}, \mathbf{\theta}) \equiv f_n(x_1, \cdots, x_n;\mathbf{\theta})$ , where $\mathbf{x} \in \mathbb{R}^n$ . Given $\mathbf{x} = x_1, \cdots, x_n$ , any $\mathbf{\hat{\theta}}$ that maximises $L(\mathbf{\theta}) \equiv L(\mathbf{\theta}; \mathbf{x}) = f_n(\mathbf{x}, \mathbf{\theta})$ over $\Theta$ is called a maximum likelihood estimate of the unknown true parameter.

Sufficiency

$T(\mathbf{X})$ is sufficient for $\mathbf{\theta}$ if the distribution of $\mathbf{X}$ given $T(\mathbf{X})$ does not depend on $\mathbf{\theta}$ .

This means we can partition the data into pieces $T(\mathbf{X})$ and $\mathbf{X} | T(\mathbf{X})$ . The latter doesn’t contain any information about $\theta$ , so we should base inference solely on $T(\mathbf{X})$ . You can establish that a statistic is sufficient by calculating the conditional distribution, or use the factorisation theorem.

$T(\mathbf{X})$ is sufficient for $\mathbf{\theta}$ iff we can partition the distribution function into two function, one involving $T(x)$ and $\theta$ , the other $x$ . ie. $f_n(\mathbf{x}, \mathbf{\theta}) = g(T(\mathbf{x}); \mathbf{\theta})h

Shows that maximum likelihood estimators are functions of sufficient statistics because the maximisation of $L(\mathbf{\theta}, \mathbf{x})$ wrt $\mathbf{\theta}$ depends on $\mathbf{x}$ only through the sufficient statistic.

Conditioning

Let $E_1$ and $E_2$ be two experiments with same parameter space $\Theta$ and with densities $f_1$ and $f_2$ , with same unknown parameter. Let $E$ be the mixture experiment composed of $(E_1, x_1)$ with probability $p$ and $(E_2, x_2)$ with probability $(1-p)$ .

The conditionality principle: If we observe $(E_i, x_i)$ , then the information from $(E, (E_i, x_i)$ , is the same as that of $(E_i, x_i)$ . The idea is to condition on (ie. treat as fixed) random variables that don’t contain any info about $\theta$ – very important for frequentest, as defines what “repitition of experiment” means

The likelihood principle

The likelihood principle: If $x_1$ observed from $E_1$ and $x_2$ observed from $E_2$ have the same likelihood functions (to within a constant) then the “information content” wrt inference about $\theta$ is the same.

Implies both the sufficiency and conditionality principles. However, it is not particularly palatable to frequentests because it implies information content does not depend on the notion of sample spaces and repeatability of the experiment.

We will perform inference based on likelihoods because for almost all practical problem it (or some modification of it) works well. Later we will see that MLE leads to estimators with good asymptotic performance.