Concepts of probability

Probability is not a thing – it is many concepts. All concepts obey the same fundamental rules, but the concept employed in a specific analysis determines the manner in which probably is introduced to the problem, and the interpretation of inferential statements. Difference concepts lead to different approaches to developing an analysis

Laplacian probability

(aka, Classical, gambling probability)

Elements of the sample space (basic outcomes) are equally likely. `P(E) = (|E|) / (|S|)`. Fundamental characteristic of equally likely outcomes usually result of physical properties of operation.

Limited, but used more often that you might think. Used directly in some randomisation procedures.

Relative frequency probability

Probability is a direct consequence of physical realities (ie. things that actually happened). Clearly inadequate, but can apply in problems that involve finite populations.

Hypothetical limiting relative frequency probability

`P(E) = lim_(n->oo) (E / n)`

Epistemic probability

`P(E|y) = (P(y|E)P(E)) /((P(y|E)P(E) + P(y | E^c)P(E^c)))`

Randomisation based approaches

Randomisation based approaches included the survey sampling and experiment approach.


Pure or strict view is that population must be finite and existing (not subject to arbitrary definition). All units may be individually identified and assigned discrete labels. Common problems:

Attributes and responses

Sampling approach assumes that there is a fixed value of interest, an attribute, associated with each unit. Not realisation of some random mechanism but fixed, immutable characteristics.

Under experimental approach, call measured or observed values as responses. Essentially the same as for sampling, but are influenced by external factors within the time frame of the study.

Survey sampling

In the context of a basic sampling approach, statistics computed over the entire population are considered to be parameters (eg. mean, total, variance, proportion belonging to some class). We are interested in estimating these parameters, because it is not normally feasible to measure every unit in the population. We will only consider sampling error, ie. error that arises because we don’t have complete coverage.

Simple random sampling

A simple random sample of `n` units, selected from a population of `N` units, st all possible distinct subets of size `n` are equally likely to be chosen.

Can obtain sample through either group or sequential selection – in practice sequential is much easier to use and gives same inclusion probabilities.

Basic estimators

Horvitz-Thompson estimator `hat tau` of the population total:

`hat tau = sum x_i I(U_i in S_n^**) 1/pi_i = sum_{U_i in S_n^**) x_i/pi_i` `var{hat tau} = var{ sum x_i/pi_i Z_i} = sum ( (1-pi_i)/pi_i) x_i^2 + 2 sum sum ( (pi_{i,j} - pi_i * pi_j) / (pi_i pi_j)) x_i x_j`.

Unequal probability samples

Have so far worked within Laplacian probability, but what if we have a set of possible samples that should be selected with unequal probability? Generally easier to think about generating equally likely samples with population units having unequal probability. (Not possible to use material concept of probability). Often called restricted randomisation, eg. cluster or stratified sampling.

Inclusion probabilities and linear estimators

Want to be able to connect averaging over samples with expectation over population units.

Given attributs of population units, and a population parameter `theta` to be estimated, a linear estimator with `theta` has the form `hat theta_k = sum B_i x_i I(U in S_{n,k}`. Play central role in survey sampling and many estimators may be written in linear form. Variances for non-linear estimators often derived by forming a Taylor series expansion.

Defined the inclusion probability for population unit `U_i` as `pi_i = Pr(U_i in S_n^**`. In general a design unbiased estimator can be found for provided the inclusion probabilities are known.

Second order probability = `pi_{i,j} = Pr{ (U_i in S_n^**) nn (U_j in S_n^8*)}`

Overall generalisation

  1. For a finite population of units `{U_i : i = 1, ..., N}` define the binary random variables `Z_i = I(U_i in S_n^**}`
  2. Define inclusion probability as `pi_i = Pr(Z_i =1)` and `pi_{i,j} = Pr{(Z_i = 1) nn (Z_j = 1) }`.
  3. For a given population consider estimators of the form `hat theta = sum_(i=1)^N beta_i(pi_i) x_i Z_i`

We can then define their properties as follows:

And under the formulation of estimators

For estimations of variances, we need “plug-in” unbiased estimators – will be unbiased if they are a linear combination of quantities that can be estimated without bias.

Experimental approach

Nested syllogism of experimentation

A: chance alone is in effect B: design (systematic force) is in effect C: observable implication of chance alone

Either A or B. If A then C. Not C. Therefore not A. Therefore B. (not replaced by exceeding low probability of)

Randomised treatment assignment

Treatments are randomly allocated to units. This is where probability enters.

Qualifying differences

  1. Assume there is no difference. (ie. treatment group is not related to
  2. attribute)
  3. Work out how some test statistic distributed in this case.
  4. How does the actual assignment compare to this distribution? (calculate a
  5. p-value by counting number greater than or equal to and dividing by total
  6. number of permutations)

Too hard to work out all possible permutations, so usually just take random sample. Traditional t-tests (etc.) asymptotically based on this principle.

Families of distributions useful in modelling

Exponential families

Multiple representations:


The parameters `theta_j` are called canonical parameters. Often the easiest for derivation of properties, not always the best. Helpful to have other parameterisations with certain desirable properties.

Mean value parameterisation 1. Expected value often of interest, but usually none of the canonical parameters correspond to the mean. Can transform `(theta_1, ..., theta_s)` to `(mu = E(Y), theta_1, ..., theta_(s_1))`.

Mean value parameterisation 2.In canonical representation, clear relationship between parameters and sufficient statistics, so we could parameterise using expected values of the `T_j`, transforming `(theta_1, ..., theta_s)` to `(mu_1(theta), ..., mu_s(theta))` where `mu_j(theta) = E[T_j(Y)] = del/(del theta_j) B(theta)`. Has the potential advantage that each parmater of the density is the expected value of a random variable associated with an observable quantity.

Mixed parameterisations. Also possible to write exponential family in terms of both canoncial and mean value eg. `(mu_1(theta), theta_2)`

Reasons for choosing one parameterisation over another:

Exponential dispersion families

Only talk about small subset – essentially one parameter families extended to include additional dispersion parameter (most common for applications). Important family is those where the sufficient statistics is `T(Y) = Y`, called natural exponential families. eg. binomial with fixed n, normal with fixed variance. Have the form `f(Y| theta) = exp phi(Y theta - b(theta)) + c(Y phi)` and `E[X] = b'(theta)`, `"Var"[x] = 1/phi b``(theta) = 1/phi V(mu)`.

If we have more than one (iid) variable then there joint distribution will be `f(Y | theta) = exp( sum theta_j sum T_j(Y_i) - B(theta) + sum c(Y_i))`, which is still in the exponential family form.

Location-Scale families

Families of distributions formed from classes of transformations are useful, particularly location-scale transformations. Let `U` be an rv with distribution `F`, if `U` is transformed into `V = U + mu` then `V` will have distribution `F(Y - mu)`. The set of distributions generated from all `mu in (-oo, oo)` is called a location family. Similarly, the transformation `V = sigma U` generates a scale family of distributions, and the composition generates `V = mu + sigma U` with distribution `F((Y-mu)/sigma)`.

Location-scale families include double exponential, uniform, logistic and normal.


`E(V) = E(U) + mu` `Var(V) = sigma^2 var(U)`

Prominence and limitations of normal distribution


Additive error models

Exemplifies idea of signal + noise. `Y = f(x, beta) + epsi` and `epsi` distributed with some absolutely continuous density function, typically with `E(epsi) = 0`. Modelling task with these models focus on:

Constant variance models

“Backbone” of statistical modelling. Can use transformation if assumption violated (but often introduces other problems).

Linear and non-linear models

Define a model as non-linear if at least of the derivatives of `beta` depends on one or more other parameters. Two notions of non-linearity:

Models with known variance parameters

Relax assumption of constant variance. Use weighted regression either with known weights (ie. `1/sigma`) or weights as a function of other parameters, most typically power of the mean model. If we use functional weights, don’t get nice asymptotic theory.

Models with unknown variance parameters

`Y_p = g_1(x, beta) + sigma g_2(x, beta, z, theta)`

Common models:

Transformaing both sides

Attempt to allow separate adjustment for nonconstant variance, nonsymmetric error distributions and incorrect specification of expectation function. Idea is to transform both sides to help with first two problems without affecting third.

Models based on response distribution

Greatly influnced by generalised linear models, but they form a small, restricted, subset.

Alternative to thinking about model in terms of signal and noise, is to think about systematic and random components that can combine in a non-additive way. Systematic equivalent to signal, but random refers to basic distribution of response variables, not additive error. In this type of model, usually first think about type of response distribution if other factors in model held constant.

Generalised linear models

Class of non-linear models with response distributions from the exponential family. Don’t have to be iid but can only differ through their natural parameters (`theta_i`) but not dispersion parameter (a constant `phi`).

`f(y_i | theta_i) = exp( phi(y_i theta_i 0 b(theta_i)) + c(y_i, phi))` – gives random component. Systematic component consists of link and linear predictor (usually a typical linear model). The link is defined as `g(mu_i) = eta_i`, `g` monotonic. Special set of link functions called canonical `g(mu_i) = b'^(-1)(mu_i) = theta_i`

Nice properties, but not especially wonderful. Most important thing is that link function maps `RR` to appropriate range.

Other important aspect is the variance function `V(mu_i)` which is proportional to the variance of the response (constant of proportionality = `1/phi`). Not open to specification, but determined by random component.

Least squares estimation

At heart, a geometric problem – want to minimise `(y - X beta)^T A (y- X beta)`, ie. when `beta = beta^** = (X^T A X)^(-1) X^T A y`.

For OLS, where `Y = X beta + sigma epsi`, `hat beta = (X^T X)^(-1) X^T Y`. Attach statistical properties to `hat beta` through Gauss-Markov theorem which states that `hat beta` is UMVU among all estimators that are linear functions of `Y`. `cov(hat beta) = (X^T X)^(-1) sigma^2`

For WLS, `Y_i = X beta + (sigma / sqrt(W)) epsi`, which leads to `hat beta = (X^T WX)^(-1) X^T W Y` and `cov(hat beta) = (X^T W X)^(-1) sigma^2`

For GLS, Gauss-Markov non-longer applies but we can use an iterative method to get similar results:

  1. Calculate initial estimates `beta^((0))`
  2. Calculate `n xx n` diagonal weight matrix `W^((j))`, `w_i(beta^((j))) = g^2_2(x^T_i beta^((j)), theta)`
  3. Calculate `V^((j))` matrix, `V^((j))_(i,k) = {: del/(del beta_k) g_1(x_i, beta) |_(beta = beta^((j))`
  4. Calculate `Y^((j))_i = Y_i - g_1(x_i, beta^((j)))`
  5. Calculate step `delta^((j)) = (V^T W V)^(-1) V^T W Y` and update parameters `beta = beta + delta`

Attach statistical properties through fundamental theorem of generalised least squares which states that if `beta^((0))` is `n^(1/2)`-consistent then `beta^((j))` will converge to true values, with asymptotically normal distributions. `hat sigma^2 = 1/(n-p) sum ( (Y_i - g_1(x_i, hat beta)) / (g_2(x_i, hat beta, theta)))^2` and `hat cov(hat beta) = hat sigma^2/n [ 1/n sum (v(x_i, hat beta) v(x_i, hat beta)^T )/(g^2_2(x_i, hat beta, theta)]^(-1)`.

Maximum likelihood estimation

Basic likelihood estimation and inference

`bbY = (Y_1, Y_2, ..., Y_n)^T` vector of iid rv with possible values in `Omega_1, ..., Omega_n`, and assume that `bbY in Omega = Omega_1 xx ... xx Omega_n` (positivity condition). `bbtheta = (theta_1, ..., theta_p)^T` vector of parameters st `theta in Theta in RR^p`, `p < n`.

Likelihood function `l_n (theta) = prod f_i(y_i | theta)`. Maximum likelihood estimator of `theta` is `hat theta in Theta` st ` l_n(hat theta) >= l_n(theta) AA theta in Theta`. Typically solved by taking logs, then derivatives and finding roots.

Let `P_theta` be distribution of rv indexed by parameter `theta`. Suppose for `theta in Theta`

Then as `n -> oo` `P[ prod f(y_i | theta_0) > prod f(y_i | theta) ] -> 1`. Provides connection between ML estimate and true value.

If independent and iid:

Properties of estimators

Developed under sets of technical conditions called regularity conditions. Many different sets from which we can derive a range of properties. Will focus on two sets, the first guarantees a consistent estimator of `theta`, and the second provides asymptotic normality.

Regularity conditions set 1

`=>` Then there exists a sequence of values `{hat theta_n}` which solve the likelihood equations and is a consistent sequence of estimators for `theta`. (does say solutions are unique, or maximum likelihood esimators.

Corollary 1: If the parameter space is finite, then there is a sequence of consistent unique ML estimates Corollary 2: If the likelihood has a unique root for each `n` then that sequence of estimators is consistent.

There are four basic (essentially indepdent) things we want to happen:

Regularly conditions set 2

  1. `E[grad/(grad theta_k) log(f(Y|theta))] = 0`
  2. `I_(j,k)(theta) = -E[ grad^2/(grad theta_k grad theta_j) log(f(Y|theta))]`
  3. the information matrix is positive definite
  4. `I_n(theta) -> oo`

Can replace 1 and 2 with being under to exchange differentiate under integral.

`=>` there exists a sequence of solutions to the likelihood equations st:

Two additional properties of MLEs are useful:

Wald theory inference

`(hat theta_n - theta)^T I_n(hat theta_n) (hat theta_n - theta) -> Chi^2_p`

Let `b(theta) = (R_1(theta), ..., R_r(theta))^T` be an `r xx 1` matrix of defined restrictions on model parameters. Let `C(theta) = (c)_(k,j) = grad/(grad theta_k) R_j(theta)`. Then `W_n = b^T (C I_n^(-1) C^T) b -> X^2_r`.

Likelihood inference

Let `dim{Theta} = p` and `dim{Theta_0} = r` and `hat theta_n = spr_(theta in Theta) L_n(theta)`, `bar theta_n = spr_(theta in Theta_0) L_n(theta)`, then `T_n = -2(L_n(bar theta_n) - L_n(hat theta_n)) -> Chi^2_(p-r)`.