# Concepts of probability

Probability is not a thing – it is many concepts. All concepts obey the same fundamental rules, but the concept employed in a specific analysis determines the manner in which probably is introduced to the problem, and the interpretation of inferential statements. Difference concepts lead to different approaches to developing an analysis

### Laplacian probability

(aka, Classical, gambling probability)

Elements of the sample space (basic outcomes) are equally likely. `P(E) = (|E|) / (|S|)`. Fundamental characteristic of equally likely outcomes usually result of physical properties of operation.

Limited, but used more often that you might think. Used directly in some randomisation procedures.

### Relative frequency probability

- Finite number of physically existing objects in a class `B`
- An operation consists of observing whether a selected object also belongs
- to class `A`.
- `P(A | B) = (|A|) / (|B|)`

Probability is a direct consequence of physical realities (ie. things that actually happened). Clearly inadequate, but can apply in problems that involve finite populations.

### Hypothetical limiting relative frequency probability

- Operations that can (at least hypothetically) be repeated an infinite
- number of times
- Sample space
- Events

### Epistemic probability

- Probability = knowledge or belief
- Belief updated in light of observed information
- Mathmatical formalism necessary so that belief modifed in coherent manner

# Randomisation based approaches

Randomisation based approaches included the survey sampling and experiment approach.

- Have common concept of popluation (usually finite and existing).
- Use Laplacian concept of probability.
- Probability enters only through random selection or assignment.

### Populations

Pure or strict view is that population must be finite and existing (not subject to arbitrary definition). All units may be individually identified and assigned discrete labels. Common problems:

- Non-static populations (restrict inference to time period of interest)
- Populations of unknown size
- Hypothetical and constructed populations
- Arbitrary defined population units (inference conditional on same
- definition, be careful that attributes still appropriate)

### Attributes and responses

Sampling approach assumes that there is a fixed value of interest, an
**attribute**, associated with each unit. Not realisation of some random
mechanism but fixed, immutable characteristics.

Under experimental approach, call measured or observed values as
**responses**. Essentially the same as for sampling, but are influenced by
external factors within the time frame of the study.

# Survey sampling

In the context of a basic sampling approach, statistics computed over the
entire population are considered to be **parameters** (eg. mean, total,
variance, proportion belonging to some class). We are interested in
estimating these parameters, because it is not normally feasible to measure
every unit in the population. We will only consider **sampling error**, ie.
error that arises because we don’t have complete coverage.

### Simple random sampling

A simple random sample of `n` units, selected from a population of `N` units, st all possible distinct subets of size `n` are equally likely to be chosen.

Can obtain sample through either **group** or **sequential** selection – in
practice sequential is much easier to use and gives same inclusion
probabilities.

#### Basic estimators

Horvitz-Thompson estimator `hat tau` of the population total:

`hat tau = sum x_i I(U_i in S_n^**) 1/pi_i = sum_{U_i in S_n^**) x_i/pi_i` `var{hat tau} = var{ sum x_i/pi_i Z_i} = sum ( (1-pi_i)/pi_i) x_i^2 + 2 sum sum ( (pi_{i,j} - pi_i * pi_j) / (pi_i pi_j)) x_i x_j`.### Unequal probability samples

Have so far worked within Laplacian probability, but what if we have a set
of possible samples that should be selected with **unequal** probability?
Generally easier to think about generating equally likely samples with
population units having unequal probability. (Not possible to use material
concept of probability). Often called **restricted randomisation**, eg.
cluster or stratified sampling.

### Inclusion probabilities and linear estimators

Want to be able to connect averaging over samples with expectation over population units.

Given attributs of population units, and a population parameter `theta` to be estimated, a linear estimator with `theta` has the form `hat theta_k = sum B_i x_i I(U in S_{n,k}`. Play central role in survey sampling and many estimators may be written in linear form. Variances for non-linear estimators often derived by forming a Taylor series expansion.

Defined the **inclusion probability** for population unit `U_i` as `pi_i = Pr(U_i in S_n^**`. In general a design unbiased estimator can be found for provided the inclusion probabilities are known.

**Second order probability** = `pi_{i,j} = Pr{ (U_i in S_n^**) nn (U_j in S_n^8*)}`

### Overall generalisation

- For a finite population of units `{U_i : i = 1, ..., N}` define the binary random variables `Z_i = I(U_i in S_n^**}`
- Define inclusion probability as `pi_i = Pr(Z_i =1)` and `pi_{i,j} = Pr{(Z_i = 1) nn (Z_j = 1) }`.
- For a given population consider estimators of the form `hat theta = sum_(i=1)^N beta_i(pi_i) x_i Z_i`

We can then define their properties as follows:

- `E{hat theta} = sum_{i=1}^N beta_i(pi_i) x_i E{Z_i}`
- `Var(hat theta) = sum_{i=1}^N beta_i^2(pi_i) x_i^2 var{Z_i} + 2 sum sum x_i x_j cov{Z_i, Z_j}`.

And under the formulation of estimators

- `E{Z_i} = pi_i`
- `var{Z_i} = pi_i(1 - pi_i)`
- `cov{Z_i, Z_j} = pi_{i,j} - pi_i * pi_j`

For estimations of variances, we need “plug-in” unbiased estimators – will be unbiased if they are a linear combination of quantities that can be estimated without bias.

# Experimental approach

### Nested syllogism of experimentation

A: chance alone is in effect B: design (systematic force) is in effect C: observable implication of chance alone

Either A or B. If A then C. Not C. Therefore not A. Therefore B. (not replaced by exceeding low probability of)

### Randomised treatment assignment

Treatments are randomly allocated to units. This is where probability enters.

### Qualifying differences

- Assume there is no difference. (ie. treatment group is not related to
- attribute)
- Work out how some test statistic distributed in this case.
- How does the actual assignment compare to this distribution? (calculate a
- p-value by counting number greater than or equal to and dividing by total
- number of permutations)

Too hard to work out all possible permutations, so usually just take random sample. Traditional t-tests (etc.) asymptotically based on this principle.

# Families of distributions useful in modelling

## Exponential families

Multiple representations:

- `f(Y | theta) = exp( sum q_j(eta) T_j(Y) ) c(eta) h(Y)`
- `f(Y | theta) = a(theta) t(Y) exp(theta^T t(Y))`
- `f(Y | theta) = exp( sum q_j(eta) T_j(Y) - B(eta)) c(Y)`
- `f(Y | theta) = exp( sum theta_j T_j(Y) - B(theta) + c(Y))` (this is the one we use)

Properties:

- The parameter space (`Theta`) is a convex set. To avoid difficulties will assume that neither `T_j(Y)` nor `theta_j` satisfy a linear constraint. If `Theta` contains an open s-dimensional rectangle then family said to be of full rank or
**regular** - For a minimal, regular family the statistic `T = (T_1, ..., T_s)` is minimal sufficiennt for `theta`
- For an integrable function `h(*)` can exchange integration and differentiation
- `E[T_j(Y) = del/(del theta_j) B(theta`)
- `cov(T_j(Y), T_k(Y)) = del^2/(del theta_j del theta_k) B(theta)`
- the moment generating function of an exponential family is defined to be that for the moments of the `T_j`’s

The parameters `theta_j` are called **canonical** parameters. Often the easiest for derivation of properties, not always the best. Helpful to have other parameterisations with certain desirable properties.

**Mean value parameterisation 1**. Expected value often of interest, but usually none of the canonical parameters correspond to the mean. Can transform `(theta_1, ..., theta_s)` to `(mu = E(Y), theta_1, ..., theta_(s_1))`.

**Mean value parameterisation 2**.In canonical representation, clear relationship between parameters and sufficient statistics, so we could parameterise using expected values of the `T_j`, transforming `(theta_1, ..., theta_s)` to `(mu_1(theta), ..., mu_s(theta))` where `mu_j(theta) = E[T_j(Y)] = del/(del theta_j) B(theta)`. Has the potential advantage that each parmater of the density is the expected value of a random variable associated with an observable quantity.

**Mixed parameterisations**. Also possible to write exponential family in terms of both canoncial and mean value eg. `(mu_1(theta), theta_2)`

Reasons for choosing one parameterisation over another:

- for interpretation, performed after estimation
- for numerical stability
- may connect model with elements of experiment more directly
- make it more clear how to incorporate covariates
- parameters restrictions more appropriate to actual situation

## Exponential dispersion families

Only talk about small subset – essentially one parameter families extended to include additional dispersion parameter (most common for applications). Important family is those where the sufficient statistics is `T(Y) = Y`, called **natural exponential families**. eg. binomial with fixed n, normal with fixed variance. Have the form `f(Y| theta) = exp phi(Y theta - b(theta)) + c(Y phi)` and `E[X] = b'(theta)`, `"Var"[x] = 1/phi b``(theta) = 1/phi V(mu)`.

- Have essentially coerced a two parameter family to look “almost” like a natural exponential family with the addition of a nuisance parameter `phi` called the
**dispersion parameter**which is a scale factor for the variance - Can only write in this form if one of the sufficient statistics is given by the identity function (eg. binomial, Poisson, normal, gamma, inverse Gaussian, also used for glm’s)

If we have more than one (iid) variable then there joint distribution will be `f(Y | theta) = exp( sum theta_j sum T_j(Y_i) - B(theta) + sum c(Y_i))`, which is still in the exponential family form.

## Location-Scale families

Families of distributions formed from classes of transformations are useful, particularly **location-scale transformations**. Let `U` be an rv with distribution `F`, if `U` is transformed into `V = U + mu` then `V` will have distribution `F(Y - mu)`. The set of distributions generated from all `mu in (-oo, oo)` is called a **location family**. Similarly, the transformation `V = sigma U` generates a **scale family** of distributions, and the composition generates `V = mu + sigma U` with distribution `F((Y-mu)/sigma)`.

Location-scale families include double exponential, uniform, logistic and normal.

### Properties

`E(V) = E(U) + mu` `Var(V) = sigma^2 var(U)`## Prominence and limitations of normal distribution

- Can easily be expressed in terms of variance independent parameters
- A `N(mu, sigma^2)` parameterisation gives location and scale directly
- In samples, allows reduction through sufficiency (generally true for exponential, but not location-scale families)

Limitations:

- Is continuous, so not good for discrete
- Has fixed “shape”
- Range is entire real line

# Additive error models

Exemplifies idea of signal + noise. `Y = f(x, beta) + epsi` and `epsi` distributed with some absolutely continuous density function, typically with `E(epsi) = 0`. Modelling task with these models focus on:

- appropriate specification of `f`
- modelling of the variance of the additive errors

## Constant variance models

“Backbone” of statistical modelling. Can use transformation if assumption violated (but often introduces other problems).

## Linear and non-linear models

Define a model as non-linear if at least of the derivatives of `beta` depends on one or more other parameters. Two notions of non-linearity:

- intrinsic curvature: cannot be reparameterised away, reflection of intrinsic curvature of exepectation surface
- parameter effects: if equally spaced points on domain of expectation surface are mapped to inequally spaced points in the range

## Models with known variance parameters

Relax assumption of constant variance. Use weighted regression either with known weights (ie. `1/sigma`) or weights as a function of other parameters, most typically power of the mean model. If we use functional weights, don’t get nice asymptotic theory.

## Models with unknown variance parameters

`Y_p = g_1(x, beta) + sigma g_2(x, beta, z, theta)`Common models:

- `g = sigma(mu(beta))^theta`
- `g = sigma exp(theta mu(beta))`
- `g = theta_0 + theta_1 x + theta_2 x^2`
- `g = theta_0 + theta_1 z + theta_2 x^2`

## Transformaing both sides

Attempt to allow separate adjustment for nonconstant variance, nonsymmetric error distributions and incorrect specification of expectation function. Idea is to transform both sides to help with first two problems without affecting third.

# Models based on response distribution

Greatly influnced by generalised linear models, but they form a small, restricted, subset.

Alternative to thinking about model in terms of signal and noise, is to think about **systematic** and **random** components that can combine in a non-additive way. Systematic equivalent to signal, but random refers to basic distribution of response variables, not additive error. In this type of model, usually first think about type of response distribution if other factors in model held constant.

## Generalised linear models

Class of non-linear models with response distributions from the exponential family. Don’t have to be iid but can only differ through their natural parameters (`theta_i`) but not dispersion parameter (a constant `phi`).

`f(y_i | theta_i) = exp( phi(y_i theta_i 0 b(theta_i)) + c(y_i, phi))` – gives random component. Systematic component consists of**link**and

**linear predictor**(usually a typical linear model). The link is defined as `g(mu_i) = eta_i`, `g` monotonic. Special set of link functions called

**canonical**`g(mu_i) = b'^(-1)(mu_i) = theta_i`

- Normal: `g(mu_i) = mu_i`
- Poisson: `g(mu_i) = log(mu_i)`
- Binomial: `g(mu_i) = log(mu_i / (1-mu_i))`
- Gamma: `g(mu_i) = 1/mu_i`
- Inversion gaussian: `g(mu_i) = 1/mu_i^2`

Nice properties, but not especially wonderful. Most important thing is that link function maps `RR` to appropriate range.

- Log link
- Power link `g(mu_i) = mu_i^lambda`
- Complimentary log-log link: `log(1-log(1-mu_i))`

Other important aspect is the variance function `V(mu_i)` which is proportional to the variance of the response (constant of proportionality = `1/phi`). Not open to specification, but determined by random component.

- normal: 1
- poission: `mu_i`
- binomial: `mu_i(1-mu_i)`
- gamma: `mu_i^2`

# Least squares estimation

At heart, a geometric problem – want to minimise `(y - X beta)^T A (y- X beta)`, ie. when `beta = beta^** = (X^T A X)^(-1) X^T A y`.

- used almost exclusively with additive error models

For OLS, where `Y = X beta + sigma epsi`, `hat beta = (X^T X)^(-1) X^T Y`. Attach statistical properties to `hat beta` through **Gauss-Markov theorem** which states that `hat beta` is UMVU among all estimators that are linear functions of `Y`. `cov(hat beta) = (X^T X)^(-1) sigma^2`

For WLS, `Y_i = X beta + (sigma / sqrt(W)) epsi`, which leads to `hat beta = (X^T WX)^(-1) X^T W Y` and `cov(hat beta) = (X^T W X)^(-1) sigma^2`

For GLS, Gauss-Markov non-longer applies but we can use an iterative method to get similar results:

- Calculate initial estimates `beta^((0))`
- Calculate `n xx n` diagonal weight matrix `W^((j))`, `w_i(beta^((j))) = g^2_2(x^T_i beta^((j)), theta)`
- Calculate `V^((j))` matrix, `V^((j))_(i,k) = {: del/(del beta_k) g_1(x_i, beta) |_(beta = beta^((j))`
- Calculate `Y^((j))_i = Y_i - g_1(x_i, beta^((j)))`
- Calculate step `delta^((j)) = (V^T W V)^(-1) V^T W Y` and update parameters `beta = beta + delta`

Attach statistical properties through **fundamental theorem of generalised least squares** which states that if `beta^((0))` is `n^(1/2)`-consistent then `beta^((j))` will converge to true values, with asymptotically normal distributions. `hat sigma^2 = 1/(n-p) sum ( (Y_i - g_1(x_i, hat beta)) / (g_2(x_i, hat beta, theta)))^2` and `hat cov(hat beta) = hat sigma^2/n [ 1/n sum (v(x_i, hat beta) v(x_i, hat beta)^T )/(g^2_2(x_i, hat beta, theta)]^(-1)`.

# Maximum likelihood estimation

## Basic likelihood estimation and inference

`bbY = (Y_1, Y_2, ..., Y_n)^T` vector of iid rv with possible values in `Omega_1, ..., Omega_n`, and assume that `bbY in Omega = Omega_1 xx ... xx Omega_n` (positivity condition). `bbtheta = (theta_1, ..., theta_p)^T` vector of parameters st `theta in Theta in RR^p`, `p < n`.Likelihood function `l_n (theta) = prod f_i(y_i | theta)`. **Maximum likelihood estimator** of `theta` is `hat theta in Theta` st ` l_n(hat theta) >= l_n(theta) AA theta in Theta`. Typically solved by taking logs, then derivatives and finding roots.

Let `P_theta` be distribution of rv indexed by parameter `theta`. Suppose for `theta in Theta`

- `P_theta` have common support
- random variables are iid with common density function `f(y_i | theta)`
- true value of `theta`, `theta_0` lies in interior of `Theta`

Then as `n -> oo` `P[ prod f(y_i | theta_0) > prod f(y_i | theta) ] -> 1`. Provides connection between ML estimate and true value.

If independent and iid:

- `l_n(theta) = prod f(y_i | theta)`
- `L_n(theta) = sum log(f(y_i | theta))`
- `U_(n,k)(theta) = sum 1/(f(y_i|theta)) (grad/(grad theta_k)) f(y_i | theta)` (score function)
- `I_(n,j,k)(theta) = nE[grad/(grad theta_k) log(f(y_i|theta)) grad/(grad theta_j) log(f(y_i | theta))]` (information matrix)

## Properties of estimators

Developed under sets of technical conditions called **regularity conditions**. Many different sets from which we can derive a range of properties. Will focus on two sets, the first guarantees a consistent estimator of `theta`, and the second provides asymptotic normality.

### Regularity conditions set 1

- Distributions of `Y_1, ..., Y_n` are distinct and have common support
- True value lies in interior of open interval contained within parameter space
- For almost all `y` the density function is differentiable with respect to all elements of `theta`

Corollary 1: If the parameter space is finite, then there is a sequence of consistent unique ML estimates Corollary 2: If the likelihood has a unique root for each `n` then that sequence of estimators is consistent.

There are four basic (essentially indepdent) things we want to happen:

- existence of mle (or sequence)
- existence of roots of the likelihood equation
- uniqueness of estimators
- consistency of sequences of estimators

### Regularly conditions set 2

- `E[grad/(grad theta_k) log(f(Y|theta))] = 0`
- `I_(j,k)(theta) = -E[ grad^2/(grad theta_k grad theta_j) log(f(Y|theta))]`
- the information matrix is positive definite
- `I_n(theta) -> oo`

Can replace 1 and 2 with being under to exchange differentiate under integral.

`=>` there exists a sequence of solutions to the likelihood equations st:- `hat theta_n` is consistent for `theta`
- `n^(1/2)(hat theta_n - theta)` is asymptotically normal with mean `bb0` and covariance `nI_n^(-1)(bb theta)`.
- `hat theta_(n,k)` is asymptotically efficient

Two additional properties of MLEs are useful:

- if a given scalar parameter `theta` has a single sufficient statistic, then the MLE must be a function of that statistic. If the statistic is minimal and complete, then the MLE is unique; if the MLE is unbiased, then it is the UMVU
**invariance**: the MLE of a function of parameters, is the function of the MLEs of the parmeters

## Wald theory inference

`(hat theta_n - theta)^T I_n(hat theta_n) (hat theta_n - theta) -> Chi^2_p`Let `b(theta) = (R_1(theta), ..., R_r(theta))^T` be an `r xx 1` matrix of defined restrictions on model parameters. Let `C(theta) = (c)_(k,j) = grad/(grad theta_k) R_j(theta)`. Then `W_n = b^T (C I_n^(-1) C^T) b -> X^2_r`.

## Likelihood inference

Let `dim{Theta} = p` and `dim{Theta_0} = r` and `hat theta_n = spr_(theta in Theta) L_n(theta)`, `bar theta_n = spr_(theta in Theta_0) L_n(theta)`, then `T_n = -2(L_n(bar theta_n) - L_n(hat theta_n)) -> Chi^2_(p-r)`.