Survey sampling

In the context of a basic sampling approach, statistics computed over the entire population are considered to be parameters (eg. mean, total, variance, proportion belonging to some class). We are interested in estimating these parameters, because it is not normally feasible to measure every unit in the population. We will only consider sampling error, ie. error that arises because we don’t have complete coverage.

Simple random sampling

A simple random sample of `n` units, selected from a population of `N` units, st all possible distinct subets of size `n` are equally likely to be chosen.

Can obtain sample through either group or sequential selection – in practice sequential is much easier to use and gives same inclusion probabilities.

Basic estimators

Horvitz-Thompson estimator `hat tau` of the population total:

`hat tau = sum x_i I(U_i in S_n^**) 1/pi_i = sum_{U_i in S_n^**) x_i/pi_i` `var{hat tau} = var{ sum x_i/pi_i Z_i} = sum ( (1-pi_i)/pi_i) x_i^2 + 2 sum sum ( (pi_{i,j} - pi_i * pi_j) / (pi_i pi_j)) x_i x_j`.

Unequal probability samples

Have so far worked within Laplacian probability, but what if we have a set of possible samples that should be selected with unequal probability? Generally easier to think about generating equally likely samples with population units having unequal probability. (Not possible to use material concept of probability). Often called restricted randomisation, eg. cluster or stratified sampling.

Inclusion probabilities and linear estimators

Want to be able to connect averaging over samples with expectation over population units.

Given attributs of population units, and a population parameter `theta` to be estimated, a linear estimator with `theta` has the form `hat theta_k = sum B_i x_i I(U in S_{n,k}`. Play central role in survey sampling and many estimators may be written in linear form. Variances for non-linear estimators often derived by forming a Taylor series expansion.

Defined the inclusion probability for population unit `U_i` as `pi_i = Pr(U_i in S_n^**`. In general a design unbiased estimator can be found for provided the inclusion probabilities are known.

Second order probability = `pi_{i,j} = Pr{ (U_i in S_n^**) nn (U_j in S_n^8*)}`

Overall generalisation

For a finite population of units `{U_i : i = 1, ..., N}` define the binary random variables `Z_i = I(U_i in S_n^**}`
Define inclusion probability as `pi_i = Pr(Z_i =1)` and `pi_{i,j} = Pr{(Z_i = 1) nn (Z_j = 1) }`.
For a given population consider estimators of the form `hat theta = sum_(i=1)^N beta_i(pi_i) x_i Z_i`

We can then define their properties as follows:

`E{hat theta} = sum_{i=1}^N beta_i(pi_i) x_i E{Z_i}`
`Var(hat theta) = sum_{i=1}^N beta_i^2(pi_i) x_i^2 var{Z_i} + 2 sum sum x_i x_j cov{Z_i, Z_j}`.

And under the formulation of estimators

`E{Z_i} = pi_i`
`var{Z_i} = pi_i(1 - pi_i)`
`cov{Z_i, Z_j} = pi_{i,j} - pi_i * pi_j`

For estimations of variances, we need “plug-in” unbiased estimators – will be unbiased if they are a linear combination of quantities that can be estimated without bias.