Cluster sampling

SRS and stratified sampling both need list of all experimental units, and if you have to visit them it can be expensive. Cluster sampling reduces problem by only sampling cluster of population, cheaper but higher standard errors (although usually lower for same cost)

Basic results of cluster sampling

$N$ (possibly unknown) experimental units, grouped in $C$ (known) clusters of size $M_i$ . $ \mu_Y = \frac{\sum_C M_i \bar{Y}_i}{\sum_C M_i} = \frac{\sum_C T_i}{\sum_C M_i} $ (with obvious estimator) $Var(\bar{y}) = \frac{1- f_1}{c\bar{M}^2} \frac{\sum_C M_i^2 (\bar{Y}_i - \mu_Y)^2}{C - 1}$ , which can be estimated with $\frac{1-f_1}{c\bar{m}^2}s_r^2$ , where $s_r^2 = \frac{\sum_C m_i^2 (\bar{y}_i - \bar{y})^2}{c-1}$ (using $\bar{M} = N / C$ if $N$ known). (remember t-values for CI(confidence interval)s depend on C, not N)

To estimate $\tau$ we can treat clusters as sampling units and totals as unit values, giving $\tau = C\bar{m}\bar{y}$ , with standard error $C\sqrt{\frac{1-f_1}{c}}s_t$ .

Sampling with probability proportion to size (PPS)

Another way to estimate $\mu_Y$ is to select clusters with probability proportional to size, then $\bar{y}_pps = (\bar{y}_1 + ... + \bar{y}_c) / c$ , which is an unbiased estimator of $\mu_Y$ . Variance is harder to calculate but if the sampling fraction ($f_1$ ) is small, then we can pretend the clusters are drawn independently (with replacement).

\[ \hat{V}_pps(\bar{y}_pps) = \frac{1}{c}\left( \sum_1^c \frac{(\bar{y}_i - \bar{y}_pps)^2}{c-1} \right) \]

In general, precision is high when cluster means are similar. PPS strategy tends to have slight edge of SRS in practice.

Special case: Equal cluster sizes

Both reduce to same formula for standard error, ie. $se(\bar{y}) = \sqrt{\frac{1-f}{c}}s_1$ where $s_1$ is the variance of the cluster means.

Special case: Estimating proportions

General formulae for estimator and standard errors don’t reduce much when estimating a population proportion.

\[ p = \frac{\sum_c m_i p_i}{\sum_c m_i} \] \[ se(p) = \sqrt{\frac{1-f_1}{c}} \frac{s_r}{\bar{m}} \]

where

\[ s_r^2 = \sum_c m_i^2 \frac{(p_i - p)^2}{c-1} \]

PPS estimator remains nearly unchanged

\[ \bar{\pi}_pps = \frac{\sum_c p_i}{c} \] \[ se(\bar{\pi}_pps) = \frac{s_p}{\sqrt{c}} \]

where $s_p^2$ is the variance of the cluster proportions.

As above, both reduce in the case of equal cluster sizes.

Special case: Systematic sampling

Choose unit at random from first k units, and then every k units after that. Can think of type of cluster sampling where the clusters are the partion under mod k, and we select one cluster at random. Systematic sampling works well if trend is present (built-in stratification effect) and for time series output, but badly for periodic data when sampling interval is multiple of period. Operationally easy.

How to get the variance?

take more than one systematic sample
use SRS formula (overestimate)
post-stratify (underestimate)
build model for $Y$ as function of $i$ and use to suggest variance