ANOVA model

Assumptions:

interested in fluorescence intensities from the spots, denoted $X_ijklm$
since they are $\ge 0$ , let $Y_ijklm = log(X_ijklm)$ , if $X_ijklm = 0$ , then $Y_ijklm = 0$
array, dye, treatment and gene are additive on $Y_ijklm$
some random error $\epsilon_ijklm$ associated with each data point

We can formulate this as an ANOVA model:

\[ Y_ijklm = \mu + A_i + D_j + T_k + G_l + AT_ij + AG_il + DG_jl + TG_kl + \epsilon_ijklm \]

However, $AT$ is aliased with $D$ : if an $AT$ interaction effect exists it will look the same as a dye effect. We need to consider which of these is most likely, possibly using a control array to detect if a AT interaction exists. We can write the (full 3 factor) model as:

\[ Y_ijklm = \mu + A_i + T_j + G_k + AT_ij + AG_ik + TG_kl + ATG_ijk + \epsilon_ijklm \]

where $i = 1, \cdots, a$ , $j = 1, \cdots, t$ , $k = 1, \cdots, g$ , $l = 1, \cdots, r$

Degrees of freedom

In the dye swap $a = t = 2$ , so we have $n = atgr = 4gr$ data points, giving $4gr$ degrees of freedom for model fitting. Model takes $atg = 4g$ df leaving $4g(r-1)$ df for estimating variability.

Replication is essential! Without it we can not estimate variability, without sacrificing some of our model (eg. $ATG$ or $AG$ interaction terms)

Parameter estimation

You can’t just use a standard ANOVA tool because they typically invert the $X^{T}X$ matrix, which will be too large when it involves thousands of genes.

Using ANOVA formulae

Can use traditional ANOVA formal to estimate parameters, fit model, calculate sums of squares and estimate the error term. (Calculating interaction terms kind of like doing set addition, or calculating determinants)

Two-stage model

Global effects: $Y_ijkl = \mu + A_i + T_j + AT_ij + \delta_ijkl$
Gene specific effects: $\delta_ijkl = G_k + AG_ik + TG_jk + ATG_ijk + \epsilon_ijkl$

Determining differential expression

Average treatment effect over all genes is measured by $T$ , treatment specific effect beyond average by $TG$ .

Hypothesis tests:

$H_{0k} = (T_1 + TG_{1k}) - (T_2 + TG_{2k})$
$H_{0k} = TG_{1k} - TG_{2k}$

These ask quite different questions. In experiments with widespread change the $T_j$ terms will be non-zero, and we need to include these terms in our hypothesis test. (Note that $(\hat{T}_1 + \hat{TG}_{1k}) - (\hat{T}_2 - \hat{TG}_{2k}) = \bar{Y}_{.1k.} - \bar{Y}_{.2k.}$ , an intuitive estimate).

If we assume the errors are normally distributed we can use a t-test, with test statistic $t_g = \frac{\hat{\mu}_1 - \hat{\mu}_2}{SE(\hat{\mu}_1 - \hat{\mu}_2)}$ .

Standard errors:

$SE((T_1 + TG_{1k}) - (T_2 + TG_{2k})) = \frac{\sigma}{\sqrt{r}}$
$SE(TG_{1k} - TG_{2k}) = \frac{\sigma}{\sqrt{r}}\sqrt{1-\frac{1}{g}}$

$\sigma$ estimates will be either common or per gene as determined by the ANOVA model. Degrees of freedom will be $atg(r-1) \approx \infinity$ with common variance, or $at(r-1)$ with per gene variances.

Normality assumption

We have assumed that the residuals are normally distributed. What if they aren’t? Non-parametric tests or resampling based methods.

Bootstrapping: repeatedly sample residual with replacement to form test distribution. Permutation tests: generate all samples without replacement to form test distribution.

Generally not very useful with microarrays because we don’t have much information about the distribution shape (especially in the tails), and we tend to get more liberal p-values. With such small samples we need the extra “power” of parametric assumptions.