ANOVA model
Assumptions:
- interested in fluorescence intensities from the spots, denoted $X_ijklm$
- since they are $\ge 0$ , let $Y_ijklm = log(X_ijklm)$ , if $X_ijklm = 0$ , then $Y_ijklm = 0$
- array, dye, treatment and gene are additive on $Y_ijklm$
- some random error $\epsilon_ijklm$ associated with each data point
We can formulate this as an ANOVA model:
\[ Y_ijklm = \mu + A_i + D_j + T_k + G_l + AT_ij + AG_il + DG_jl + TG_kl + \epsilon_ijklm \]However, $AT$ is aliased with $D$ : if an $AT$ interaction effect exists it will look the same as a dye effect. We need to consider which of these is most likely, possibly using a control array to detect if a AT interaction exists. We can write the (full 3 factor) model as:
\[ Y_ijklm = \mu + A_i + T_j + G_k + AT_ij + AG_ik + TG_kl + ATG_ijk + \epsilon_ijklm \]where $i = 1, \cdots, a$ , $j = 1, \cdots, t$ , $k = 1, \cdots, g$ , $l = 1, \cdots, r$
Degrees of freedom
In the dye swap $a = t = 2$ , so we have $n = atgr = 4gr$ data points, giving $4gr$ degrees of freedom for model fitting. Model takes $atg = 4g$ df leaving $4g(r-1)$ df for estimating variability.
Replication is essential! Without it we can not estimate variability, without sacrificing some of our model (eg. $ATG$ or $AG$ interaction terms)
Parameter estimation
You can’t just use a standard ANOVA tool because they typically invert the $X^{T}X$ matrix, which will be too large when it involves thousands of genes.
Using ANOVA formulae
Can use traditional ANOVA formal to estimate parameters, fit model, calculate sums of squares and estimate the error term. (Calculating interaction terms kind of like doing set addition, or calculating determinants)
Two-stage model
- Global effects: $Y_ijkl = \mu + A_i + T_j + AT_ij + \delta_ijkl$
- Gene specific effects: $\delta_ijkl = G_k + AG_ik + TG_jk + ATG_ijk + \epsilon_ijkl$
Determining differential expression
Average treatment effect over all genes is measured by $T$ , treatment specific effect beyond average by $TG$ .
Hypothesis tests:
- $H_{0k} = (T_1 + TG_{1k}) - (T_2 + TG_{2k})$
- $H_{0k} = TG_{1k} - TG_{2k}$
These ask quite different questions. In experiments with widespread change the $T_j$ terms will be non-zero, and we need to include these terms in our hypothesis test. (Note that $(\hat{T}_1 + \hat{TG}_{1k}) - (\hat{T}_2 - \hat{TG}_{2k}) = \bar{Y}_{.1k.} - \bar{Y}_{.2k.}$ , an intuitive estimate).
If we assume the errors are normally distributed we can use a t-test, with test statistic $t_g = \frac{\hat{\mu}_1 - \hat{\mu}_2}{SE(\hat{\mu}_1 - \hat{\mu}_2)}$ .
Standard errors:
- $SE((T_1 + TG_{1k}) - (T_2 + TG_{2k})) = \frac{\sigma}{\sqrt{r}}$
- $SE(TG_{1k} - TG_{2k}) = \frac{\sigma}{\sqrt{r}}\sqrt{1-\frac{1}{g}}$
Normality assumption
We have assumed that the residuals are normally distributed. What if they aren’t? Non-parametric tests or resampling based methods.
Bootstrapping: repeatedly sample residual with replacement to form test distribution. Permutation tests: generate all samples without replacement to form test distribution.
Generally not very useful with microarrays because we don’t have much information about the distribution shape (especially in the tails), and we tend to get more liberal p-values. With such small samples we need the extra “power” of parametric assumptions.