Experimental process

Data

genepix file have .gpr extension
8 data points recorded for each spot (mean & median, foreground & background, red & green)
usually use median as less affected by outliers

Goal of experiment is to discover genes undergoing differential expression, ie. different intensities between treatments.

Background correction

Foreground intensity consists of both signal and noise. Background correction attempts to remove the noise by subtracting the background intensity.

Cons:

sometimes background intensities higher than foreground (“black hole effect”)
often very little background noise spread uniformly
binding characteristics of background (ie. blank slide) are quite quite to those of foreground (ie. DNA)

Pros:

can help to remove spatial effects

Fold changes

Regardless of whether background correction is performed, we have two data points for each spot: a red intensity (Cy5) and a green intensity (Cy3). Using these we wanted to work out if there is a difference in expression between the two treatments. Differential expression is often reported as fold change = red intensity / green intensity, or log(fold change). Initially biologists used a cutoff of > 2 and < 0.5.

There are a number of things that might cause different intensity to be measured (differential expression, hybridisation variability, dye effects). A statistical approach needs to take this into account, using replication to estimate variability and experimental design to balance unwanted effects.

The effect of hybridisation variability can be reduced by increasing replication of genes on the array, but this reduces the number of genes we can test. Dye effects can reduced be balancing across treatments, but we can’t double label each sample on a single array. Solution: use more arrays! This introduces another issue: variability between arrays.

Dye Swap

Split each sample into two subsamples
Label one green and one red.
On each array use one colour from each sample

Equivalent to a Latin-square design of size 2. The array, treatment, dye and gene effects are balanced, allowing their effects to be separated and estimated.

ANOVA model

Assumptions:

interested in fluorescence intensities from the spots, denoted $X_ijklm$
since they are $\ge 0$ , let $Y_ijklm = log(X_ijklm)$ , if $X_ijklm = 0$ , then $Y_ijklm = 0$
array, dye, treatment and gene are additive on $Y_ijklm$
some random error $\epsilon_ijklm$ associated with each data point

We can formulate this as an ANOVA model:

\[ Y_ijklm = \mu + A_i + D_j + T_k + G_l + AT_ij + AG_il + DG_jl + TG_kl + \epsilon_ijklm \]

However, $AT$ is aliased with $D$ : if an $AT$ interaction effect exists it will look the same as a dye effect. We need to consider which of these is most likely, possibly using a control array to detect if a AT interaction exists. We can write the (full 3 factor) model as:

\[ Y_ijklm = \mu + A_i + T_j + G_k + AT_ij + AG_ik + TG_kl + ATG_ijk + \epsilon_ijklm \]

where $i = 1, \cdots, a$ , $j = 1, \cdots, t$ , $k = 1, \cdots, g$ , $l = 1, \cdots, r$

Degrees of freedom

In the dye swap $a = t = 2$ , so we have $n = atgr = 4gr$ data points, giving $4gr$ degrees of freedom for model fitting. Model takes $atg = 4g$ df leaving $4g(r-1)$ df for estimating variability.

Replication is essential! Without it we can not estimate variability, without sacrificing some of our model (eg. $ATG$ or $AG$ interaction terms)

Parameter estimation

You can’t just use a standard ANOVA tool because they typically invert the $X^{T}X$ matrix, which will be too large when it involves thousands of genes.

Using ANOVA formulae

Can use traditional ANOVA formal to estimate parameters, fit model, calculate sums of squares and estimate the error term. (Calculating interaction terms kind of like doing set addition, or calculating determinants)

Two-stage model

Global effects: $Y_ijkl = \mu + A_i + T_j + AT_ij + \delta_ijkl$
Gene specific effects: $\delta_ijkl = G_k + AG_ik + TG_jk + ATG_ijk + \epsilon_ijkl$

Determining differential expression

Average treatment effect over all genes is measured by $T$ , treatment specific effect beyond average by $TG$ .

Hypothesis tests:

$H_{0k} = (T_1 + TG_{1k}) - (T_2 + TG_{2k})$
$H_{0k} = TG_{1k} - TG_{2k}$

These ask quite different questions. In experiments with widespread change the $T_j$ terms will be non-zero, and we need to include these terms in our hypothesis test. (Note that $(\hat{T}_1 + \hat{TG}_{1k}) - (\hat{T}_2 - \hat{TG}_{2k}) = \bar{Y}_{.1k.} - \bar{Y}_{.2k.}$ , an intuitive estimate).

If we assume the errors are normally distributed we can use a t-test, with test statistic $t_g = \frac{\hat{\mu}_1 - \hat{\mu}_2}{SE(\hat{\mu}_1 - \hat{\mu}_2)}$ .

Standard errors:

$SE((T_1 + TG_{1k}) - (T_2 + TG_{2k})) = \frac{\sigma}{\sqrt{r}}$
$SE(TG_{1k} - TG_{2k}) = \frac{\sigma}{\sqrt{r}}\sqrt{1-\frac{1}{g}}$

$\sigma$ estimates will be either common or per gene as determined by the ANOVA model. Degrees of freedom will be $atg(r-1) \approx \infinity$ with common variance, or $at(r-1)$ with per gene variances.

Normality assumption

We have assumed that the residuals are normally distributed. What if they aren’t? Non-parametric tests or resampling based methods.

Bootstrapping: repeatedly sample residual with replacement to form test distribution. Permutation tests: generate all samples without replacement to form test distribution.

Generally not very useful with microarrays because we don’t have much information about the distribution shape (especially in the tails), and we tend to get more liberal p-values. With such small samples we need the extra “power” of parametric assumptions.

Multiple comparison procedures

When we’re trying to find significant changes in expression, we’re basically testing $H_{0k}$ : gene $k$ is differentially expressed, for each gene. We’ve seen two possible hypothesis tests based on ANOVA parameters, but there are many others that can be formulated. Most will produce a p-value for each gene, which represents the probability of observing a result this large or larger under the null hypothesis.

We need to decide how small a p-value needs to be before we consider it to be significant. When testing single hypotheses, we often fix a type I error rate of $\alpha = 0.05$ or $\alpha = 0.01$ . (Type I error = reject null hypothesis when it is true = false positive).

Setting a type I error rate of 0.05 means we are willing to make a type I error rate in 5% of our test. If we’re testing 20,000 genes for differential expression, we would expect 1,000 type 1 errors! To get around this problem we use multiple comparison procedures.

Family wise error rate control

Many multiple testing procedures try to control the FWER, where the FWER is defined as the probability that there is a single type I error in the entire set (family) of hypotheses tested.

Bonferroni correction

The most simple MCP is the Bonferroni procedure, where we use $\alpha^{*} = \alpha / n$ . This guarantees control regardless of the true number of null hypotheses, known as strong control. The Bonferroni correction is simple, but very conservation, and has low power.

Fisher’s least significant difference test

Fisher also proposed the LSD which involves performing an overall F-test (to test the global null hypothesis) and only proceeding with individual significant tests if the global null is rejected. This controls the FWER only when all null hypotheses are true, called weak control.

Multi-stage procedures: Holm’s sequential rejection procedure

Multistage procedures (as demonstrated by Fisher’s LSD) can give very tight strong control of FWER.

Holm’s sequential rejection provides strong control, but has more power than the Bonferroni procedure:

Put p-values in ascending order. Let $\alpha^{*}_i = \alpha / i$ . Step up procedure: Start from the smallest p-value and ascend while $p(i) \lt \alpha^{*}_i$ Step down procedure: Start from the largest p-value and descend until $p(i) \lt \alpha^{*}_i$ .

FWER gives a high level of certainty, but is very conservative, and at very small p-values our assumptions (eg. normality) may not hold well.

False discovery rate

FDR introduced in 1995 by Benjamini and Hochberg. Provides less conservative approach than FWER methods: greater power, but at the cost of increased type I errors.

Some notation:

$m_0$ true null hypothesis, of which $U$ are declared non-significant (true negatives) and $V$ significant (false positives)
$m-m_0$ non-true null hypotheses, of which $T$ are declared non-significant (false negatives) and $S$ significant (true positives)
total of $R$ tests declared significant (positives)

FWER tries to make sure $P(V \gt 0) \le \alpha$ . FDR is concerned with controlling $V / R$ , the proportion of incorrect rejected null hypotheses.

Formally B&H defined FDR as $E( \frac{V}{R} | R \gt 0 ) P(R \gt 0) $ . Other have defined FDR as $\frac{V}{R}$ and $E( \frac{V}{R} | R \gt 0 )$ (positive FDR).

Procedure

Put p-values in ascending order. Let $\alpha^{*}_i = i\alpha / m$ . Step up procedure: Start from the smallest p-value and ascend while $p(i) \lt \alpha^{*}_i$ Step down procedure: Start from the largest p-value and descend until $p(i) \lt \alpha^{*}_i$ .

Operating characteristics

Guarantees $FDR \lt \alpha$ regardless of how many null hypotheses are true. In fact, has been proved that $FDR = \alpha \pi_0$ , where $\pi_0$ is the proportion of true null hypotheses, so when $\pi_0$ is small, FDR provides considerably more control. Adaptive FDR attempts to estimate $\pi_0$ and use this to compensate, but in practise is hard to get good estimates of $\pi_0$ .

FDR defined as expectation which means there must be underlying distribution. What is this distribution? How variable is it? How does changing $\pi_0$ affect the distribution?

Other MCP alternatives

Fixed rejection regions, assigns arbitrary rejection region then calculates error rates. Same as regular frequentest approaches.

Bayesian approaches use posterior probabilities. Works really well when model correct.

Summary

Choose procedure based on the level of control you’re interested in. (You can’t do both!)

All MCPs above assume that tests are independent, which is probably not true in practise. Various adjustments can overcome this.

Experimental design

Want to determine best way to put samples on microarrays, rather like field trials. We’re interested in the contrasts between treatments ($t_k_1 - t_k_2$ ), known as elementary contrasts. We want to design an experiment that allows us to estimate these contrasts, while accounting for other effects.

Blocking

Often we can’t observe different treatment under the same experimental conditions, we try and account for this using random blocks.

Some notation:

$b$ blocks, $v$ treatments and $r_k$ replicates for each treatment
if $r_k = r$ , design is equireplicate
if all blocks same size, experiment is proper
if treatment occurs at most once in each block, design is binary
design is connected is all elementary contrasts can be estimated

For microarrays, we’re interested in proper, equireplicate, binary, connected designs.

If $k \lt v$ then every treatment can’t occur in every block, called incomplete block design. Best we can do is to find a balanced design, where the variance of elementary contrasts is equal, known as balanced incomplete block design.

If there are more than one blocking factors, we can use a latin cross design (provided all factors have same number of levels).

Microarray experimental design

Visualising microarray designs

Arrow denotes a block, two treatments joined by line appear on same array. Treatment at pointy end labelled red, other green.

Want to minimise variance of elementary contrasts, which depends on distance on design graph. Treatments that share line have smallest variance.

BIBD

Because microarrays have very small block size, may not be best option, as will require many (expensive) arrays.

Reference design

Each treatment compared to reference sample on same array.

Disadvantages: half available spots wasted, no direct comparison, often no dye balance Advantages: can easily add more later, or repeat arrays

Loop design

Circular design graph.

Advantages: don’t waste any channels, dyes balanced Disadvantages: variance gets big for large loops, easy to disconnect loop, but tricky to fix, hard to add more treatments

Testing contrasts

With the designs above, as the number of treatments and arrays increases, the number of comparisons does to (even though the model stays the same). For each gene we need to test each combination of treatments (eg. $T_1 - T_2$ , $T_2 - T_3$ , $T_3 - T_1$ ). A complicating factor is that the tests aren’t independent, eg. if $T_1 = T_2$ and $T_2 = T_3$ , then $T_1 = T_3$ . Although random error may obscure this, dependence is still present.

R can test all pairwise comparisons using the multcomp library. The simtest function provides p-values adjusted for dependence, by using a multivariate t-distribution. Tukey’s HSD also provides a way of dealing with correlations between tests.

Things get trickier with more than 2 treatments on one array, but these are uncommon anyway.

Additional variance components

Until now we have assumed that the model parameters are fixed, and only error is random. It’s possible to describe additional variance components for effects we are interested in as viewing as random variables. Mixed models is the field on statistics concerned with this.

In microarray analysis, we’re usually interested in modelling the array effect as random. Wolfinger et al (2001) were the first to suggest this, using a two stage model:

$Y_ijkl = \mu + A_i + T_j + AT_ij + \delta_ijkl$ . $\delta_ijkl = G_k + AG_ik + TG_jk + ATG_ijk + \epsilon_ijkl$ , where all array effects and interactions are considered to be normally distributed with mean 0 and appropriate variance.

In general, it’s probably more appropriate to use a random model, but it does make calculations more complicated. Has implications for experimental design, as there will be increased variance between contrasts not on the same array (eg. in a reference design).

IBD with random blocks

Possible to calculate intra- and inter-block estimates, which can be combined to give improved estimates.

Power

Given testing procedure, and estimate of variability, it is possible to calculate the probability of detecting effects of certain sizes, called power study.

Power measures ability to detect deviations from null hypothesis = 1 – P(type II error) = P(reject $H_0$ when $H_0$ is false), is a measure of sensitivity. Generally fix specificity ($\alpha$ level, type I error) and then estimate sensitivity.

Decision to reject $H_0$ based on size of test statistic, which under $H_0$ follows the t distribution. If our alternative hypothesis is a specific value, then we can try various values and look at the probability of rejecting $H_0$ .

How large does the test statistic have to be? $T = (\bar{Y}_{.1g.} - \bar{Y}_{.2g.}) / (\hat{sigma} / \sqrt{r})$ . We reject $H_0$ if $|T| \gt t_{\frac{\alpha}{2}}(df)$ , where $df = at(r-1)$ . This is equivalent to rejecting $H_0$ if $|\bar{Y}_{.1g.} - \bar{Y}_{.2g.}| \gt t_{\frac{\alpha}{2} \frac{\hat{sigma}}{\sqrt{r}}}$

Under the alternative hypothesis, for a particular value of $\mu_A$ , we want to calculate $P(|\bar{Y}_{.1g.} - \bar{Y}_{.2g.}|) \gt y^*$ , which will be normally distributed with mean $\mu_A$ , this implies that $(\bar{Y}_{.1g.} - \bar{Y}_{.2g.} - \mu_A) / \frac{\hat{sigma}{\sqrt{r}}} ~ t_{at(r-1)}$ .

\[ P(Reject H_0 | \mu_A) = P\left( T \gt t_{\frac{\alpha}{2}} - \frac{\mu_A}{\frac{\hat{sigma}}{\sqrt{r}}} \right) + P\left( T \gt t_{\frac{\alpha}{2}} + \frac{\mu_A}{\frac{\hat{sigma}}{\sqrt{r}}} \right) \]

For microarrays

Use common variance (gene specific would be crazy) Need to take multiple comparison technique into account – Bonferroni is easiest ($\alpha / g$ )

Normalisation

So far have used linear models approach to remove experimental effects from data. Removal of experimental effects also known as normalisation. Most methods assume that relatively few genes are undergoing differential expression.

Original method: divide background corrected values by channel median (average differential expression for each channel should be about the same)

ANOVA method: basically the same, but instead of dividing by median, we subtracted logged mean from logged data, has better statistical properties but not really any different from original method

What if the experiment effects are non-linear?

Non-linear dye effect

Some non-linear dye effects often found (can be detected with control hybridisation). Can lead to false discoveries. MA plot seems to be a good graphical way of detecting these intensity dependent effects not easily seen on simple R vs G plot.

$M = log_2(R/G) = log R - log G$ , $A = \frac{1}{2} log(RG) = 1/2 (log R + log G)$ .

ANOVA approach would simply subtract off average effect – this would centre cloud about zero, but wouldn’t change shape.

Alternatives: we assume non-differentially data should be distributed around diagonal, with some amount of symmetric data. This is equivalent to an MA plot symmetric about 0. Can straighten this out using some kind of locally weighted regression (eg. lowess or loess).

In R: l <- lowess(A,M); m <- m - lowess[order(m)].

Then transform back to R and G values: $R' = \sqrt{2^{2A' + M'}}$ , $G' = \sqrt{2^{2A' - M'}}$ .

Problems

We assume most genes aren’t differentially expressed – what if they are? What if expression is not symmetric? Default smoothing parameters can also cause problems. Make sure to check the smoothed fit!

Pin tip correction

Possible there may have been problems with some pin tips, so do separate correction for each pin-tip. Beware – can be a disaster!

Clustering

Expression profiles

Represent change in expression from baseline. Can plot as line graph, or as colour strips. Useful for displaying thousands of genes.

Expect that genes with similar expression profiles have similar functions or work in same process.

Clustering

Choose distances measure (eg. euclidean, correlation)
Choose clustering method (hierarchical, divisive)
Choose linkage method (single, complete, average)
Group together two closest clusters
Repeat

Affymetrix arrays

Spotted arrays

Spotted arrays consist of strands of denatured DNA arranged in grid of spots. DNA can come from expressed mRNA libraries, or manufactured artificially. Common feature is method of DNA attachment: robotic spotting. Developed in research environment and can be used by anyone. A number of alternative proprietary techniques have been developed.

Affymetrix

Sells complete analysis package – arrays, scanner and analysis software. Builds short sequences of oligos directly on the slide surface using photolithography. Must know sequence of interest, but currently not a big problem. Process is expensive and creating one-off custom arrays is not practical. Very precise and allows extremely high density..

Experimental process

Probe selection. For given target, probe aims to optimise hybridisation while maintain selectivity. 25 bp. 11-20 probes per gene, 1 probe per spot. PM(perfect match) and MM(mismatch) (homomeric base at bp 13), attempt to control for non-specific hybridisation. Probe pairs for particular gene spread across array to control for spatial effects. Each probe represented by cell, 18-20 microns in size. Using 11 PM-MM probes per gene, and 18 micron cells, human genome can be represented on 2 arrays.
Array fabrication. Multiple identical arrays constructed simultaneously on quartz wafer. Proceeds in stepwise manner, oligos formed one base at a time. At each step photolithographic masks placed over the wafer. UV light through open windows exposes linkers which will bind to a nucleotide. Single wafer yields 50-400 arrays.
Hybridisation. Similar to spotted arrays, but only one sample per array – makes experimental design much easier! Inter-array variability is very low.
Data analysis