Useful SAS code

v = variable, k = keywords, o = options SAS user’s guide

Univariate statistics

PROC UNIVARIATE [o /];

VAR v; (variables to process) BY v’s; (sort by) CLASS v-1 v-2; (classify by) FREQ v; (variable represents frequencies) HIST [v’s]; (produce histogram) ID v-1 …; (observation identifier) OUTPUT [OUT = data set] keyword-1=name-1 …; (save to data-set) PROBPLOT v’s; (produce probability plot) QQPLOT v’s; (duh)

RUN;

Will produce everything you could ever possibly want

Options:

Methods for continuous data

Student’s t-test commonly used for detecting differences means between two groups

Unpaired: PROC TTEST [o] VAR v’s; (variables to test) CLASS variable; (classification variable) BY v’s; (group by) FREQ v; (frequency variable) RUN;

Paired: PROC TTEST [o] PAIRED pairlist; (variables to compare) BY v’s; FREQ v; RUN;

Pairlist:

Output: variables, n, mean, sd, mean se., t-tests, and folded f-test for equality of variances.

Wilcoxon rank sum

PROC NPAR1WAY [o]; VAR v’s; CLASS variable; BY v’s; FREQ v; EXACT statistic-groups; (compute exact p values) RUN;

One-way ANOVA

Same assumptions as t-test. Non-parametric test is Kruskal-Wallis test.

PROC GLM [o]; CLASS v’s; MODEL dependents=independents [/ o]; (model formula) RUN;

ABSORB v's; (absorb out variabe effects)
BY v's; FREQ v; ID v; WEIGHT v;
CONTRAST 'label' effect values ... ; (contrasts)
ESTIMATE 'label' effect values  ...; (estimate linear combs of parameters)
LSMEANS effects; (essential for ANOVA!)
MANOVA ...;
MEANS effects;
OUTPUT ...;
RANDOM effects;
REPEATED;
TEST;

Prints multivariable equivalents of t-test outputs.

Scatter plots

PROC PLOT [o]; BY v’s; PLOT x * y; RUN;

Correlation

PROC CORR [o]; VAR v’s; (variables to correlate) PARTIAL v’s; (variables to calculate partial correlation coefficients for) RUN;

BY v's;
FREQ v;
WEIGHT v;

Regression

PROC REG [o]; MODEL y={x1 + x2} x3×4 / SELECTION = stepwise RUN;

BY v's; FREQ v;  ID v's; VAR v's; WEIGHT w;
ADD v's; DELETE v's; (interactively add and delete variables from model)
MTEST... ; (test linear combination hypotheses)
OUTPUT...;
PAINT ...; PLOT ...; (generate scatter plots)
PRINT...;
RESTRICT...; (restrict parameters)

Produces typical shitload of output.

Introduction

What is a mixed model?

Generalisation of linear models where observations are not independent. Mixed models model the covariance structure of data.

Three basic types:

Why use mixed models?

Useful definitions

Containment

Basically the same as nesting (?).

Balance

When a design is balanced estimate effects will equal raw means. Balance occurs for a fixed effect when:

It’s also important identify when fixed effect means will differ depending on whether fixed or mixed model is used. They will be the same provided:

Rules of thumb:

Error strata

Error stratum defined by each random effect and by the residuals. The containment stratum of a fixed effect is defined as error strata that contains the effect – the residual stratum, unless contained within a random effect. Usually an effect has only one containment stratum, but can have more in more complicated situations.

Higher level strata are defined by random effects themselve contained within another random effect. If higher level strata are present and data are imbalanced across random effects, fixed effects will be estimated using information from the higher level strata as well.

The normal mixed model

Instead of tricky integrals, price contingent claims by simulating asset price paths.

If we generate random variates $X_1, ..., X_n$ indepedent with the distribution $X / S_0(T)$ then $\bar{X} = \frac{1}{n} \sum X_i$ estimates $\pi_X (0)$ with usual CLT confidence intervals.

To generate the $X_i$ , need to simulate asset price paths $((S_1(t)), . With a Black-Scholes model $S_i (t) = exp((r- \frac{1}{2} , so we need to simulate Brownian motion.

What to do next depends on the type of claim. For example, with Euro options only need final value so only need $\tilde{W}_T ~ N(0, T)$ . But in general will need the whole price path. Will discretise into $m$ steps of size $\Delta t = T / m$ , then fill in other values via linear interpolation. This works well for complicated options. It doesn’t work well for American options because it doesn’t provide optimal strategy.

Variance reduction techniques

High accuracy required for financial application, usually want error to be less than 1 part per thousand. Standard errors $\in o(\sqrt{n})$ , so therefore useful to use variance reduction techniques.

Antithetic variates

The simplest of these is antithetic variates – if $(W_t)$ is a Brownian motion, then so is $(-W_t)$ . Then let $X = \frac{1}{2} (f(S_i^{(i')} (T)) . Variance will be reduced because prices from positive and negative paths are likely to be negatively correlated.

Control variates

Idea: Use known “model payoffs” $Y$ , similar to $X$ , with $E[Y] = \mu_Y$ . $\frac{1}{n} \sum (X_i + c(Yi - \mu_Y))$ , estimates $E[X]$ as $E[X + c(Y - .

This will be better than simple sampling if $Var(X + c(Y - \mu_Y)) \lt , ie. $Var(X) + 2cCov(X,Y) + Var(Y) \lt Var(X)$ . A pilot sample may help to estimate the best value $c^{*}$ of $c$ . Alternatively could just take $c$ to be $-1$ , ie use $\frac{1}{n} \sum (X_i - Y_i + \mu_Y)$ . Relies on $Var(X-Y)$ being smaller than $Var(X)$ .

Another approach uses several model payoffs $Y^{(1)}, Y^{(2)}, ..., with $E[Y^{(j)}] = \mu_j$ . Then estimate $E[X]$ using $\frac{1}{n} \sum (X_i + \sum c_j(Y^{(j)}_i - \mu_j))$ . This works as above. You can find the $c_j$ ’s by regressing $X_i$ on $Y^{(1)}_i, ...$ .

Importance sampling

Idea: Use $E_K [X \frac{dQ}{dR} ]$ , ie. simulate from distribution of $X under R rather than $X$ under $Q$ . Hopefully $Var_R (X will be smaller than $Var_Q (X)$ .

\[ Var_R(X \frac{dQ}{dR}) = E_R [ X^2 {(\frac{dQ}{dR})}^2] - \mu^2

We want $\frac{dQ}{dR}$ to be small. (Note: $\frac{dQ}{dR} \gt 0$ and $E_Q ). Can do this with GBM(geometric brownian motion). If $S_1(t) = S_1(0) exp( (r-\frac{1}{2}\sigma^2)t \sigma we can find an equivalent measure $R$ s.t. $\frac{dR}{dQ} = , under which $\bar{W}_T = \tilde{W}_T - ct$ is a Brownian motion. So simulating the assert price under $R$ is just a matter of simulating a GBM.

Also:

\[ \frac{dQ}{dR} = exp (-\tilde{W}_T + \frac{1}{2} c^2 T)

Repeated measures

Any dataset in which subjects are measured repeatedly over time can be described as repeated measure data. Can be made at pre-determined times or in an uncontrolled fashion. This will determine the types of analysis available.

Fixed effect approaches

Mixed model approaches

Advantages:

There are several ways to use mixed models. The simplest is to use a random effects model with patient effects fitted as random. This allows for constant correlation between all observations on same patient – but this is often not the case. Use a covariance pattern or random coefficients model instead.

Covariance pattern models

Define covariance structure directly rather than using random effects. Observations within each category of a blocking effect assumed to have same covariance structure. Defined as block diagonal matrix $\mathbf{R} = .

Covariance patterns

Large selection of covariance patterns available. Most depend on observations being taken at fixed times, and some are more easily justified when observations are evenly spaced. Some patterns take into account exact value of time, and are best used when intervals are irregular.

Some simple patterns are:

If variability in a measurement is different at each time point, we can obvious heretogeneous generalisations. Can also fit a separate covariance structure to each treatment group.

If widely separated observations appear to be uncorrelated, can create a banded covariance structure by setting all covariances more than $t$ steps apart to 0. Can be done to any covariance pattern, and reduces number of variance components to be estimated.

Covariance patterns using the exact separation in time also exist (eg. power $r_ijk = \sigma^2 \rho^{d_ijk}$ ). Useful when time points do not occur at fixed intervals. Most cause covariance to decrease exponentially with increasing distance.

Which covariance pattern should be used?

Want to choose the covariance pattern that best fits the true covariance pattern – not easy! Increasing number of parameters will improve fit, but will lead to over-fitting. Can test using likelihood-ratio test (provided models are nested), or by comparing goodness of fit statistics adjusted for number of parameters (eg. AIC = $log(L) - q$ or BIC = $log(L) - q ).

Which patterns should be considered? Not usually practical to test large numbers. Can either start with simplest and work up, or use general to get some idea of what it should look like. Often covariance pattern will make little difference to estimate of treatment effect and standard errors, in this case compound symmetry is reasonable (roughly check by comparing to general) or you could just use the empirical estimator.

General points

Missing data occurs frequently in repeated measures experiments. Less of a problem with specified covariance structure as patient with only a few observations still influence others through the covariance pattern. Still need to be careful.

Significance testing should be performed using F-tests and Satterthwaites df. If Satterthwaites not available, use patient df to compare treatment effects (will be conservative).

Fixed effect standard estimates will be biased downward as the covariance parameters are estimated, not known. Can use robust ‘emprical’ variance estimator as described previously, but this will ignore model specification and use covariance pattern from data.

Residuals assumed $~N(\mathbf{0}, \mathbf{R})$ . Difficult to check formally, but plots of residuals should be sufficient to identify any major outliers or deviations from normality.

Random coefficients models

Random coefficient models develop an explicit relationship between the measurement and time. The most common model is linear with time, and interested if the slope differs between treatment groups. Usually fit time and treatment.time as fixed effects and patient and patient.time as random to allow patients to randomly vary around the treatment mean.

Fitting polynomial models to the data can be accomplished by successively adding polynomials of higher order (as fixed and random effects) until the variance component becomes negative (for random effects) or effect becomes non-significant (fixed effects).

General points

If negative variance component estimate obtained, refit the model without that component. However, not all software will produce negative variance components. In SAS non-convergence or a non-semi-definite G-matrix are signs of a negative variance component. In this case, remove components in order of complexity until problem resolved.

Sample size estimation

Often calculated as simple-between patient trial, because of correlation between repeated measures, actually require fewer patients. Obviously won’t know covariance pattern in advance, but compound symmetry pattern will probably be adequate. If no estimate of within-patient correlation is known, a conservative prediction of correlation could be used.

$Var(\sum_j y_ij) = m \sigma^2_p (1 + (m -1)\rho)$ $Var(\bar{y}_i) = \sigma^2_p ( 1 + (m-1)\rho)/m$ $SE(t_i - t_j) = 2\sqrt{Var(\bar{y}_i)}/n$

with $m$ repeated measurements, $n$ patients in each group, $\sigma^2_p$ between patient variance, $\rho$ correlation between measurements.

Multicentre trials

A multicentre trial is carried out at several centres to increase patient numbers or to assess treatments in different settings. Sometimes increase variability in treatment effects due to centre differences (most likely in surgical trials)- can be included in model as centre and centre.treatment effects. Need to decide whether these effects should be fitted as fixed or random, based on what population you want to make inference about.

Important to check for any outlying centres – make indicate lapse of protocol. In fixed effects model, spurious outliers may be caused by small sample variation – random effects model does not have this problem.

Both fixed

Treatment effects: estimated with equal weight given to each centre. Cannot be estimated unless every treatment recieved at every centre. Treatment se: $Var(t_i - t_j) = \sigma^2_r ( 1 / n_i + 1 / n_j)$ . Always $\le$ variance with centre.treatment effects random Outlying centres: Determine from centre and centre.treatment means. Small centres may appear outlying because of random variation. Inference: Only to centres in trial.

Centre effects fixed, centre.treatment effects omitted

Often used when fixed centre.treatment effect non-significant, but can be difficult to detect small centre.treatment effects, and variability in treatment effect within centres is often ignored.

Treatment estimates: Take account of differing centre sizes Treatment se: As above Outlying centres: As above Inference: As above, but extrapolation to other centres seems more reasonable if no centre.treatment effect

Both random

Centre.treatment effects retained, even if non-significant, provided $\sigma^2_ct \ge 0$ .

Treatment estimates: Take account of differing centre sizes. Estimated from centre.treatment and centre error strata. Estimable even if some treatments not recieved in all centres. Treatment se: Based on centre.treatment variation. If equal number of patients recieve treatment at each centre, $Var(t_i - t_j) = 2(\sigma^2_r / . Not easily specified if centre sizes unequal. Outlying centres: Determined using shrunken centre and centre.treatment estimates Inference: Over population of possible centres.

Centre effects random, centre.treatment effects omitted

Useful when local estimates are required, because will get smaller treatment se.

Treatment estimates: Take account of differing centre sizes. Additional info from centre strata. Treatment se: Based on residual (within-centre) variation. If balanced, se has same formula as fixed effects case. Otherwise, expected to be less. Outlying centres: As above. Inference: Related to sampled centres only – only under strong assumption that no centre.treatment effect exists can inference be extended to entire population of centres.

Practical considerations

Plausibility of centre.treatment interaction

One approach to analysing multi-centre drug trials is to assume that there is no centre.treatment interaction, and if there is it is a fault of the study design. However, a drug effect can vary between centres because of different underlying populations, thus centre.treatment effects are always plausible.

Generalisation

Strictly speaking, shouldn’t generalise unless centres have been drawn at random from a population of interest. Rarely the case in practice (as with random selection of patients) – but usually do anyway, along with subjective judgements. Generally, multicentre trials are more generalisable than single-centre.

Number of centres

Accuracy of variance components dependent on number of centres used – if $\lt 5$ centres used, $\sigma^2_ct$ and treatment effect se may not be accurate. Main conclusions should be based on local results from fixed effect model – usually omitting centre.treatment term

Centre size

Sometimes some centres pullout after only collecting a little information. Often combined into one in fixed effects analyses, but doesn’t make much difference with random effects models.

Negative variance components

As for the general case, can remove or set 0 (preferred). If both centre and centre.treatment are negative, then can drop both and analyse as a simple between patient trial.

Balance

If centre.treatment term is included, balance will only occur if same number of patients recieved treatment at every centre (unlikely!) – in this case treatment estimates will equal raw means. In more usual (unbalanced) situation means will differ between fixed and random models.

If centre.treatment term is omitted, balance will occur when same proportion of patients recieve treatment at each centre.

Sample size estimation

($c \times r \times t$ patients in total)

In balanced trial $var(t_i - t_j) = 2/c (\sigma^2_r /r + \sigma^2_ct)$ and estimates for $c$ and $r$ can be obtained from the usual sample size estimation equation: $\Delta = (t_{\alpha /2} + t_{1-\beta}) SE (t_i - . $DF = (c-1)(t-1)$ not always known, but can use normal-distribution instead of t to get initial value, then iterate.

One difficulty is that estimates of both $\sigma^2_R$ and $\sigma^2_ct$ are required. Can use results from previous (or trial) study, or make educated (conservative) guess.

If only know relative cost $g$ , the total cost will be $crt + cg$ which is minimised when $r = \sqrt{ \frac{ g\sigma^2_r}{t \sigma^2_ct}}$ which can then substituted into the above formula.

Other applications

Repeated measurements within visits

Often done when data collection methods are inaccurate/imprecise. Can just use summary statistics from each visit, but does not deal well with missing data or testing interaction. Same basic approach as normal repeated measurements: covariance pattern or random coefficients model.

Covariance pattern models

Several ways to model covariance, five reasonable model presented below:

Many options – how to choose? If interested in pattern itself, start with simplest and build up. If interested in estimates of mean and standard errors, then second option probably adequate.

Random coefficients models

Again, more appropriate if interested in relationship with time, but now have two time scales, within and between visits, and can fit slopes to either or both of these.

Visit time: take reps as categorical fixed effects, not usually used for cross-over data Rep time: Both: not usually used for cross-over data

Matched case-control studies

Match group of subjects with disease to those without on age, sex, etc and determine which other factors differ between groups. Usually, take candidate risk factors as outcome variables, and matched set as random effect. Similar to cross-over trial.

Will be same as fixed effect model, when same number of controls for every case and positive matched set variance component. In fixed effects model information from group effects with all case or all control is completely lost. Another fixed effects approach is to fit the matching variables as covariates but otherwise ignore matching – this can cause bias if matching variables are associated with group.

With binary variables

If a variable is binary, then likely that several uniform categories for matched set effects. This may cause bias in random effects model and should be reparameterised as a covariance pattern model with compound symmetry. Preferable to conditional logistic regression where information lost on all tied matching sets.

Different variances for patient groups

In simple between-patient trial sometimes treatment groups have different variances – allowing for this will give more appropriate standard errors, and variance estimates may be useful for understanding treatment mechanisms. Easy to do – just allow $\mathbf{R}$ to have difference variances for each group (use repeated statement in SAS, eg. repeated int / subject=patient group=treat).

Estimating variance components

Sometimes interested in just estimating variance components so that they can be used to help design future trials. Set $\Delta = (Z_{\alpha/2} + and then use standard variance formula $Var(t_i - t_j) = \sum_{\text{error strata}} \sigma^2 / n$ . Will usually need to determine via grid search.

Small sample sizes

Mixed models useful, because shrinkage properties to reduce small sample variability.

Meta analysis

Useful/important when:

Most in medical literature are of clinical trials, recently epidemiological reanalysed as well.

Main purposes:

Practical steps:

If original data are available then analyse like multicentre, with trials replacing centres. Trial.treatment variance larger than centre.treatment variance because of different protocols. Taking trial effects as random can increase accuracy of treatment estimates.

Meta-analysis usually done on binomial data – with normal data individual trials usually achieve desired conclusion.

Generalised linear mixed models

Many situations where residuals are not normally distribution - presence/absence data, count data etc. Generalised linear models and generalised mixed models relax the assumption of normally distributed residuals (to exponential-family distributed) for fixed effect and mixed models respectively.

Generalised linear models

GLM derived from simple fixed effects model by instead of making $\mu = , $g(\mu) = X \alpha$ . Can also think of link function as way of mapping data from their scale to the reals (the scale of normally distributed data).

Distributions

Binomial: parameter of interest proportion of sucesses, $f(y, n) = \frac{n! . Poisson: count data, $f(y) = \mu^y e^{-\mu}/y!$ Poisson with offset: counts per unit scale, $f(y, t) = (\mu t)^{yt} e^{-\mu

All members of exponential family of distributions: $f(y; \theta, \phi) = . $\theta$ , a function of the original parameters is known as the canonical link.

Model selection and measure of fit

Expressed by deviance $= 2(LR_full - LR_model)$ or generalised Pearson $\chi^2 = \sum (y_i - \hat{mu}_i) / V(\hat{\mu})$ , both asymptotically chi-squared. Pearson and deviance residuals defined as contribution of individual observations.

Common problem with Poisson and Binomial is over-dispersion where variance is greater than assumed distribution. Can use quasi-likelihood model where we model the variance independently of the underlying model.

PROC MIXED

Uses REML, and Newton-Rhapson, first iteration uses Fisher scoring.

Syntax

PROC MIXED (options); CLASS ... MODEL ... RANDOM ... REPEATED ... LSMEANS...

Convergence

Check model has converged, and criterion is close to 0. If -2 Res Log Like value is very large and negative, likely that covariance matrix is singular: results will be invalid, respecify model omitting random effects. If Hessian matrix not positive definite, may have reached local maxima and will need to try grid search

Asymptotic covariance matrix of estimates

Gives rough indication of correlation between variance components. Only asymptotically normal and only accurate with large DF. Using empirical option gives empirical estimates instead.

MODEL statement

MODEL dependent = fixed effects / options Use . for interactions. Need to define non-linear terms in DATA statement.

Useful options:

Random statement

Used to specify random effects/coefficients $\mathbf{\beta}$ , and form of variance matrix $\mathbf{G}$ .

RANDOM random effects / options

Useful options:

REPEATED statement

Specifies correlation structure of residual matrix $\mathbf{R}$ , used for covariance pattern models.

REPATED repeated effect / SUBJECT = blocking effect TYPE = pattern options

Covariance types:

Useful options:

When REPEATED statement is used alone, the residual DF is incorrect for between patient comparisons of fixed effects. If compound symmetry structure required, can fit subjects (nested in treatment) as RANDOM effects.

LSMEANS statement

Calculates least squares mean estimates of specified fixed effects. LSMEANS fixed effects / options

Useful options: