Introduction

Statistical models summarise relationships between variables. Regression models focus on one variable (response) and how it can be modelled by other explanatory variables. In general, response = function of explanatory variables + random scatter. We need to describe function and scatter. Balance between simplicity and accuracy.

Our approach:

Role of graphics

Data cleaning

All real data contain errors. Data should be carefully checked prior to analysis, but suitable plots are likely to reveal gross errors.

Exploratory graphics

Basic plots

Single variable:

Two variables:

More than two variables:

Trellis graphics

Can generalise the coplot to condition on more than one variable. Easy if the variables are categorical, but if they’re continuous we need to divide them up in to subranges. Conditoning variables determine layout of cells. x/y variables determine type of graph in each cell.

In R:

library(lattice)
xyplot(y ~ x | w * z, data=a.df)
xyplot(y ~ x | equal.count(w, number=4, overlap=0)

The multiple regression model

The regression model assumes:

The relationship $\mu = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k$ can be visualised as a plane, determined by the $\beta$ ’s. Data will be scattered above and below the plane. An alternative model than emphasises this is $Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + \epsilon$ , where $\epsilon$ ~ N(0, $\sigma^2$ ).

How can we tell when this model is suitable, ie. planar?

Coefficients

Interpretation

$\beta_0$ = average response when all covariates are 0. $\beta_1$ = slope of plane in $x_1$ direction, average change in response for unit change in $x_1$ when all others held constant, slope of coplot of $y$ vs. $x_1$

Estimation

Minimise least squares criterion. Done in R using lm function.

How well does the plane fit?

Judge by looking at fitted values and residuals (difference between fitted and observed values). In a good fit, residuals will be small compared to the $y$ ’s and will be randomly distributed. For model to be useful, need strong relationship between response & explanatory variables.

How do we measure this? Think about predicting a new response.

If we don’t know the $x$ ’s, we’d predict $y^{*} = \bar{y}$ , with error $\frac{1}{n-1} \sum{(y_i - \bar{y})^2}$ . If we do know the $x$ ’s, then we’d predict $y^{*} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_k x_k$ , with error $\frac{1}{n-1}\sum(y_i - \hat{y}_i)^2$ = $ \frac{1}{n-1}\sum e_i^2$ .

The ratio of these prediction errors is $RSS / TSS$ where $RSS$ = residual sum of squares and $TSS$ = total sum of squares. The smaller this value is, the better the fit. RSS + RegSS = TSS, which implies RSS must lie between 0 and 1.

Another way is to look at $RegSS / TSS$ or $1 - RSS / TSS$ , this is known as the coefficient of determination, or $R^2$ . Clearly a big $R^2$ is desirable, 1 means a perfect fit, 0 means a flat plane.

$\sigma^2$ is also related to $R^2$ : more scatter equals poorer fit. $\sigma^2$ is estimated by $RSS / (n-k-1)$ .

Inference

Imagine we keep the x’s fixed, but resample the errors. The new fit would give us some idea of the variability of the coefficients. The variability depends on the arrangement of x’s (more correlation = more change) and the $\sigma^2$ .

Testing if single variable is required

Often we want to know if a particular variable is necessary in a given model (this is not the same as asking if a variable is related to the response). We can test this by comparing the ratio of a coefficient and its standard error to a t-distribution.

Testing if any variable is required

Can test whether any variable is necessary/useful by looking at F-value (automatically generated by R).

Testing if subset is required

Often want to test if a smaller model is as good. Compare $RSS$ of the models. $F = (RSS_sub - RSS_full) /ds^2$ , where $d$ is number of variable dropped, $s^2$ estimate of $\sigma^2$ from full model. Use anova in R to do calculations.

Prediction

Can predict a new response value from our model by substituting in the values of the response variables. Prediction error is calculated from $\sqrt{Var(predictor) + \sigma^2}$ . We normally give a 95% confidence interval of $\pm ~2$ se’s. In R:

predict(model, data, se.fit=T, interval="prediction")

Can estimate average response similarly, but confidence intervals will be smaller because variance does not include sigma (ie. standard error = $\sqrt{Var(predictor)}$ .

predict(model, data, se.fit=T, interval="confidence")

Collinearity

When two explanatory variables are strongly related we have collinearity, which increases the standard errors of our coefficients.

\[ var(\hat{\beta}_j) = \frac{\sigma^2 / (n-1)}{var(x)(1-R_j^2)} \]

Where $R_j$ is $R^2$ from regressing $x_j$ against all the other explanatory variables. If $x_j$ is orthogonal to the other variables, then $R_j = 0$ , and the variance is $\sigma^2 / (n-1) / var(x)$ . The factor $1 / (1- R_j^2)$ represents increase in variance from linear dependence, and is called the variance inflation factor (VIF). VIF = var(variable) / var(residuals), and is given by diagonal of inverse of correlation matrix.

vif <- diag(solve(cor(X[,-5])))

If one or more variables has high VIF, regression said to be collinear. Results in imprecise estimation of regression coefficients, high standard errors, and non-significant variables. If variable non-significant it is because either is not related to explanatory, or information has already been explained.

Partial regression plots/Added variable plots

Plot residuals from regression without x vs. residuals from full model. (unexplained variation in Y vs. variation in X not explained by others). Relationship in graph indicates relationship between the two sets, ie. relationship between x and response not accounted for by other variables.

added.variable.plots(model) (from 330 code)

Things to note:

Diagnostics

General modelling cycle is:

  1. fit
  2. examine residuals
  3. transform if necessary
  4. repeat above until happy

Reasons for bad fit:

Detecting non-planar data

Residual plots

Any pattern in residuals suggestive of non-planarity.

Partial residual plots

Suppose true model is $y = \beta_0 + \beta_1 x + g(z) + \epsilon = \beta_0 + \beta_1 x + \beta_2 z + g(z) - \beta_2 z + \epsilon$ . If we fit a linear model using $x$ and $z$ and the errors aren’t too large $\epsilon \approx g(z) - \beta_2(z)$ , or $g(z) \approx e + \beta_2 z$ . If we define the partial residuals for z as $e + \hat{\beta}_2 z$ , then plotting $e_z$ vs $z$ should reveal shape of g. Called partial residual plot, very similar to added variable plot and has similar least square properties.

Use 330 function partial.res.plots.

Additive models

Can use more general class of additive general models to estimate smoothed function directly, use gam in R. eg.

model.gam <-(y ~ x + s(z)) plot(model.gam)

Fixing non-planar data

If the data isn’t planar, we need to transform it so that it is – we need a function the same shape as the pattern we see in the partial residual or gam plot. Use trial and error, looking at $R^2$ and residuals plots to measure success.

Identifying non-constant scatter

Look at residuals. Common problem is funnel effect, where $\sigma^2 \propto \mu$ .

Fixing non-constant scatter

Can either transform the response (ie. take reciprocal or log) or estimate variances of observations and use weighted least squares regression.

Weighted least squares

Instead of minimising $\sum r_i^2$ , minimise weighted residuals: $\sum r_i^2 / v_i$ , where ith observation has variance $v_i \sigma_2$ . How do we calculate $v_i$ ?

  1. plot squared residuals
  2. smooth
  3. estimate variance of observation using smooth function
  4. weight = 1 / estimated variance

Can use 330 function est.var.function(model) to calculate weights, and then supply them to lm using weights argument.

Identifying outliers and high leverage points

An outlier has extreme $y$ -value. A high leverage point has extreme $x$ -values.

Effect on an outlier depends on its leverage:

Can be shown the $y_i = h_{i1} y_1 + \cdots + h_ii y_i _ \cdots + h_{in} y_n$ , where the the $h$ ’s are the hat matrix diagonals, and measure the influence that $y_j$ has on the ith fitted value. Can also be interpreted as distance between $x_i$ ’s and average $x$ values.

Each HMD lies between 0 and 1. Average HMD = $p/n$ , > than 3 times this considered extreme.

Identifying outliers

If influence low, then outlier will normally have large residual, but if influence is high, the residual may be small. How do we identify outliers with high leverage? We can drop the point and refit the regression. If the coefficients change a lot it’s probably an outlier, or influential point.

Can calculate a number of measures with leave-one-out techniques by looking at change in key regression quantities (coefficients, fitted values, etc.).

In R:

Identifying dependence

Important assumption is that data is independent, if this is not so standard errors will be incorrect. Usually talk about autocorrelation, which can be positive or negative. Look at plots of :

Durbin-Watson is formal test for significance. Assumes AR (1) model, and tests hypothesis that $\theta = \rho = 0$ . $\rho$ estimated by $\sum e_i e_{i-1} / \sum e_i^2$ , and test statistic is $DW \approx 2(1-\hat{\rho})$ . $0 \lt DW \lt 4$ , values of DW close to 0 indicate negative autocorrelation, values close ot 4 positive. Exact values based on $p$ and $n$ .

Identifying non-normality

Normality assumption not so critical, but longer tailed distributions will imply more outliers. Also important for prediction errors. Use qqplot, and should fall in straight line ( j-shape = right skew, s-shape = short tailed). Can use Weisberg-Bingham test for formal hypothesis test, WB = square of correlation of normal plot, values close to 1 indicate normality.

In R, use 330 function WB.test.

Remedies for non-normality

Transform response, using box-cox plot to suggest suitable power.

Variable selection

Often many variables to select from to build model, which ones should we use?

If we use too many, including some unrelated to the response, then we are overfitting: model not good for prediction (large error), model too elaborate, coefficient standard errors often inflated.

If we use too few, leaving out useful variables, we are underfitting: model not good for prediction (biased), regression coefficients biased, overestimate error variance.

If we have $p$ variables and a constant term, then we have $2^p - 1$ possible additive models. How do we select which one is best? There are two approaches: all possible regressions and stepwise methods.

APR

Define criterion of “model goodness” which balances simplicity and quality of fit. Calculate criterion for every possible model, then choose best based on criterion.

Possible criteria:

$R^2$

Big values good. Since $R^2$ always increases with increasing variables, will always select model with all variables, but is ok for comparing models with same number of variables. Can adjust $R^2$ to penalise complicated models: $\bar{R}_p^2 = 1 - \frac{n-1}{n-p-1}(1-R^2_p)$ .

Residual mean square ($\hat{\sigma}$ )

Small values good. Equivalent to choosing largest adjusted $R^2$ .

Mallow’s Cp

Estimate of prediction quality. $C_k = \frac{RSS_k}{EMS_full} + 2(k +1) - n$ . Values around $p + 1$ are good (note: $C_p = p+1$ )

If p-variable model contains all important information, then $RSS_p \approx (n-p-1)\sigma^2$ , and $EMS \approx \sigma^2$ , so $C_p \approx p +1$ .

Cp plot: for each model we plot Cp vs. p, with line Cp = p +1. Models close to line are good.

AIC and BIC

$AIC = n log(RSS_p) + 2p$ . Small is good. $BIC = RSS / EMS_full + p log(n)$ . Small is good. Tends to favour smaller models than AIC.

Stepwise methods

In R, with 330 function stepwise(model, direction="forward")

Backwards elimination

  1. Start with full model
  2. Find variable with largest p-value
  3. Remove it and refit model
  4. Repeat until p-value < some cutoff (usually 0.10)

Forward selection

  1. Start with null model
  2. Choose variable with greatest partial correlation coefficient
  3. Add and refit model
  4. Repeat until p-value > some cutoff (around 0.10)

Full stepwise

Combination of forward and backwards, one step forward, one step back. Stop when no change.

Case study

Modelling cycle:

  1. Graphical check (suitability for regression, gross outliers)
  2. Model selection (APR, step wise etc)
  3. Transformation, removal of points, if required
  4. Use model for inference

Factors

Up to now have used numerical variables, but can use categorical too, called factors. Regression will assign a value to each level of the factor, equivalent to specifying a different intercept for each level. Can use interaction terms (:) to specify that the model should use different slopes as well.

Anova models

Regression models where all explanatory variables are categorical. Assume response normally distributed around mean determined by factor levels. Other standard assumptions (equal variances) still apply. Can write in offset or dummy variable forms. R expects first factor level to be baseline, so may need to rearrange level order.

F-test tests for equality of means. HSD plot (HSD.plot(lm())) displays differences graphically.

Two-factor models

Usually want to split each cell-mean into:

Use plot.design(df) to look at level means, interaction.plot(factor1, factor2, explanatory) to look at interactions, anova(lm) for formal significance tests.

Models with many continuous and categorical variables

General approach is to fit separate regressions to each continuous variable for each combination of categorical variables. Will produce many possible models, and need variable selection techniques. (Remember: need low order interaction for high level, if not enough data, drop high level interactions)

Logistic regression

Binary data

Grouped or ungrouped (in R, use tapply to go from ungrouped to grouped). Grouped data has form covariates, number of successes ($r$ ) and number of trials ($n$ ). Probability of success = $r / n$ . This is the binomial distribution!

Relationship between response and covariates $ Y = \frac{e^{model formula}}{1+ e^{model formula}}$ . (Logit used to avoid nasty boundary problems). Called logistic regression. Estimate model coefficients using maximum likelihood (numerical method used in R: iteratively reweighted least squares), implemented by glm).

Grouped: glm(cbind(sucesses, failures) ~ covariates, family = binomial) Ungrouped: glm(response ~ covariates, family = binomial)

Probability, odds and log-odds

Probability: $\pi = e^{\alpha + \beta x} / (1 + e^{\alpha + \beta x})$ Odds: $\pi / (1 - \pi) = e^{\alpha + \beta x}$ . Log odds: $\alpha + \beta x$ .

$\beta$ measures effect of unit change of x: increases log odds by $\beta$ , increases odds by factor of $e^\beta$ .

Prediction

Probability: predict(model, new.data, type="response", se.fit =T) Log odds: predict(model, new.data, se.fit=T)

Plotting estimated probability

Easiest with grouped data: plot(response, r/n) lines(response, predict(model, data.frame, type="response")) A bit harder with ungrouped: plot(response, occurred) lines(sort(response), predict(model, sort(data.frame), type="response"))

Inference

Providing there’s enough data, estimated coefficients will be approx. normal. ML gives ways of estimating standard errors. Therefore tests and confidence intervals proceed exactly as for linear regression.

Deviance

Model had 2 parts: binomial assumption ($r ~ Bin(n, \pi)$ ) and logistic assumption (logit of $\pi$ is linear). If we only make binomial assumption, we have most general model, since we put no restrictions on the probabilities. Maximising this (ie. $\pi_i = r_i / n_i$ ) gives maximum value of likelihood, $L_max$ . If we compare this to the likelihood from logistic model, $L_mod$ , then residual deviance = $2(log L_max - log L_mod)$ ~ $\chi^2_{m-k-1}$ , where m is number of covariate patterns, k number of covariates. If 1 - pchisq(deviance, df) small, model is significantly different from maximal model. (Only applies to grouped data, with small m and large n)

Other extreme, most restrictive model has all probabilities $\pi_i = \pi = r / n$ . Deviance of this model is called null deviance. Use to see if any model will be useful.

Also use deviances when comparing models. If two models fit the data equally well, then the difference in deviances will be $\chi^2_{df difference}$ . (Works for both grouped and ungrouped data)

Residuals

Two kinds:

Diagnostics

Influential points:

Non-planarity

For single x, plot $logit(r_i / n_i)$ vs. x For multiple x, plot Pearson residuals vs. each x For either, use GAM.

If covariate groups mostly have only one member, not much can be done.

Over/under dispersion

Variance of binomial distribution is $np(1-p)$ , which is always less than the mean. Sometimes the individuals with the same covariate pattern are positively (or negatively) correlated, this will result in the variance being greater (or less) than $np(1-p)$ . This is called over (or under) dispersion – and standard errors will be correct. Can use quasibinomial family which estimates scale parameter from data.

Other points

ANOVA

Construction and interpretation same as for linear ANOVA, but in terms of log-odds rather than means. Remember to convert parameter estimates to odds form!

Variable selection

Proceeds as for linear. AIC = $deviance - 2 * (number of parameters)$ . Fit full then use stepwise and anova to select simpler model if appropriate.

Poisson regression

Response is a count, with Poisson distribution, mean $\mu$ . $log(\mu) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$ . Parameters estimated by ML(maximum likelihood) with IRLS.

In R: glm(response ~ variables, family=poisson, weight=weight)

Count data often collected as with rates, or number of events occurring for a given population. Could use a binomial model, but more commonly use offsets. $\mu = rate * population size$ , $log(\mu) = log(rate) + log(size)$ . $log(size)$ is just another variable in model, but regression coefficient is fixed at 1. In R: glm(response ~ variables, family=poisson, offset=log(population))

Contingency tables

Arise when when classify individuals into categories using one or more criteria. Two sampling models used for contingency tables: multinomial and poisson.

Multinomial model

Assume: table has $m$ cells, $n$ individuals classified independently with probability $\pi_i$ of being in cell $i$ , $\sum_i \pi_i = 1$ , $Y_i$ number of individuals in cell $i$ .

Then $Y_1, ..., Y_m$ has multinomial distribution: $P(Y_1 = y_1, ..., Y_m = y_m) = \frac{n!}{y_1! y_2! ... y_m !}\pi_1^{y_1}...\pi_m^{y_m}$ .

If there are no restrictions on the $\pi$ ’s, then the likelihood ($\y_i log(\pi_i) + \cdots $ ) will be maximised when $\pi_i = \y_i / n$ . Can test this maximal likelihood against more constrained model using usual $\chi^2$ test.

Poisson model

Assume: each cell is ~Poisson($\mu_i$ ), equivalent to Poisson regression model. Maximal and null likelihoods same as for multinomial – models are equivalent! Can estimate multinomial parameters by fitting a poisson model where $\pi_i = \mu_i / \sum \mu_i$ . Can test hypotheses about the multinomial by testing equivalent Poisson hypotheses.

For example, can test for independence of classifying factors by looking at interaction term in poisson model.

Product multinomial sampling model

So far have assumed sampling at random from single population, all factors were “responses”, and were examining joint distribution. Sometimes we want to think of a factor (A) as representing different populations, and want to compare distribution of another factor (B) between populations. We want to look at the conditional distribution of B given A.

Can compare visually using bar or dot plot.

In general, populations may be defined as level combinations of 2 or more “conditioning factors” (C & D) and want to compare joint distribution of “response factors” (A & B). Want to check if joint distributions of A & B conditional on C & D are the same. Do this by comparing maximal model to model where A & B is independent of C & D, ie. AB.

General solution

Example of a more general type of problem – does our data fit a certain distribution? Basic solution: create contigency table comparing observed and expected under distributional hypotheses, then use LR test.

eg. $2(sum(n_i * log(o_i)) - sum(n_i * log(e_i)))$ .

Related topics

Association reversal

Regression coefficients can change sign when new variables added to model. Similar thing happends in contingency table when collapsing over variables. Known as Simpson’s paradox, or association reversal. Dangerous to collapse over variables strongly related to remaining variables. Better to look at conditional tables rather than marginal (collapsed) tables.