Introduction
Statistical models summarise relationships between variables. Regression models focus on one variable (response) and how it can be modelled by other explanatory variables. In general, response = function of explanatory variables + random scatter. We need to describe function and scatter. Balance between simplicity and accuracy.
Our approach:
- explore data using graphics and summary statistics
- construct useful model (iterative)
- use model to gain knowledge about system
- communicate findings
Role of graphics
- data cleaning: locate outliers, gross errors, special codings etc.
- exploration: what type of model could be useful?
- diagnostics: what is wrong with our model?
Data cleaning
All real data contain errors. Data should be carefully checked prior to analysis, but suitable plots are likely to reveal gross errors.
Exploratory graphics
Basic plots
Single variable:
- histogram/density plot
- qqplots
Two variables:
- scatterplot (may need multiple plot to distinguish different groups)
- boxplots
More than two variables:
- cloud plot (brush & spin etc.)
- coplots (show relationship between x and y, conditioned on z)
Trellis graphics
Can generalise the coplot to condition on more than one variable. Easy if the variables are categorical, but if they’re continuous we need to divide them up in to subranges. Conditoning variables determine layout of cells. x/y variables determine type of graph in each cell.
In R:
library(lattice) xyplot(y ~ x | w * z, data=a.df) xyplot(y ~ x | equal.count(w, number=4, overlap=0)
The multiple regression model
The regression model assumes:
- responses normally distributed with means $\mu$ and constant variance $\sigma^2$
- mean response depends on linear combination of covariates
The relationship $\mu = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k$ can be visualised as a plane, determined by the $\beta$ ’s. Data will be scattered above and below the plane. An alternative model than emphasises this is $Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + \epsilon$ , where $\epsilon$ ~ N(0, $\sigma^2$ ).
How can we tell when this model is suitable, ie. planar?
- $k=2$ , use spinner and look for edge
- $k \ge 2$ , use coplots, and make sure all plots are parallel
Coefficients
Interpretation
$\beta_0$ = average response when all covariates are 0. $\beta_1$ = slope of plane in $x_1$ direction, average change in response for unit change in $x_1$ when all others held constant, slope of coplot of $y$ vs. $x_1$Estimation
Minimise least squares criterion. Done in R using lm
function.
How well does the plane fit?
Judge by looking at fitted values and residuals (difference between fitted and observed values). In a good fit, residuals will be small compared to the $y$ ’s and will be randomly distributed. For model to be useful, need strong relationship between response & explanatory variables.
How do we measure this? Think about predicting a new response.
If we don’t know the $x$ ’s, we’d predict $y^{*} = \bar{y}$ , with error $\frac{1}{n-1} \sum{(y_i - \bar{y})^2}$ . If we do know the $x$ ’s, then we’d predict $y^{*} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_k x_k$ , with error $\frac{1}{n-1}\sum(y_i - \hat{y}_i)^2$ = $ \frac{1}{n-1}\sum e_i^2$ .
The ratio of these prediction errors is $RSS / TSS$ where $RSS$ = residual sum of squares and $TSS$ = total sum of squares. The smaller this value is, the better the fit. RSS + RegSS = TSS, which implies RSS must lie between 0 and 1.
Another way is to look at $RegSS / TSS$ or $1 - RSS / TSS$ , this is known as the coefficient of determination, or $R^2$ . Clearly a big $R^2$ is desirable, 1 means a perfect fit, 0 means a flat plane.
$\sigma^2$ is also related to $R^2$ : more scatter equals poorer fit. $\sigma^2$ is estimated by $RSS / (n-k-1)$ .Inference
Imagine we keep the x’s fixed, but resample the errors. The new fit would give us some idea of the variability of the coefficients. The variability depends on the arrangement of x’s (more correlation = more change) and the $\sigma^2$ .
Testing if single variable is required
Often we want to know if a particular variable is necessary in a given model (this is not the same as asking if a variable is related to the response). We can test this by comparing the ratio of a coefficient and its standard error to a t-distribution.
Testing if any variable is required
Can test whether any variable is necessary/useful by looking at F-value (automatically generated by R).
Testing if subset is required
Often want to test if a smaller model is as good. Compare $RSS$
of the models. $F = (RSS_sub - RSS_full) /ds^2$
, where $d$
is number of variable dropped, $s^2$
estimate of $\sigma^2$
from full model. Use anova
in R to do calculations.
Prediction
Can predict a new response value from our model by substituting in the values of the response variables. Prediction error is calculated from $\sqrt{Var(predictor) + \sigma^2}$ . We normally give a 95% confidence interval of $\pm ~2$ se’s. In R:
predict(model, data, se.fit=T, interval="prediction")
Can estimate average response similarly, but confidence intervals will be smaller because variance does not include sigma (ie. standard error = $\sqrt{Var(predictor)}$ .
predict(model, data, se.fit=T, interval="confidence")
Collinearity
When two explanatory variables are strongly related we have collinearity, which increases the standard errors of our coefficients.
\[ var(\hat{\beta}_j) = \frac{\sigma^2 / (n-1)}{var(x)(1-R_j^2)} \]Where $R_j$ is $R^2$ from regressing $x_j$ against all the other explanatory variables. If $x_j$ is orthogonal to the other variables, then $R_j = 0$ , and the variance is $\sigma^2 / (n-1) / var(x)$ . The factor $1 / (1- R_j^2)$ represents increase in variance from linear dependence, and is called the variance inflation factor (VIF). VIF = var(variable) / var(residuals), and is given by diagonal of inverse of correlation matrix.
vif <- diag(solve(cor(X[,-5])))
If one or more variables has high VIF, regression said to be collinear. Results in imprecise estimation of regression coefficients, high standard errors, and non-significant variables. If variable non-significant it is because either is not related to explanatory, or information has already been explained.
Partial regression plots/Added variable plots
Plot residuals from regression without x vs. residuals from full model. (unexplained variation in Y vs. variation in X not explained by others). Relationship in graph indicates relationship between the two sets, ie. relationship between x and response not accounted for by other variables.
added.variable.plots(model)
(from 330 code)
Things to note:
- slope = partial correlation coefficient
- residuals from line of best, same as residuals from full model
- 0 intercept
Diagnostics
General modelling cycle is:
- fit
- examine residuals
- transform if necessary
- repeat above until happy
Reasons for bad fit:
- non-planar data (major problem, coefficients effectively meaningless)
- non-constant scatter
- errors not independent
- outliers
- non-normal residuals
Detecting non-planar data
- plot residuals versus fitted values, and residuals vs. variables
- partial residual plots
- fitting additive models
Residual plots
Any pattern in residuals suggestive of non-planarity.
Partial residual plots
Suppose true model is $y = \beta_0 + \beta_1 x + g(z) + \epsilon = \beta_0 + \beta_1 x + \beta_2 z + g(z) - \beta_2 z + \epsilon$ . If we fit a linear model using $x$ and $z$ and the errors aren’t too large $\epsilon \approx g(z) - \beta_2(z)$ , or $g(z) \approx e + \beta_2 z$ . If we define the partial residuals for z as $e + \hat{\beta}_2 z$ , then plotting $e_z$ vs $z$ should reveal shape of g. Called partial residual plot, very similar to added variable plot and has similar least square properties.
Use 330 function partial.res.plots
.
Additive models
Can use more general class of additive general models to estimate smoothed function directly, use gam
in R. eg.
model.gam <-(y ~ x + s(z))
plot(model.gam)
Fixing non-planar data
If the data isn’t planar, we need to transform it so that it is – we need a function the same shape as the pattern we see in the partial residual or gam plot. Use trial and error, looking at $R^2$ and residuals plots to measure success.
Identifying non-constant scatter
Look at residuals. Common problem is funnel effect, where $\sigma^2 \propto \mu$ .
Fixing non-constant scatter
Can either transform the response (ie. take reciprocal or log) or estimate variances of observations and use weighted least squares regression.
Weighted least squares
Instead of minimising $\sum r_i^2$ , minimise weighted residuals: $\sum r_i^2 / v_i$ , where ith observation has variance $v_i \sigma_2$ . How do we calculate $v_i$ ?
- plot squared residuals
- smooth
- estimate variance of observation using smooth function
- weight = 1 / estimated variance
Can use 330 function est.var.function(model)
to calculate weights, and then supply them to lm
using weights
argument.
Identifying outliers and high leverage points
An outlier has extreme $y$ -value. A high leverage point has extreme $x$ -values.
Effect on an outlier depends on its leverage:
- high leverage: can attract fitted plane, may have small residual, can increase $R^2$
- low leverage: doesn’t distort fit as much, usually has big residual, increases standard errors, decreases $R^2$
Can be shown the $y_i = h_{i1} y_1 + \cdots + h_ii y_i _ \cdots + h_{in} y_n$ , where the the $h$ ’s are the hat matrix diagonals, and measure the influence that $y_j$ has on the ith fitted value. Can also be interpreted as distance between $x_i$ ’s and average $x$ values.
Each HMD lies between 0 and 1. Average HMD = $p/n$ , > than 3 times this considered extreme.
Identifying outliers
If influence low, then outlier will normally have large residual, but if influence is high, the residual may be small. How do we identify outliers with high leverage? We can drop the point and refit the regression. If the coefficients change a lot it’s probably an outlier, or influential point.
Can calculate a number of measures with leave-one-out techniques by looking at change in key regression quantities (coefficients, fitted values, etc.).
- Standardised difference in coefficients. $|\hat{\beta}_j - \hat{\beta}_j(-1)| / se(\hat{\beta}_j)$ . Problem when $\gt 2 / \sqrt{n}$ .
- Standardised difference in fitted values. $|\hat{y}_j - \hat{y}_j(-1)| / se(\hat{y}_i)$ . Problem when $\gt 3 \sqrt{p / (n-p)}$ .
- Cov ratio. Change in standard errors of coefficients. Problem when > $1 + 3p/n$ , or < $1 - 3p/n$ .
- Cook’s D. Overall change in coefficients. Problem when in upper 10% of F distribution
In R:
- leverage-residual plot:
lrplot(model)
- influence measures:
influence.measures(model)
- Cook’s D plot:
influence.plots(educ.lm)
Identifying dependence
Important assumption is that data is independent, if this is not so standard errors will be incorrect. Usually talk about autocorrelation, which can be positive or negative. Look at plots of :
- residuals over time
- residuals vs. previous residual
- correlograms
Durbin-Watson is formal test for significance. Assumes AR (1) model, and tests hypothesis that $\theta = \rho = 0$ . $\rho$ estimated by $\sum e_i e_{i-1} / \sum e_i^2$ , and test statistic is $DW \approx 2(1-\hat{\rho})$ . $0 \lt DW \lt 4$ , values of DW close to 0 indicate negative autocorrelation, values close ot 4 positive. Exact values based on $p$ and $n$ .
Identifying non-normality
Normality assumption not so critical, but longer tailed distributions will imply more outliers. Also important for prediction errors. Use qqplot
, and should fall in straight line ( j-shape = right skew, s-shape = short tailed). Can use Weisberg-Bingham test for formal hypothesis test, WB = square of correlation of normal plot, values close to 1 indicate normality.
In R, use 330 function WB.test
.
Remedies for non-normality
Transform response, using box-cox plot to suggest suitable power.
Variable selection
Often many variables to select from to build model, which ones should we use?
If we use too many, including some unrelated to the response, then we are overfitting: model not good for prediction (large error), model too elaborate, coefficient standard errors often inflated.
If we use too few, leaving out useful variables, we are underfitting: model not good for prediction (biased), regression coefficients biased, overestimate error variance.
If we have $p$ variables and a constant term, then we have $2^p - 1$ possible additive models. How do we select which one is best? There are two approaches: all possible regressions and stepwise methods.
APR
Define criterion of “model goodness” which balances simplicity and quality of fit. Calculate criterion for every possible model, then choose best based on criterion.
Possible criteria:
- $R^2$
- residual mean square
- Mallow’s Cp
- AIC and BIC
$R^2$
Big values good. Since $R^2$ always increases with increasing variables, will always select model with all variables, but is ok for comparing models with same number of variables. Can adjust $R^2$ to penalise complicated models: $\bar{R}_p^2 = 1 - \frac{n-1}{n-p-1}(1-R^2_p)$ .
Residual mean square ($\hat{\sigma}$ )
Small values good. Equivalent to choosing largest adjusted $R^2$ .
Mallow’s Cp
Estimate of prediction quality. $C_k = \frac{RSS_k}{EMS_full} + 2(k +1) - n$ . Values around $p + 1$ are good (note: $C_p = p+1$ )
If p-variable model contains all important information, then $RSS_p \approx (n-p-1)\sigma^2$ , and $EMS \approx \sigma^2$ , so $C_p \approx p +1$ .
Cp plot: for each model we plot Cp vs. p, with line Cp = p +1. Models close to line are good.
AIC and BIC
$AIC = n log(RSS_p) + 2p$ . Small is good. $BIC = RSS / EMS_full + p log(n)$ . Small is good. Tends to favour smaller models than AIC.Stepwise methods
In R, with 330 function stepwise(model, direction="forward")
Backwards elimination
- Start with full model
- Find variable with largest p-value
- Remove it and refit model
- Repeat until p-value < some cutoff (usually 0.10)
Forward selection
- Start with null model
- Choose variable with greatest partial correlation coefficient
- Add and refit model
- Repeat until p-value > some cutoff (around 0.10)
Full stepwise
Combination of forward and backwards, one step forward, one step back. Stop when no change.
Case study
Modelling cycle:
- Graphical check (suitability for regression, gross outliers)
- Model selection (APR, step wise etc)
- Transformation, removal of points, if required
- Use model for inference
Factors
Up to now have used numerical variables, but can use categorical too, called factors. Regression will assign a value to each level of the factor, equivalent to specifying a different intercept for each level. Can use interaction terms (:
) to specify that the model should use different slopes as well.
Anova models
Regression models where all explanatory variables are categorical. Assume response normally distributed around mean determined by factor levels. Other standard assumptions (equal variances) still apply. Can write in offset or dummy variable forms. R expects first factor level to be baseline, so may need to rearrange level order.
F-test tests for equality of means. HSD plot (HSD.plot(lm())
) displays differences graphically.
Two-factor models
Usually want to split each cell-mean into:
- overall baseline response, $\mu_11$ ,
- effect of factor A, $\mu_i1 - \mu_11$ ,
- effect of factor B, $\mu_1j - \mu_11$ ,
- how A & B interact, what’s left
Use plot.design(df)
to look at level means, interaction.plot(factor1, factor2, explanatory)
to look at interactions, anova(lm)
for formal significance tests.
Models with many continuous and categorical variables
General approach is to fit separate regressions to each continuous variable for each combination of categorical variables. Will produce many possible models, and need variable selection techniques. (Remember: need low order interaction for high level, if not enough data, drop high level interactions)
Logistic regression
Binary data
Grouped or ungrouped (in R, use tapply
to go from ungrouped to grouped). Grouped data has form covariates, number of successes ($r$
) and number of trials ($n$
). Probability of success = $r / n$
. This is the binomial distribution!
Relationship between response and covariates $ Y = \frac{e^{model formula}}{1+ e^{model formula}}$
. (Logit used to avoid nasty boundary problems). Called logistic regression. Estimate model coefficients using maximum likelihood (numerical method used in R: iteratively reweighted least squares), implemented by glm
).
Grouped: glm(cbind(sucesses, failures) ~ covariates, family = binomial)
Ungrouped: glm(response ~ covariates, family = binomial)
Probability, odds and log-odds
Probability: $\pi = e^{\alpha + \beta x} / (1 + e^{\alpha + \beta x})$ Odds: $\pi / (1 - \pi) = e^{\alpha + \beta x}$ . Log odds: $\alpha + \beta x$ .
$\beta$ measures effect of unit change of x: increases log odds by $\beta$ , increases odds by factor of $e^\beta$ .Prediction
Probability: predict(model, new.data, type="response", se.fit =T)
Log odds: predict(model, new.data, se.fit=T)
Plotting estimated probability
Easiest with grouped data:plot(response, r/n)
lines(response, predict(model, data.frame, type="response"))
A bit harder with ungrouped:
plot(response, occurred)
lines(sort(response), predict(model, sort(data.frame), type="response"))
Inference
Providing there’s enough data, estimated coefficients will be approx. normal. ML gives ways of estimating standard errors. Therefore tests and confidence intervals proceed exactly as for linear regression.
Deviance
Model had 2 parts: binomial assumption ($r ~ Bin(n, \pi)$
) and logistic assumption (logit of $\pi$
is linear). If we only make binomial assumption, we have most general model, since we put no restrictions on the probabilities. Maximising this (ie. $\pi_i = r_i / n_i$
) gives maximum value of likelihood, $L_max$
. If we compare this to the likelihood from logistic model, $L_mod$
, then residual deviance = $2(log L_max - log L_mod)$
~ $\chi^2_{m-k-1}$
, where m is number of covariate patterns, k number of covariates. If 1 - pchisq(deviance, df)
small, model is significantly different from maximal model. (Only applies to grouped data, with small m and large n)
Other extreme, most restrictive model has all probabilities $\pi_i = \pi = r / n$ . Deviance of this model is called null deviance. Use to see if any model will be useful.
Also use deviances when comparing models. If two models fit the data equally well, then the difference in deviances will be $\chi^2_{df difference}$ . (Works for both grouped and ungrouped data)
Residuals
Two kinds:
- Pearsons: useful for grouped only, similar to residuals in linear regression (actual – fitted), standardised to unit variance. $\frac{r_i - n_i \hat{\pi}_i}{\sqrt{n_i \hat{\pi}_i(1 - \hat{\pi})}}$
.
residuals(model.glm, type="pearson")
- Deviance: useful for both, measure contribution of each covariate pattern to total deviance. Big value indicates large influence.
residuals(model.glm, type="deviance")
Diagnostics
Influential points:
- Hat matrix diagonals: defined similarly as in linear models. Measures how outlying covariates are, but down-weighted according to estimated probability of observation
- Cook’s distance
- Leave-one-out deviance change
Non-planarity
For single x, plot $logit(r_i / n_i)$ vs. x For multiple x, plot Pearson residuals vs. each x For either, use GAM.
If covariate groups mostly have only one member, not much can be done.
Over/under dispersion
Variance of binomial distribution is $np(1-p)$ , which is always less than the mean. Sometimes the individuals with the same covariate pattern are positively (or negatively) correlated, this will result in the variance being greater (or less) than $np(1-p)$ . This is called over (or under) dispersion – and standard errors will be correct. Can use quasibinomial family which estimates scale parameter from data.
Other points
ANOVA
Construction and interpretation same as for linear ANOVA, but in terms of log-odds rather than means. Remember to convert parameter estimates to odds form!
Variable selection
Proceeds as for linear. AIC = $deviance - 2 * (number of parameters)$ . Fit full then use stepwise and anova to select simpler model if appropriate.
Poisson regression
Response is a count, with Poisson distribution, mean $\mu$ . $log(\mu) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$ . Parameters estimated by ML(maximum likelihood) with IRLS.
In R: glm(response ~ variables, family=poisson, weight=weight)
Count data often collected as with rates, or number of events occurring for a given population. Could use a binomial model, but more commonly use offsets. $\mu = rate * population size$
, $log(\mu) = log(rate) + log(size)$
. $log(size)$
is just another variable in model, but regression coefficient is fixed at 1. In R: glm(response ~ variables, family=poisson, offset=log(population))
Contingency tables
Arise when when classify individuals into categories using one or more criteria. Two sampling models used for contingency tables: multinomial and poisson.
Multinomial model
Assume: table has $m$ cells, $n$ individuals classified independently with probability $\pi_i$ of being in cell $i$ , $\sum_i \pi_i = 1$ , $Y_i$ number of individuals in cell $i$ .
Then $Y_1, ..., Y_m$ has multinomial distribution: $P(Y_1 = y_1, ..., Y_m = y_m) = \frac{n!}{y_1! y_2! ... y_m !}\pi_1^{y_1}...\pi_m^{y_m}$ .If there are no restrictions on the $\pi$ ’s, then the likelihood ($\y_i log(\pi_i) + \cdots $ ) will be maximised when $\pi_i = \y_i / n$ . Can test this maximal likelihood against more constrained model using usual $\chi^2$ test.
Poisson model
Assume: each cell is ~Poisson($\mu_i$ ), equivalent to Poisson regression model. Maximal and null likelihoods same as for multinomial – models are equivalent! Can estimate multinomial parameters by fitting a poisson model where $\pi_i = \mu_i / \sum \mu_i$ . Can test hypotheses about the multinomial by testing equivalent Poisson hypotheses.
For example, can test for independence of classifying factors by looking at interaction term in poisson model.
Product multinomial sampling model
So far have assumed sampling at random from single population, all factors were “responses”, and were examining joint distribution. Sometimes we want to think of a factor (A) as representing different populations, and want to compare distribution of another factor (B) between populations. We want to look at the conditional distribution of B given A.
Can compare visually using bar or dot plot.
In general, populations may be defined as level combinations of 2 or more “conditioning factors” (C & D) and want to compare joint distribution of “response factors” (A & B). Want to check if joint distributions of A & B conditional on C & D are the same. Do this by comparing maximal model to model where A & B is independent of C & D, ie. AB.
General solution
Example of a more general type of problem – does our data fit a certain distribution? Basic solution: create contigency table comparing observed and expected under distributional hypotheses, then use LR test.
eg. $2(sum(n_i * log(o_i)) - sum(n_i * log(e_i)))$ .
Related topics
Association reversal
Regression coefficients can change sign when new variables added to model. Similar thing happends in contingency table when collapsing over variables. Known as Simpson’s paradox, or association reversal. Dangerous to collapse over variables strongly related to remaining variables. Better to look at conditional tables rather than marginal (collapsed) tables.