Diagnostics
General modelling cycle is:
- fit
- examine residuals
- transform if necessary
- repeat above until happy
Reasons for bad fit:
- non-planar data (major problem, coefficients effectively meaningless)
- non-constant scatter
- errors not independent
- outliers
- non-normal residuals
Detecting non-planar data
- plot residuals versus fitted values, and residuals vs. variables
- partial residual plots
- fitting additive models
Residual plots
Any pattern in residuals suggestive of non-planarity.
Partial residual plots
Suppose true model is $y = \beta_0 + \beta_1 x + g(z) + \epsilon = \beta_0 + \beta_1 x + \beta_2 z + g(z) - \beta_2 z + \epsilon$ . If we fit a linear model using $x$ and $z$ and the errors aren’t too large $\epsilon \approx g(z) - \beta_2(z)$ , or $g(z) \approx e + \beta_2 z$ . If we define the partial residuals for z as $e + \hat{\beta}_2 z$ , then plotting $e_z$ vs $z$ should reveal shape of g. Called partial residual plot, very similar to added variable plot and has similar least square properties.
Use 330 function partial.res.plots
.
Additive models
Can use more general class of additive general models to estimate smoothed function directly, use gam
in R. eg.
model.gam <-(y ~ x + s(z))
plot(model.gam)
Fixing non-planar data
If the data isn’t planar, we need to transform it so that it is – we need a function the same shape as the pattern we see in the partial residual or gam plot. Use trial and error, looking at $R^2$ and residuals plots to measure success.
Identifying non-constant scatter
Look at residuals. Common problem is funnel effect, where $\sigma^2 \propto \mu$ .
Fixing non-constant scatter
Can either transform the response (ie. take reciprocal or log) or estimate variances of observations and use weighted least squares regression.
Weighted least squares
Instead of minimising $\sum r_i^2$ , minimise weighted residuals: $\sum r_i^2 / v_i$ , where ith observation has variance $v_i \sigma_2$ . How do we calculate $v_i$ ?
- plot squared residuals
- smooth
- estimate variance of observation using smooth function
- weight = 1 / estimated variance
Can use 330 function est.var.function(model)
to calculate weights, and then supply them to lm
using weights
argument.
Identifying outliers and high leverage points
An outlier has extreme $y$ -value. A high leverage point has extreme $x$ -values.
Effect on an outlier depends on its leverage:
- high leverage: can attract fitted plane, may have small residual, can increase $R^2$
- low leverage: doesn’t distort fit as much, usually has big residual, increases standard errors, decreases $R^2$
Can be shown the $y_i = h_{i1} y_1 + \cdots + h_ii y_i _ \cdots + h_{in} y_n$ , where the the $h$ ’s are the hat matrix diagonals, and measure the influence that $y_j$ has on the ith fitted value. Can also be interpreted as distance between $x_i$ ’s and average $x$ values.
Each HMD lies between 0 and 1. Average HMD = $p/n$ , > than 3 times this considered extreme.
Identifying outliers
If influence low, then outlier will normally have large residual, but if influence is high, the residual may be small. How do we identify outliers with high leverage? We can drop the point and refit the regression. If the coefficients change a lot it’s probably an outlier, or influential point.
Can calculate a number of measures with leave-one-out techniques by looking at change in key regression quantities (coefficients, fitted values, etc.).
- Standardised difference in coefficients. $|\hat{\beta}_j - \hat{\beta}_j(-1)| / se(\hat{\beta}_j)$ . Problem when $\gt 2 / \sqrt{n}$ .
- Standardised difference in fitted values. $|\hat{y}_j - \hat{y}_j(-1)| / se(\hat{y}_i)$ . Problem when $\gt 3 \sqrt{p / (n-p)}$ .
- Cov ratio. Change in standard errors of coefficients. Problem when > $1 + 3p/n$ , or < $1 - 3p/n$ .
- Cook’s D. Overall change in coefficients. Problem when in upper 10% of F distribution
In R:
- leverage-residual plot:
lrplot(model)
- influence measures:
influence.measures(model)
- Cook’s D plot:
influence.plots(educ.lm)
Identifying dependence
Important assumption is that data is independent, if this is not so standard errors will be incorrect. Usually talk about autocorrelation, which can be positive or negative. Look at plots of :
- residuals over time
- residuals vs. previous residual
- correlograms
Durbin-Watson is formal test for significance. Assumes AR (1) model, and tests hypothesis that $\theta = \rho = 0$ . $\rho$ estimated by $\sum e_i e_{i-1} / \sum e_i^2$ , and test statistic is $DW \approx 2(1-\hat{\rho})$ . $0 \lt DW \lt 4$ , values of DW close to 0 indicate negative autocorrelation, values close ot 4 positive. Exact values based on $p$ and $n$ .
Identifying non-normality
Normality assumption not so critical, but longer tailed distributions will imply more outliers. Also important for prediction errors. Use qqplot
, and should fall in straight line ( j-shape = right skew, s-shape = short tailed). Can use Weisberg-Bingham test for formal hypothesis test, WB = square of correlation of normal plot, values close to 1 indicate normality.
In R, use 330 function WB.test
.
Remedies for non-normality
Transform response, using box-cox plot to suggest suitable power.