Diagnostics

General modelling cycle is:

fit
examine residuals
transform if necessary
repeat above until happy

Reasons for bad fit:

non-planar data (major problem, coefficients effectively meaningless)
non-constant scatter
errors not independent
outliers
non-normal residuals

Detecting non-planar data

plot residuals versus fitted values, and residuals vs. variables
partial residual plots
fitting additive models

Residual plots

Any pattern in residuals suggestive of non-planarity.

Partial residual plots

Suppose true model is $y = \beta_0 + \beta_1 x + g(z) + \epsilon = \beta_0 + \beta_1 x + \beta_2 z + g(z) - \beta_2 z + \epsilon$ . If we fit a linear model using $x$ and $z$ and the errors aren’t too large $\epsilon \approx g(z) - \beta_2(z)$ , or $g(z) \approx e + \beta_2 z$ . If we define the partial residuals for z as $e + \hat{\beta}_2 z$ , then plotting $e_z$ vs $z$ should reveal shape of g. Called partial residual plot, very similar to added variable plot and has similar least square properties.

Use 330 function partial.res.plots.

Additive models

Can use more general class of additive general models to estimate smoothed function directly, use gam in R. eg.

model.gam <-(y ~ x + s(z)) plot(model.gam)

Fixing non-planar data

If the data isn’t planar, we need to transform it so that it is – we need a function the same shape as the pattern we see in the partial residual or gam plot. Use trial and error, looking at $R^2$ and residuals plots to measure success.

Identifying non-constant scatter

Look at residuals. Common problem is funnel effect, where $\sigma^2 \propto \mu$ .

Fixing non-constant scatter

Can either transform the response (ie. take reciprocal or log) or estimate variances of observations and use weighted least squares regression.

Weighted least squares

Instead of minimising $\sum r_i^2$ , minimise weighted residuals: $\sum r_i^2 / v_i$ , where ith observation has variance $v_i \sigma_2$ . How do we calculate $v_i$ ?

plot squared residuals
smooth
estimate variance of observation using smooth function
weight = 1 / estimated variance

Can use 330 function est.var.function(model) to calculate weights, and then supply them to lm using weights argument.

Identifying outliers and high leverage points

An outlier has extreme $y$ -value. A high leverage point has extreme $x$ -values.

Effect on an outlier depends on its leverage:

high leverage: can attract fitted plane, may have small residual, can increase $R^2$
low leverage: doesn’t distort fit as much, usually has big residual, increases standard errors, decreases $R^2$

Can be shown the $y_i = h_{i1} y_1 + \cdots + h_ii y_i _ \cdots + h_{in} y_n$ , where the the $h$ ’s are the hat matrix diagonals, and measure the influence that $y_j$ has on the ith fitted value. Can also be interpreted as distance between $x_i$ ’s and average $x$ values.

Each HMD lies between 0 and 1. Average HMD = $p/n$ , > than 3 times this considered extreme.

Identifying outliers

If influence low, then outlier will normally have large residual, but if influence is high, the residual may be small. How do we identify outliers with high leverage? We can drop the point and refit the regression. If the coefficients change a lot it’s probably an outlier, or influential point.

Can calculate a number of measures with leave-one-out techniques by looking at change in key regression quantities (coefficients, fitted values, etc.).

Standardised difference in coefficients. $|\hat{\beta}_j - \hat{\beta}_j(-1)| / se(\hat{\beta}_j)$ . Problem when $\gt 2 / \sqrt{n}$ .
Standardised difference in fitted values. $|\hat{y}_j - \hat{y}_j(-1)| / se(\hat{y}_i)$ . Problem when $\gt 3 \sqrt{p / (n-p)}$ .
Cov ratio. Change in standard errors of coefficients. Problem when > $1 + 3p/n$ , or < $1 - 3p/n$ .
Cook’s D. Overall change in coefficients. Problem when in upper 10% of F distribution

In R:

leverage-residual plot: lrplot(model)
influence measures: influence.measures(model)
Cook’s D plot: influence.plots(educ.lm)

Identifying dependence

Important assumption is that data is independent, if this is not so standard errors will be incorrect. Usually talk about autocorrelation, which can be positive or negative. Look at plots of :

residuals over time
residuals vs. previous residual
correlograms

Durbin-Watson is formal test for significance. Assumes AR (1) model, and tests hypothesis that $\theta = \rho = 0$ . $\rho$ estimated by $\sum e_i e_{i-1} / \sum e_i^2$ , and test statistic is $DW \approx 2(1-\hat{\rho})$ . $0 \lt DW \lt 4$ , values of DW close to 0 indicate negative autocorrelation, values close ot 4 positive. Exact values based on $p$ and $n$ .

Identifying non-normality

Normality assumption not so critical, but longer tailed distributions will imply more outliers. Also important for prediction errors. Use qqplot, and should fall in straight line ( j-shape = right skew, s-shape = short tailed). Can use Weisberg-Bingham test for formal hypothesis test, WB = square of correlation of normal plot, values close to 1 indicate normality.

In R, use 330 function WB.test.

Remedies for non-normality

Transform response, using box-cox plot to suggest suitable power.