Diagnostics

General modelling cycle is:

  1. fit
  2. examine residuals
  3. transform if necessary
  4. repeat above until happy

Reasons for bad fit:

Detecting non-planar data

Residual plots

Any pattern in residuals suggestive of non-planarity.

Partial residual plots

Suppose true model is $y = \beta_0 + \beta_1 x + g(z) + \epsilon = \beta_0 + \beta_1 x + \beta_2 z + g(z) - \beta_2 z + \epsilon$ . If we fit a linear model using $x$ and $z$ and the errors aren’t too large $\epsilon \approx g(z) - \beta_2(z)$ , or $g(z) \approx e + \beta_2 z$ . If we define the partial residuals for z as $e + \hat{\beta}_2 z$ , then plotting $e_z$ vs $z$ should reveal shape of g. Called partial residual plot, very similar to added variable plot and has similar least square properties.

Use 330 function partial.res.plots.

Additive models

Can use more general class of additive general models to estimate smoothed function directly, use gam in R. eg.

model.gam <-(y ~ x + s(z)) plot(model.gam)

Fixing non-planar data

If the data isn’t planar, we need to transform it so that it is – we need a function the same shape as the pattern we see in the partial residual or gam plot. Use trial and error, looking at $R^2$ and residuals plots to measure success.

Identifying non-constant scatter

Look at residuals. Common problem is funnel effect, where $\sigma^2 \propto \mu$ .

Fixing non-constant scatter

Can either transform the response (ie. take reciprocal or log) or estimate variances of observations and use weighted least squares regression.

Weighted least squares

Instead of minimising $\sum r_i^2$ , minimise weighted residuals: $\sum r_i^2 / v_i$ , where ith observation has variance $v_i \sigma_2$ . How do we calculate $v_i$ ?

  1. plot squared residuals
  2. smooth
  3. estimate variance of observation using smooth function
  4. weight = 1 / estimated variance

Can use 330 function est.var.function(model) to calculate weights, and then supply them to lm using weights argument.

Identifying outliers and high leverage points

An outlier has extreme $y$ -value. A high leverage point has extreme $x$ -values.

Effect on an outlier depends on its leverage:

Can be shown the $y_i = h_{i1} y_1 + \cdots + h_ii y_i _ \cdots + h_{in} y_n$ , where the the $h$ ’s are the hat matrix diagonals, and measure the influence that $y_j$ has on the ith fitted value. Can also be interpreted as distance between $x_i$ ’s and average $x$ values.

Each HMD lies between 0 and 1. Average HMD = $p/n$ , > than 3 times this considered extreme.

Identifying outliers

If influence low, then outlier will normally have large residual, but if influence is high, the residual may be small. How do we identify outliers with high leverage? We can drop the point and refit the regression. If the coefficients change a lot it’s probably an outlier, or influential point.

Can calculate a number of measures with leave-one-out techniques by looking at change in key regression quantities (coefficients, fitted values, etc.).

In R:

Identifying dependence

Important assumption is that data is independent, if this is not so standard errors will be incorrect. Usually talk about autocorrelation, which can be positive or negative. Look at plots of :

Durbin-Watson is formal test for significance. Assumes AR (1) model, and tests hypothesis that $\theta = \rho = 0$ . $\rho$ estimated by $\sum e_i e_{i-1} / \sum e_i^2$ , and test statistic is $DW \approx 2(1-\hat{\rho})$ . $0 \lt DW \lt 4$ , values of DW close to 0 indicate negative autocorrelation, values close ot 4 positive. Exact values based on $p$ and $n$ .

Identifying non-normality

Normality assumption not so critical, but longer tailed distributions will imply more outliers. Also important for prediction errors. Use qqplot, and should fall in straight line ( j-shape = right skew, s-shape = short tailed). Can use Weisberg-Bingham test for formal hypothesis test, WB = square of correlation of normal plot, values close to 1 indicate normality.

In R, use 330 function WB.test.

Remedies for non-normality

Transform response, using box-cox plot to suggest suitable power.