Variable selection

Often many variables to select from to build model, which ones should we use?

If we use too many, including some unrelated to the response, then we are overfitting: model not good for prediction (large error), model too elaborate, coefficient standard errors often inflated.

If we use too few, leaving out useful variables, we are underfitting: model not good for prediction (biased), regression coefficients biased, overestimate error variance.

If we have $p$ variables and a constant term, then we have $2^p - 1$ possible additive models. How do we select which one is best? There are two approaches: all possible regressions and stepwise methods.

APR

Define criterion of “model goodness” which balances simplicity and quality of fit. Calculate criterion for every possible model, then choose best based on criterion.

Possible criteria:

$R^2$
residual mean square
Mallow’s Cp
AIC and BIC

$R^2$

Big values good. Since $R^2$ always increases with increasing variables, will always select model with all variables, but is ok for comparing models with same number of variables. Can adjust $R^2$ to penalise complicated models: $\bar{R}_p^2 = 1 - \frac{n-1}{n-p-1}(1-R^2_p)$ .

Residual mean square ($\hat{\sigma}$ )

Small values good. Equivalent to choosing largest adjusted $R^2$ .

Mallow’s Cp

Estimate of prediction quality. $C_k = \frac{RSS_k}{EMS_full} + 2(k +1) - n$ . Values around $p + 1$ are good (note: $C_p = p+1$ )

If p-variable model contains all important information, then $RSS_p \approx (n-p-1)\sigma^2$ , and $EMS \approx \sigma^2$ , so $C_p \approx p +1$ .

Cp plot: for each model we plot Cp vs. p, with line Cp = p +1. Models close to line are good.

AIC and BIC

$AIC = n log(RSS_p) + 2p$ . Small is good. $BIC = RSS / EMS_full + p log(n)$ . Small is good. Tends to favour smaller models than AIC.

Stepwise methods

In R, with 330 function stepwise(model, direction="forward")

Backwards elimination

Start with full model
Find variable with largest p-value
Remove it and refit model
Repeat until p-value < some cutoff (usually 0.10)

Forward selection

Start with null model
Choose variable with greatest partial correlation coefficient
Add and refit model
Repeat until p-value > some cutoff (around 0.10)

Full stepwise

Combination of forward and backwards, one step forward, one step back. Stop when no change.