Collinearity

When two explanatory variables are strongly related we have collinearity, which increases the standard errors of our coefficients.

\[ var(\hat{\beta}_j) = \frac{\sigma^2 / (n-1)}{var(x)(1-R_j^2)} \]

Where $R_j$ is $R^2$ from regressing $x_j$ against all the other explanatory variables. If $x_j$ is orthogonal to the other variables, then $R_j = 0$ , and the variance is $\sigma^2 / (n-1) / var(x)$ . The factor $1 / (1- R_j^2)$ represents increase in variance from linear dependence, and is called the variance inflation factor (VIF). VIF = var(variable) / var(residuals), and is given by diagonal of inverse of correlation matrix.

vif <- diag(solve(cor(X[,-5])))

If one or more variables has high VIF, regression said to be collinear. Results in imprecise estimation of regression coefficients, high standard errors, and non-significant variables. If variable non-significant it is because either is not related to explanatory, or information has already been explained.

Partial regression plots/Added variable plots

Plot residuals from regression without x vs. residuals from full model. (unexplained variation in Y vs. variation in X not explained by others). Relationship in graph indicates relationship between the two sets, ie. relationship between x and response not accounted for by other variables.

added.variable.plots(model) (from 330 code)

Things to note:

slope = partial correlation coefficient
residuals from line of best, same as residuals from full model
0 intercept