56 Regularised Regression

56.1 Basic Idea

Fit a regression model
Penalise (or shrink) large coefficients

Pros:

Can help with the bias / variance trade off
Can help with model selection

Cons:

May be computationally demanding on large data sets
Does not perform as well as random forests or boosting

56.2 A Motivating Example

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 +X_2 + \epsilon\]

Where:

\(X_1\) and \(X_2\) are nearly perfectly correlated (co-linear).

This model can be approximated by leaving out one of the regressors:

\[Y = \beta_0 + (\beta_1 + \beta_2)X_1 + \epsilon\]

This result is:

You will get a good estimate of \(Y\)
The estimate (of \(Y\)) will be biased
We my reduce variance in the estimate

56.3 Common Pattern

When the complexity of the model is increased, the accuracy of the model on the training set will always increase as more of the variance in the training set can be explained. Training the model too closely to the training set can overfit the model to the extent that it will no longer generalise, and thus perform poorly on the testing set.

56.4 Model Selection Approach: Split Samples

There is no better method when data/computation time permits it. The approach is as follows:

Divide the data into training, testing and validation sets.
Treat validation set as testing data, train all competing models on the training data and pick the best one on the validation set.
To approximately assess performance on new data apply, the model to the testing set.
After this, you can re-split and reperform steps 1-3.

There are two common problems to this approach:

Limited Data
Computational complexity

56.5 Decomposing Expected Prediction Error

So another approach is to try to decompose the prediction error and see if there’s another way that we can work directly get at including only the variable that need to be included in the model. So if we assume that the variable y can be predicted as a function of x, plus some error term, then the expected prediction error is the expected difference between the outcome and the prediction of the outcome squared. And so, f-hat lambda here is the estimate from the training set. Using a particular set of tuning parameters lambda.

Assume that the outcome \(Y_i\) can be predicted as a function of \(X_i\) with the addition of an error term: \(Y_i = f(X_i)+\epsilon_i\)

Then the expected prediction error \((EPE)\) is the difference between the outcome and the predicted outcome squared: \[EPE(\lambda)= E[(Y - \hat{f_\lambda}(X))^2]\] \(\hat{f_\lambda}\) is the estimate from the training data using a particular set of training parameters \((\lambda)\) and the new data points are \(X = x^*\). The expected prediction error can then be decomposed into three components:

\[E[(Y - \hat{f_\lambda}(X))^2] = \sigma^2 + [E[\hat{f_\lambda}(x^*)] - f(x^*)]^2 + var[\hat{f_\lambda}(x_0)]\]

Where:

\(\sigma^2\) is the irreducible error
\([E[\hat{f_\lambda}(x^*)] - f(x^*)]\) is the bias (difference between the expected prediction and the truth)
\(var[\hat{f_\lambda}(x_0)]\) is the variance of out estimate

56.6 Another Issue for High-Dimensional Data

In the event that you have an extremely high dimensional dataset with few data points, for some of the variables, the regression algorithms will not be able to calculate coefficients for all variables. For example; if five daa points are taken from the prostate dataset, R will not be able to calculate coefficients for all variables in the data set.

small = prostate[1:5, ]
lm(lpsa ~ ., data = small)

## 
## Call:
## lm(formula = lpsa ~ ., data = small)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##     9.60615      0.13901     -0.79142      0.09516           NA           NA  
##         lcp      gleason        pgg45    trainTRUE  
##          NA     -2.08710           NA           NA

56.7 Hard Thresholding

One way to get around this problem is by using hard thresholding. * We assume the model is a linear regression problem \(Y = f(X) + \epsilon\). * Set \(\hat{f_{\lambda}}(x) = x^\prime \beta\) * Constrain only \(\lambda\) coefficients to be nonzero * Selection problem is after choosing \(\lambda\) figure out which \(p - \lambda\) coefficients to make non zero.

The question is, after we pick lambda, so suppose there are only three non-zero coefficients. Then we have to try all the possible combinations of three coefficients that are non-zero, and then fit the best model. So that’s still computationally quite demanding.

56.8 Regularisation for Regression

If the \(\beta_j\)’s are unconstrained, they can explode if they are highly correlated, so they’re susceptible to very high variance which can lead to less accurate predictions. To control variance we might want to regularise/shrink the coefficients. \[PRSS(\beta) = \sum_{j=1}^n (Y_j - \sum_{j=1}^n \beta_1 X_{ij})^2 + P(\lambda;\beta)\]

Where:

\(\sum_{j=1}^n (Y_j - \sum_{j=1}^n \beta_1 X_{ij})^2\) is the difference between the outcome and the linear model fit.
\(P(\lambda;\beta)\) is the penalty term

Where PRSS is a penalized form of the sum of squares. Things that are commonly looked for:

Penalty reduces complexity
Penalty reduces variance
Penalty respects structure of the problem

The first approach used for penalised regression was Ridge Regression.

56.9 Ridge Regression

Solve: \[\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2\] Equivalent to solving: \(\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2\) subject to \(\sum_{j=1}^p \beta_j^2 \le s\) where \(s\) is inversely proportional to \(\lambda\).

Inclusion of \(\lambda\) makes the problem non-singular even is \(X^TX\) is not invertible.

56.9.1 Tuning Parameter \(\lambda\)

\(\lambda\) controls the size of the coefficients
\(\lambda\) controls the amount of regularisation
As \(\lambda \rightarrow 0\) we obtain the least square solution
As \(\lambda \rightarrow \infty\) we have \(\hat{\beta_{\lambda=\infty}^{ridge}}\)

Cross validation can be used to obtain this value.

56.9.2 Lasso

The lasso method of regularisation uses a different constraint, the absolute value of the beta terms.

\[\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2\] Subject to:

\[\sum_{j=1}^p |\beta_j| \le s\] Also has a Lagrangian form:

\[\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j|\] For orthonormal design matrices (Not the norm) this has a closed form solution. But not in general. \[\hat{\beta_j} = sign(\hat{\beta_j^0})(\hat{\beta_j^0}- \gamma)^+\]

To fit penalised regression models in caret, the methods to use are:

ridge
lasso
relaxo