56 Regularised Regression
56.1 Basic Idea
- Fit a regression model
- Penalise (or shrink) large coefficients
Pros:
- Can help with the bias / variance trade off
- Can help with model selection
Cons:
- May be computationally demanding on large data sets
- Does not perform as well as random forests or boosting
56.2 A Motivating Example
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 +X_2 + \epsilon\]
Where:
- \(X_1\) and \(X_2\) are nearly perfectly correlated (co-linear).
This model can be approximated by leaving out one of the regressors:
\[Y = \beta_0 + (\beta_1 + \beta_2)X_1 + \epsilon\]
This result is:
- You will get a good estimate of \(Y\)
- The estimate (of \(Y\)) will be biased
- We my reduce variance in the estimate
56.3 Common Pattern
When the complexity of the model is increased, the accuracy of the model on the training set will always increase as more of the variance in the training set can be explained. Training the model too closely to the training set can overfit the model to the extent that it will no longer generalise, and thus perform poorly on the testing set.
56.4 Model Selection Approach: Split Samples
There is no better method when data/computation time permits it. The approach is as follows:
- Divide the data into training, testing and validation sets.
- Treat validation set as testing data, train all competing models on the training data and pick the best one on the validation set.
- To approximately assess performance on new data apply, the model to the testing set.
- After this, you can re-split and reperform steps 1-3.
There are two common problems to this approach:
- Limited Data
- Computational complexity
56.5 Decomposing Expected Prediction Error
So another approach is to try to decompose the prediction error and see if there’s another way that we can work directly get at including only the variable that need to be included in the model. So if we assume that the variable y can be predicted as a function of x, plus some error term, then the expected prediction error is the expected difference between the outcome and the prediction of the outcome squared. And so, f-hat lambda here is the estimate from the training set. Using a particular set of tuning parameters lambda.
Assume that the outcome \(Y_i\) can be predicted as a function of \(X_i\) with the addition of an error term: \(Y_i = f(X_i)+\epsilon_i\)
Then the expected prediction error \((EPE)\) is the difference between the outcome and the predicted outcome squared: \[EPE(\lambda)= E[(Y - \hat{f_\lambda}(X))^2]\] \(\hat{f_\lambda}\) is the estimate from the training data using a particular set of training parameters \((\lambda)\) and the new data points are \(X = x^*\). The expected prediction error can then be decomposed into three components:
\[E[(Y - \hat{f_\lambda}(X))^2] = \sigma^2 + [E[\hat{f_\lambda}(x^*)] - f(x^*)]^2 + var[\hat{f_\lambda}(x_0)]\]
Where:
- \(\sigma^2\) is the irreducible error
- \([E[\hat{f_\lambda}(x^*)] - f(x^*)]\) is the bias (difference between the expected prediction and the truth)
- \(var[\hat{f_\lambda}(x_0)]\) is the variance of out estimate
56.6 Another Issue for High-Dimensional Data
In the event that you have an extremely high dimensional dataset with few data points, for some of the variables, the regression algorithms will not be able to calculate coefficients for all variables. For example; if five daa points are taken from the prostate
dataset, R will not be able to calculate coefficients for all variables in the data set.
##
## Call:
## lm(formula = lpsa ~ ., data = small)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph svi
## 9.60615 0.13901 -0.79142 0.09516 NA NA
## lcp gleason pgg45 trainTRUE
## NA -2.08710 NA NA
56.7 Hard Thresholding
One way to get around this problem is by using hard thresholding. * We assume the model is a linear regression problem \(Y = f(X) + \epsilon\). * Set \(\hat{f_{\lambda}}(x) = x^\prime \beta\) * Constrain only \(\lambda\) coefficients to be nonzero * Selection problem is after choosing \(\lambda\) figure out which \(p - \lambda\) coefficients to make non zero.
The question is, after we pick lambda, so suppose there are only three non-zero coefficients. Then we have to try all the possible combinations of three coefficients that are non-zero, and then fit the best model. So that’s still computationally quite demanding.
56.8 Regularisation for Regression
If the \(\beta_j\)’s are unconstrained, they can explode if they are highly correlated, so they’re susceptible to very high variance which can lead to less accurate predictions. To control variance we might want to regularise/shrink the coefficients. \[PRSS(\beta) = \sum_{j=1}^n (Y_j - \sum_{j=1}^n \beta_1 X_{ij})^2 + P(\lambda;\beta)\]
Where:
- \(\sum_{j=1}^n (Y_j - \sum_{j=1}^n \beta_1 X_{ij})^2\) is the difference between the outcome and the linear model fit.
- \(P(\lambda;\beta)\) is the penalty term
Where PRSS is a penalized form of the sum of squares. Things that are commonly looked for:
- Penalty reduces complexity
- Penalty reduces variance
- Penalty respects structure of the problem
The first approach used for penalised regression was Ridge Regression.
56.9 Ridge Regression
Solve: \[\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2\] Equivalent to solving: \(\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2\) subject to \(\sum_{j=1}^p \beta_j^2 \le s\) where \(s\) is inversely proportional to \(\lambda\).
Inclusion of \(\lambda\) makes the problem non-singular even is \(X^TX\) is not invertible.
56.9.1 Tuning Parameter \(\lambda\)
- \(\lambda\) controls the size of the coefficients
- \(\lambda\) controls the amount of regularisation
- As \(\lambda \rightarrow 0\) we obtain the least square solution
- As \(\lambda \rightarrow \infty\) we have \(\hat{\beta_{\lambda=\infty}^{ridge}}\)
Cross validation can be used to obtain this value.
56.9.2 Lasso
The lasso method of regularisation uses a different constraint, the absolute value of the beta terms.
\[\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2\] Subject to:
\[\sum_{j=1}^p |\beta_j| \le s\] Also has a Lagrangian form:
\[\sum_{j=1}^N (Y_j - \beta_0+ \sum_{j=1}^px_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j|\] For orthonormal design matrices (Not the norm) this has a closed form solution. But not in general. \[\hat{\beta_j} = sign(\hat{\beta_j^0})(\hat{\beta_j^0}- \gamma)^+\]
To fit penalised regression models in caret, the methods to use are:
ridge
lasso
relaxo