49 Predicting with Regression

49.1 Key Ideas

  • Fit a simple regression model
  • Plug in new covariates and multiply them by the coefficients
  • Useful when the linear model is (nearly) correct

Pros: * Easy to implement * easy to interpret

Cons: * Often poor performance on non-linear settings

49.2 Example: Geyser Eruptions

##    eruptions waiting
## 6      2.883      55
## 7      4.700      88
## 9      1.950      51
## 11     1.833      54
## 12     3.917      84
## 13     4.200      78

## 
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29776 -0.37733  0.03967  0.35578  1.20490 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.82108    0.23306  -7.814  1.4e-12 ***
## waiting      0.07478    0.00323  23.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5183 on 135 degrees of freedom
## Multiple R-squared:  0.7988, Adjusted R-squared:  0.7973 
## F-statistic: 535.8 on 1 and 135 DF,  p-value: < 2.2e-16

49.3 Predicting a New Value

So according to our model, what’s the predicted value of the eruption duration eruption given a waiting time of \(80\)?

Using the linear model: \[\hat{ED} = \hat{\beta_0} + \hat{\beta_1} WT\]

## (Intercept) 
##    4.161219
##        1 
## 4.161219

49.5 Get Training and Test set Errors

The measure of error for the linear model here will be \(RMSE\). This is the total difference between the fitted points and the actual points.

The \(RMSE\) for the test set is the exact same bench

## [1] 6.021535
## [1] 5.509391

49.6 Prediction Intervals

Prediction intervals can be useful for looking at possible ranges that the values are likely to take. However, confidence intervals in general are probably more useful, this looks more or less intuitive going by the variation around the regression line.

Below we’ll use the regular plot(), matlines() and prediciton() functions, however this can also be done in the caret package.

This process can also be done in the caret package.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29776 -0.37733  0.03967  0.35578  1.20490 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.82108    0.23306  -7.814  1.4e-12 ***
## waiting      0.07478    0.00323  23.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5183 on 135 degrees of freedom
## Multiple R-squared:  0.7988, Adjusted R-squared:  0.7973 
## F-statistic: 535.8 on 1 and 135 DF,  p-value: < 2.2e-16