40 ROC Curves

40.1 Why a Curve?

In binary classification, you;re predicting one of two categories, such as live or dead, but your predictions are often quantitative.

  • Probability of being alive
  • Prediction scale from 1 to 10
  • The cutoff you choose gives different results. (For example, if we list everyone with a chance of being alive above 40% as alive, there will be more results than if the cutoff value is 80%)

The ROC curve plots the True positive rate vs the False positive rate. (Sensitivity vs Specificity). This will define whether the algorithm is any good. The benchmark for this is the area under the curve.

  • AUC of 0.5 is effectively random guessing for a binary classifier. Lower is worse.
  • AUC = 1 is a perfect classifier
  • In general an AUC above 0.8 is considered “Good” but make sure that you have an estimate of Bayes Optimal Error beforehand.

40.2 Cross Validation

40.2.1 Key Idea

  1. Accuracy on the training data (resubsitiution accuracy) is optimistic.
  2. a better estimate comes from an independent set (test set accuracy).
  3. But we can;t use the test set when building the model or it becomes part of the training set.
  4. So we estimate the test set accuracy with the training set.

Approach:

  1. Use the training set
  2. Split it into training/test sets
  3. Build a model on the training set
  4. Evaluate on the test set
  5. Repeat and average the estimated errors

Used for:

  1. Picking variables to include in the model
  2. Picking the type of prediction function to use
  3. Picking the parameters in the prediction function
  4. Comparing different predictors

40.3 Example:

##          R2     RMSE      MAE
## 1 0.5946201 6.410914 5.651552
## [1] 0.08800157
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.738618  0.6128307  6.116021
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 42, 44, 42, 43, 41, 42, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.126707  0.6863589  6.046966
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 43, 42, 42, 43, 41, 42, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.304991  0.7211256  6.030067
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

40.4 Considerations

  • For time series data, data must be used in chunks.
  • For K-Fold cross validation
    • Larger K = less bias, more variance
    • Smaller K = more bias, less variance
  • Random sampling must be done without replacement
  • Random sampling with replacement is the bootstrap
    • Underestimates the error
    • Can be corrected but is complicated (0.632 Bootstrap)
  • If you cross-validate to pick predictors, you must estimate errors on an independent dataset.