40 ROC Curves

40.1 Why a Curve?

In binary classification, you;re predicting one of two categories, such as live or dead, but your predictions are often quantitative.

Probability of being alive
Prediction scale from 1 to 10
The cutoff you choose gives different results. (For example, if we list everyone with a chance of being alive above 40% as alive, there will be more results than if the cutoff value is 80%)

The ROC curve plots the True positive rate vs the False positive rate. (Sensitivity vs Specificity). This will define whether the algorithm is any good. The benchmark for this is the area under the curve.

AUC of 0.5 is effectively random guessing for a binary classifier. Lower is worse.
AUC = 1 is a perfect classifier
In general an AUC above 0.8 is considered “Good” but make sure that you have an estimate of Bayes Optimal Error beforehand.

40.2 Cross Validation

40.2.1 Key Idea

Accuracy on the training data (resubsitiution accuracy) is optimistic.
a better estimate comes from an independent set (test set accuracy).
But we can;t use the test set when building the model or it becomes part of the training set.
So we estimate the test set accuracy with the training set.

Approach:

Use the training set
Split it into training/test sets
Build a model on the training set
Evaluate on the test set
Repeat and average the estimated errors

Used for:

Picking variables to include in the model
Picking the type of prediction function to use
Picking the parameters in the prediction function
Comparing different predictors

40.3 Example:

library(caret)
library(tidyverse)

# Load data
data("swiss")

# Split the data into training and test set
set.seed(123)

training.samples <- swiss$Fertility %>%
  createDataPartition(p = 0.8, list = FALSE)
      train.data  <- swiss[training.samples, ]
      test.data <- swiss[-training.samples, ]

      
      
      
      
# Build the model
model <- lm(Fertility ~., data = train.data)

# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)

data.frame( R2 = R2(predictions, test.data$Fertility),
            RMSE = RMSE(predictions, test.data$Fertility),
            MAE = MAE(predictions, test.data$Fertility))

##          R2     RMSE      MAE
## 1 0.5946201 6.410914 5.651552

# Find the prediction error rate
RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)

## [1] 0.08800157

### 1. Try 'Leave One Out Cross Validation' ###

# Define training control using 'Leave One Out Cross Validation'
train.control <- trainControl(method = "LOOCV")
# Train the model
model_cross_val <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_cross_val)

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.738618  0.6128307  6.116021
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

### 2. Try 'K-Folds Cross Validation' ###

# Define training control
train.control <- trainControl(method = "cv", number = 10)

# Train the model
model_k_fold <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)

# Summarize the results
print(model_k_fold)

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 42, 44, 42, 43, 41, 42, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.126707  0.6863589  6.046966
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

### 3. Try 'Repeated K-Folds Cross Validation' ###

# Define training control
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
model_k_fold_rep <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_k_fold_rep)

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 43, 42, 42, 43, 41, 42, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.304991  0.7211256  6.030067
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

40.4 Considerations

For time series data, data must be used in chunks.
For K-Fold cross validation
- Larger K = less bias, more variance
- Smaller K = more bias, less variance
Random sampling must be done without replacement
Random sampling with replacement is the bootstrap
- Underestimates the error
- Can be corrected but is complicated (0.632 Bootstrap)
If you cross-validate to pick predictors, you must estimate errors on an independent dataset.