42 The Caret Package

Useful front end package that wraps around a lot of the prediction algorithms and tools used in R.

Caret has a lot of functionality:

  • Some preprocessing (cleaning)
    • preProcess
  • Data splitting
    • createDataPartition
    • createResample
    • createTimeSlices
  • Training / Testing functions
    • train
    • predict
  • Model comparison
    • confusionMatrix

42.1 Machine Learning Algorithms in R

There are so many popular machine learning algorithms in R;

  • Linear Discriminant Analysis
  • Regression
  • Naive Bayes
  • Support Vector Machines
  • Classification and Regression Trees
  • Random Forests
  • Boosting
  • More…

42.2 SPAM Example: Data Splitting

Important functions here:

  • createDataPartition()
  • train()
  • predict()
  • confusionMatrix()
## [1] 3451   58
## [1] 1150   58
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9191452  0.8288818
## $family
## 
## Family: binomial 
## Link function: logit 
## 
## 
## $deviance
## [1] 1386.676
## 
## $df.residual
## [1] 3393
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     667   56
##    spam         30  397
##                                           
##                Accuracy : 0.9252          
##                  95% CI : (0.9085, 0.9398)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8418          
##                                           
##  Mcnemar's Test P-Value : 0.007022        
##                                           
##             Sensitivity : 0.9570          
##             Specificity : 0.8764          
##          Pos Pred Value : 0.9225          
##          Neg Pred Value : 0.9297          
##              Prevalence : 0.6061          
##          Detection Rate : 0.5800          
##    Detection Prevalence : 0.6287          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : nonspam         
##