42 The Caret Package
Useful front end package that wraps around a lot of the prediction algorithms and tools used in R.
Caret has a lot of functionality:
- Some preprocessing (cleaning)
- preProcess
- Data splitting
- createDataPartition
- createResample
- createTimeSlices
- Training / Testing functions
- train
- predict
- Model comparison
- confusionMatrix
42.1 Machine Learning Algorithms in R
There are so many popular machine learning algorithms in R;
- Linear Discriminant Analysis
- Regression
- Naive Bayes
- Support Vector Machines
- Classification and Regression Trees
- Random Forests
- Boosting
- More…
42.2 SPAM Example: Data Splitting
Important functions here:
createDataPartition()
train()
predict()
confusionMatrix()
# Load the package and the data
library(caret); library(kernlab); data(spam)
set.seed(123)
# Use data partition functions to create the training and testing sets
inTrain <- createDataPartition(y = spam$type,
p = 0.75,
list=FALSE)
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]
# Check that it works
dim(training)
## [1] 3451 58
## [1] 1150 58
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9191452 0.8288818
# Examine actual fitted values (Seems like calling the 'summary' function is better than this though TBH)
modelFit$finalModel[c(8,10, 16)]
## $family
##
## Family: binomial
## Link function: logit
##
##
## $deviance
## [1] 1386.676
##
## $df.residual
## [1] 3393
# Predict on new data using the 'predict' function
predictions <- predict(modelFit, newdata = testing)
# To dignose the model (look at the predictions) we use the 'confusionMatrix' function
confusionMatrix(predictions, testing$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 667 56
## spam 30 397
##
## Accuracy : 0.9252
## 95% CI : (0.9085, 0.9398)
## No Information Rate : 0.6061
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8418
##
## Mcnemar's Test P-Value : 0.007022
##
## Sensitivity : 0.9570
## Specificity : 0.8764
## Pos Pred Value : 0.9225
## Neg Pred Value : 0.9297
## Prevalence : 0.6061
## Detection Rate : 0.5800
## Detection Prevalence : 0.6287
## Balanced Accuracy : 0.9167
##
## 'Positive' Class : nonspam
##