43 Data Slicing
When using the createDataParition()
function, the first argument you pass is what you want the function to “Split” on. Here we will set y to be the spam$type
factor. The second argument tells the function what percentage of the data you want in the training set, in the code below it has been set to a \(75\% ~:~25\%\) split.
# Load the package and the data
library(caret); library(kernlab); data(spam)
set.seed(123)
# Use data partition functions to create the training and testing sets
inTrain <- createDataPartition(y = spam$type,
p = 0.75,
list=FALSE)
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]
# Check that it works
dim(training)
## [1] 3451 58
## [1] 1150 58
43.1 SPAM Example: K-Fold
Important functions here:
createFolds()
set.seed(123)
# Create the folds returning the training data using the `createFolds` function
folds <- createFolds(y = spam$type,
k = 10,
list = TRUE,
returnTrain = TRUE)
# Check how many data points are in each fold for the training data to make sure they're the same length
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 4141 4140 4140 4142 4142 4141 4141 4141 4140 4141
## [1] 1 2 3 4 5 6 7 8 10 11
# Create the folds returning the testing data using the `createFolds` function
folds <- createFolds(y = spam$type,
k = 10,
list = TRUE,
returnTrain = FALSE)
# Check how many data points are in each fold for the testing data to make sure they're the same length
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 460 460 460 459 460 461 459 461 461 460
## [1] 1 16 28 35 40 68 78 87 111 114
43.2 Spam Example: Resampling
If you want to resample or bootstrap instead of doing k-folds cross validation, you can use the createResample()
function. The argument ‘times’ tells the function how many times you want to resample the data and the ‘list’ argument lets the function know what structure you would like the data to be outputted as.
set.seed(123)
# Create resample using 'createResample()' function
folds <- createResample(y = spam$type, times = 10,
list = TRUE)
sapply(folds, length)
## Resample01 Resample02 Resample03 Resample04 Resample05 Resample06 Resample07
## 4601 4601 4601 4601 4601 4601 4601
## Resample08 Resample09 Resample10
## 4601 4601 4601
# As we're resampling here, you can get the same output back multiple times, for example, we have 1 and 5 return twice in the first fold.
folds[[1]][1:10]
## [1] 1 1 2 5 5 9 13 16 17 18
43.3 SPAM Example: Time Slices
If we want time slices for any particular reason, we can use the createTimeSlices()
function to do that. The ‘initial Window’ argument lets the function know how long the slices will be (how many samples from our arbitrary time vector below) and the ‘horizon’ argument lets the function know how many samples we would like to predict using our window specified beforehand.
set.seed(123)
# Create our arbitrary time vector
time <- 1:1000
# Create the folds uing our 'createTimeSlices' function.
folds <- createTimeSlices(y = time, initialWindow = 20,
horizon = 10)
#What does the 'folds' object look like now?
names(folds)
## [1] "train" "test"
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## [1] 21 22 23 24 25 26 27 28 29 30