46 Preprocessing

46.1 Why Preprocess?

From the SPAM data below, it is evident that preprocessing is needed in order to prevent our machine learning models from being tricked by the hugely varied data.

set.seed(123)

# Use data partition functions to create the training and testing sets
inTrain <- createDataPartition(y = spam$type,
                               p = 0.75,
                               list=FALSE)
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]

# Lets plot a variable to see it's distibution
hist(training$capitalAve, main="", xlab = "Average capital run length")

# Hard to see the distribution as it's so extremely skewed, lets try taking the log of the same variable
hist(log10(training$capitalAve), main="", xlab = "Log 10 Average capital run length")

# Lets look at the mean and standard deviation of this variable
summary(training$capitalAve)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.575    2.277    4.864    3.714 1102.500

sd(training$capitalAve)

## [1] 27.80173

46.2 Standardising

One way of getting around this is to standardise the data. Standardise the data by : \[Standardised~~ Data ~=~ \frac{Sample - Mean(Sample)}{Sd(Sample)}\] Just be sure not to forget that if you’re standardising the training set, you also have to standardise the testing set otherwise you’re not in for a good time.

trainCapAve <- training$capitalAve

# Standardise the data
trainCapAveS <- (trainCapAve - mean(trainCapAve))/sd(trainCapAve)

# Check it works still
mean(trainCapAveS)

## [1] 9.854945e-18

sd(trainCapAveS)

## [1] 1

# Dont forget to also standardise the test set
testCapAve <- testing$capitalAve
testCapAveS <- (testCapAve - mean(testCapAve))/sd(testCapAve)

# Check it works still
mean(testCapAveS)

## [1] 8.507133e-18

sd(testCapAveS)

## [1] 1

46.2.1 Using the preProcess function

A lot of the standard preprocessing techniques can be done automatically by using the preProcess() function from the caret package. The ‘method’ argument tells the function what preprocessing you would like to be done, the ‘center’ and ‘scale’ arguments are listed below.

preObj <- preProcess(training[,-58], method = c("center", "scale"))
trainCapAveS <- predict(preObj, training[, -58])$capitalAve
mean(trainCapAveS)

## [1] 8.680584e-18

sd(trainCapAveS)

## [1] 1

testCapAveS <- predict(preObj, testing[, -58])$capitalAve
mean(testCapAveS)

## [1] 0.04713308

sd(testCapAveS)

## [1] 1.486708

The preProcessing() function can also be passed as an argument to the train() function.

modelFit <- train(type ~., data = training,
                  preProcess = c("center", "scale"),
                  method = "glm")
modelFit

## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9191452  0.8288818

46.3 Standardising: Box-Cox Transforms

Standarising helps to remove really strongly biased predictors or predictors with really high variability. The Box-Cox transforms take a set of continuous data and try to make them look like normal data by estimating a specific set of parameters using MLE (Maximum likelihood estimation). Note that the Box-Cox transformations still encounter problems with 0 value samples. (There is still a fairly large stack of them in the histogram plot below)

# Preprocess you data using Box-Cox transformation
preObj <- preProcess(training[, -58], method = c("BoxCox"))

# Predict using the preprocessed data
trainCapAveS <- predict(preObj, training[, -58])$capitalAve
testCapAveS <- predict(preObj, testing[, -58])$capitalAve

par(mfrow = c(2,2))

# Plot theresulting data
hist(trainCapAveS)
qqnorm(trainCapAveS)
hist(testCapAveS)
qqnorm(testCapAveS)

46.4 Standardising: Imputing Data

Often you will be forced to work with data sets that contain missing data. Predictive algorithms will usually fail in these cases. One way around this, is to impute the missing data. In this case we will use K-Nearest Neighbors Imputation to input the missing data. It is important to note that this is a random process, so it is important to set the seed to ensure reproducible results.

library(RANN)

set.seed(123)

# Use data partition functions to create the training and testing sets
inTrain <- createDataPartition(y = spam$type,
                               p = 0.75,
                               list=FALSE)
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]

# First, start by making some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1], size =1, prob = 0.05)==1
training$capAve[selectNA] <- NA

# Now impute and standardise the data
preObj <- preProcess(training[,-58], method = "knnImpute")
capAve <- predict(preObj, training[, -58])$capAve

# Standardise the true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth - mean(capAveTruth))/sd(capAveTruth)

# Lets see if this checks out
quantile(capAve - capAveTruth)

##            0%           25%           50%           75%          100% 
## -2.3541187775  0.0006568801  0.0018603563  0.0024353679  0.2137068700

quantile((capAve - capAveTruth)[selectNA])

##           0%          25%          50%          75%         100% 
## -2.354118777 -0.025095587  0.002530341  0.012163980  0.213706870

quantile((capAve - capAveTruth)[!selectNA])

##            0%           25%           50%           75%          100% 
## -0.8630004021  0.0007258579  0.0018568190  0.0024090345  0.0028578816