46 Preprocessing

46.2 Standardising

One way of getting around this is to standardise the data. Standardise the data by : \[Standardised~~ Data ~=~ \frac{Sample - Mean(Sample)}{Sd(Sample)}\] Just be sure not to forget that if you’re standardising the training set, you also have to standardise the testing set otherwise you’re not in for a good time.

## [1] 9.854945e-18
## [1] 1
## [1] 8.507133e-18
## [1] 1

46.2.1 Using the preProcess function

A lot of the standard preprocessing techniques can be done automatically by using the preProcess() function from the caret package. The ‘method’ argument tells the function what preprocessing you would like to be done, the ‘center’ and ‘scale’ arguments are listed below.

## [1] 8.680584e-18
## [1] 1
## [1] 0.04713308
## [1] 1.486708

The preProcessing() function can also be passed as an argument to the train() function.

## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9191452  0.8288818

46.3 Standardising: Box-Cox Transforms

Standarising helps to remove really strongly biased predictors or predictors with really high variability. The Box-Cox transforms take a set of continuous data and try to make them look like normal data by estimating a specific set of parameters using MLE (Maximum likelihood estimation). Note that the Box-Cox transformations still encounter problems with 0 value samples. (There is still a fairly large stack of them in the histogram plot below)

46.4 Standardising: Imputing Data

Often you will be forced to work with data sets that contain missing data. Predictive algorithms will usually fail in these cases. One way around this, is to impute the missing data. In this case we will use K-Nearest Neighbors Imputation to input the missing data. It is important to note that this is a random process, so it is important to set the seed to ensure reproducible results.

##            0%           25%           50%           75%          100% 
## -2.3541187775  0.0006568801  0.0018603563  0.0024353679  0.2137068700
##           0%          25%          50%          75%         100% 
## -2.354118777 -0.025095587  0.002530341  0.012163980  0.213706870
##            0%           25%           50%           75%          100% 
## -0.8630004021  0.0007258579  0.0018568190  0.0024090345  0.0028578816