48 Preprocessing With Principal Components Analysis (PCA)

The main idea behind using PCA is to reduce the dimensionality of a dataset. Often there will be a large number of strongly correlated covariants that will not be all that different. Including more of these variables will help to explain more of the variance in training set, however to make an effective predictive model it will not be computationally efficient to analyse the entire dataset. PCA will act as a summary, as it will reduce the dimensionality of the dataset by create new covarients that aim to capture the underlying trends in the existing covariants.

##        row col
## num415  34  32
## num857  32  34

## [1] "num857" "num415"

As the two covariates above num415 and num857 are really highly correlated, it’s not necessary to include both.

48.1 Basic PCA Idea

  • We might not want to look at every predictor
  • A weighted combination of predictors might be better for computational efficiency
  • We should pick this combination to capture the “most information” possible
  • Benefits of this are:
    • Reduced number of predictors
    • Reduced noise in the data due to averaging out our predictors

If we add the covariates together on one axis and subtract them on another axis, we can see where most of the variance among the two covariates lies.

This plot suggests that we should use the addition of the two covariates together instead of using them independently. This would bother reduce the amount of data used to train the model while reducing the amount of noise around our data.

48.3 Single Value Decomposition (SVD) and Principal Components Analysis (PCA)

SVD

If \(X\) is a matrix with each variable in a column and each observation in a row, then the SVD is a Matrix decomposition. \[X = UDV^T\]

Where:

  • The columns of \(U\) are orthogonal (left singular vectors)
  • The columns of \(V\) are orthogonal (right singular vectors)
  • \(D\) is a diagonal matrix (single values)

PCA

The principal components are equal to the right singular values if you first scale (standardise) the variables.

48.6 PCA with Caret

PCA can be performed using function from the caret package. The preProcess() function is useful here.