51 Predicting with Trees

51.1 Key Ideas

Iteratively split variables into groups
Evaluate “homogeneity” within each group
Split again if necessary

Pros:

Easy to interpret
Better performance in non linear settings

Cons:

Without pruning/cross-validation, this can lead to overfitting
Harder to estimate uncertainty
Results may be variable

51.2 The Basic Algorithm

Start with all variables in one group
Find the variable/split that best separates the outcomes
Divide the data into two groups (“leaves”) on that split (“node”)
Within each split, find the best variable/split that separates the outcomes
Continue until the groups are too small or sufficiently homogeneous

51.3 Measures of Impurity

\[\hat{P}_{mk} = \frac{1}{N_m} \sum_{x_i~~ in~~ Leaf ~~m} \Bbb{I}(y_i = k)\]

Where:

\(x_i\) is a particular observation on leaf \(m\).
\(N_m\) is the number of objects that you can consider.
\(y_i\) is the number of times that the class \(k\) appears in leaf \(m\).
\(\hat{P}_{mk}\) is the probability that class \(k\) appears in leaf \(m\).

Misclassification Error: \[1-\hat{P}_{mk(m)}\]

0 = No entropy
0.5 = Perfect Entropy

Gini Index: \[\sum{}_{k \ne k} \hat{P}_{mk} \times \hat{P}_{mk~'} = \sum{}_{k=1}^K \hat{P}_{mk}(1- \hat{P}_{mk}) = 1 - \sum{}_{k=1}^K P^2_{mk}\]

\[1 - \sum{}_{k=1}^K P^2_{mk}\]

0 = No entropy
0.5 = Perfect Entropy

51.4 Example: Iris Data

# Load in the Data
data(iris); library(ggplot2); library(rattle)

# Examine the Data Labels
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

# Create the training and testing sets
inTrain <- createDataPartition(y = iris$Species,
                               p = 0.7, list = FALSE)

training <- iris[inTrain, ]
testing <- iris[-inTrain, ]

dim(training); dim(testing)

## [1] 105   5

## [1] 45  5

# Lets take a quick look at the distribution
ggplot(data = iris, aes(x = Petal.Width, y = Sepal.Width, col = Species)) + 
  geom_point()

# We'll use the tree algorithm from the 'rpart' package bundled with caret to make the model 
modFit <- train(Species ~. , method = "rpart", data = training)

# This package makes this shit to view, so if you want to have something that is easier to interpret, try using the C5.0 package or rattle
print(modFit$finalModel)

## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.3333333 0.3333333 0.3333333)  
##   2) Petal.Length< 2.6 35  0 setosa (1.0000000 0.0000000 0.0000000) *
##   3) Petal.Length>=2.6 70 35 versicolor (0.0000000 0.5000000 0.5000000)  
##     6) Petal.Length< 4.75 31  0 versicolor (0.0000000 1.0000000 0.0000000) *
##     7) Petal.Length>=4.75 39  4 virginica (0.0000000 0.1025641 0.8974359) *

fancyRpartPlot(modFit$finalModel)

plot(modFit)

51.5 Notes:

Classification trees are non-linear models
- As this is the case, they use interactions between variables
- Data transformations may be less important
- Trees can also be used for regression problems.
More packages for building trees include: party, rpart, c50 and tree.