27 Regression Trees and Model Trees

Trees for numeric prediction fall into two categories;

Regression trees
Model trees

Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node. Depending on the number of leaf nodes, a model tree may build tens or even hundreds of such models. This may make model trees more difficult to understand than the equivalent regression tree, with the benefit that they may result in a more accurate model.

Though traditional regression methods are typically the first choice for numeric prediction tasks, in some cases, numeric decision trees offer distinct advantages. For instance, decision trees may be better suited for tasks with many features or many complex, non-linear relationships among features and outcome. These situations present challenges for regression. Regression modeling also makes assumptions about how numeric data is distributed that are often violated in real-world data. This is not the case for trees.

One common splitting criterion is known as the standard deviation reduction \[SDR = sd(T) - \sum_i {{|T_i|}\over{T}} \times sd(T_i)\]

Where:

\(sd(T)\) refers to the standard deviation in set \(T\).
\(T_1, T_2, ... T_n\) are sets resulting from a split on a feature.
\(|T|\) is the number of observations in set \(T\).

Essentially, the formula measures the reduction in standard deviation by comparing the standard deviation pre-split to the weighted standard deviation post-split.

27.1 Building the Model

redwine <- read.csv("wine.data", header=FALSE)
whitewine <- read.csv("whitewine.csv", header = TRUE)

whitewine$quality <- as.numeric(whitewine$quality)

# add some names to the features in the redwine data
name <- c("cultivar", "alcohol", "malic acid", "ash", "alcalinity of ash", "magnesium", "total phenols", "flavanoids", "nonflavanoid phenols", "proanthocyanins", "colour intensity", "hue", "od280/od315 of diluted wines", "proline")
names(redwine) <- name

str(whitewine)

## 'data.frame':    4899 obs. of  12 variables:
##  $ fixedAcidity      : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatileAcidity   : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citricAcid        : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residualSugar     : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides         : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ freeSulfurDioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ totalSulfurDioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density           : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates         : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol           : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality           : num  6 6 6 6 6 6 6 6 6 6 ...

hist(whitewine$quality)

wine_train <- whitewine[1:3750, ]
wine_test <- whitewine[3751:4898, ]

rp_model <- rpart(quality ~ ., data = wine_train)

rp_model

## n= 3750 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3750 3140.06000 5.886933  
##    2) alcohol< 10.85 2473 1510.66200 5.609381  
##      4) volatileAcidity>=0.2425 1406  740.15080 5.402560  
##        8) volatileAcidity>=0.4225 182   92.99451 4.994505 *
##        9) volatileAcidity< 0.4225 1224  612.34560 5.463235 *
##      5) volatileAcidity< 0.2425 1067  631.12090 5.881912 *
##    3) alcohol>=10.85 1277 1069.95800 6.424432  
##      6) freeSulfurDioxide< 11.5 93   99.18280 5.473118 *
##      7) freeSulfurDioxide>=11.5 1184  879.99920 6.499155  
##       14) alcohol< 11.85 611  447.38130 6.296236 *
##       15) alcohol>=11.85 573  380.63180 6.715532 *

rpart.plot(rp_model, digits = 3)

rpart.plot(rp_model, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

27.2 Evaluating the Model

rp_predict <- predict(rp_model, wine_test)

summary(rp_predict)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.995   5.463   5.882   5.999   6.296   6.716

summary(wine_test$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.848   6.000   8.000