37 In and out of sample error
In sample error: The error rate you get on the same data set you used to build your predictor. (Sometimes called resubsititution error)
Out of sample error: The error rate you get on a new dataset. (Sometimes called generalisation error)
Key Ideas:
- Out of sample error is what you care about.
- In sample error < out of sample error.
- The reason is overfitting, where you algorithm is matched to the data you have.
set.seed(1)
smallSpam <- spam[sample(dim(spam)[1], size=200),]
spamLabel <- (smallSpam$type == "spam")*1 + 1
# Average number of capital letters
plot((smallSpam$capitalAve), col=spamLabel, ylim = c(0,10))
abline(h=3)
# Seems like spam mails have an average of more than 3 capital letters
rule1 <- function(x) {
prediction <- rep(NA, length(x))
prediction[x >= 2.6] <- "spam"
prediction[x < 2.6] <- "nonspam"
return(prediction)
}
table(rule1(smallSpam$capitalAve), smallSpam$type)
##
## nonspam spam
## nonspam 85 28
## spam 26 61
37.0.1 What’s going on?
Data have two components:
- Signal
- Noise
The goal of the predictor is to find the signal. You can always design a perfect in-sample predictor, however you capture both the signal and the noise when you do that. This means that this predictor will perform terribly on new samples.