35 Prediction Motivation

This will cover the basic ideas behind machine learning and prediction.

Study Design - Training vs Tests
Conceptual issues - Out of sample error & ROC curves
Practical implementation - The caret package

35.1 Who Predicts?

Local governments - Pension payments
Google - Ad preferences
Amazon/Netflix - Suggestion Models
Insurance Companies - Risk of death
Uni - Successes in programs

View The Elements Of Statistical Learning

View Stanford Machine Learning Ng Course

View Notes from MIT Open Courseware

35.2 What is Prediction?

If we start with a dataset containing blue and orange dots, from all data we’ll use probability/sampling to create a training set that we’ll use to define a prediction function that will take a dot of unknown color and using predictors, it will separate out these two colors of dots. Then we will view whether these predictions are correct or not.

Prediction/Sampling is extremely important and often overlooked, looking further into this. (Parable of Google Flu)

35.3 Components of a Predictor

Question - Very specific and well defined. What are you trying to predict and what are you trying to predict it with?
Input-data - Best possible.
Features - Measured characteristics that we have, or use computation to build useful features.
Algorithm
Parameters - Estimate parameters of the algorithms and use the parameters to apply the algorithm to a new dataset for evaluation.
Evaluation

35.3.1 SPAM example

Start with a general question
Can I automatically detect emails that are spam?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

library(kernlab)

data(spam)
head(spam)[, 1:5]

##   make address  all num3d  our
## 1 0.00    0.64 0.64     0 0.32
## 2 0.21    0.28 0.50     0 0.14
## 3 0.06    0.00 0.71     0 1.23
## 4 0.00    0.00 0.00     0 0.63
## 5 0.00    0.00 0.00     0 0.63
## 6 0.00    0.00 0.00     0 1.85

# Spam mails have more frequent uses of the word "Your"
plot(density(spam$your[spam$type=="nonspam"]), col = "blue", main="", xlab="Frequency of 'Your'")
lines(density(spam$your[spam$type=="spam"]), col = "red")
abline(v=0.3, col = "black")

From the plot above, we could build a simple model to predict spam mails by only using the value ‘C’:

Our simple algorithm:

Find the value ‘C’
Frequency of ‘Your’ > C - predict “SPAM”

prediction <- ifelse(spam$your > 0.5, "spam", "nonspam")
table(prediction, spam$type)/length(spam$type)

##           
## prediction   nonspam      spam
##    nonspam 0.4590306 0.1017170
##    spam    0.1469246 0.2923278

Accuracy = 0.459 + 0.292 = 0.751