35 Prediction Motivation

This will cover the basic ideas behind machine learning and prediction.

  • Study Design - Training vs Tests
  • Conceptual issues - Out of sample error & ROC curves
  • Practical implementation - The caret package

35.1 Who Predicts?

  • Local governments - Pension payments
  • Google - Ad preferences
  • Amazon/Netflix - Suggestion Models
  • Insurance Companies - Risk of death
  • Uni - Successes in programs

View The Elements Of Statistical Learning

View Stanford Machine Learning Ng Course

View Notes from MIT Open Courseware

35.2 What is Prediction?

If we start with a dataset containing blue and orange dots, from all data we’ll use probability/sampling to create a training set that we’ll use to define a prediction function that will take a dot of unknown color and using predictors, it will separate out these two colors of dots. Then we will view whether these predictions are correct or not.

Prediction/Sampling is extremely important and often overlooked, looking further into this. (Parable of Google Flu)

35.3 Components of a Predictor

  • Question - Very specific and well defined. What are you trying to predict and what are you trying to predict it with?
  • Input-data - Best possible.
  • Features - Measured characteristics that we have, or use computation to build useful features.
  • Algorithm
  • Parameters - Estimate parameters of the algorithms and use the parameters to apply the algorithm to a new dataset for evaluation.
  • Evaluation

35.3.1 SPAM example

  • Start with a general question
  • Can I automatically detect emails that are spam?
  • Make it concrete
  • Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
##   make address  all num3d  our
## 1 0.00    0.64 0.64     0 0.32
## 2 0.21    0.28 0.50     0 0.14
## 3 0.06    0.00 0.71     0 1.23
## 4 0.00    0.00 0.00     0 0.63
## 5 0.00    0.00 0.00     0 0.63
## 6 0.00    0.00 0.00     0 1.85

From the plot above, we could build a simple model to predict spam mails by only using the value ‘C’:

Our simple algorithm:

  • Find the value ‘C’
  • Frequency of ‘Your’ > C - predict “SPAM”
##           
## prediction   nonspam      spam
##    nonspam 0.4590306 0.1017170
##    spam    0.1469246 0.2923278

Accuracy = 0.459 + 0.292 = 0.751