38 Prediction Study Design

How to minimize the problems that can be caused by in sample verses out of sample errors. The first thing that we need to do in prediction study design is to:

Define the error rate
Split data into: Training, Testing and Validation sets
On training set, pick features (Use cross validation)
On training set, pick prediction function (Use cross validation)
If no validation (Apply 1x to test set)
If validation (Apply to test set and redefine then apply 1x to validation)

38.1 Avoid Small Sample Sizes

Suppose you’re predicting a binary outcome like diseased or healthy
One classifier is flipping a coin
Probability of perfect classification is:
- \(0.5^{(~sample ~size)}\)
- n = 1 flipping coin 50% chance of 100% accuracy
- n = 2 flipping coin 25% chance of 100% accuracy
- n = 10 flipping coin 0.10% chance of 100% accuracy

38.2 Rules of thumb for prediction study design

If you have a large sample size, try to use 60% training, 20% testing and 20% validation. (Validation set is your second insurance)
If you have a medium sample size, try to use 60% training and 40% testing.
If you have a small sample size, try to do cross validation and report caveat of small sample size.

38.3 Some Principals to Remember

Unless you’re using time series data, randomly sample your test and training sets. If you do have time series data, use backtesting, where you test and train in chunks of time.
Your datasets much reflect the structure of the problem you’re trying to predict.
All subsets should reflect as much diversity as possible
- Random assignment does this
- You can also try to balance by features but this is tricky.