38 Prediction Study Design
How to minimize the problems that can be caused by in sample verses out of sample errors. The first thing that we need to do in prediction study design is to:
- Define the error rate
- Split data into: Training, Testing and Validation sets
- On training set, pick features (Use cross validation)
- On training set, pick prediction function (Use cross validation)
- If no validation (Apply 1x to test set)
- If validation (Apply to test set and redefine then apply 1x to validation)
38.1 Avoid Small Sample Sizes
- Suppose you’re predicting a binary outcome like diseased or healthy
- One classifier is flipping a coin
- Probability of perfect classification is:
- \(0.5^{(~sample ~size)}\)
- n = 1 flipping coin 50% chance of 100% accuracy
- n = 2 flipping coin 25% chance of 100% accuracy
- n = 10 flipping coin 0.10% chance of 100% accuracy
38.2 Rules of thumb for prediction study design
If you have a large sample size, try to use 60% training, 20% testing and 20% validation. (Validation set is your second insurance)
If you have a medium sample size, try to use 60% training and 40% testing.
If you have a small sample size, try to do cross validation and report caveat of small sample size.
38.3 Some Principals to Remember
- Unless you’re using time series data, randomly sample your test and training sets. If you do have time series data, use backtesting, where you test and train in chunks of time.
- Your datasets much reflect the structure of the problem you’re trying to predict.
- All subsets should reflect as much diversity as possible
- Random assignment does this
- You can also try to balance by features but this is tricky.