41 What Data Should You Use?

After defining a question, what is the next important step? It is necessary to identify the data set that will create the best possible predictions, which are the machine learning algorithms.

Example: FiveThirtyEight

For predicting X, use data as closely related to X as possible. (Most of the time, not a hard rule, google’s flu outbreak predictor broke because of this)

For predicting baseball player performance, use baseball player performance and performance of players similar to them.
For predicting which movies people would watch on the Netflix prize, use the movies that they have watched in the past.
For predicting data about hospitalizations, use data related to hospitalizations. (Heritage health prize)

The looser the connection, the harder the prediction can be. (Oncotype DX model for looking into underlying biology that changes gene expression)

Data properties matter. Comparing the CDC flu data to the google flu trends data reveals how heavily influenced the google flu trends model is, by search terms related to ‘flu’, as this could just be a search around flu season. Know how the data connects to the thing that you want to predict and make everything explicit.

Unrelated data is the most common mistake when building machine learning models!!! Example of plot of chocolate consumption (kg/yr/capita) vs number of nobel prize winners per 10 million people in the population. This linear plot shows \(r=0.791,~~ p<0.0001\)