31 GLMs for Binary Data
Binary GLMs come from trying to model outcomes that can take only two values. Some examples include: survival or not at the end of a study, winning versus losing of a team and success versus failure of a treatment or product. Often these outcomes are called Bernoulli outcomes, from the Bernoulli distribution named after the famous probabilist and mathematician.
If we happen to have several exchangeable binary outcomes for the same level of covariate values, then that is binomial data and we can aggregate the 0’s and 1’s into the count of 1’s. As an example, imagine if we sprayed insect pests with 4 different pesticides and counted whether they died or not. Then for each spray, we could summarize the data with the count of dead and total number that were sprayed and treat the data as binomial rather than Bernoulli.
31.1 Key Ideas
Frequently we care about outcomes with two values:
- Alive/Dead
- Win/Loss
- Success/Failure
These are all binary results, 0/1 outcomes.
- We are going to model data as if it were a collection of coin flips where a success probability depends on a collection of covariates.
- If you have a collection of zeros and ones where the success probability is constant and they’re independant, than the total number of success or failures is a binomial random variable.
31.2 Ravens Data
If we try to predict wether a team will win or not based off of their score and the socre of their opponents using linear regression, then we can have negative slope estimates and coefficiens above 1. It would be better to model the odds in this case.
- Binary Outcome 0/1 \[RW_i\]
- Probability (0,1) \[Pr(RW_i|RS_i, b_0, b_1)\]
- Odds (0, \(\infty\) ) \[\frac{Pr(RW_i|RS_i, b_0, b_1)}{1-Pr(RW_i|RS_i, b_0, b_1)}\]
- Log Odds (\(-\infty, \infty\)) \[log(\frac{Pr(RW_i|RS_i, b_0, b_1)}{1-Pr(RW_i|RS_i, b_0, b_1)})\]
Linear: \[RW_i = b_0 + b_1RS_i + \epsilon_i\] or
\[E[RW_i|RS_i, b_0, b_1] = b_0 + b_1RS_i\]
Logistic: \[Pr(RW_i|RS_i, b_0, b_1) = \frac{exp(b_0+b_1RS_i)}{1+exp(b_0+b_1RS_i)}\] or \[log(\frac{Pr(RW_i|RS_i, b_0, b_1)}{1-Pr(RW_i|RS_i, b_0, b_1)}) = b_0+b_1RS_i\]
- Function to invert log of the odds \(\frac{e^a}{1+e^a}\)
31.3 Interpreting Logistic Regression
\[log(\frac{Pr(RW_i|RS_i, b_0, b_1)}{1-Pr(RW_i|RS_i, b_0, b_1)}) = b_0+b_1RS_i\]
- \(b_0\) = Log odds of a Ravens win in they score zero points
- \(b_1\) = Log odds ratio of a win probability for each point scored (compared to zero points)
- \(exp(b1)\) = Odds ratio of win probability for each point scored (compared to zero points)
31.4 Making the Model in R
- GLMs can be made with the
glm
function, exactly the same way as thelm()
works:
logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore,family="binomial") summary(logRegRavens)
- To view the fitted models:
plot(ravensData$ravenScore, logRegRavens$fitted, pch=19, col="blue", xlab="Score", ylab="Prob Ravens Win")
- To find the Odds ratios and confidence intervals:
exp(logRegRavens$coeff) exp(confint(logRegRavens))
- To do analysis of variance (ANOVA) on a linear model:
anova(logRegRavens,test="Chisq")
31.5 Interpreting Odds Ratios
- These are not probabilities
- Odds ratio of 1 = no difference in odds
- Lod odds ratio of 0 = no difference in the odds
- Odds ratio <0.5 or >2 are commonly seen as having a ‘moderate effect’
- Relative risk is the ratio between two probabilities (does put in some limiations)
- Relative risk is not an odds ratio