15 Permutation Tests
15.1 Group Comparisons
Consider comparing two indpendent groups. Example: comparing different bug sprays
- Consider the null hypothesis that the distribution of the observations from each group is the same
- Then, the group labels are irrelevant
- Consider a data frame with counts in one column and spray label in another
- Permute the spray (group) labels
- Recalcuate the statistic
- Mean difference in counts
- Geomatric means
- T-statistic
- Calculate the percentage of simulations where the simulted statistic was mroe extreme (toward the alternative) than the observed. This will create a permutation-based p-value.
15.2 Variations on Permutation Testing
Data Type, Statistic, Test Name
- Ranks, Rank Sum, Rank Sum Test
- Binary, Hypergeometric Prob, Fisher’s exact test
- Raw data, …, Ordinary Permutation Test
So-called Randomization tests are exactly oerutation tests, with different motivations.
- For matched data, one can randomize the signs
- For ranks, the results in the signed rank test
- Permutation strategies work for regression as well
- Permuting a regressor of interest
- Permutation tests work very well in multivariate settings
15.3 Permutation Test B v C
# Select sprays B and C
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"), ]
# Set y as the outcome (counts in this case)
y <- subdata$count
# Set group as group labels
group <- as.character(subdata$spray)
# Test statistic here is just the difference in average between the two sprays
# Averaged out over batches
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
# The observed statistic here is just the test statistic applied to the outcome and group
observedStat <- testStat(y, group)
# Now we're going to resample the group labels to breka up any association between the outcome and the group labels and measure a test statistic 10,000 times
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
observedStat
## [1] 13.25
# Calculate the percentage of permuted test statistics that are larger or more extreme, in favor of the alternative in our observed statistic
mean(permutations > observedStat)
## [1] 0
# Plotting a histogram allows us to see the distribution of permuted means. The red line indicates the observed mean from the original group
qplot(x = permutations) +
geom_histogram(color = "black", fill = "lightblue", binwidth = 1) +
geom_vline(aes(xintercept=observedStat),color="red", size=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.