13 Multiple Comparisons

13.1 Multiple Testing

13.1.1 Key Ideas

  • Hypothesis testing / significance analysis is commonly overused
  • Correcting for multiple testing avoids false positives or discoveries
  • Two key components
    • Error measure
    • Correction

13.1.2 Why Correct for Multiple Tests?

If we perform tests with an alpha value of \(5\%\), meaning that we have a \(5\%\) chance of our finding happening by chance, then we go away and test 20 different hypotheses, then it becomes extremely likely that at least one of them would result in a coincidence.

Here, it becomes necessary to point out that p values and hypothesis tests are not interchangeable.

13.2 Error Rates

False positive rate

  • The rate at which false results \((\beta=0)\) are called significant \(E [\frac{V}{m_0}]\)

Family wise error rate (FWER)

  • The probability of at least one false positive \(Pr(V\ge1)\)

False discovery rate (FDR)

  • The rate at which claims of significance are false \(E [\frac{V}{R}]\)

13.3 Controlling the False Positive Rate

If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.

Problem: suppose that you perform 10,000 tests and \(\beta = 0\) for all of them. * Suppose that you call \(P < 0.05\) significant. * The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives. * How do we avoid so many false positives?

13.3.1 Controlling the Family-wise Error Rate

The Biferroni Correction is the oldest multiple testing correction.

Basic Idea:

  • Suppose you do \(m\) tests
  • You want to control FWER at level \(\alpha\) so \(Pr(V\ge1) < \alpha\)
  • Calculate P-Values normally
  • Set \(\alpha_{fwer} = \frac{\alpha}{m}\)
  • Call all P-values less than \(\alpha_{fwer}\) significant

Pros: Easy to calculate Cons: May be very conservative

13.3.2 Controlling the False Discovery Rate (FDR)

This is the most popular correction when performing lots of tests, say in genomics, imaging, astronomy, or other signal-pressing disciplines.

Basic idea:

  • Suppose that you do \(m\) tests.
  • You want to control FDR at level \(\alpha\) so \(E[\frac{V}{R}]\)
  • Calculate P-values normally
  • Order the P values from smallest to largest \(P_{[1]}, P_{[2]}, P_{[3]}...P_{[m]}\)
  • Call any \(P_{[i]} \le \alpha \times \frac{i}{m}\) significant

Pros: Still pretty easy to calculate, less conservative (maybe much less) Cons: Allows for more false positives, may behave strangely under dependence

13.4 Adjusted P-Values

  • One approach is to adjust the threshold \(\alpha\)
  • A different approach is to calculate “adjusted p-values”
  • These are no longer p-values
  • Theses can however, be used directly without adjusting \(\alpha\)

Example:

  • Suppose that P-values are \(P_{[1]}, P_{[2]}, P_{[3]}...P_{[m]}\)
  • You could adjust them by taking \(P_{i}^{fwer}=max(m)\times P_i, 1\) for each P-value
  • Then if you call all \(P_{i}^{fwer} < \alpha\) significant, you will control the FWER

13.5 Case Study 1: No True Positives

The code below will create 1000 linear models using random normally distributed data with no underlying relationship. We’ll record the p-values for each regression coefficient and see how many times we falsely detect a statistically significant relationship.

We’ll then look at different ways that we can adjust the p-values to mitigate these errors.

## [1] 47
## [1] 0
## [1] 0

13.7 Notes

  • Multiple testing is an entire sub field
  • A basic Bonferroni/BH correction is usually good enough
  • If there is a strong dependence between tests, there may be problems
    • Consider method = “BY”