Warnings‎ > ‎

Multiple testing

The main idea...

The problem of multiple testing (also known as multiple comparisons) occurs when the same statistical test is used to assess numerous hypothesis simultaneously, without taking into account the increased probability of detecting a significant result by chance alone. 

To illustrate, suppose an acceptable Type I error rate (the probability of rejecting the null hypothesis when it is true; an incorrect result) was deemed to be 0.05 or 5% for a given test. This suggests that in 95% of the cases, the null hypothesis would be rejected when false (a correct result) for that test. However, if two such tests were conducted simultaneously, there would be a 0.95 × 0.95, or 90.25%, chance of obtaining a correct result. If 100 tests were conducted simultaneously, there is only a 0.5% (0.95100) chance of obtaining a correct result in all cases, i.e. a 99.5% chance of committing a Type I error. The 99.5% risk of committing a Type I error is far from the 5% threshold set for the single test and casts doubt on any result drawn from the set of 100 tests conducted.

Corrections for multiple testing

The family-wise error rate

The choice of correction should reflect the cost of committing either a Type I or Type II error in your individual experiment. If the cost of a Type I error (i.e. declaring a false positive) is higher than the cost of a Type II error (a false negative), a more conservative correction may be a good choice. If, however, the nature of the experiment is more speculative and exploratory, a higher Type I error rate is often more acceptable and a more sensitive correction may be used. Below, a few popular family-wise error rate (FWER) correction measures are described. These methods are of particular interest when one wants to control the risk of committing any Type I errors in the entire family of tests.

 Bonferroni This is an example of a "single-step" correction. The desired Type I error rate for a single test (α) is divided by the number of tests to be conducted (n). The resulting value  (α/n) is that which the researcher should treat as the adjusted Type I error rate. The Bonferroni correction is considered to be a relatively harsh or conservative correction. While the risk of committing a Type I error is reduced, the risk of committing a Type II error (accepting the null hypothesis when it is false) is likely to increase, reducing the power of the test.

 Holm-Bonferroni This is an example of a "sequential" correction. The Holm-Bonferroni correction is less conservative than the Bonferroni correction and thus has more statistical power. This correction uses an algorithm to correct the p-values of each hypothesis tested, one at a time, with different factors. After the tests have been performed, the hypothesis with the most extreme p-value (i.e. the lowest Type I error rate) has its p-value divided by the total number of tests conducted (n), as in the Bonferroni correction. The hypothesis with the second most extreme p-value has its p-value divided by n - 1, the hypothesis with the third by n - 2, and so forth. At each step, the algorithm checks whether the adjusted p-value  (α/n - i) is less than the acceptable Type I error rate, stopping as soon as the corrected p-value of a given hypothesis is greater than this threshold. The first hypothesis to fail this test and all other hypotheses with higher p-values are rejected (i.e. the null hypothesis is accepted).  

 Šidák  Also known as the Dunn-Šidák correction, the Šidák correction is less conservative (and hence more sensitive) than the Bonferroni correction; however it assumes that the hypotheses being tested are statistically independent. The form of this correction is 1-(1- α)1/n where  α is the desired Type I error rate and n is the number of tests being conducted. The resulting value is the adjusted Type-I error rate threshold.


Large data sets and the false discovery rate

Methods other than FWER corrections may be more appropriate for large data sets as large numbers of tests are likely to produce adjusted p-value thresholds that are far too conservative, resulting in many false negatives (Type II errors). Large sampling campaigns in microbial ecology and the use of technologies such as microarrays and genomic sequencing require the use of techniques that provide reasonable significance corrections when thousands or millions of comparisons are performed. In this scenario, one is generally willing to accept a certain proportion of "false positives" or "false discoveries".

False discovery rate
The false discovery rate (FDR; Benjamini & Hochberg, 1995) is the quotient of the number of "false discoveries" (i.e. Type I errors) divided by the total number of "discoveries" (instances where the null hypothesis has been rejected, whether it is true or false) multiplied by the probability of making at least one discovery. This quotient (Q) may be set to an acceptable level, for example 5%, or the number of false discoveries may be directly estimated. The FDR is scalable to any test size: a Q of 5% may correspond to 5 false discoveries in 100 discoveries or 50 in 1000. Further, the FDR is adaptive in the sense that the number of false discoveries is only meaningful when compared to the total number of discoveries made. 
Positive false discovery rate
The positive false discovery rate (pFDR) is very similar to the FDR; however, has some different properties that allow the estimation of an FDR, and hence a measure of significance, for each hypothesis tested (Storey, 2002; Storey, 2003). This measure is called the q-value and is a function of the p-value of an individual test and the distribution of the p-values for all the tests performed. As most statistical tests give non-adjusted p-values in their output, it is quite straightforward to estimate the corresponding q-values  from these p-values (Storey and Tibshiriani, 2003; see implementations). Storey (2002, 2003) argues the pFDR offers more power than the FDR.  A q-value reflects the percentage of test results (discoveries) that are likely to be false positives (false discoveries) for a given p-value in a given set of tests. For example, if a hypothesis test result, Hi, has a q-value of 0.02, 2% of hypothesis tests with a p-value less than or equal to Hi are likely to be false discoveries.

Many of these approaches assume the distribution of p-values obtained from a set of tests is "true". This can be problematic if p-values have been estimated from distributions that do not describe the distribution of a test statistic given a specific data set. Estimating p-values through resampling methods such as permutation may improve this situation.


Consider a data matrix which records the abundances of 100 species (variables) across 30 sites (objects). Pair-wise correlations may be calculated (between species 1 and 2, 1 and 3, 2 and 3, etc.) and tested for significance by e.g. permutation. An alpha level of 0.05 was deemed acceptable for a single test. If each species is to be tested against all other species, a total of 9900 (100 x 99) tests would be conducted. The Bonferroni correction would require the alpha level to be adjusted to 5.05 x 10-6, or 0.05 ÷ 9900.


The following implementations often use different approaches to calculate FDRs and may require different input (e.g. p-values or z-scores). Please read their documentation and ensure that the FDR calculated is appropriate to your data.
  • R
    • Package fdrtool
    • Package mixfdr
    • Package qvalue (Bioconductor) 
    • Package nFDR
    • Package multtest (Bioconductor)