The problem of multiple testing (also known as multiple comparisons) occurs when the same statistical test is used to assess numerous hypothesis simultaneously, without taking into account the increased probability of detecting a significant result by chance alone.
To illustrate, suppose an acceptable Type I error rate (the probability of rejecting the null hypothesis when it is true; an incorrect result) was deemed to be 0.05 or 5% for a given test. This suggests that in 95% of the cases, the null hypothesis would be rejected when false (a correct result) for that test. However, if two such tests were conducted simultaneously, there would be a 0.95 × 0.95, or 90.25%, chance of obtaining a correct result. If 100 tests were conducted simultaneously, there is only a 0.5% (0.95 ^{100}) chance of obtaining a correct result in all cases, i.e. a 99.5% chance of committing a Type I error. The 99.5% risk of committing a Type I error is far from the 5% threshold set for the single test and casts doubt on any result drawn from the set of 100 tests conducted.
Corrections for multiple testingThe familywise error rate
The choice of correction should reflect the cost of committing either a Type I or Type II error in your individual experiment. If the cost of a Type I error (i.e. declaring a false positive) is higher than the cost of a Type II error (a false negative), a more conservative correction may be a good choice. If, however, the nature of the experiment is more speculative and exploratory, a higher Type I error rate is often more acceptable and a more sensitive correction may be used. Below, a few popular familywise error rate (FWER) correction measures are described. These methods are of particular interest when one wants to control the risk of committing any Type I errors in the entire family of tests.
Bonferroni 
This is an example of a "singlestep" correction. The desired Type I error rate for a single test (α) is divided by the number of tests to be conducted (n). The resulting value (α/n) is that which the researcher should treat as the adjusted Type I error rate. The Bonferroni correction is considered to be a relatively harsh or conservative correction. While the risk of committing a Type I error is reduced, the risk of committing a Type II error (accepting the null hypothesis when it is false) is likely to increase, reducing the power of the test.

HolmBonferroni 
This is an example of a "sequential" correction. The HolmBonferroni correction is less conservative than the Bonferroni correction and thus has more statistical power. This correction uses an algorithm to correct the pvalues of each hypothesis tested, one at a time, with different factors. After the tests have been performed, the hypothesis with the most extreme pvalue (i.e. the lowest Type I error rate) has its pvalue divided by the total number of tests conducted (n), as in the Bonferroni correction. The hypothesis with the second most extreme pvalue has its pvalue divided by n  1, the hypothesis with the third by n  2, and so forth. At each step, the algorithm checks whether the adjusted pvalue (α/n  i) is less than the acceptable Type I error rate, stopping as soon as the corrected pvalue of a given hypothesis is greater than this threshold. The first hypothesis to fail this test and all other hypotheses with higher pvalues are rejected (i.e. the null hypothesis is accepted).

Šidák 
Also known as the DunnŠidák correction, the Šidák correction is less conservative (and hence more sensitive) than the Bonferroni correction; however it assumes that the hypotheses being tested are statistically independent. The form of this correction is 1(1 α)^{1/n} where α is the desired Type I error rate and n is the number of tests being conducted. The resulting value is the adjusted TypeI error rate threshold. 


Large data sets and the false discovery rate
Methods other than FWER corrections may be more appropriate for large data sets as large numbers of tests are likely to produce adjusted pvalue thresholds that are far too conservative, resulting in many false negatives (Type II errors). Large sampling campaigns in microbial ecology and the use of technologies such as microarrays and genomic sequencing require the use of techniques that provide reasonable significance corrections when thousands or millions of comparisons are performed. In this scenario, one is generally willing to accept a certain proportion of "false positives" or "false discoveries".
False discovery rate (FDR) 
The false discovery rate (FDR; Benjamini & Hochberg, 1995) is the quotient of the number of "false discoveries" (i.e. Type I errors) divided by the total number of "discoveries" (instances where the null hypothesis has been rejected, whether it is true or false) multiplied by the probability of making at least one discovery. This quotient (Q) may be set to an acceptable level, for example 5%, or the number of false discoveries may be directly estimated. The FDR is scalable to any test size: a Q of 5% may correspond to 5 false discoveries in 100 discoveries or 50 in 1000. Further, the FDR is adaptive in the sense that the number of false discoveries is only meaningful when compared to the total number of discoveries made. 
Positive false discovery rate (pFDR) 
The positive false discovery rate (pFDR) is very similar to the FDR; however, has some different properties that allow the estimation of an FDR, and hence a measure of significance, for each hypothesis tested (Storey, 2002; Storey, 2003). This measure is called the qvalue and is a function of the pvalue of an individual test and the distribution of the pvalues for all the tests performed. As most statistical tests give nonadjusted pvalues in their output, it is quite straightforward to estimate the corresponding qvalues from these pvalues (Storey and Tibshiriani, 2003; see implementations). Storey (2002, 2003) argues the pFDR offers more power than the FDR. A qvalue reflects the percentage of test results (discoveries) that are likely to be false positives (false discoveries) for a given pvalue in a given set of tests. For example, if a hypothesis test result, H_{i}, has a qvalue of 0.02, 2% of hypothesis tests with a pvalue less than or equal to H_{i} are likely to be false discoveries. 
Many of these approaches assume the distribution of pvalues obtained from a set of tests is "true". This can be problematic if pvalues have been estimated from distributions that do not describe the distribution of a test statistic given a specific data set. Estimating pvalues through resampling methods such as permutation may improve this situation.
Consider a data matrix which records the abundances of 100 species (variables) across 30 sites (objects). Pairwise correlations may be calculated (between species 1 and 2, 1 and 3, 2 and 3, etc.) and tested for significance by e.g. permutation. An alpha level of 0.05 was deemed acceptable for a single test. If each species is to be tested against all other species, a total of 9900 (100 x 99) tests would be conducted. The Bonferroni correction would require the alpha level to be adjusted to 5.05 x 10^{6}, or 0.05 ÷ 9900.
The following implementations often use different approaches to calculate FDRs and may require different input (e.g. pvalues or zscores). Please read their documentation and ensure that the FDR calculated is appropriate to your data.  R
 Package fdrtool
 Package mixfdr
 Package qvalue (Bioconductor)
 Package nFDR
 Package multtest (Bioconductor)
