Reference‎ > ‎Resampling‎ > ‎

Permutation

The main idea...

Informally, permutation is simply the act of rearranging objects (Figure 1). Within a single data matrix, rearranging values can disrupt relationships between variables and objects. Additionally, given a data set that comprises one matrix of response data and one matrix of explanatory data, randomly permuting row or column values in one of these matrices is likely to disrupt any relationships (if present) between the two matrices.

In hypothesis testing procedures, this effect can be used to create a numerical representation of a null hypothesis that does not depend on a predefined data distribution. 
 This is called the "null distribution". To create the null distribution, a chosen test statistic will be calculated for each permutation of a given data set. If the value of the test statistic calculated from the non-permuted data is sufficiently different from those in the null distribution, statistical significance may be asserted (Figure 2).

Creating a null distribution by permutation is particularly suited to data sets that do not follow standard distributions or those with few samples (objects). Be aware, however, that you must be sure that the permutational approach you choose aligns to a valid null hypothesis for your data and is supported by your experimental design. This hinges on correctly identifying what units are exchangeable under the null hypothesis (Anderson, 2001). See below for some examples. 

Importantly, if an experimental or sampling design features nestedness or hierarchy, permutations must be restricted appropriately. For example, if sediment samples were taken from coastal and pelagic regions and there is no good reason to believe that the sediment types are the same between these two sets


Examples of permutational schemes and their corresponding null hypotheses. Examples of the null hypotheses are described using sites and species (as objects and variables, resp.).

 within rows Permuting values within rows (Figure 1, b) suggests that there is no meaning in the order of variable values within an object. That is, an object is equally likely to realise any variable value present across all variables which describe that particular object. 

This suggests that, at a given site, species' presences or abundances (and hence names) are arbitrary.

 whole rows Permuting whole rows (Figure 1, c) suggests that there is no meaningful relationship between a complete set of variable values associated to a given object. In other words, any object could just as well be described by the variable values (taken as a complete set) of another object. Objects, in their entirety, are treated as the exchangeable units.

This suggests that sites are interchangeable, i.e. there is no difference between them.

 whole columns Permuting whole columns (Figure 1, d) suggests that there is no meaningful relationship or distinction between variables. That is, any variable could just as well be any other variable. The exchangeable elements are the variables themselves, in their entirety. This permutational scheme is useful in several R mode analyses.

This suggests that species are interchangeable, i.e. species names are irrelevant.
 
 within columns Permuting values within columns (Figure 1, e) suggests that there is no meaningful relationship between a variable value and an object. That is, an object is equally likely to realise any variable value present within a given variable in the data set.

This suggests that the presence or abundance of a given species across all sites is arbitrary.



How many permutations?

Occasionally, it isn't possible (or useful) to calculate all the possible permutations of large data sets due to limited computation power and time. In these scenarios, a random subset of all possible permutations can be calculated. However, the number of permutations performed will determine the minimum probability of rejecting the null hypothesis. For example, using 1,000 permutations, the smallest possible "p-value" is 0.001. 

The ranking of the observed value of a given test statistic among the permuted test statistics gives its p-value. For example, if 999 permutations are performed to create the null distribution, and 4 of these are less than or equal to the observed value of a given statistic, the resulting probability of the observed statistic is:

            (1+4) ÷ (1+999) = 0.005

Note that the "1" added to both the numerator and denominator is the observed value itself.

Assumptions

  • Permuted units are assumed to be exchangeable under the null hypothesis. 
  • Homogeneity of variances or, in the multivariate cases, dispersion is often required 
  • Permuted units should be independent. This assumption can be difficult to meet in ecological scenarios. Overlooking this assumption should be done consciously and adequately justified.

Warnings 

  • Permutation is not totally assumption free. See "Assumptions".
  • During hypothesis testing, be clear if you are performing a one- or two-tailed test. If opting for a two-tailed test, and the number of permutations supports a p-value of 0.05, then the test statistic must be greater than 97.5% or less than 2.5% of the null distribution to be significant. A one-tailed test would require the test statistic to be either greater than 95% or less than 5% of the null distribution.
  • While often equated in prose, permutation tests differ from randomised exact tests. Most notably, the latter perform all possible permutations of a data set to arrive at a null distribution.
  • Very small data sets which have correspondingly small numbers of possible permutations may not be suited to a permutational assessment of significance.
  • Data sets from unbalanced experimental designs can limit power when employing permutation.

a

b

c

d

e

Figure 1: Schematic illustrating different permutational schemes. a) Original data. Filled circles represent data values. Rows are coded by colour hue and columns by brightness / luminosity. b) Permutations within rows c) Permutation of whole rows d) Permutation of whole columns e) Permutation within columns.





Figure 2: The value of two population means, μ1 and μ2, are compared to a distribution of n population means generated from permuted data (μ1*...n). While μ1 is likely to be lower than what could be expected by chance (i.e. given randomised data) at most accepted cut-offs, μ2 is appears to belong to the null distribution. 

Implementations

    • The "permute" package offers a range of permutation functions originally found in CANOCO. It supports restricted permutational schemes familiar to ecologists. 
    • The "perm" package offers a set of functions allowing exact and asymptotic permutation tests. 
    • The "coin" package offers a range of functions, including some to build permutational distributions for conditional independence tests (Hothorn et al., 2008). 
    • The "lmperm" package offers permutational alternatives for linear model approaches. Multivariate support is not implemented, however.
References
  • Anderson MJ (2001) Permutation tests for univariate and multivariate analysis of variance and regression. Can J Fish Aquat Sci. 58: 626-39. 
  • Hothorn T, Hornik K, van de Wiel MA, Zeileis A (2008) Implementing a class of permutation tests: The coin package. J Stat Soft. 28(8):1–23.
  • Knijnenburg TA, Wessels LFA, Reinders MJT, Shmulevich I (2009) Fewer permutations, more accurate P-values Bioinformatics. 25(12): i161-i168. 
Comments