Choosing the right measureYour choice of (dis)similarity measure is likely to have major impact on your results. Understanding how each measure affects your data and which one is suitable is an essential part of many analyses. The page below discusses some of these measures. If you're unable to decide on a measure, consider using our (dis)similarity wizard to help you decide what sort of measure may be most appropriate..
(Dis)similarity, distance, and dependence measures are powerful tools in determining ecological association and resemblance. Choosing an appropriate measure is essential as it will strongly affect how your data is treated during analysis and what kind of interpretations are meaningful. Non-metric dimensional scaling, principal coordinate analysis, and cluster analysis are examples of analyses that are strongly influenced by the choice of (dis)similarity measure used. Note, that while these measures may draw out certain types of relationships in your raw data, they may do so at the expense of other information present therein. Below, several key measures for asserting ecological resemblance are introduced. For a more complete overview, see chapter seven in Legendre & Legendre's Numerical Ecology (1998). For a critical view on the use of dissimilarity and distance, see Warton et al. (2012).
When choosing a distance measure, ensure that the measure reflects the ecological relationships you are concerned with. Further, some measures have mathematical properties that make them unsuitable for certain analyses. Similarly, certain analyses will only produce meaningful results when certain measures are used. If a measure listed below sounds suited to your data, use more detailed resources to learn about its properties and limitations before drawing any conclusions from analyses based upon it. The list below is not exhaustive, but aims to familiarise you with a set of commonly used measures and their uses.
Q mode similarity measures
As noted above, similarity measures (S) are never metric, thus objects cannot be ordinated in a metric or Euclidean space based on their similarities. Converting similarities to distances can allow such ordination. This can be done simply by taking their one-complement (1-S) or its square root. Below, a few common measures are described below. For an extensive overview, see Legendre and Legendre (1998). |
Simple-matching coefficient | This coefficient gives equal weight to both forms of match - double zeros and double ones, and is thus a symmetrical coefficient. |
Jaccard coefficient | This coefficient excludes double zeros, giving equal weight to non-zero agreements ("1", "1") and disagreements ("1", "0" and "0", "1") when comparing two objects. Given a "sites x species" matrix, the Jaccard coefficient can be used to express species/OTU turnover. |
Sørensen / Dice coefficient |
This coefficient is similar to the Jaccard coefficient, however, gives double weight to non-zero agreements. This asserts that the co-occurrence or coincidence of variable states among objects is more informative or important than disagreements. This is based on the logic of the harmonic mean and is thus suitable for data sets with large-valued outliers. It may, however, increase the influence of small-valued outliers. |
Other binary measures are available which treat double-zero agreements, double-one agreements, and disagreements differently for a variety of reasons. Consider carefully if any special meaning is indicated by the different matching states of the binary variables in your data set and ensure that the measure chosen adequately reflects these.
Quantitative measures
Quantitative coefficients take into account values other than "0" and "1". Some quantitative measures lessen the effect of relatively large or small variable values in a data set to preserve overall interpretability. However, other measures are sensitive to large quantitative differences and perform better on transformed data.
Gower coefficient | This coefficient may be used for heterogeneous data sets (i.e. data sets including numerous variable types). It calculates a partial similarity value of two objects for each variable describing them. The final similarity score is the average of all partial similarities. Binary, qualitative and semi-quantitative, and quantitative variables are treated differently.
|
Gower, 1971 |
Steinhaus coefficient |
This asymmetric coefficient is widely used for raw count data. It compares the sum of the minimum, per-variable values between two objects to the average value of all variables describing these objects. If applied to binary data, this is equivalent to the Sørensen coefficient. The one-complement of this coefficient is the popular Bray-Curtis dissimilarity measure. |
|
== incomplete ==
|
There are three groups of dissimilarity measues: metric, semimetric, and nonmetric. See the "Key terminology" section of this page for definitions.
Metric distances
Euclidean distance | A simple, symmetrical metric using the Pythagorean formula. The more variables present in a data set, the larger one may expect Euclidean distances to be. Further, double zeros result in decreased distances. This property makes the Euclidean distance unsuitable for many ecological data sets and ecologically-motivated transformations should be considered. Principal components analysis and redundancy analysis ordinate objects using Euclidean distances. |
|
Chord distance | This asymmetric distance measure is simply the Euclidean distance calculated for a row standardised matrix (see the chord transformation). Rather than comparing absolute values, the chord distance compares objects based on the proportion of a given value to the sum of all variable values across the row corresponding to that object. Thus, even if objects have different raw values of two or more variables, as long as these values are proportionately equivalent when standardised, the sites will be considered similar. The Chord distance is insensitive to double zeros. |
Orlóci, 1967 |
Mahalanobis distance | Appropriate for comparing groups of objects described by the same variables, this coefficient eliminates the effect of correlations between variables and is arrived at through the calculation of a covariance matrix from the input matrix. It also eliminates differences in scale between variables. Alternative forms of this measure may be used to calculate the distance between a group and a single object. |
Mahalanobis, 1936 |
Coefficient of racial likeness | Appropriate for comparing groups of objects described by the same variables, this coefficient does not eliminate the effect of correlations between variables. This may be desirable when samples are too small to effectively remove correlative effects (see e.g. Penrose, 1952) |
Pearson, 1926 |
χ2 metric | The calculation of this asymmetric metric transforms a matrix of quantitative values into a matrix of conditional probabilities (i.e. the quotient of a given value in a cell and either the row or column totals). A weighted Euclidean distance measure is then computed based on the values in the rows (or columns in R mode analysis) of the conditional probability matrix. Weights, which are the reciprocal of the variable (column) totals from the raw data matrix, serve to reduce the influence of the highest values measured. |
|
χ2 distance | This asymmetric distance is similar to the χ2 metric, however, the weighted Euclidean distances are multiplied by the total of all values in the raw data matrix. This converts the weights in the Euclidean distances to probabilities rather than column totals. This is the measure used in correspondence analysis and related analyses. |
Lebart & Fénelon, 1971 |
Hellinger distance | This asymmetric distance is similar
to the χ2 metric. While no weights are applied, the square roots of conditional probabilities are used as variance-stabilising data transformations. This distance measure performs well in linear ordination. Variables with few non-zero counts (such as rare species) are given lower weights. |
Hellinger, 1909; Rao, 1995 |
Manhattan metric | Similar to the Euclidean distance; however, rather than using the Pythagorean formula, the Manhattan distance simply sums the absolute differences across pairs of variable values for a given object. Just like the Euclidean distance, this metric suffers from the double zero problem and distances reported will increase with the number of variables assessed. |
|
Canberra metric | This metric excludes double zeros and increases the effect of differences between variables with low values or many zeroes. | Lance & Williams, 1966 |
Jaccard distance | The one complement of the Jaccard similarity (described above), is a metric distance. |
Semimetric measures
As described above, semimetric measures do not always satisfy the triangle inequality and hence cannot be fully relied upon to represent dissimilarities in a Euclidean space without appropriate transformation. That being said, they often do behave metrically and can be used in principal coordinates analysis (following an adjustment for negative eigenvalues if necessary) and non-metric dimensional scaling.
Bray-Curtis dissimilarity | This is an asymmetrical measure often used for raw count data. This is the one-complement of the Steinhaus similarity coefficient and a popular measure of dissimilarity in ecology. This measure treats differences between high and low variable values equally. |
Bray & Curtis, 1957 |
Sørensen dissimilarity | The one complement of the Sørensen similarity coefficient (described above) is a semimetric dissimilarity measure. |
|
Nonmetric measures
As noted by Legendre and Legendre (1998), nonmetric dissimilarity measures, such as a binary coefficient proposed by Kulczynski (1928) which is the quotient of double presences and disagreements, may assume negative values. As negative dissimilarities are intuitively nonsensical, they are problematic for interpretation. In general, these should be avoided unless there is a very clear reason to use them.
R mode measures of dependence
R mode measures express the relationships between variables. With some exceptions, Q-mode measures are generally not useful or meaningful in R-mode analysis. See Legendre and Legendre (1998) and Ludwig and Reynolds (1988) for an explanation of what constitutes a permissible R-mode measure. Often, R-mode measures are referred to as dependence coefficients as they express how much the values of one variable can be said to depend on the states of another variable. Well-known correlation measures are examples of R mode measures.
Pearson's r | This familiar measure of linear correlation between two variables, suitable only for detecting linear relationships between variables. This is covariance between two variables divided by the product of their standard deviations. If your variables have many zeros, this correlation coefficient will not be reliable as double-zeros will be understood as an "agreement" when, in fact, they are simply the absence of an observation. This will inflate the correlation coefficient. |
|
Spearman's rho | This is a non-parametric measure of correlation which uses ranks rather than the original variable values. Variables should have monotonic relationships: that is, their ranks should either go up or down across objects, but not necessarily in a linear fashion. Like Pearson's r, Spearman's rho is based on the principal of least squares, but is concerned with how strongly the rankings between two variables disagree. The larger the disagreement the lower the rho value. This statistic is sensitive to large disagreements. That is, if one variable ranks an object as "1" and another variable ranks the same object as "100", the correlation reported by Spearman's rho will be strongly affected (relative to Kendall's tau, for example), even if these variables agree on all other ranks. This measure is suitable for raw or standardized abundance data and any monotonically related variables. |
|
Kendall's τ | Like the Spearman's rho, Kendall's tau uses ranked values to calculate correlation. This measure, however, is not based on the principal of least squares and instead expresses the degree of concordance between two rankings. The tau statistic is the quotient of 1) the difference between concordant and discordant pairs (i.e. ranks that agree and ranks that differ) and 2) the total number of pairs compared. This statistic is not sensitive to the scale of the disagreement. As above, variables should have monotonic relationships: that is, their ranks should either go up or down across objects, but not necessarily in a linear fashion. This measure is suitable for raw or standardized abundance data and any monotonically related variables. |
|
χ2 similarity, metric, and distance |
The χ2 similarity, metric, and distance measures (see above for description) may also be used for R-mode analysis. These are useful when monotonic relationships are not present and are appropriate for raw abundance data, qualitative and ordinal data. |
|
Hellinger distance |
Described above, the Hellinger distance is useful for variables populated with abundance data. |
|
Symmetric uncertainty coefficient | This coefficient is based in the logic of information theory. It expresses the amount of information shared between two variables using contingency tables and Shannon's information formula. Resorting to contingency tables is useful when dealing with qualitative variables with no monotonic relationships. Probabilities of association can be calculated and then translated into measures of dependence. Legendre and Legendre (1998) offer a developed discussion on information theory in numerical ecology. |
|
Jaccard coefficient | This coefficient excludes double zeros, giving equal weight to non-zero agreements ("1", "1") and disagreements ("1", "0" and "0", "1") when comparing two objects. Given a "sites x species" matrix, the Jaccard coefficient can be used to express species/OTU turnover. | |
Dice coefficient | This coefficient is similar to the Jaccard coefficient, however, gives double weight to non-zero agreements. This asserts that the co-occurrence or coincidence of variable states among objects is more informative or important than disagreements.This is based on the logic of the harmonic mean and is thus suitable for data sets with large-valued outliers. It may, however, increase the influence of small-valued outliers. | |
Ochai index | The Ochai index is the quotient of the total non-zero agreements ("1", "1") between two variables and the the product of the square rooted sums of non-zero agreements and each form of disagreement (i.e. "0","1" and "1","0"). Thus, this measure is based on the logic of the geometric mean and values with different ranges will be normalised before a central value is proposed. This is particularly suitable when the ranges and variance of agreements and disagreements are very different from one another. |
Implementations
- R
- vegdist() in the vegan package
- dist() in the package
- distance() or bcdist() in the ecodist package
- daisy() in the cluster package can compute a Gower index for both quantitative and categorical variables
- R
- vegdist() in the vegan package
- dist() in the package
- distance() or bcdist() in the ecodist package
- daisy() in the cluster package can compute a Gower index for both quantitative and categorical variables
References
- Bray JR, Curtis JT (1957). An ordination of upland forest communities of southern Wisconsin. Ecol Monogr. 27:325-349.
- Legendre P, Legendre L. Numerical Ecology. 2nd ed. Amsterdam: Elsevier, 1998. ISBN 978-0444892508.
- Gower JC (1971) A General Coefficient of Similarity and Some of Its Properties. Biometrics. 27(4):857-871
- Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif. 3(1): 5-48.
- Hellinger E (1909) Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J Reine Angew Math. 136:210–271.
- Kulczynski S (1928) Die Pflanzenassoziationen der Pieninen. Bull Int Acad Pol Sci Lett Cl Sci Math Nat Ser B. Suppl. II (1927):57-203.
- Lance GN, Williams WT (1966) Computer programs for hierarchical polythetic classification (“similarity analysis”). Comput J. 9:60-64.
- Ludwig JA, Reynolds JF. Statistical ecology: A primer on methods and computing. New York: Wiley, 1988.
- Orlóci L (1967) An agglomerative method for classification of plant communities. J Ecol. 55:193-205
- Mahalanobis, PC (1936) On the generalised distance in statistics Proceedings of the National Institute of Sciences of India 2(1): 49–55.
- Pearson K (1926) On the coefficient of racial likeness. Biometrika 18:105-117.
- Penrose LS (1952) Distance, size and shape. Ann Eugen. 17(1):337-343.
- Rao CR. The use of Hellinger distance in graphical displays of contingency table data. In: Multivariate Statistics and Matrices in Statistics: Proceedings of the 5th Tartu Conference, Tartu, Pühajärve, Estonia, 23-25 May 1994. Tiit EM, Kollo T, Niemi H (Eds.) Zeist: VSP BV, 1995. ISBN 90-6764-195-2.
- Warton DI, Wright ST, Wang Y (2012). Distance-based Multivariate Analyses Confound Location and Dispersion Effects. Methods Ecol Evol. 3(1):89–101.