Qualitative variation
An index of qualitative variation (IQV) is a measure of statistical dispersion in nominal distributions. Examples include the variation ratio or the information entropy. PropertiesThere are several types of indices used for the analysis of nominal data. Several are standard statistics that are used elsewhere - range, standard deviation, variance, mean deviation, coefficient of variation, median absolute deviation, interquartile range and quartile deviation. In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox (Wilcox 1967), (Wilcox 1973), who requires the following standardization properties to be satisfied:
In particular, the value of these standardized indices does not depend on the number of categories or number of samples. For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance. Indices of qualitative variation are then analogous to information entropy, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation. One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences. Wilcox's indexesWilcox gives a number of formulae for various indices of QV (Wilcox 1973), the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean. ModVRThe formula for the variation around the mode (ModVR) is derived as follows: where fm is the modal frequency, K is the number of categories and fi is the frequency of the ith group. This can be simplified to where N is the total size of the sample. Freeman's index (or variation ratio) is[2] This is related to M as follows: The ModVR is defined as where v is Freeman's index. Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation. When K is large, ModVR is approximately equal to Freeman's index v. RanVRThis is based on the range around the mode. It is defined to be where fm is the modal frequency and fl is the lowest frequency. AvDevThis is an analog of the mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean. MNDifThis is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.[3] where fi and fj are the ith and jth frequencies respectively. The MNDif is the Gini coefficient applied to qualitative data. VarNCThis is an analog of the variance. It is the same index as Mueller and Schussler's Index of Qualitative Variation[4] and Gibbs' M2 index. It is distributed as a chi square variable with K – 1 degrees of freedom.[5] StDevWilson has suggested two versions of this statistic. The first is based on AvDev. The second is based on MNDif HRelThis index was originally developed by Claude Shannon for use in specifying the properties of communication channels. where pi = fi / N. This is equivalent to information entropy divided by the and is useful for comparing relative variation between frequency tables of multiple sizes. B indexWilcox adapted a proposal of Kaiser[6] based on the geometric mean and created the B' index. The B index is defined as R packagesSeveral of these indices have been implemented in the R language.[7] Gibb's indices and related formulaeGibbs & Poston Jr (1975) proposed six indexes.[8] M1The unstandardized index (M1) (Gibbs & Poston Jr 1975, p. 471) is where K is the number of categories and is the proportion of observations that fall in a given category i. M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category,[9] so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in. M2A second index is the M2[10] (Gibbs & Poston Jr 1975, p. 472) is: where K is the number of categories and is the proportion of observations that fall in a given category i. The factor of is for standardization. M1 and M2 can be interpreted in terms of variance of a multinomial distribution (Swanson 1976) (there called an "expanded binomial model"). M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution. M4The M4 index is where m is the mean. M6The formula for M6 is · where K is the number of categories, Xi is the number of data points in the ith category, N is the total number of data points, || is the absolute value (modulus) and This formula can be simplified where pi is the proportion of the sample in the ith category. In practice M1 and M6 tend to be highly correlated which militates against their combined use. Related indicesThe sum has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology[11] In linguistics and cryptanalysis this sum is known as the repeat rate. The incidence of coincidence (IC) is an unbiased estimator of this statistic[12] where fi is the count of the ith grapheme in the text and n is the total number of graphemes in the text.
The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,[13] Simpson's measure of diversity,[14] Bachi's index of linguistic homogeneity,[15] Mueller and Schuessler's index of qualitative variation,[16] Gibbs and Martin's index of industry diversification,[17] Lieberson's index.[18] and Blau's index in sociology, psychology and management studies.[19] The formulation of all these indices are identical. Simpson's D is defined as where n is the total sample size and ni is the number of items in the ith category. For large n we have Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.[20] where n is the sample size and c(x,y) = 1 if x and y are unalike and 0 otherwise. For large n we have where K is the number of categories. Another related statistic is the quadratic entropy which is itself related to the Gini index.
Greenberg's monolingual non weighted index of linguistic diversity[21] is the M2 statistic defined above.
Another index – the M7 – was created based on the M4 index of Gibbs & Poston Jr (1975)[22] where and where K is the number of categories, L is the number of subtypes, Oij and Eij are the number observed and expected respectively of subtype j in the ith category, ni is the number in the ith category and pj is the proportion of subtype j in the complete sample. Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female. Other single sample indicesThese indices are summary statistics of the variation within the sample. Berger–Parker indexThe Berger–Parker index, named after Wolfgang H. Berger and Frances Lawrence Parker, equals the maximum value in the dataset, i.e. the proportional abundance of the most abundant type.[23] This corresponds to the weighted generalized mean of the values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/∞D). Brillouin index of diversityThis index is strictly applicable only to entire populations rather than to finite samples. It is defined as where N is total number of individuals in the population, ni is the number of individuals in the ith category and N! is the factorial of N. Brillouin's index of evenness is defined as where IB(max) is the maximum value of IB. Hill's diversity numbersHill suggested a family of diversity numbers[24] For given values of a, several of the other indices can be computed
Hill also suggested a family of evenness measures where a > b. Hill's E4 is Hill's E5 is Margalef's indexwhere S is the number of data types in the sample and N is the total size of the sample.[25] Menhinick's indexwhere S is the number of data types in the sample and N is the total size of the sample.[26] In linguistics this index is the identical with the Kuraszkiewicz index (Guiard index) where S is the number of distinct words (types) and N is the total number of words (tokens) in the text being examined.[27][28] This index can be derived as a special case of the Generalised Torquist function.[29] Q statisticThis is a statistic invented by Kempton and Taylor.[30] and involves the quartiles of the sample. It is defined as where R1 and R2 are the 25% and 75% quartiles respectively on the cumulative species curve, nj is the number of species in the jth category, nRi is the number of species in the class where Ri falls (i = 1 or 2). Shannon–Wiener indexThis is taken from information theory where N is the total number in the sample and pi is the proportion in the ith category. In ecology where this index is commonly used, H usually lies between 1.5 and 3.5 and only rarely exceeds 4.0. An approximate formula for the standard deviation (SD) of H is where pi is the proportion made up by the ith category and N is the total in the sample. A more accurate approximate value of the variance of H(var(H)) is given by[31] where N is the sample size and K is the number of categories. A related index is the Pielou J defined as One difficulty with this index is that S is unknown for a finite sample. In practice S is usually set to the maximum present in any category in the sample. Rényi entropyThe Rényi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed: which equals This means that taking the logarithm of true diversity based on any value of q gives the Rényi entropy corresponding to the same value of q. The value of is also known as the Hill number.[24] McIntosh's D and EMcIntosh proposed measure of diversity:[32] where ni is the number in the ith category and K is the number of categories. He also proposed several normalized versions of this index. First is D: where N is the total sample size. This index has the advantage of expressing the observed diversity as a proportion of the absolute maximum diversity at a given N. Another proposed normalization is E — ratio of observed diversity to maximum possible diversity of a given N and K (i.e., if all species are equal in number of individuals): Fisher's alphaThis was the first index to be derived for diversity.[33]
where K is the number of categories and N is the number of data points in the sample. Fisher's α has to be estimated numerically from the data. The expected number of individuals in the rth category where the categories have been placed in increasing size is where X is an empirical parameter lying between 0 and 1. While X is best estimated numerically an approximate value can be obtained by solving the following two equations where K is the number of categories and N is the total sample size. The variance of α is approximately[34] Strong's indexThis index (Dw) is the distance between the Lorenz curve of species distribution and the 45 degree line. It is closely related to the Gini coefficient.[35] In symbols it is where max() is the maximum value taken over the N data points, K is the number of categories (or species) in the data set and ci is the cumulative total up and including the ith category. Simpson's EThis is related to Simpson's D and is defined as where D is Simpson's D and K is the number of categories in the sample. Smith & Wilson's indicesSmith and Wilson suggested a number of indices based on Simpson's D. where D is Simpson's D and K is the number of categories. Heip's indexwhere H is the Shannon entropy and K is the number of categories. This index is closely related to Sheldon's index which is where H is the Shannon entropy and K is the number of categories. Camargo's indexThis index was created by Camargo in 1993.[36]
where K is the number of categories and pi is the proportion in the ith category. Smith and Wilson's BThis index was proposed by Smith and Wilson in 1996.[37] where θ is the slope of the log(abundance)-rank curve. Nee, Harvey, and Cotgreave's indexThis is the slope of the log(abundance)-rank curve. Bulla's EThere are two versions of this index - one for continuous distributions (Ec) and the other for discrete (Ed).[38] where is the Schoener–Czekanoski index, K is the number of categories and N is the sample size. Horn's information theory indexThis index (Rik) is based on Shannon's entropy.[39] It is defined as where In these equations xij and xkj are the number of times the jth data type appears in the ith or kth sample respectively. Rarefaction indexIn a rarefied sample a random subsample n in chosen from the total N items. In this sample some groups may be necessarily absent from this subsample. Let be the number of groups still present in the subsample of n items. is less than K the number of categories whenever at least one group is missing from this subsample. The rarefaction curve, is defined as: Note that 0 ≤ f(n) ≤ K. Furthermore, Despite being defined at discrete values of n, these curves are most frequently displayed as continuous functions.[40] This index is discussed further in Rarefaction (ecology). Caswell's VThis is a z type statistic based on Shannon's entropy.[41] where H is the Shannon entropy, E(H) is the expected Shannon entropy for a neutral model of distribution and SD(H) is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou where pi is the proportion made up by the ith category and N is the total in the sample. Lloyd & Ghelardi's indexThis is where K is the number of categories and K' is the number of categories according to MacArthur's broken stick model yielding the observed diversity. Average taxonomic distinctness indexThis index is used to compare the relationship between hosts and their parasites.[42] It incorporates information about the phylogenetic relationship amongst the host species. where s is the number of host species used by a parasite and ωij is the taxonomic distinctness between host species i and j. Index of qualitative variationSeveral indices with this name have been proposed. One of these is where K is the number of categories and pi is the proportion of the sample that lies in the ith category. Theil's HThis index is also known as the multigroup entropy index or the information theory index. It was proposed by Theil in 1972.[43] The index is a weighted average of the samples entropy. Let and
where pi is the proportion of type i in the ath sample, r is the total number of samples, ni is the size of the ith sample, N is the size of the population from which the samples were obtained and E is the entropy of the population. Indices for comparison of two or more data types within a single sampleSeveral of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area. Index of dissimilarityLet A and B be two types of data item. Then the index of dissimilarity is where Ai is the number of data type A at sample site i, Bi is the number of data type B at sample site i, K is the number of sites sampled and || is the absolute value. This index is probably better known as the index of dissimilarity (D).[44] It is closely related to the Gini index. This index is biased as its expectation under a uniform distribution is > 0. A modification of this index has been proposed by Gorard and Taylor.[45] Their index (GT) is Index of segregationThe index of segregation (IS)[46] is where and K is the number of units, Ai and ti is the number of data type A in unit i and the total number of all data types in unit i. Hutchen's square root indexThis index (H) is defined as[47] where pi is the proportion of the sample composed of the ith variate. Lieberson's isolation indexThis index ( Lxy ) was invented by Lieberson in 1981.[48] where Xi and Yi are the variables of interest at the ith site, K is the number of sites examined and Xtot is the total number of variate of type X in the study. Bell's indexThis index is defined as[49] where px is the proportion of the sample made up of variates of type X and where Nx is the total number of variates of type X in the study, K is the number of samples in the study and xi and pi are the number of variates and the proportion of variates of type X respectively in the ith sample. Index of isolationThe index of isolation is where K is the number of units in the study, Ai and ti is the number of units of type A and the number of all units in ith sample. A modified index of isolation has also been proposed The MII lies between 0 and 1. Gorard's index of segregationThis index (GS) is defined as where and Ai and ti are the number of data items of type A and the total number of items in the ith sample. Index of exposureThis index is defined as where and Ai and Bi are the number of types A and B in the ith category and ti is the total number of data points in the ith category. Ochiai indexThis is a binary form of the cosine index.[50] It is used to compare presence/absence data of two data types (here A and B). It is defined as where a is the number of sample units where both A and B are found, b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A. Kulczyński's coefficientThis coefficient was invented by Stanisław Kulczyński in 1927[51] and is an index of association between two types (here A and B). It varies in value between 0 and 1. It is defined as where a is the number of sample units where type A and type B are present, b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A. Yule's QThis index was invented by Yule in 1900.[52] It concerns the association of two different types (here A and B). It is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Q varies in value between -1 and +1. In the ordinal case Q is known as the Goodman-Kruskal γ. Because the denominator potentially may be zero, Leinhert and Sporer have recommended adding +1 to a, b, c and d.[53] Yule's YThis index is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Baroni–Urbani–Buser coefficientThis index was invented by Baroni-Urbani and Buser in 1976.[54] It varies between 0 and 1 in value. It is defined as
where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. When d = 0, this index is identical to the Jaccard index. Hamman coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Rogers–Tanimoto coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size Sokal–Sneath coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Sokal's binary distanceThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Russel–Rao coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Phi coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Soergel's coefficientThis coefficient is defined as where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Simpson's coefficientThis coefficient is defined as where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A. Dennis' coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Forbes' coefficientThis coefficient was proposed by Stephen Alfred Forbes in 1907.[55] It is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size (N = a + b + c + d). A modification of this coefficient which does not require the knowledge of d has been proposed by Alroy[56] Where n = a + b + c. Simple match coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Fossum's coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Stile's coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A, d is the sample count where neither type A nor type B are present, n equals a + b + c + d and || is the modulus (absolute value) of the difference. Michael's coefficientThis coefficient is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Peirce's coefficientIn 1884 Charles Peirce suggested[57] the following coefficient where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Hawkin–Dotson coefficientIn 1975 Hawkin and Dotson proposed the following coefficient where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Benini coefficientIn 1901 Benini proposed the following coefficient where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. Min(b, c) is the minimum of b and c. Gilbert coefficientGilbert proposed the following coefficient where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size. Gini indexThe Gini index is where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. Modified Gini indexThe modified Gini index is where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. Kuhn's indexKuhn proposed the following coefficient in 1965 where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. K is a normalizing parameter. N is the sample size. This index is also known as the coefficient of arithmetic means. Eyraud indexEyraud proposed the following coefficient in 1936 where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. Soergel distanceThis is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size. Tanimoto indexThis is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size. Piatetsky–Shapiro's indexThis is defined as where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A. Indices for comparison between two or more samplesCzekanowski's quantitative indexThis is also known as the Bray–Curtis index, Schoener's index, least common percentage index, index of affinity or proportional similarity. It is related to the Sørensen similarity index. where xi and xj are the number of species in sites i and j respectively and the minimum is taken over the number of species in common between the two sites. Canberra metricThe Canberra distance is a weighted version of the L1 metric. It was introduced by introduced in 1966[58] and refined in 1967[59] by G. N. Lance and W. T. Williams. It is used to define a distance between two vectors – here two sites with K categories within each site. The Canberra distance d between vectors p and q in a K-dimensional real vector space is where pi and qi are the values of the ith category of the two vectors. Sorensen's coefficient of communityThis is used to measure similarities between communities. where s1 and s2 are the number of species in community 1 and 2 respectively and c is the number of species common to both areas. Jaccard's indexThis is a measure of the similarity between two samples: where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively. This index was invented in 1902 by the Swiss botanist Paul Jaccard.[60] Under a random distribution the expected value of J is[61] The standard error of this index with the assumption of a random distribution is
where N is the total size of the sample. Dice's indexThis is a measure of the similarity between two samples: where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively. Match coefficientThis is a measure of the similarity between two samples: where N is the number of data points in the two samples and B and C are the data points found only in the first and second samples respectively. Morisita's indexMasaaki Morisita's index of dispersion ( Im ) is the scaled probability that two points chosen at random from the whole population are in the same sample.[62] Higher values indicate a more clumped distribution. An alternative formulation is where n is the total sample size, m is the sample mean and x are the individual values with the sum taken over the whole sample. It is also equal to where IMC is Lloyd's index of crowding.[63] This index is relatively independent of the population density but is affected by the sample size. Morisita showed that the statistic[62] is distributed as a chi-squared variable with n − 1 degrees of freedom. An alternative significance test for this index has been developed for large samples.[64] where m is the overall sample mean, n is the number of sample units and z is the normal distribution abscissa. Significance is tested by comparing the value of z against the values of the normal distribution. Morisita's overlap indexMorisita's overlap index is used to compare overlap among samples.[65] The index is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats
CD = 0 if the two samples do not overlap in terms of species, and CD = 1 if the species occur in the same proportions in both samples. Horn's introduced a modification of the index[66] Standardised Morisita's indexSmith-Gill developed a statistic based on Morisita's index which is independent of both sample size and population density and bounded by −1 and +1. This statistic is calculated as follows[67] First determine Morisita's index ( Id ) in the usual fashion. Then let k be the number of units the population was sampled from. Calculate the two critical values where χ2 is the chi square value for n − 1 degrees of freedom at the 97.5% and 2.5% levels of confidence. The standardised index ( Ip ) is then calculated from one of the formulae below When Id ≥ Mc > 1 When Mc > Id ≥ 1 When 1 > Id ≥ Mu When 1 > Mu > Id Ip ranges between +1 and −1 with 95% confidence intervals of ±0.5. Ip has the value of 0 if the pattern is random; if the pattern is uniform, Ip < 0 and if the pattern shows aggregation, Ip > 0. Peet's evenness indicesThese indices are a measure of evenness between samples.[68] where I is an index of diversity, Imax |