Aller au contenu principal

Fowlkes–Mallows index


Fowlkes–Mallows index


The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm), and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

Preliminaries

The Fowlkes–Mallows index, when results of two clustering algorithms are used to evaluate the results, is defined as

F M = P P V T P R = T P T P + F P T P T P + F N {\displaystyle FM={\sqrt {PPV\cdot TPR}}={\sqrt {{\frac {TP}{TP+FP}}\cdot {\frac {TP}{TP+FN}}}}}

where T P {\displaystyle TP} is the number of true positives, F P {\displaystyle FP} is the number of false positives, and F N {\displaystyle FN} is the number of false negatives. T P R {\displaystyle TPR} is the true positive rate, also called sensitivity or recall, and P P V {\displaystyle PPV} is the positive predictive rate, also known as precision.

The minimum possible value of the Fowlkes–Mallows index is 0, which corresponds to the worst binary classification possible, where all the elements have been misclassified. And the maximum possible value of the Fowlkes–Mallows index is 1, which corresponds to the best binary classification possible, where all the elements have been perfectly classified.

Definition

Consider two hierarchical clusterings of n {\displaystyle n} objects labeled A 1 {\displaystyle A_{1}} and A 2 {\displaystyle A_{2}} . The trees A 1 {\displaystyle A_{1}} and A 2 {\displaystyle A_{2}} can be cut to produce k = 2 , , n 1 {\displaystyle k=2,\ldots ,n-1} clusters for each tree (by either selecting clusters at a particular height of the tree or setting different strength of the hierarchical clustering). For each value of k {\displaystyle k} , the following table can then be created

M = [ m i , j ] ( i = 1 , , k  and  j = 1 , , k ) {\displaystyle M=[m_{i,j}]\qquad (i=1,\ldots ,k{\text{ and }}j=1,\ldots ,k)}

where m i , j {\displaystyle m_{i,j}} is of objects common between the i {\displaystyle i} th cluster of A 1 {\displaystyle A_{1}} and j {\displaystyle j} th cluster of A 2 {\displaystyle A_{2}} . The Fowlkes–Mallows index for the specific value of k {\displaystyle k} is then defined as

B k = T k P k Q k {\displaystyle B_{k}={\frac {T_{k}}{\sqrt {P_{k}Q_{k}}}}}

where

T k = i = 1 k j = 1 k m i , j 2 n {\displaystyle T_{k}=\sum _{i=1}^{k}\sum _{j=1}^{k}m_{i,j}^{2}-n}
P k = i = 1 k ( j = 1 k m i , j ) 2 n {\displaystyle P_{k}=\sum _{i=1}^{k}(\sum _{j=1}^{k}m_{i,j})^{2}-n}
Q k = j = 1 k ( i = 1 k m i , j ) 2 n {\displaystyle Q_{k}=\sum _{j=1}^{k}(\sum _{i=1}^{k}m_{i,j})^{2}-n}

B k {\displaystyle B_{k}} can then be calculated for every value of k {\displaystyle k} and the similarity between the two clusterings can be shown by plotting B k {\displaystyle B_{k}} versus k {\displaystyle k} . For each k {\displaystyle k} we have 0 B k 1 {\displaystyle 0\leq B_{k}\leq 1} .

Fowlkes–Mallows index can also be defined based on the number of points that are common or uncommon in the two hierarchical clusterings. If we define

T P {\displaystyle TP} as the number of pairs of points that are present in the same cluster in both A 1 {\displaystyle A_{1}} and A 2 {\displaystyle A_{2}} .
F P {\displaystyle FP} as the number of pairs of points that are present in the same cluster in A 1 {\displaystyle A_{1}} but not in A 2 {\displaystyle A_{2}} .
F N {\displaystyle FN} as the number of pairs of points that are present in the same cluster in A 2 {\displaystyle A_{2}} but not in A 1 {\displaystyle A_{1}} .
T N {\displaystyle TN} as the number of pairs of points that are in different clusters in both A 1 {\displaystyle A_{1}} and A 2 {\displaystyle A_{2}} .

It can be shown that the four counts have the following property

T P + F P + F N + T N = n ( n 1 ) / 2 {\displaystyle TP+FP+FN+TN=n(n-1)/2}

and that the Fowlkes–Mallows index for two clusterings can be defined as

F M = P P V T P R = T P T P + F P T P T P + F N {\displaystyle FM={\sqrt {PPV\cdot TPR}}={\sqrt {{\frac {TP}{TP+FP}}\cdot {\frac {TP}{TP+FN}}}}}
where T P {\displaystyle TP} is the number of true positives, F P {\displaystyle FP} is the number of false positives, and F N {\displaystyle FN} is the number of false negatives.
T P R {\displaystyle TPR} is the true positive rate, also called sensitivity or recall, and P P V {\displaystyle PPV} is the positive predictive rate, also known as precision.
The Fowlkes–Mallows index is the geometric mean of precision and recall.

Discussion

Since the index is directly proportional to the number of true positives, a higher index means greater similarity between the two clusterings used to determine the index. One basic way to test the validity of this index is to compare two clusterings that are unrelated to each other. Fowlkes and Mallows showed that on using two unrelated clusterings, the value of this index approaches zero as the number of total data points chosen for clustering increase; whereas the value for the Rand index for the same data quickly approaches 1 {\displaystyle 1} making Fowlkes–Mallows index a much more accurate representation for unrelated data. This index also performs well if noise is added to an existing dataset and their similarity compared. Fowlkes and Mallows showed that the value of the index decreases as the component of the noise increases. The index also showed similarity even when the noisy dataset had a different number of clusters than the clusters of the original dataset. Thus making it a reliable tool for measuring similarity between two clusters.

Further reading

  • Chicco, Davide; Jurman, Giuseppe (2023). "A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and Fowlkes–Mallows index". Journal of Biomedical Informatics. 144 (104426): 1–7. doi:10.1016/j.jbi.2023.104426. PMID 37352899. S2CID 259240662.

See also

  • F1 score
  • Matthews correlation coefficient
  • Confusion matrix

References

External links

  • Implementation of Fowlkes–Mallows index in R.

Text submitted to CC-BY-SA license. Source: Fowlkes–Mallows index by Wikipedia (Historical)