Aller au contenu principal

Rand index


Rand index


The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. The Rand index is the accuracy of determining if a link belongs within a cluster or not.

Rand index

Definition

Given a set of n {\displaystyle n} elements S = { o 1 , , o n } {\displaystyle S=\{o_{1},\ldots ,o_{n}\}} and two partitions of S {\displaystyle S} to compare, X = { X 1 , , X r } {\displaystyle X=\{X_{1},\ldots ,X_{r}\}} , a partition of S into r subsets, and Y = { Y 1 , , Y s } {\displaystyle Y=\{Y_{1},\ldots ,Y_{s}\}} , a partition of S into s subsets, define the following:

  • a {\displaystyle a} , the number of pairs of elements in S {\displaystyle S} that are in the same subset in X {\displaystyle X} and in the same subset in Y {\displaystyle Y}
  • b {\displaystyle b} , the number of pairs of elements in S {\displaystyle S} that are in different subsets in X {\displaystyle X} and in different subsets in Y {\displaystyle Y}
  • c {\displaystyle c} , the number of pairs of elements in S {\displaystyle S} that are in the same subset in X {\displaystyle X} and in different subsets in Y {\displaystyle Y}
  • d {\displaystyle d} , the number of pairs of elements in S {\displaystyle S} that are in different subsets in X {\displaystyle X} and in the same subset in Y {\displaystyle Y}

The Rand index, R {\displaystyle R} , is:

R = a + b a + b + c + d = a + b ( n 2 ) {\displaystyle R={\frac {a+b}{a+b+c+d}}={\frac {a+b}{n \choose 2}}}

Intuitively, a + b {\displaystyle a+b} can be considered as the number of agreements between X {\displaystyle X} and Y {\displaystyle Y} and c + d {\displaystyle c+d} as the number of disagreements between X {\displaystyle X} and Y {\displaystyle Y} .

Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that X {\displaystyle X} and Y {\displaystyle Y} will agree on a randomly chosen pair.

( n 2 ) {\displaystyle {n \choose 2}} is calculated as n ( n 1 ) / 2 {\displaystyle n(n-1)/2} .

Similarly, one can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:

R I = T P + T N T P + F P + F N + T N {\displaystyle RI={\frac {TP+TN}{TP+FP+FN+TN}}}
where T P {\displaystyle TP} is the number of true positives, T N {\displaystyle TN} is the number of true negatives, F P {\displaystyle FP} is the number of false positives, and F N {\displaystyle FN} is the number of false negatives.

Properties

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

In mathematical terms, a, b, c, d are defined as follows:

  • a = | S | {\displaystyle a=|S^{*}|} , where S = { ( o i , o j ) o i , o j X k , o i , o j Y l } {\displaystyle S^{*}=\{(o_{i},o_{j})\mid o_{i},o_{j}\in X_{k},o_{i},o_{j}\in Y_{l}\}}
  • b = | S | {\displaystyle b=|S^{*}|} , where S = { ( o i , o j ) o i X k 1 , o j X k 2 , o i Y l 1 , o j Y l 2 } {\displaystyle S^{*}=\{(o_{i},o_{j})\mid o_{i}\in X_{k_{1}},o_{j}\in X_{k_{2}},o_{i}\in Y_{l_{1}},o_{j}\in Y_{l_{2}}\}}
  • c = | S | {\displaystyle c=|S^{*}|} , where S = { ( o i , o j ) o i , o j X k , o i Y l 1 , o j Y l 2 } {\displaystyle S^{*}=\{(o_{i},o_{j})\mid o_{i},o_{j}\in X_{k},o_{i}\in Y_{l_{1}},o_{j}\in Y_{l_{2}}\}}
  • d = | S | {\displaystyle d=|S^{*}|} , where S = { ( o i , o j ) o i X k 1 , o j X k 2 , o i , o j Y l } {\displaystyle S^{*}=\{(o_{i},o_{j})\mid o_{i}\in X_{k_{1}},o_{j}\in X_{k_{2}},o_{i},o_{j}\in Y_{l}\}}

for some 1 i , j n , i j , 1 k , k 1 , k 2 r , k 1 k 2 , 1 l , l 1 , l 2 s , l 1 l 2 {\displaystyle 1\leq i,j\leq n,i\neq j,1\leq k,k_{1},k_{2}\leq r,k_{1}\neq k_{2},1\leq l,l_{1},l_{2}\leq s,l_{1}\neq l_{2}}

Relationship with classification accuracy

The Rand index can also be viewed through the prism of binary classification accuracy over the pairs of elements in S {\displaystyle S} . The two class labels are " o i {\displaystyle o_{i}} and o j {\displaystyle o_{j}} are in the same subset in X {\displaystyle X} and Y {\displaystyle Y} " and " o i {\displaystyle o_{i}} and o j {\displaystyle o_{j}} are in different subsets in X {\displaystyle X} and Y {\displaystyle Y} ".

In that setting, a {\displaystyle a} is the number of pairs correctly labeled as belonging to the same subset (true positives), and b {\displaystyle b} is the number of pairs correctly labeled as belonging to different subsets (true negatives).

Adjusted Rand index

The adjusted Rand index is the corrected-for-chance version of the Rand index. Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model. Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in K-means the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.

Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.

The contingency table

Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these elements, namely X = { X 1 , X 2 , , X r } {\displaystyle X=\{X_{1},X_{2},\ldots ,X_{r}\}} and Y = { Y 1 , Y 2 , , Y s } {\displaystyle Y=\{Y_{1},Y_{2},\ldots ,Y_{s}\}} , the overlap between X and Y can be summarized in a contingency table [ n i j ] {\displaystyle \left[n_{ij}\right]} where each entry n i j {\displaystyle n_{ij}} denotes the number of objects in common between X i {\displaystyle X_{i}} and Y j {\displaystyle Y_{j}}  : n i j = | X i Y j | {\displaystyle n_{ij}=|X_{i}\cap Y_{j}|} .

X Y Y 1 Y 2 Y s sums X 1 n 11 n 12 n 1 s a 1 X 2 n 21 n 22 n 2 s a 2 X r n r 1 n r 2 n r s a r sums b 1 b 2 b s {\displaystyle {\begin{array}{c|cccc|c}{{} \atop X}\!\diagdown \!^{Y}&Y_{1}&Y_{2}&\cdots &Y_{s}&{\text{sums}}\\\hline X_{1}&n_{11}&n_{12}&\cdots &n_{1s}&a_{1}\\X_{2}&n_{21}&n_{22}&\cdots &n_{2s}&a_{2}\\\vdots &\vdots &\vdots &\ddots &\vdots &\vdots \\X_{r}&n_{r1}&n_{r2}&\cdots &n_{rs}&a_{r}\\\hline {\text{sums}}&b_{1}&b_{2}&\cdots &b_{s}&\end{array}}}

Definition

The original Adjusted Rand Index using the Permutation Model is

A R I = i j ( n i j 2 ) [ i ( a i 2 ) j ( b j 2 ) ] / ( n 2 ) 1 2 [ i ( a i 2 ) + j ( b j 2 ) ] [ i ( a i 2 ) j ( b j 2 ) ] / ( n 2 ) {\displaystyle ARI={\frac {\left.\sum _{ij}{\binom {n_{ij}}{2}}-\left[\sum _{i}{\binom {a_{i}}{2}}\sum _{j}{\binom {b_{j}}{2}}\right]\right/{\binom {n}{2}}}{\left.{\frac {1}{2}}\left[\sum _{i}{\binom {a_{i}}{2}}+\sum _{j}{\binom {b_{j}}{2}}\right]-\left[\sum _{i}{\binom {a_{i}}{2}}\sum _{j}{\binom {b_{j}}{2}}\right]\right/{\binom {n}{2}}}}}

where n i j , a i , b j {\displaystyle n_{ij},a_{i},b_{j}} are values from the contingency table.

See also

  • Simple matching coefficient

References

External links

  • C++ implementation with MATLAB mex files
Giuseppe Zanotti Luxury Sneakers

Text submitted to CC-BY-SA license. Source: Rand index by Wikipedia (Historical)


INVESTIGATION