Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis.
Types of clustering
Hierarchical algorithms find successive clusters using previously established clusters. These algorithms can be either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem for which a number of techniques have been developed.
Distance measure
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (x = 1, y = 0) and the origin (x = 0, y = 0) is always 1 according to the usual norms, but the distance between the point (x = 1, y = 1) and the origin can be 2, √2 or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
* The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
* The Manhattan distance (aka taxicab norm or 1-norm)
* The maximum norm (aka infinity norm)
* The Mahalanobis distance corrects data for different scales and correlations in the variables
* The angle between two vectors can be used as a distance measure when clustering high dimensional data. See Inner product space.
* The Hamming distance measures the minimum number of substitutions required to change one member into another.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. Many of the distance functions listed above have the property that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case. (A true metric gives symmetric measures of distance.)
Hierarchical clustering
Main article: Hierarchical clustering
Merge-arrow.svg
It has been suggested that this article or section be merged into Hierarchical clustering. (Discuss)
Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.
Any valid metric may be used as a measure of similarity between pairs of observations. The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
Agglomerative hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
Raw data
The hierarchical clustering dendrogram would be as such:
Traditional representation
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters \mathcal{A} and \mathcal{B} is one of the following:
* The maximum distance between elements of each cluster (also called complete linkage clustering):
\max \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B}\,\}.
* The minimum distance between elements of each cluster (also called single-linkage clustering):
\min \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B} \,\}.
* The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
{1 \over {|\mathcal{A}|\cdot|\mathcal{B}|}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y).
* The sum of all intra-cluster variance.
* The increase in variance for the cluster being merged (Ward's criterion).
* The probability that candidate clusters spawn from the same distribution function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis.
Types of clustering
Hierarchical algorithms find successive clusters using previously established clusters. These algorithms can be either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem for which a number of techniques have been developed.
Distance measure
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (x = 1, y = 0) and the origin (x = 0, y = 0) is always 1 according to the usual norms, but the distance between the point (x = 1, y = 1) and the origin can be 2, √2 or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
* The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
* The Manhattan distance (aka taxicab norm or 1-norm)
* The maximum norm (aka infinity norm)
* The Mahalanobis distance corrects data for different scales and correlations in the variables
* The angle between two vectors can be used as a distance measure when clustering high dimensional data. See Inner product space.
* The Hamming distance measures the minimum number of substitutions required to change one member into another.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. Many of the distance functions listed above have the property that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case. (A true metric gives symmetric measures of distance.)
Hierarchical clustering
Main article: Hierarchical clustering
Merge-arrow.svg
It has been suggested that this article or section be merged into Hierarchical clustering. (Discuss)
Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.
Any valid metric may be used as a measure of similarity between pairs of observations. The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
Agglomerative hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
Raw data
The hierarchical clustering dendrogram would be as such:
Traditional representation
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters \mathcal{A} and \mathcal{B} is one of the following:
* The maximum distance between elements of each cluster (also called complete linkage clustering):
\max \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B}\,\}.
* The minimum distance between elements of each cluster (also called single-linkage clustering):
\min \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B} \,\}.
* The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
{1 \over {|\mathcal{A}|\cdot|\mathcal{B}|}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y).
* The sum of all intra-cluster variance.
* The increase in variance for the cluster being merged (Ward's criterion).
* The probability that candidate clusters spawn from the same distribution function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).