The DBSCAN algorithm creates overlapping clusters

Cluster (data analysis)

As Cluster (occasionally also Agglomerations) is a term used in computer science and statistics group of data objects with similar properties. The set of clusters found in a data set is known as Clustering, Method for calculating such a grouping as cluster analysis. Data objects that do not belong to a cluster are called outliersoutlier) or noise (Englishnoise).

The core idea of ​​a cluster is that objects in the same cluster have “similar” properties and are different from objects that are not in the same cluster.

Cluster membership

Models of clusters

Comparison of k-means and EM algorithm on an artificial data set, visualized with ELKI. By using variances, EM can accurately describe the different normal distributions, while k-means divides the data into unfavorable Voronoi cells.

Different algorithms for cluster analysis often use different terms for clusters. This often leads to problems of understanding, since the results of one method need not be similar in the sense of another method.

The k-means algorithm describes clusters by their center points (or the Voronoi cells resulting from them), the EM algorithm describes clusters by the center point and a covariance matrix, while DBSCAN calculates "dense-connected" sets of any shape as a cluster.

Depending on the cluster term used, different structures may or may not be found. In the example shown here, the existing clusters cannot be found accurately by the k-means algorithm using its cluster model. The more complex model of the EM algorithm, on the other hand, is ideal for describing this data, since it was generated from a normal distribution.

Subspace cluster

As Subspace cluster is a term used to describe a cluster that is not conspicuous in all attributes or attribute combinations. Only when the data is appropriately projected can one recognize the higher similarity of the cluster objects compared to the others.

A distinction can be made between subspace clusters Axis-parallel clusters (based on a selection of attributes) and freely oriented Correlation Clusters.

Methods for subspace cluster methods are, for example, CLIQUE, ORCLUS, SubClu, PreDeCon, PROCLUS, HiSC, HiCO, 4C, ERiC and CASH.

Calculation of clusters

There are numerous methods (so-called cluster analysis algorithms) for calculating clusters. These differ significantly in what models they use for clusters. In many classic methods such as the k-means algorithm, the EM algorithm, hierarchical cluster analysis and DBSCAN, the cluster model is in the foreground, and there are sometimes several specific algorithms that allow an (at least locally) optimal solution for this model Find. Many newer methods, however, no longer have a correspondingly clearly defined model.

Evaluation of clusters

The evaluation of found clusters is not a simple problem, especially if the clusters come from different methods. There is a risk of overfitting if the valuation method is too similar to one of the methods used - that is, one is ultimately investigating which method is most similar to the valuation method.

Internal evaluation

From one internal Evaluation is used when no additional information is used for evaluation, but only the objects of the data set are used for evaluation. Typically, distance measures are used for this, for example the average distance between two cluster objects. Internal scoring usually favors clustering results built according to the same model. For example, from -Means found clusters naturally shorter average distances than DBSCAN clusters. This type of evaluation is therefore particularly useful if you want to evaluate different results of the same procedure, for example from several runs of a randomized procedure like this -Means algorithm. The silhouette coefficient is an internal measure for evaluating distance-based clusterings that is independent of the number of clusters. It is particularly suitable for measuring results from -Means with different values ​​of to compare as it depends on the number of clusters is independent.

External evaluation

In the external Scoring adds information that was not used during cluster analysis. If, for example, there is a classification of the data, then the agreement of the cluster with a class can be used for evaluation. The problems with this approach are that, on the one hand, suitable information is not always available and, on the other hand, the aim of the cluster analysis is precisely to discover a new structure, and the evaluation based on a known structure therefore only makes limited sense. Furthermore, several overlapping structures can exist in the data.[1] Because it is linked to the existing classification, this evaluation prefers informed methods from the area of ​​machine learning over uninformed methods from (real) cluster analysis.

See also

Individual evidence

  1. ↑ I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, A. Zimek: On Using Class Labels in Evaluation of Clusterings. In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC. 2010 (uni-muenchen.de [PDF]).

literature

  • Martin Ester, Jörg Sander: Knowledge Discovery in Databases. Techniques and Applications. Springer, Berlin 2000, ISBN 3-540-67328-8.