Partition a set of objects into homogeneous clusters

Cluster Analysis

The term cluster analysis encompasses a number of different classification algorithms. Hierarchical (Tree) clustering is the most popular. The agglomerative hierarchical clustering algorithms build a cluster hierarchy that is displayed as a tree diagram called a dendrogram. At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure (e.g. Euclidean). At each next step, the two clusters that are most similar are joined into a single new cluster. Once fused, objects are never separated. There are many rules for defining the similarity between clusters. For example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. This rule is called Single linkage (nearest neighbor). In Complete linkage (furthest neighbor) method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters.

K-means Clustering in general will produce exactly k different clusters of greatest possible distinction. Computationally, you may think of this method as ANOVA "in reverse." The program will start with k random clusters, and then move objects between those clusters with the goal to (1) minimize variability within clusters and (2) maximize variability between clusters.

Medoid Clustering finds a set of representative objects called medoids. The medoid of a cluster is defined as that object for which the average dissimilarity to all other objects in the cluster is minimal. If k clusters are desired, k medoids are found. Once the medoids are found, the data are classified into the cluster of the nearest medoid.

Fuzzy Clustering allows an individual to be partially classified into more than one cluster. In regular clustering, each individual is a member of only one cluster. Suppose we have K clusters and we define a set of variables that represent the probability that object i is classified into cluster k. In fuzzy clustering, the membership is spread among all clusters.

Cluster analysis is frequently used in market research. Consumers of the product are sort into several groups, so persons within each group were similar and groups would differ between each other. Consumers are characterized by variables such as: age, gender, income, amount of consumption, frequency of purchase etc.