The main idea...
Nonhierarchical cluster analysis aims to find a grouping of objects which maximises or minimises some evaluating criterion. Many of these algorithms will iteratively assign objects to different groups while searching for some optimal value of the criterion. Below, a popular example of a nonhierarchical cluster analysis is described.
Kmeans clustering
Kmeans clustering aims to assign objects to a userdefined number of clusters (k) in such a way that maximises the separation of those clusters while minimising intracluster distances relative to the cluster's mean or centroid (Figure 1). The algorithm typically defaults to Euclidean distances, however, alternate criteria, such as different distance or dissimilarity measures, can be accepted by many implementations. If not, alternate distances or dissimilarities may be transformed to be compatible with Euclidean representation. The user is usually expected to set 3 parameters: 1) the number of clusters expected (k), 2) the initialisation method, and 3) the distance metric to be used.
This algorithm has a variety of implementations (see Jain, 2010), however, one of the most popular uses an iterative refinement approach (Figure 2).
 An initialisation step creates k centroids. Some algorithms use an existing object as an initial centroid while others randomly assign objects into k clusters in order to calculate the first centroids. It is often possible to define custom initial placements for centroids.
 An assignment step places each object into a cluster whose centroid (mean) is closest to it.
 An update or reassignment step then calculates new centroids based on the membership of each cluster.
 Step 2 and 3 are repeated until the solution converges, i.e. when the centroid positions no longer change.
While simple in implementation, the Kmeans algorithm has several features which are important to grasp to use this method appropriately. Please go through the Warnings section of this page and Davidson (2002).
The clustering algorithm can be constrained by rules governing object membership. For exampling, mustlink and cannotlink constraints can be imposed on individual objects.


Figure 1: A schematic of a twodimensional kmeans clustering solution with (a) k = 3 and (b) k = 6. 




Figure 2: A schematic illustration of a common kmeans clustering algorithm. (a) After an initialisation step which places k centroids (filled diamonds) among the objects (filled circles), objects are assigned to a cluster based on their distances to these centroids (b). An update step then recalculates the position of the centroids to reflect the mean location of the objects in a given cluster (c). Steps (b) and (c) are repeated until a stable solution is found (d). 
Warnings
 The kmeans algorithm can become 'stuck' in local optima. Repeating the clustering algorithm and adding noise to the data can help evaluate the robustness of the solution.
 The kmeans algorithm will favour higher values of k. This is not necessarily desirable and users should consider carefully which values of k are sensible for their data set.
 The full kmeans algorithm is computationally intensive. Heuristic algorithms are available for large datasets.
 Solutions may vary with different distance measures.
 The distances of objects to their cluster centroids are the primary factor contributing to a kmeans solution. The algorithm is insensitive to other features of an object population. Thus, verify that the populations of interest can be classified well by their distance to their multivariate mean.
 In the original kmeans algorithm, objects must be assigned to one cluster. This may be problematic in cases where an object is equidistant from two or more centroids. It is also problematic if there exist outliers. These outliers will be forced into a cluster, affecting the positioning of its centroid.
 If populations of objects overlap, the kmeans algorithm may provide biased estimates of their centroids, being 'pulled' towards regions where no (or less) overlap occurs.
 Standardising variables may change the solution. Prior to standardisation, variables with larger ranges will contribute more to the distance of objects.
 Comparing clustering solutions with different values of k may be problematic, as many solutions for each value would have to be examined to prevent evaluating solutions that represent local optima.
 When using Euclidean distances, variables which are correlated will naturally influence object positioning in similar ways. Thus, if a large proportion of the variables in an analysis are strongly correlated, the influence of other variables may be poorly represented. To avoid this, Davidson (2002) recommends performing factor analysis on the variables and, if it is successful, using the extracted factors as input for kmeans clustering.
References
