Statistical Associates Publishers
Cluster Analysis: 10 Worst Pitfalls and Mistakes
- Thinking cluster analysis and factor analysis are equivalent methods.
Factor analysis finds similarities based on partical coefficients which control for other variables in the model. Cluster analysis finds similarities based on paired distances and does not control for other variables in the model.
- Not randomizing the data.
The solutions in K-means cluster analysis, two-stage cluster analysis, and certain other types of cluster analysis depend on the order in which observations are entered. Multiple randomized runs are needed.
- Using ordinary cluster analysis methods with repeated measures data. .
Cluster analysis usually assumes data independence and will give erroneous results when this assumption is violated. The FAQ section in our "Cluster Analysis" book discusses special strategies used for dependent data.
- Not standardizing variables used for clustering. .
Not standardizing variables is usually a cluster analysis mistake as variables with higher magnitudes will dominate. However, standardizing can be a mistake also for certain research purposes, as discussed in a FAQ section of out "Cluster Analysis" book.
- Predetermining the optimal number of clusters. .
One of the usual purposes of doing cluster analysis is to determine (not assume beforehand) the optimal number of clusters. This is not strictly a statistical question but rather depends on the research purpose. Researchers should examine a range of possible solutions and undertake sensitivity analysis.
- Using automatic methods to select the optimal number of clusters. .
Some clustering procedures allow for automatic selection of the "optimal" number of clusters. Like other stepwise procedures, this is a data-driven approach which may overfit the model to noise in the data and which may not replicate. Cross-validation, using a validation dataset separate from the development dataset, is a well-advised precaution if automatic selection is to be used.
- Ignoring the shape of the clusters. .
Hierarchical cluster analysis, k-means cluster analysis, and most forms of clustering assume spherical clusters. However, elongated ellipses and other irregular shapes are not uncommon. It may be necessary to sphericize the data in order to obtain optimal clustering. In SAS, PROC ACECLUS may be used to possibly sphericize and separate data clusters as a pre-processing step.
- Applying significance tests in the usual way. .
Although significance tests, such as F tests, are output in cluster analysis, they should be considered useful for exploratory purposes only and not used as if they were conventional significance tests, the assumptions of which are inherently violated by the clustering process. An exception is nonparametric density cluster analysis (PROC MODECLUS in SAS), which does generate valid p values and may be selected for this reason.
- Thinking a given cluster analysis technique can only cluster variables or only cluster observations. .
While a given cluster technique may be associated with observation clustering or with variable clustering, it is always possible to transpose the data matrix and cluster either, even if the software is set up for this option. Of course, depending on the research context, it is not always the case that both are appropriate.
- Meeting the assumptions of cluster analysis. .
Our book, listed below, enumerates 8 key assumptions of cluster analysis, clearly listed in the "Assumptions" section.
Want to learn more about all this and much more?
"Cluster Analysis" on Amazon, Kindle format
"Cluster Analysis" Preview, PDF format
"Cluster Analysis" Information and table of contents
"Statistical Associates Library" of 50 Statistics E-books on Amazon, no-password .PDF format