Cluster Analysis

Background to K-Means Clustering

Objective: create clusters of items, individuals or objects that have similarity with the others in the cluster but with differences between clusters.

# Document prepared by Robert L. Andrews, April 2005 & revised, April 2011
  • The items, individuals or objects being placed into clusters will be referred to as cases. The degree of similarity or dissimilarity may be determined from the recorded values for one or multiple characteristics for the cases. There are no dependent variables for cluster analysis.
  • Clustering procedures require that similarity be quantified.
    One quantitative measure for interval scale data is the distance between cases.
  • Euclidean Distance measures the length of a straight line between two cases. The numeric value of the distance between cases depends on the measurement scale.
  • If the measurements are recorded using different measurement scales then one should use a to assure similar variability of measurements for all characteristics being used to create the clusters.

  • Other measures may be used to create a dissimilarity or distance matrix that can be used as the basis for creating clusters. % (see Measures for Interval Data section under Hierarchical Cluster Analysis).

  • A key issue in obtaining a set of clusters is the determination of the number of clusters. Hierarchical procedures provide information that allows the analyst to decide on the number of clusters based on the output.
  • This is often done by examining tabular or graphical output to identify the gaps that define logical clusters.
  • Hierarchical clustering requires a distance or similarity matrix between all pairs of cases. That's an extremely large matrix if you have tens of thousands of cases in your data file.

K-means Clustering


% Cluster analysis can be an effective tool to identify extreme data values in a multivariate data set. Extreme points will be a cluster by themselves while the vast majority of the other points are in one or more well populated clusters.
% When performing hierarchical cluster analysis one can cluster cases or variables in SPSS by selecting either cases or variables in the initial menu. The default is cases since it is probably done more than clustering variables but clustering variables may be desirable in certain situations.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%* Statistics. Complete solution: initial cluster centers, ANOVA table. Each case: cluster information, distance from cluster center.

To Obtain a K-Means Cluster Analysis From the menus choose:

Analyze  >  Classify  >  K-Means Cluster...    

% (Descriptions below originally copied from SPSS13.0 Help & I have made % slight additions.)

Cluster Analysis}

Cluster analysis can be an effective tool to identify extreme data values in a multivariate data set. Extreme points will be a cluster by themselves while the vast majority of the other points are in one or more well populated clusters.

When performing hierarchical cluster analysis one can cluster cases or variables in SPSS by selecting either cases or variables in the initial menu. The default is cases since it is probably done more than clustering variables but clustering variables may be desirable in certain situations.

In these methods the desired number of clusters is specified in advance and the `best' solution is chosen. The steps in such a method are as follows: