Topic 7B: Big Data I (Clustering)

🏡 Welcome to the seventh computer lab for the Science/Health stream of STM1001.

In this computer lab we will carry out a variety of clustering analyses using penguin data from the palmerpenguins R package (Horst, Hill, and Gorman 2020).

By the end of this lab, you should feel comfortable applying various clustering techniques in jamovi. Let’s get started!

🏡 This week’s computer lab is based on the Topic 7B lecture , where we discussed Clustering. If you did not attend the lecture and have not yet reviewed the recording, you can review the slides here .

1 Cluster Analysis

🏡 A cluster is a subset of a data set that consists of observations which, using a chosen metric, are similar to each other, and also dissimilar to other observations¹.

Cluster Analysis is the method of grouping data into clusters, using a clustering technique. There is a large variety of clustering techniques, and each of these techniques uses a different metric for determining the similarity between observations. Don’t worry, we won’t go into all the complicated mathematics involved in these techniques - our focus is on learning how to conduct cluster analyses in jamovi.

Purpose

Clustering is often used as part of an exploratory data analysis procedure, to uncover hidden patterns in a new data set. The main purposes of clustering² are to:

Analyse the data structure
Relate the different elements of the data to each other, and then, most importantly
Aid in classifying the data into certain classes

It is worth noting that in general, we may not be aware of what these classes are, prior to the clustering analysis³.

Clustering Techniques

In this computer lab, we will apply the following popular clustering techniques:

\(k\)-means Clustering
Hierarchical Clustering (optional)

We will provide brief explanations of each of these techniques as we work through this computer lab.

2 Preparations

2.1 Penguin Data

🏡 The penguins data set in the palmerpenguins R package (Horst, Hill, and Gorman 2020) contains recent data on 3 species of penguin (Adelie, Chinstrap and Gentoo) living on islands in the Palmer archipelago, off the coast of Antarctica. We will use this data for our clustering analyses.

This penguin data contains information on numerous variables. The table below provides an example of some recorded values for these different variables (the column names), for 3 different penguins:

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Gentoo	Biscoe	49.2	15.2	221	6300	male	2007
Chinstrap	Dream	45.5	17.0	196	3500	female	2008

2.2

🏡 Consider the below matrix of scatter plots for the different variables in the penguins data set:

Note here that each scatter plot shows two variables from the penguins data set, plotted against each other.

The variable names are shown on the diagonal boxes - just cross reference a plot to the variable names in that row and column to determine the variables plotted on the \(y\)-axis and \(x\)-axis respectively.

Hint: For example, the plot shown in the sixth row, fifth column should be body_mass_g plotted on the y-axis against flipper_length_mm on the x-axis.

2.3

🏡 Looking at the scatter plots for the continuous random variables, does it appear that there are any natural clusters that will be easy to identify using a clustering technique? 💬

2.4 Preparing data for Cluster Analysis

🏡 For the purposes of this computer lab, we will assume that we don’t actually have data on the species of each penguin in the penguins data set. Instead, we will conduct cluster analyses using the other information contained in this data set.

If the cluster technique works well, we may end up with clusters for each penguin species!

However, before we begin our clustering analyses, it is important to note the following:

Clustering algorithms only work on continuous variables. As we have seen, the penguins data set contains both continuous and categorical (discrete) variables. Therefore, for our cluster analysis, we will only use a subset of the variables in the penguins data set, namely bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g.
Before conducting a cluster analysis, it is also generally a good idea to normalise our data (this process is like the standardisation process covered in section 4.2 of Topic 3). This will remove the effect of different variables being measured on different scales (e.g. flipper length in mm and body mass in grams), and can prevent any single variable overpowering others when being assessed by the cluster algorithm.

Normalising the data consists of two steps:

First, we compute the sample mean and sample standard deviation for each variable.
Second, we subtract the relevant sample mean from each observation, and then divide by the relevant sample standard deviation.

For example, across the whole data set the sample mean for bill_depth_mm is roughly 17.151, and the sample standard deviation is roughly 1.975. The first penguin in the data set has a recorded bill_depth_mm value of 18.7. The normalised version of this value would therefore be \[\displaystyle \frac{18.7 - 17.151}{1.975} \approx 0.78.\]

The penguins_jamovi.csv file on the LMS has been prepared for our cluster analysis and contains the following variables:

index: Row number, can be used as identifier
Normalised (or standardised) versions of bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g
species (Adelie, Gentoo or Chinstrap)
species_numeric: Species stored in numeric format, i.e. with a number from 1-3 to represent each species.

2.4.1 Carry out these steps before the next question

🏡 Before moving on to the next question, carry out the following steps:

Download the file called penguins_jamovi.csv from the LMS and open it in jamovi.
Install the snowCluster module (Seol 2022; Kassambara and Mundt 2020) in jamovi (The jamovi project 2022).

3 \(k\)-means Clustering

💻 We will now begin begin our cluster analysis.

The first clustering technique we will consider is \(k\)-means clustering, which is a form of centroid-based clustering. A simplified description of \(k\)-means clustering is presented below.

We begin by arbitrarily choosing a number of clusters \(k\) to use.
Each data point from our data set is assigned to one of these \(k\) initial clusters, based on the distance between the point and the mean of all points.
The centroid, i.e. the mean of each cluster, is then calculated, and an iterative process begins:
- Points are moved between clusters, one point at a time, depending on how close they are to each centroid.
- This process continues, until no point can be moved between clusters without increasing the average distance between the points and the centroids⁴. At this point the clustering stops.

The following video demonstrates \(k\)-means clustering in jamovi and will be useful for the remainder of this question question.

Note: The snowCluster module has been updated since this video was recorded. All of the changes are minor, and will not impact our studies in STM1001. All of the required options are still available, but some of them now appear in slightly different locations as compared with the video.

Check the Clustering Lab 7B Update pdf on the LMS for a list of the changes.

3.1

💻 Following the steps shown in the above video, carry out a \(k\)-means cluster analysis in jamovi using \(k = 3\) clusters.

3.2

💻 Let’s discuss the results.

We are primarily interested in the cluster membership assigned by the algorithm. Check that this information has now been added to the data set in jamovi. (If you don’t see it, check that you ticked the Cluster number option in the Save section of the cluster analysis you carried out in jamovi.)
Calculate the between_ss /total_ss value. (Note that between_ss is the “Between clusters” value and total_ss is the “Total” value, both from the Sum of squares Table in jamovi)

To further understand the sum of squares table, note that:

between_SS denotes the sum of squares between clusters (i.e., how well the clusters are separated from each other),
total_ss denotes the total variability in the data.
The calculation between_ss /total_ss tells us how much of the variability in the data is accounted for by the clusters found by the algorithm. This value can range from 0% to 100%, with larger values indicating a better result. Here, 72.1% is decent, although we might have expected a slightly higher result.

3.3 Visualising our results

💻 We can visualise our results in several ways.

Firstly, consider the following matrix of scatter plots of the continuous variables, coloured by cluster.

Recall from 2.2 that each scatter plot here shows two of the variables in the penguins data set plotted against each other. Now however, they are coloured by cluster (as determined by the \(k\)-means method).

If, by inspecting these plots, you can clearly distinguish between clusters, then the clustering method has performed well.

Hint: If you are having trouble interpreting the graph, this note may help. For example, the plot shown in the fifth row, fourth column should be body_mass_g plotted on the y-axis against flipper_length_mm on the x-axis.

3.3.1

💻 Using jamovi, create the following plots:

A bar plot of species separated by cluster
A scatter plot with index on the \(x\)-axis, species_numeric on the \(y\)-axis, and grouped by Clustering

Based on these plot, and the plot produced in 3.3, do you think the \(k\)-means clustering has performed well? 💬

Note: If you are unable to drag the Clustering variable across to the Variables box in the Exploration -> Descriptives section, this may be because the Clustering variable isn’t classified correctly. The easiest fix for this is to copy the data in the Clustering column, and paste it into a fresh column, which you can call e.g. Clustering V2. You should be able to specify this variable to be a nominal variable in the data setup.

Please also check that you have the latest solid version of jamovi installed, and update your version if not - you can download jamovi here. The current solid version will be around 2.6.44 (as at September 2025).

3.4

💻 At this point, it might seem that our job is done. We have results for \(k=3\), and there are three species of penguins.

Remember though, normally we would not actually know what the number of clusters should be. Therefore, we would usually try a range of different \(k\) values.

Conduct new \(k\)-means cluster analyses on our data, for \(k\) values of 2 and 4. (Remember to click on the snowCluster module and again choose K-means clustering method to begin a new \(k\)-means cluster analysis.)

Note: If you notice an error message saying “number of cluster centres must lie between 1 and nrow(x)”, this message can be ignored as it is likely due to a bug in jamovi.

Note down the between_ss /total_ss results for each of the \(k\) values of 2, 3 and 4. What do you observe? 💬

3.5 Diagnostics

🏡 You might have noticed that the between_ss /total_ss value increases as we increase the number of clusters. This can happen even if adding another cluster isn’t actually that helpful. Therefore, we should also consider some other assessment methods.

If we did not know the number of clusters, we could use several methods to determine the appropriate value of \(k\).

Within the snowCluster module, jamovi uses the factoextra R package in the background. In R, this package can be used to produce three types of plots to help choose the most appropriate value for \(k\), with the third option being available in jamovi:

The “wss” method
The “silhouette” method
The “gap_stat” method (currently the method available in jamovi)

For the purposes of this lab, we will focus on just the silhouette (see Rousseeuw 1987) and gap methods.

3.5.1 The silhouette method

🏡 The silhouette assessment method measures the similarity of a point to other points in its cluster.

The silhouette assessment method (Rousseeuw, 1987) assigns values between 0 and 1 to each point to indicates how clearly the point belongs to its allocated cluster, compared to the closest alternative.

Average silhouette width values close to 1 indicate a clear grouping, while values close to 0 indicate many boundary points which could reasonably be allocated to a different cluster.

Thus higher values are preferable to lower values.

By considering the silhouette plot provided below, determine the optimal number of clusters for our data set:

3.5.2 The gap statistic method

🏡 The second \(k\)-means clustering assessment method we will consider is the gap_stat assessment method. This computes a statistic known as the ‘gap statistic’. We won’t go into the mathematical details of this statistic; suffice to say that higher values are considered preferable.

The “Optimal number of clusters” plot produced in jamovi uses the gap statistic method. Using this plot, determine the optimal number of clusters for our data. 💬

3.5.3 Cluster plots

🏡 We can also visualise the clusters using the Cluster plot produced in jamovi.

Compare the cluster plots produced for each of the \(k = 2\), \(k = 3\) and \(k = 4\) analyses. Which one do you prefer, and why? 💬

3.6

🏡 In conclusion, based on all your \(k\)-means clustering analyses, which value of \(k\) would you recommend using, and why? 💬

At this point, we’ve covered the main material for this computer lab - well done! If you would like, you can extend your knowledge on clustering by going through the Extension section below. Otherwise, if you have time left, you might like to work on your assessments.

4 Extension: Hierarchical Clustering

Unlike the previous clustering techniques discussed, hierarchical clustering begins by assigning each point to its own cluster.

This means that, for our penguins data set, we would start with 333 clusters - 1 for each penguin. Then, the two most similar clusters are merged (continuing our example, this would result in us now having 332 clusters). We repeat this process, until we have just one large cluster (which contains all the points).

As a result, we now have a hierarchy of clusters, which looks a bit like an upside-down tree (with observations from the same cluster on the same ‘branch’).

4.1

By clicking on the snowCluster module and then choosing Hierarchical clustering mendrogram, carry out hierarchical clustering on the penguins data set.

Un-tick Standardize data
Decide on clusters by height (leave default as 15)

4.2

We can visualise the results of the hierarchical clustering using a dendrogram and a pairs plot.

In the Plots section, select the Plot dendrogram and Plot pairs of varibles options.

4.3

Do you understand what is shown in both plots, and are you able to identify how many clusters have been decided on by looking at the dendrogram? Check with your computer lab demonstrator if you are not sure. Note that it is also possible to select the number of clusters directly in the Decide on clusters section.

4.4

There are several linkage method options we can choose from when conducting hierarchical clustering. These relate to the way in which distance is measured between two points, to determine their level of similarity.

The default hierarchical clustering method in jamovi is ward.D2. Try re-running your hierarchical clustering analysis using some other options, by instead selecting "average" and then "single".

When you visualise the results, do the dendrograms look very different?

Great work, that’s everything for today!

References

Gan, G., C. Ma, and J. Wu. 2007. Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability ; 20. Philadelphia, Pa.: Society for Industrial; Applied Mathematics (SIAM, 3600 Market Street, Floor 6, Philadelphia, PA 19104). https://doi.org/https://doi.org/10.1137/1.9780898718348.

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.

Kassambara, A., and F. Mundt. 2020. factoextra: Extract and Visualize the Results of Multivariate Data Analyses. http://www.sthda.com/english/rpkgs/factoextra.

Mirkin, B. 1996. Mathematical Classification and Clustering. 1st ed. 1996.. Nonconvex Optimization and Its Applications, 11.

Rousseeuw, P. J. 1987. “Silhouette: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics 20: 53–65.

Seol, H. 2022. snowCluster: Multivariate Analysis. https://github.com/hyunsooseol/snowCluster.

The jamovi project. 2022. Jamovi [Computer Software]. https://www.jamovi.org.

Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.

These notes have been prepared by Amanda Shaker and Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

Mirkin (1996), p.25↩︎
see e.g. Mirkin (1996), p.25↩︎
see e.g. Gan, Ma, and Wu (2007)↩︎
See e.g. section 4.10.3 of Thulin (2021)↩︎

STM1001: Computer Lab