Topic 7B: Big Data I (Clustering)


These are the solutions for the Topic 7B Computer Lab.

1 Cluster Analysis

No solution required.

2 Preparations

2.1 Penguin Data

No solution required.

2.2

No solution required.

2.3

For several of the continuous random variable scatter plots, there does appear to be visible clustering occurring - e.g. flipper_length_mm vs bill_depth_mm suggests that there are two distinct clusters. Therefore applying a clustering technique does seem to be warranted.

2.4 Preparing data for Cluster Analysis

No solution required.

2.4.1 Carry out these steps before the next question

No solution required.

3 \(k\)-means Clustering

3.1

No solution required.

3.2

For the between_ss /total_ss value, we have 958.347 / 1328 = 0.72 = 72%.

3.3 Visualising our results

As we can see, the clusters are actually doing a pretty good job of separating penguins by species!

3.3.1

Based on these plots, it seems that the \(k\)-means clustering has done a great job of clustering the penguins into clusters of different species. If you look carefully, there is some error when distinguishing between Chinstrap and Adelie penguins. If we also include more data such as sex or island, we might be able to more clearly distinguish between these two species.

3.4

The between_ss /total_ss values are:

  • \(k = 2\): 776.989 / 1328 = 0.59 = 59%
  • \(k = 3\): 958.347 / 1328 = 0.72 = 72%
  • \(k = 4\): 1037.978 / 1328 = 0.78 = 78%

The between_ss /total_ss value actually increases as we increase the number of clusters, which isn’t necessarily accurate - we know e.g. that there are not four species of penguin in our data (what happens if we use \(k=10\)?!).

3.5 Diagnostics

No solution required.

3.5.1 The silhouette method

Based on this plot, it seems that the optimal number of clusters is two.

3.5.2 The gap statistic method

Based on this plot, it seems that the optimal number of clusters is four

3.5.3 Cluster plots

\(k = 2\):

\(k = 3\):

\(k = 4\):

Answers will vary.

3.6

Answers may vary here. Three or four clusters seem reasonable.

4 Extension: Hierarchical Clustering

4.1

No solution required.

4.2

4.3

Check with your computer lab demonstrator if you are not sure.

4.4

The dendrograms all look quite different. The ward.D2 method one looks perhaps the cleanest and easiest to assess.

References

Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.


These notes have been prepared by Amanda Shaker and Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.