Science/Health Science/Data Science Module

Topic 6B: Big Data I (Clustering)


Example R code solutions for the Week 6 Science/Health Science/Data Science Computer Lab, which uses penguin data from the palmerpenguins R package1 (Horst, Hill, and Gorman 2020), are presented below.

1 Cluster Analysis

Purpose

No answer required.

Clustering Techniques

No answer required.

2 Preparations

install.packages(c("factoextra", "ggfortify", "palmerpenguins"))
library(cluster)
library(factoextra)
library(ggfortify)
library(palmerpenguins)

Note: If you are in the Data Science module, you should already have the palmerpenguins package installed.

2.1 Penguin Data

No answer required.

2.2

head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
plot(penguins)

For several of the continuous random variable scatter plots, there does appear to be visible clustering occuring - e.g. flipper_length_mm vs bill_depth_mm suggests that there are two distinct clusters. Therefore applying a clustering technique does seem to be warranted.

2.3 Preparing data for Cluster Analysis

No answer required.

2.3.1

penguins <- na.omit(penguins)
penguins_subset <- penguins[, 3:6]

2.3.2

No answer required.

2.3.3

penguins_scaled <- scale(penguins_subset)
head(penguins_scaled)
##      bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## [1,]     -0.8946955     0.7795590        -1.4246077  -0.5676206
## [2,]     -0.8215515     0.1194043        -1.0678666  -0.5055254
## [3,]     -0.6752636     0.4240910        -0.4257325  -1.1885721
## [4,]     -1.3335592     1.0842457        -0.5684290  -0.9401915
## [5,]     -0.8581235     1.7444004        -0.7824736  -0.6918109
## [6,]     -0.9312674     0.3225288        -1.4246077  -0.7228585

3 k-means Clustering

3.1

kmeans_fit3 <- kmeans(penguins_scaled, 3)

3.2

We can check the cluster membership as follows:

kmeans_fit3$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3
## [149] 2 3 3 2 2 3 2 2 2 3 2 3 2 3 2 3 2 3 3 2 2 2 2 2 3 2 3 3 2 2 3 3 3 2 3 2 3
## [186] 2 3 2 2 3 2 2 3 2 3 2 3 2 3 2 2 2 2 2 3 2 3 2 3 2 3 3 2 3 2 3 3 2 2 3 2 3
## [223] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 3 2 2 3 2 3 2 3 3 2 3 2 3 3 3 2 3 2 3
## [260] 3 2 2 3 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The between_ss /total_ss value is 72.1%.

3.3 Visualising our results

plot(penguins[, c(1,3:6)], col = kmeans_fit3$cluster)

As we can see, the clusters are actually doing a pretty good job of separating penguins by species!

3.4

plot(as.numeric(penguins$species), col = kmeans_fit3$cluster)

Based on these plots, it seems that the k-means clustering has done a great job of clustering the penguins into clusters of different species. If you look carefully, there is some error when distinguishing between Chinstrap and Adelie penguins. If we also include more data such as sex or island, we might be able to more clearly distinguish between these two species.

3.5

kmeans_fit2 <- kmeans(penguins_scaled, 2)
kmeans_fit4 <- kmeans(penguins_scaled, 4)
kmeans_fit5 <- kmeans(penguins_scaled, 5)

The between_ss /total_ss results for these new cluster fits are:

  • 58.5% for \(k=2\),
  • 77.1% for \(k=4\),
  • 82.8% for \(k=5\)

The between_ss /total_ss value actually increases as we increase the number of clusters, which isn’t necessarily accurate - we know there aren’t five species of penguin in our data (what happens if we use \(k=10\)?!).

3.6

No answer required.

3.6.1

kmeans_wss <- fviz_nbclust(penguins_scaled, kmeans, method = "wss")
kmeans_wss

Based on this plot, it seems that the optimal number of clusters is three. After \(k=3\), the Total Within Sum of Squares variance does not decrease as rapidly when more clusters are used.

3.6.2

kmeans_sil <- fviz_nbclust(penguins_scaled, kmeans, method = "silhouette")
kmeans_sil

Based on this plot, it seems that the optimal number of clusters is two.

3.6.3

kmeans_gap <- fviz_nbclust(penguins_scaled, kmeans, method = "gap_stat")
kmeans_gap

Based on this plot, it seems that the optimal number of clusters is four.

3.7

fviz_cluster(kmeans_fit3, data = penguins_scaled)

fviz_cluster(kmeans_fit2, data = penguins_scaled)

fviz_cluster(kmeans_fit4, data = penguins_scaled)

fviz_cluster(kmeans_fit5, data = penguins_scaled)

3.8

Answers may vary here. Three or four clusters seem reasonable.

4 PAM Clustering

4.1

pam_fit2 <- pam(penguins_scaled, 2)
pam_fit3 <- pam(penguins_scaled, 3)
pam_fit4 <- pam(penguins_scaled, 4)
pam_fit5 <- pam(penguins_scaled, 5)

4.2

pam_wss <- fviz_nbclust(penguins_scaled, pam, method = "wss")
pam_sil <- fviz_nbclust(penguins_scaled, pam, method = "silhouette")
pam_gap <- fviz_nbclust(penguins_scaled, pam, method = "gap_stat")

pam_wss

pam_sil

pam_gap

We have different results here, of three, two and five clusters being optimal, respectively.

4.3

The R code below shows results for a cluster choice of three.

fviz_cluster(pam_fit3, data = penguins_scaled)

plot(as.numeric(penguins$species), col = pam_fit3$clustering)

We can see from these plots that the PAM clustering does a very good job of clustering penguins into species (although there is still some trouble distinguishing Adelie and Chinstrap penguins). For the larger numbers of clusters, the algorithm may also be distinguishing between male and female penguins - try checking if this is true.

5 Extension 1: Hierarchical Clustering

5.1

hier_fit <- agnes(penguins_scaled)

5.2

plot(hier_fit, which = 2)

5.3

hier_fit <- agnes(penguins_scaled)
hier_fit_single <- agnes(penguins_scaled, method = "single")
hier_fit_complete <- agnes(penguins_scaled, method = "complete")
hier_fit_ward <- agnes(penguins_scaled, method = "ward")
plot(hier_fit_single, which = 2)

plot(hier_fit_complete, which = 2)

plot(hier_fit_ward, which = 2)

The dendrograms all look quite different. The complete or ward method ones look perhaps the cleanest and easiest to assess.

5.4

The code shown below produces a coloured dendrogram with three clusters, for the hierarchical clustering algorithm using the “complete” measurement method.

penguin_dendro <- hcut(penguins_scaled, 
                k = 3, 
                hc_func = "agnes",
                hc_method = "complete",
                hc_metric = "euclidean") 

fviz_dend(penguin_dendro)

6 Extension 2: Fuzzy Clustering

6.1

The R code below shows fits for the different cluster number options.

fuzzy_fit2 <- fanny(penguins_scaled, 2)
fuzzy_fit3 <- fanny(penguins_scaled, 3)
fuzzy_fit4 <- fanny(penguins_scaled, 4)
fuzzy_fit5 <- fanny(penguins_scaled, 5)

Please note that output for these fits is omitted due to the length of the results.

6.2

fuzzy_wss <- fviz_nbclust(penguins_scaled, fanny, method = "wss")
fuzzy_sil <- fviz_nbclust(penguins_scaled, fanny, method = "silhouette")

fuzzy_wss

fuzzy_sil

We obtain the same results as for the k-means and PAM clustering here, i.e. we have different results of three and two, respectively. (If we ran the gap_stat method, we would arrive at five clusters being optimal, but this takes quite a while to run).

6.3

The R code below shows results for a cluster choice of three.

fviz_cluster(fuzzy_fit3, data = penguins_scaled)

# This code assumes you are assessing a result stored in the object fuzzy_fit3
plot(as.numeric(penguins$species), col = fuzzy_fit3$cluster)

From these plots we can see that the fuzzy clustering performs similarly to the previous methods.


That’s all the content covered.


References

Horikoshi, M., Y. Tang, A. Dickey, M. Grenie, R. Thompson, L. Selzer, D. Strbenac, K. Voronin, and D. Pulatov. 2021. ggfortify: Data Visualization Tools for Statistical Analysis Results. https://github.com/sinhrks/ggfortify.
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Kassambara, A., and F. Mundt. 2020. factoextra: Extract and Visualize the Results of Multivariate Data Analyses. http://www.sthda.com/english/rpkgs/factoextra.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.


These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.


  1. We also utilise the packages factoextra (Kassambara and Mundt 2020), ggfortify (Horikoshi et al. 2021) and cluster(R Core Team 2021).↩︎

---
title: "STM1001: Computer Lab 6B Solutions"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Science/Health Science/Data Science Module {-}


### Topic 6B: Big Data I (Clustering) {-}

<br>

Example R code solutions for the [Week 6 Science/Health Science/Data Science Computer Lab](https://rpubs.com/LTU_STM1001/SMDSMCL6_S), which uses penguin data from the `palmerpenguins` R package^[We also utilise the packages `factoextra` [@facto], `ggfortify` [@gg] and `cluster`[@R].] [@penguins], are presented below.

# Cluster Analysis

### Purpose {-}

No answer required.

### Clustering Techniques {-}

No answer required.

# Preparations {#prep}

```{r eval = T, include = F}
library(cluster)
library(factoextra)
library(ggfortify)
library(palmerpenguins)
```

```{r eval = F, echo = T}
install.packages(c("factoextra", "ggfortify", "palmerpenguins"))
library(cluster)
library(factoextra)
library(ggfortify)
library(palmerpenguins)
```

*Note: If you are in the Data Science module, you should already have the `palmerpenguins` package installed.*

## Penguin Data

No answer required.

##


```{r eval = T, echo = T}
head(penguins)
plot(penguins)
```

For several of the continuous random variable scatter plots, there does appear to be visible clustering occuring - e.g. `flipper_length_mm` vs `bill_depth_mm` suggests that there are two distinct clusters. Therefore applying a clustering technique does seem to be warranted.

## Preparing data for Cluster Analysis

No answer required.

###

```{r eval = T, echo = T}
penguins <- na.omit(penguins)
penguins_subset <- penguins[, 3:6]
```

###

No answer required.

###

```{r class.source = "fold-show", eval = T, echo = T}
penguins_scaled <- scale(penguins_subset)
head(penguins_scaled)
```

# k-means Clustering

## {#kmeans3}

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
kmeans_fit3 <- kmeans(penguins_scaled, 3)
```

##

We can check the cluster membership as follows:

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
kmeans_fit3$cluster
```

The `between_ss /total_ss` value is 72.1%.

## Visualising our results {#kmeansvisualise}

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(8, 10)}
plot(penguins[, c(1,3:6)], col = kmeans_fit3$cluster)
```

As we can see, the clusters are actually doing a pretty good job of separating penguins by species!

##

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
plot(as.numeric(penguins$species), col = kmeans_fit3$cluster)
```

Based on these plots, it seems that the k-means clustering has done a great job of clustering the penguins into clusters of different species. If you look carefully, there is some error when distinguishing between Chinstrap and Adelie penguins. If we also include more data such as `sex` or `island`, we might be able to more clearly distinguish between these two species.

## {#kmeansmore}

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
kmeans_fit2 <- kmeans(penguins_scaled, 2)
kmeans_fit4 <- kmeans(penguins_scaled, 4)
kmeans_fit5 <- kmeans(penguins_scaled, 5)
```

The `between_ss /total_ss` results for these new cluster fits are:

* 58.5% for $k=2$,
* 77.1% for $k=4$,
* 82.8% for $k=5$

The `between_ss /total_ss` value actually increases as we increase the number of clusters, which isn't necessarily accurate - we know there aren't five species of penguin in our data (what happens if we use $k=10$?!).

## {#nbclust}

No answer required.

###

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
kmeans_wss <- fviz_nbclust(penguins_scaled, kmeans, method = "wss")
kmeans_wss
```

Based on this plot, it seems that the optimal number of clusters is three. After $k=3$, the Total Within Sum of Squares variance does not decrease as rapidly when more clusters are used.

###

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
kmeans_sil <- fviz_nbclust(penguins_scaled, kmeans, method = "silhouette")
kmeans_sil
```

Based on this plot, it seems that the optimal number of clusters is two.

###

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
kmeans_gap <- fviz_nbclust(penguins_scaled, kmeans, method = "gap_stat")
kmeans_gap
```

Based on this plot, it seems that the optimal number of clusters is four.

##

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
fviz_cluster(kmeans_fit3, data = penguins_scaled)

fviz_cluster(kmeans_fit2, data = penguins_scaled)
fviz_cluster(kmeans_fit4, data = penguins_scaled)
fviz_cluster(kmeans_fit5, data = penguins_scaled)
```

## 

Answers may vary here. Three or four clusters seem reasonable.

# PAM Clustering

## {#pam}

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
pam_fit2 <- pam(penguins_scaled, 2)
pam_fit3 <- pam(penguins_scaled, 3)
pam_fit4 <- pam(penguins_scaled, 4)
pam_fit5 <- pam(penguins_scaled, 5)
```

## {#pambest}

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
pam_wss <- fviz_nbclust(penguins_scaled, pam, method = "wss")
pam_sil <- fviz_nbclust(penguins_scaled, pam, method = "silhouette")
pam_gap <- fviz_nbclust(penguins_scaled, pam, method = "gap_stat")

pam_wss
pam_sil
pam_gap
```

We have different results here, of three, two and five clusters being optimal, respectively.

##

The R code below shows results for a cluster choice of three.

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
fviz_cluster(pam_fit3, data = penguins_scaled)
```

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(8, 6)}
plot(as.numeric(penguins$species), col = pam_fit3$clustering)
```

We can see from these plots that the PAM clustering does a very good job of clustering penguins into species (although there is still some trouble distinguishing Adelie and Chinstrap penguins). For the larger numbers of clusters, the algorithm may also be distinguishing between male and female penguins - try checking if this is true.

# Extension 1: Hierarchical Clustering

## {#hierarchical}

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
hier_fit <- agnes(penguins_scaled)
```

##

```{r class.source = "fold-show", eval = T, echo = T, fig.dim = c(8, 6), cache = T}
plot(hier_fit, which = 2)
```

##

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
hier_fit <- agnes(penguins_scaled)
hier_fit_single <- agnes(penguins_scaled, method = "single")
hier_fit_complete <- agnes(penguins_scaled, method = "complete")
hier_fit_ward <- agnes(penguins_scaled, method = "ward")
```

```{r class.source = "fold-show", eval = T, echo = T, fig.dim = c(8, 6), cache = T}
plot(hier_fit_single, which = 2)
plot(hier_fit_complete, which = 2)
plot(hier_fit_ward, which = 2)
```

The dendrograms all look quite different. The `complete` or `ward` method ones look perhaps the cleanest and easiest to assess.

## 

The code shown below produces a coloured dendrogram with three clusters, for the hierarchical clustering algorithm using the "complete" measurement method.

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(8, 6), warning = F, message = F}
penguin_dendro <- hcut(penguins_scaled, 
                k = 3, 
                hc_func = "agnes",
                hc_method = "complete",
                hc_metric = "euclidean") 

fviz_dend(penguin_dendro)
```

# Extension 2: Fuzzy Clustering

##

The R code below shows fits for the different cluster number options.

```{r class.source = "fold-show", eval = T, echo = T, cache = T}
fuzzy_fit2 <- fanny(penguins_scaled, 2)
fuzzy_fit3 <- fanny(penguins_scaled, 3)
fuzzy_fit4 <- fanny(penguins_scaled, 4)
fuzzy_fit5 <- fanny(penguins_scaled, 5)
```

Please note that output for these fits is omitted due to the length of the results.

## {#fuzzybest}

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6), warning = F, message = F}
fuzzy_wss <- fviz_nbclust(penguins_scaled, fanny, method = "wss")
fuzzy_sil <- fviz_nbclust(penguins_scaled, fanny, method = "silhouette")

fuzzy_wss
fuzzy_sil
```

We obtain the same results as for the k-means and PAM clustering here, i.e. we have different results of three and two, respectively.
(If we ran the `gap_stat` method, we would arrive at five clusters being optimal, but this takes quite a while to run).

##

The R code below shows results for a cluster choice of three.

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(6, 6)}
fviz_cluster(fuzzy_fit3, data = penguins_scaled)
```

```{r class.source = "fold-show", eval = T, echo = T, cache = T, fig.dim = c(8, 6)}
# This code assumes you are assessing a result stored in the object fuzzy_fit3
plot(as.numeric(penguins$species), col = fuzzy_fit3$cluster)
```

From these plots we can see that the fuzzy clustering performs similarly to the previous methods.

<br>

#### That's all the content covered. #### {-}

<br>

# References {- #Ref}
<div id="refs"></div>

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in @ModStat. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>