class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 Lecture # Introduction to Cluster Analysis ## Science/Health and Data Science streams ### La Trobe University --- # Welcome! ### In this lecture we will cover an Introduction to Cluster Analysis, focusing on how to conduct cluster analyses in R/jamovi. -- * By the end of this lecture you will: -- * understand what cluster analysis is and why we use it -- * have a foundational understanding of `\(k\)`-means clustering -- We will practice cluster analyses in Computer Lab 7B, and the computer lab has some additional extension material if you would like to extend your knowledge. -- *It is recommended that you cover this lecture material before you start Computer Lab 7B.* --- # Overview Over the following slides, we will cover: -- * Motivation for using Clustering -- * What is Clustering? -- * Types of Clustering Methods -- * `\(k\)`-means Clustering -- * Extensions -- Throughout this lecture, we will see how clustering works via an example. --- # 1. Motivation for Clustering We have looked at a range of data sets so far in STM1001. These have covered a variety of variables, from penguins to GDP per capita to onion yield. -- For all of these data sets, for each individual (penguin, country, onion) we have observations which, broadly speaking, fall into one of two groups: -- * A set of **feature** variables (e.g. flipper length, happiness score, onion density etc) -- * A .teal_style[group label] variable (e.g. species, continent, planting location etc) --- # 1. Motivation for Clustering For our analyses of these data, we used the .teal_style[group label] to group together similar **feature variable** observations, e.g.: * Assess **flipper lengths** across different penguin .teal_style[Species] -- * Assess **yield of onions** grown across different .teal_style[Localities] -- * Assess **Extraversion** across different US .teal_style[Regions] -- What happens though if we have data with feature variables, but no group label? -- * How can we determine the group in which the observations should be placed? -- * What if we don't even know what groups exist, or how many there are? -- We will need a new statistical technique to help us with our analyses - clustering! --- # 1. Motivation for Clustering * For example, let us again consider the Big Five data set (Hartnett, 2020) * Recall that the Big Five personality traits are: * **E**xtraversion * **A**greeableness * **C**onscientiousness * **N**euroticism * **O**penness --- # 1. Motivation for Clustering * Suppose now that we do not have access to any grouping variables - all we have access to are the Big Five measurements:
--- # 1. Motivation for Clustering * Can we use just this information to help us identify distinct groups, or "clusters"? -- * We will see how Clustering analysis can help us do just that --- # 2. What is Clustering? Clusters are reasonably intuitive, and are easy to understand visually. -- * As an example, here we have used feature variables from our `Big Five` data set to cluster the data into two clusters: <img src="data:image/png;base64,#STM1001_Data_Science_Clustering_Lecture_files/figure-html/unnamed-chunk-2-1.svg" width="55%" style="display: block; margin: auto;" /> --- # Purposes of Clustering Clustering is often used as part of an exploratory data analysis procedure, to uncover hidden patterns in a new data set. -- Typically, the main purposes of clustering are to: -- * Analyse the data structure -- * Relate the different elements of the data to each other -- and most importantly -- * Aid in classifying the data into certain groups or classes * Note that in general, we may not be aware of what these classes are, prior to the clustering analysis. -- **Note that clustering is usually not the final part of our analysis.** --- # How to define a Cluster? We have seen that clusters can be relatively easy to identify visually. -- Clusters can however be difficult to define mathematically. We will avoid overly technical definitions, and use the following intuitive notion of clustering: -- * A cluster is a set of entities which are alike, and entities from different clusters are not alike (Everitt, 1974). -- Phrased slightly differently: -- * A cluster is a subset of a data set that consists of observations which, using a chosen metric, are similar to each other, and also dissimilar to other observations (Mirkin, 1996). --- # Types of Clustering Methods There are numerous clustering algorithms (methods), but no single clustering method is universally preferred over all others. -- What this means is that the choice of 'best' clustering method is context specific - different clustering methods can be better for different problems. -- It is important to be able to use different types of clustering methods. In STM1001, we will look at: -- * .teal_style[k-means clustering] -- * .teal_style[Hierarchical clustering] (extension) -- * .teal_style[PAM clustering] and .teal_style[Fuzzy clustering] (extension for Data Science students only) -- **Don’t worry**, we won’t go into all the complicated mathematics involved in these techniques - our focus is on learning how to conduct and interpret cluster analyses in R and jamovi. --- # Conducting Cluster Analyses in jamovi and RStudio In Computer Lab 7B, you will learn how to conduct your own cluster analyses. -- * We will need certain modules and packages for cluster analysis: -- * In **jamovi** (The jamovi project 2022), we can use the `snowCluster` module (Seol 2022; Kassambara and Mundt 2020). -- * In **RStudio** (Posit team 2023), we can use the `cluster` package (R Core Team 2021), along with the `factoextra` (Kassambara and Mundt 2020) and `ggfortify` (Horikoshi et al. 2021) packages. --- # Data Preparations Regardless of software, we must properly prepare our data, before we begin our cluster analyses. -- **Important: Clustering algorithms only work on continuous feature variables.** -- Let's take a look at the Big Five data as an example:
--- # Data Preparations * Here, only `E`, `A`, `C`, `N` and `O` are continuous numeric variables. -- * We won't be able to use the other variables as feature variables (unless we conduct some transformations, which are beyond the scope of this topic). -- * So the data we will use to input into the clustering algorithm will be limited to the numeric variables:
--- # Data Preparations Before conducting a cluster analysis, it is also generally a good idea to .teal_style[normalise] our data. This is like calculating `\(z\)`-scores for each number in a data set ([recall the standardisation process covered in Topic 3](https://bookdown.org/a_shaker/STM1001_Topic_3/4.2-standardisation.html)). -- * Normalising our data helps remove any impact from different variables being measured on different scales (e.g. flipper length in mm and body mass in grams). * This can prevent any single variable unintentionally overpowering others when being assessed by the cluster algorithm. -- * Normalising the data consists of two steps: -- 1. We compute the sample mean and sample standard deviation for each variable. -- 1. We subtract the relevant sample mean from each observation, and then divide by the relevant sample standard deviation. (Recall the `\(z\)`-score formula: `\(z = \frac{x - \mu}{\sigma}\)`. If we use the sample values instead of the population parameters `\(\mu\)` and `\(\sigma\)`, we have `\(\frac{x - \overline{x}}{s}\)`.) --- # Data Preparations - Big Five Example Note that we don't have to do this normalisation by hand - we can use a simple line of code (R) or click a button (jamovi) to conduct the process. -- Let's take a look at the Big Five data after normalization: --
--- class: middle Once our data is prepared, we are ready to conduct our cluster analysis. --- # 3. k-means Clustering The main type of clustering we will focus on is called `\(k\)`-means clustering. -- `\(k\)`-means clustering is a top-down or *divisive* form of clustering, whereby we take the entire data sample, and divide the observations into similar groups. -- We say that `\(k\)`-means clustering is a *centroid*-based type of clustering, since cluster membership is determined by closeness to cluster means (aka *centroids*). --- # k-means Clustering Steps A simplified description of `\(k\)`-means clustering process is as follows: -- 1. We begin by arbitrarily choosing a number of clusters `\(k\)` to use. -- 1. `\(k\)` initial ***centroids*** (centre points) are initially chosen (normally randomly) -- 1. Each data point from our data set is assigned to a cluster, depending on which ***centroid*** it is closest to, and an iterative process begins: -- * The ***centroid***, i.e. the mean of each cluster, is recalculated -- * Each data point is again assigned the cluster with the closest ***centroid*** -- * This process continues, until the new centroids do not change and the points therefore remain stable within their clusters. At this point the clustering stops. -- * This process can be seen visually at [this link](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/) (Harris, 2014), in a 2-dimensional setting (i.e. where clusters are formed based on two feature variables). Example settings to use: `\(k = 3\)`, Initial centroids: "`I'll choose`", using "`Gaussian Mixture`" data --- # How to choose k? In general, we do not know how many clusters are in our data set. -- * Recall for example our Big Five data set, assuming we only have access to the Big Five measurements and no other variables -- * We cannot easily plot this data and it is not obvious how many clusters there should be -- Like clustering algorithms, there is no universally dominant or correct method for selecting the number of clusters. We will consider two options: -- * .teal_style[The silhouette method] -- * .teal_style[The gap statistic method] Note that only the gap statistic method is available in jamovi --- # How to choose k - The silhouette method The **silhouette** assessment method (Rousseeuw, 1987) measures the similarity of a point to other points in its cluster. -- The silhouette value for each observation can be between 0 and 1. -- * Values close to 0 suggest clusters are very similar * Values close to 1 indicate clusters are very different -- Thus, the higher the average silhouette width value (across all points in a cluster), the better. --- # How to choose k - The silhouette method To interpret a silhouette plot, we simply pick the number of clusters which results in the highest average silhouette width value - which is 3 in the example below: -- <img src="data:image/png;base64,#STM1001_Data_Science_Clustering_Lecture_files/figure-html/unnamed-chunk-6-1.svg" width="55%" style="display: block; margin: auto;" /> --- # How to choose k - the gap statistic method The other clustering assessment method we will consider is the **gap statistic** assessment method (Tibshirani, Walther & Hastie, 2001). -- We won’t go into the mathematical details of this statistic; suffice to say that .bold_style[higher values are considered preferable], like the silhouette method. * However, .bold_style[if the best gap statistic values are similar between two potential cluster numbers, we may select the smaller number of clusters]. --- # How to choose k - the gap statistic method We can interpret the gap statistic plots in a similar way to the silhouette plots. -- Just look for the dashed line highlighting the number of clusters determined to be optimal - in the example below it is 1. -- <img src="data:image/png;base64,#STM1001_Data_Science_Clustering_Lecture_files/figure-html/unnamed-chunk-7-1.svg" width="50%" style="display: block; margin: auto;" /> --- # Determining the optimal number for k * As we have seen, different diagnostic methods may propose different optimal numbers for `\(k\)`, for a given data set. -- * In practice, we can carry out the cluster analysis for various values of `\(k\)` and make a decision after seeing the results. -- * Sometimes, there is no "right" choice for `\(k\)`, so whatever value is chosen, it is important to be able to justify why you have chosen a specific value for `\(k\)`. -- * Returning to the Big Five example, let's choose a value of `\(k = 3\)` and see the results --- # Big Five example Choosing a value of `\(k = 3\)` results in the following cluster plot: .pull-left[ <img src="data:image/png;base64,#STM1001_Data_Science_Clustering_Lecture_files/figure-html/unnamed-chunk-8-1.svg" width="120%" style="display: block; margin: auto;" /> ] -- .pull-right[ * As we can see, three distinct clusters have been identified {{content}} ] -- * Cluster 1 overlaps a little bit with both Cluster 2 and Cluster 3 (compare this to the result with `\(k = 2\)` which we saw earlier) {{content}} -- * However, a reasonable amount of separation has been achieved Note that since there is some randomness involved in the clustering algorithm, the results can vary slightly each time the algorithm is run --- # Big Five example: Some diagnostics As well as the cluster plot, there are a number of other diagnostics we can look at to understand the results. For example, consider the following "Means across clusters" plot: -- <img src="data:image/png;base64,#Means_across_clusters.png" width="400px" style="display: block; margin: auto;" /> -- Taking Cluster 3 as an example, the cluster contains states with above average levels of Extraversion, Agreeableness and Conscientiousness, and below average levels of Neuroticism and Openness -- See if you can interpret the other two clusters. --- # Big Five example: Some diagnostics * We carried out our cluster analysis assuming we did not have any information about the `Region` variable. -- * But, an interesting question to ask is: how closely aligned are the identified clusters to the Regions that exist in the data? --- # Big Five example: Some diagnostics * Because we actually do have the Regions variable, we can compare the Region categories with the identified clusters: <img src="data:image/png;base64,#Regions_clusters.png" width="400px" style="display: block; margin: auto;" /> -- * All Midwest (MW) states except one have been assigned to Cluster 3; -- all Northeast (NE) states except one have been assigned to Cluster 1; -- most West (W) states have been assigned to Cluster 2; -- South (S) states have been assigned to a mix of all clusters. -- * Overall, generally speaking, the clusters do seem to have separated the regions into different groups, although there is some overlap * We will consider some other diagnostic measures in Computer Lab 7B --- # Cluster Analysis Extensions There are many other clustering analysis options out there to discover. In Computer Lab 7B, as extensions, we can look at: -- * .teal_style[Hierarchical clustering] -- * .teal_style[PAM clustering] and .teal_style[Fuzzy clustering] (**Data Science** students only) --- # Extensions: Hierarchical Clustering The hierarchical clustering process is very different to `\(k\)`-means clustering, with clusters visualised via a .teal_style[dendrogram], as shown below: -- <img src="data:image/png;base64,#STM1001_dendrogram.png" width="600px" style="display: block; margin: auto;" /> *Note the y-axis values denote distance between clusters.* --- # Extensions: Combining Cluster Analyses While beyond the scope of this subject, it is worthwhile to note that we can even combine information from multiple cluster analyses - see for example this .teal_style[heatmap] below, produced by dendrogram bi-clustering. <img src="data:image/png;base64,#heatmap.jpg" width="750px" style="display: block; margin: auto;" /> --- name: menti class: middle background-image: url(data:image/png;base64,#menti.jpg) background-size: 115% # Kahoot ## Go to [www.kahoot.it](https://www.kahoot.it) and use ## the code provided --- # End That concludes our Introduction to Cluster Analysis lecture. -- What to do next: * Before Computer Lab 7B, make sure you are up-to-date with the current assessments, and remember that you can ask your computer lab demonstrator during the lab if you have questions about the assessments * If you have any cluster analysis questions, we can resolve them in the computer labs. --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! --- # References Harris, N. (2014). _Visualizing K-Means Clustering_. URL: [https://www.naftaliharris.com/blog/visualizing-k-means-clustering/](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/). Hartnett, J. (2020). _Not awful and boring ideas for teaching statistics_. URL: [https://notawfulandboring.blogspot.com/2020/04/online-day-6-one-way-anova-example.html](https://notawfulandboring.blogspot.com/2020/04/online-day-6-one-way-anova-example.html). Everitt, B. (1974). *Cluster Analysis*. New York: Wiley. Gan, G., C. Ma, & J. Wu. 2007. *Data Clustering: Theory, Algorithms, and Applications*. ASA-SIAM Series on Statistics and Applied Probability ; 20. Philadelphia, Pa.: Society for Industrial; Applied Mathematics (SIAM, 3600 Market Street, Floor 6, Philadelphia, PA 19104). [https://doi.org/10.1137/1.9780898718348](https://doi.org/10.1137/1.9780898718348). Horikoshi, M., Y. Tang, A. Dickey, M. Grenie, R. Thompson, L. Selzer, D. Strbenac, K. Voronin, & D. Pulatov. 2021. *ggfortify: Data Visualization Tools for Statistical Analysis Results*. [https://github.com/sinhrks/ggfortify](https://github.com/sinhrks/ggfortify). The jamovi project. 2022. *Jamovi [Computer Software]*.[https://www.jamovi.org](https://www.jamovi.org). --- # References Kassambara, A., and F. Mundt. 2020. *factoextra: Extract and Visualize the Results of Multivariate Data Analyses.* [http://www.sthda.com/english/rpkgs/factoextra](http://www.sthda.com/english/rpkgs/factoextra). Mirkin, B. 1996. Mathematical Classification and Clustering. 1st ed. 1996.. Nonconvex Optimization and Its Applications, 11. Posit team (2023). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL [http://www.posit.co/](http://www.posit.co/). R Core Team. 2021. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. [https://www.R-project.org/](https://www.R-project.org/) Rousseeuw, P. J. 1987. Silhouette: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. *Computational and Applied Mathematics*. 20: 53–65. Seol, H. 2022. *snowCluster: Multivariate Analysis*. [https://github.com/hyunsooseol/snowCluster](https://github.com/hyunsooseol/snowCluster) --- # References Tibshirani, R., Walther, G., & Hastie, T. 2001. Estimating the number of clusters in a data set via the gap statistic. *Journal of the Royal Statistical Society, Series B*. 63 (2): 411–423. [https://doi.org/10.1111/1467-9868.00293](https://doi.org/10.1111/1467-9868.00293) --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>