Buckle up 🚀

In this learning path, we’ll learn how to create Machine learning models using R 😊. Machine learning is the foundation for predictive modeling and artificial intelligence. We’ll learn the core principles of machine learning and how to use common tools and frameworks to train, evaluate, and use machine learning models.

Modules that will be covered in this learning path include:

  • Explore and analyze data with R

  • Train and evaluate regression models

  • Train and evaluate classification models

  • Train and evaluate clustering models

  • Train and evaluate deep learning models (work in progress)

Prerequisites

This learning path assumes knowledge of basic mathematical concepts. Some experience with R and the tidyverse is also beneficial though we’ll try as much as possible to skim through the core concepts. To get started with R and the tidyverse, the best place would be R for Data Science an O’Reilly book written by Hadley Wickham and Garrett Grolemund. It’s designed to take you from knowing nothing about R or the tidyverse to having all the basic tools of data science at your fingertips.

The Python edition of the learning path can be found at this learning path.

Why R?

R has emerged over the last couple decades as a first-class tool for scientific computing tasks, and has been a consistent leader in implementing statistical methodologies for analyzing data. The usefulness of R for data science stems from the large, active, and growing ecosystem of third-party packages: tidyverse for common data analysis activities;h2o, ranger, xgboost, and others for fast and scalable machine learning; iml, pdp, vip, and others for machine learning interpretability; and many more tools will be mentioned throughout the pages that follow. - Boehmke & Greenwell (2019) Hands-On Machine Learning with R

Now, let’s get started!

Artwork by @allison_horst

A gentle introduction to clustering

Clustering is a form of unsupervised machine learning in which observations are grouped into clusters based on similarities in their data values, or features. This kind of machine learning is considered unsupervised because it does not make use of previously known label values to train a model; in a clustering model, the label is the cluster to which the observation is assigned, based purely on its features.

Clustering works by separating the training cases based on similarities that can be determined from their feature values. Perhaps the two best-known clustering approaches are: K-means clustering and hierarchical clustering. In K-means clustering, we seek to partition the observations into a pre-specified number of clusters. On the other hand, in hierarchical clustering, we do not know in advance how many clusters we want.

Think of it this way; the numeric features of a given entity can be thought of as vector coordinates that define the entity’s position in n-dimensional space. What a clustering model seeks to do is to identify groups, or clusters, of entities that are close to one another while being separated from other clusters.

For example, suppose a botanist observes a sample of flowers and records the number of petals and leaves on each flower.

It may be useful to group these flowers into clusters based on similarities between their features.

There are many ways this could be done. For example, if most flowers have the same number of leaves, they could be grouped into those with many vs few petals. Alternatively, if both petal and leaf counts vary considerably there may be a pattern to discover, such as those with many leaves also having many petals. The goal of the clustering algorithm is to find the optimal way to split the dataset into groups. What ‘optimal’ means depends on both the algorithm used and the dataset that is provided.

Although this flower example may be simple for a human to achieve with only a few samples, as the dataset grows to thousands of samples or to more than two features, clustering algorithms become very useful to quickly dissect a dataset into groups.

The best way to learn about clustering is to try it for yourself, so that’s what you’ll do in this exercise.

We’ll require some packages to knock-off this module. You can have them installed as: install.packages(c('tidyverse', 'tidymodels', 'skimr', 'here', 'plotly', 'factoextra', 'cluster'))

Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case some are missing.

suppressWarnings(if(!require("pacman")) install.packages("pacman"))

pacman::p_load('tidyverse', 'tidymodels', 'skimr', 'here', 'plotly', 'factoextra', 'cluster')

1. Principal Component Analysis (PCA)

Let’s take a look at a dataset that contains measurements of different species of wheat seed.

Citation: The seeds dataset used in the this exercise was originally published by the Institute of Agrophysics of the Polish Academy of Sciences in Lublin, and can be downloaded from the UCI dataset repository (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science).

# Load the core tidyverse and make it available in your current R session
library(tidyverse)

# Read the csv file into a tibble
seeds <- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/seeds.csv")

# Print the first 10 rows of the data
seeds %>% 
  slice_head(n = 5)

Sometimes, we may want some little more information on our data. We can have a look at the data, its structure and the data type of its features by using the glimpse() function as below:

# Explore dimension and type of columns
seeds %>% 
  glimpse()
## Rows: 210
## Columns: 8
## $ area                  <dbl> 15.26, 14.88, 14.29, 13.84, 16.14, 14.38, 14.69,~
## $ perimeter             <dbl> 14.84, 14.57, 14.09, 13.94, 14.99, 14.21, 14.49,~
## $ compactness           <dbl> 0.8710, 0.8811, 0.9050, 0.8955, 0.9034, 0.8951, ~
## $ kernel_length         <dbl> 5.763, 5.554, 5.291, 5.324, 5.658, 5.386, 5.563,~
## $ kernel_width          <dbl> 3.312, 3.333, 3.337, 3.379, 3.562, 3.312, 3.259,~
## $ asymmetry_coefficient <dbl> 2.2210, 1.0180, 2.6990, 2.2590, 1.3550, 2.4620, ~
## $ groove_length         <dbl> 5.220, 4.956, 4.825, 4.805, 5.175, 4.956, 5.219,~
## $ species               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

While at it, let’s use skimr::skim() to take a look at the summary statistics for the data

library(skimr)

# Obtain Summary statistics
seeds %>% 
  skim()
Data summary
Name Piped data
Number of rows 210
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
area 0 1 14.85 2.91 10.59 12.27 14.36 17.30 21.18 ▇▆▅▅▂
perimeter 0 1 14.56 1.31 12.41 13.45 14.32 15.71 17.25 ▆▇▆▅▅
compactness 0 1 0.87 0.02 0.81 0.86 0.87 0.89 0.92 ▂▃▇▇▃
kernel_length 0 1 5.63 0.44 4.90 5.26 5.52 5.98 6.68 ▆▇▅▅▂
kernel_width 0 1 3.26 0.38 2.63 2.94 3.24 3.56 4.03 ▇▇▇▆▅
asymmetry_coefficient 0 1 3.70 1.50 0.77 2.56 3.60 4.77 8.46 ▆▇▇▂▁
groove_length 0 1 5.41 0.49 4.52 5.04 5.22 5.88 6.55 ▂▇▂▃▂
species 0 1 1.00 0.82 0.00 0.00 1.00 2.00 2.00 ▇▁▇▁▇

🤩Take a moment and go through the quick data exploration we just performed. Do we have any missing values? What’s the dimension of our data (rows and columns)? What are the different column types? How are the values in our columns distributed?

For this module, we’ll work with the first 6 feature columns. For plotting purposes, let’s encode the label column as categorical. Tidymodels provides a neat way of excluding this variable when fitting a model to our data. Remember, we are dealing with unsupervised learning - which does not make use of previously known label values to train a model.

# Narrow down to desired features
seeds_select <- seeds %>% 
  select(!groove_length) %>% 
  mutate(species = factor(species))

# View first 5 rows of the data
seeds_select %>% 
  slice_head(n = 5)

As you can see, we now have six data points (or features) for each instance (observation) of a seed’s species. So you could interpret these as coordinates that describe each seed’s location in six-dimensional space.

Now, of course six-dimensional space is difficult to visualise in a three-dimensional world, or on a two-dimensional plot; so we’ll take advantage of a mathematical technique called Principal Component Analysis (PCA) to analyze the relationships between the features and summarize each observation as coordinates for two principal components - in other words, we’ll translate the six-dimensional feature values into two-dimensional coordinates.

Principal Component Analysis (PCA) is a dimension reduction method that aims at reducing the feature space, such that, most of the information or variability in the data set can be explained using fewer uncorrelated features.

PCA works by receiving as input P variables (in this case six) and calculating the normalized linear combination of the P variables. This new variable is the linear combination of the six variables that captures the greatest variance out of all of them. PCA continues to calculate other normalized linear combinations but with the constraint that they need to be completely uncorrelated to all the other normalized linear combinations. Please see:

for further reading.

Let’s see this in action by creating a specification of a recipe that will estimate the principal components based on our six variables. We’ll then prep andbake the recipe to apply the computations.

PCA works well when the variables are normalized (centered and scaled)

# Specify a recipe for pca
pca_rec <- recipe(~ ., data = seeds_select) %>% 
  update_role(species, new_role = "ID") %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 2, id = "pca")

# Print out recipe
pca_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         ID          1
##  predictor          6
## 
## Operations:
## 
## Centering and scaling for all_predictors()
## No PCA components were extracted.

Compared to supervised learning techniques, we have no outcome variable in this recipe.

By updating the role of the species column to ID, this tells the recipe to keep the variable but not use it as either an outcome or predictor.

By calling prep() which estimates the statistics required by PCA and applying them to seeds_features using bake(new_data = NULL), we can get the fitted PC transformation of our features.

# Estimate required statistcs 
pca_estimates <- prep(pca_rec)

# Return preprocessed data using bake
features_2d <- pca_estimates %>% 
  bake(new_data = NULL)

# Print baked data set
features_2d %>% 
  slice_head(n = 5)

🤩 These two components capture the maximum amount of information (i.e. variance) in the original variables. From the output of our prepped recipe pca_estimates, we can examine how much variance each component accounts for:

# Examine how much variance each PC accounts for
pca_estimates %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(str_detect(terms, "percent"))
theme_set(theme_light())
# Plot how much variance each PC accounts for
pca_estimates %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(terms == "percent variance") %>% 
  ggplot(mapping = aes(x = component, y = value)) +
  geom_col(fill = "midnightblue", alpha = 0.7) +
  ylab("% of total variance")

This output tibbles and plots shows how well each principal component is explaining the original six variables. For example, the first principal component (PC1) explains about 72% of the variance of the six variables. The second principal component explains an additional 16.97%, giving a cumulative percent variance of 89.11%. This is certainly better. It means that the first two variables seem to have some power in summarizing the original six variables.

Naturally, the first PC (PC1) captures the most variance followed by PC2, then PC3, etc.

Now that we have the data points translated to two dimensions PC1 and PC2, we can visualize them in a plot:

# Visualize PC scores
features_2d %>% 
  ggplot(mapping = aes(x = PC1, y = PC2)) +
  geom_point(size = 2, color = "dodgerblue3")

Hopefully you can see at least two, arguably three, reasonably distinct groups of data points; but here lies one of the fundamental problems with clustering - without known class labels, how do you know how many clusters to separate your data into?

One way we can try to find out is to use a data sample to create a series of clustering models with an incrementing number of clusters, and measure how tightly the data points are grouped within each cluster. A metric often used to measure this tightness is the within cluster sum of squares (WCSS), with lower values meaning that the data points are closer. You can then plot the WCSS for each model.

We’ll use the built-in kmeans() function, which accepts a data frame with all numeric columns as it’s primary argument to perform clustering - means we’ll have to drop the species column. For clustering, it is recommended that the data have the same scale. We can use the recipes package to perform these transformations.

# Drop target column and normalize data
seeds_features<- recipe(~ ., data = seeds_select) %>% 
  step_rm(species) %>% 
  step_normalize(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

# Print out data
seeds_features %>% 
  slice_head(n = 5)

Now, let’s explore the WCSS of different numbers of clusters.

We’ll get to use map() from the purrr package to apply functions to each element in list.

map() functions allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.

broom::augment.kmeans() accepts a model object and returns a tibble with exactly one row of model summaries. The summaries are typically goodness of fit measures, p-values for hypothesis tests on residuals, or model convergence information.

set.seed(2056)
# Create 10 models with 1 to 10 clusters
kclusts <- tibble(k = 1:10) %>% 
  mutate(
    model = map(k, ~ kmeans(x = seeds_features, centers = .x, nstart = 20)),
    glanced = map(model, glance)) %>% 
  unnest(cols = c(glanced))

# View results
kclusts
# Plot Total within-cluster sum of squares (tot.withinss)
kclusts %>% 
  ggplot(mapping = aes(x = k, y = tot.withinss)) +
  geom_line(size = 1.2, alpha = 0.5, color = "dodgerblue3") +
  geom_point(size = 2, color = "dodgerblue3")

We seek to minimize the the total within-cluster sum of squares, by performing K-means clustering. The plot shows a large reduction in WCSS (so greater tightness) as the number of clusters increases from one to two, and a further noticable reduction from two to three clusters. After that, the reduction is less pronounced, resulting in an elbow 💪in the chart at around three clusters. This is a good indication that there are two to three reasonably well separated clusters of data points.

2. K-Means Clustering

The algorithm we used to approximate the number of clusters in our data set is called K-Means. Let’s get to the finer details, shall we?

K-Means is a commonly used clustering algorithm that separates a dataset into K clusters of equal variance such that observations within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas observations from different clusters are as dissimilar as possible (i.e., low inter-class similarity). The number of clusters, K, is user defined.

The basic algorithm has the following steps:

  1. Specify the number of clusters to be created (this is done by the analyst). Taking the flowers example we used at the beginning of the lesson, this means deciding how many clusters you want to use to group the flowers.
  2. Next, the algorithm randomly selects K observations from the data set to serve as the initial centers for the clusters (i.e., centroids).
  3. Next, each of the remaining observations (in this case flowers) are assigned to its closest centroid.
  4. Next, the new means of each cluster is computed and the centroid is moved to the mean.
  5. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means. The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e., when convergence is achieved). Typically, the algorithm terminates when each new iteration results in negligible movement of centroids and the clusters become static.
  6. Note that due to randomization of the initial k observations used as the starting centroids, we can get slightly different results each time we apply the procedure. For this reason, most algorithms use several random starts and choose the iteration with the lowest WCSS. As such, it is strongly recommended to always run K-Means with several values of nstart to avoid an undesirable local optimum.

So training usually involves multiple iterations, reinitializing the centroids each time, and the model with the best (lowest) WCSS is selected. The following animation shows this process:

Now, back to our seeds example. After creating a series of clustering models with different numbers of clusters and plotting the WCSS across the clusters, we noticed a bend at around k = 3. This bend indicates that additional clusters beyond the third have little value and that there are two to three reasonably well separated clusters of data points.

So, let’s perform K-Means clustering specifying k = 3 clusters and add the classifications to the data set using augment.

set.seed(2056)
# Fit and predict clusters with k = 3
final_kmeans <- kmeans(seeds_features, centers = 3, nstart = 100, iter.max = 1000)

# Add cluster prediction to the data set
results <- augment(final_kmeans, seeds_features) %>% 
# Bind pca_data - features_2d
  bind_cols(features_2d)

results %>% 
  slice_head(n = 5)

Let’s see those cluster assignments with the two dimensional data points. We’ll add some touch of interactivity using the plotly package, so feel free to hover.

# Plot km_cluster assignmnet on the PC data
cluster_plot <- results %>% 
  ggplot(mapping = aes(x = PC1, y = PC2)) +
  geom_point(aes(shape = .cluster), size = 2) +
  scale_color_manual(values = c("darkorange","purple","cyan4"))

# Make plot interactive
ggplotly(cluster_plot)

🤩🤩 Hopefully, the data has been separated into three distinct clusters.

So what’s the practical use of clustering? In some cases, you may have data that you need to group into distict clusters without knowing how many clusters there are or what they indicate. For example a marketing organization might want to separate customers into distinct segments, and then investigate how those segments exhibit different purchasing behaviors.

Sometimes, clustering is used as an initial step towards creating a classification model. You start by identifying distinct groups of data points, and then assign class labels to those clusters. You can then use this labelled data to train a classification model.

In the case of the seeds data, the different species of seed are already known and encoded as 0 (Kama), 1 (Rosa), or 2 (Canadian), so we can use these identifiers to compare the species classifications to the clusters identified by our unsupervised algorithm

# Plot km_cluster assignmnet on the PC data
clust_spc_plot <- results %>% 
  ggplot(mapping = aes(x = PC1, y = PC2)) +
  geom_point(aes(shape = .cluster, color = species), size = 2, alpha = 0.8) +
  scale_color_manual(values = c("darkorange","purple","cyan4"))

# Make plot interactive
ggplotly(clust_spc_plot)

There may be some differences between the cluster assignments and class labels as shown by the different colors (species) within each cluster (shape). But the K-Means model should have done a reasonable job of clustering the observations so that seeds of the same species are generally in the same cluster. 💪

3. Hierarchical clustering

The first step in K-Means clustering is the data scientist specifying the number of clusters K to partition the observations into. Hierarchical clustering is an alternative approach which does not require the number of clusters to be defined in advance. Furthermore, hierarchical clustering results can be easily visualized using an attractive tree-based representation called a dendrogram. Once the dendrogram has been constructed, we slice this structure horizontally to identify the clusters formed.

Hierarchical clustering creates clusters by either a divisive method or agglomerative method. The divisive method is a top down approach starting with the entire dataset and then finding partitions in a stepwise manner. Agglomerative clustering is a bottom up approach. In this lab you will work with agglomerative clustering, commonly referred to as AGNES (AGglomerative NESting), which roughly works as follows:

  1. The linkage distances between each of the data points is computed.

  2. Points are clustered pairwise with their nearest neighbor.

  3. Linkage distances between the clusters are computed.

  4. Clusters are combined pairwise into larger clusters.

  5. Steps 3 and 4 are repeated until all data points are in a single cluster.

A fundamental question in hierarchical clustering is: How do we measure the dissimilarity between two clusters of observations? The linkage function/aggromeration methods can be computed in a number of ways:

  • Ward’s minimum variance method: Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged. Tends to produce more compact clusters.

  • Average linkage uses the mean pairwise distance between the members of the two clusters. Can vary in the compactness of the clusters it creates.

  • Complete or Maximal linkage uses the maximum distance between the members of the two clusters. Tends to produce more compact clusters.

Several different distance metrics are used to compute linkage functions:

  • Euclidian or l2 distance is the most widely used. This is the only metric for the Ward linkage method.

  • Manhattan or l1 distance is robust to outliers and has other interesting properties.

  • Cosine similarity, is the dot product between the location vectors divided by the magnitudes of the vectors. Notice that this metric is a measure of similarity, whereas the other two metrics are measures of difference. Similarity can be quite useful when working with data such as images or text documents.

Please see:

for further reading.

Therefore, in Hierarchical clustering, the clusters themselves belong to a larger group, which belong to even larger groups, and so on. This is useful for not only breaking data into groups, but understanding the relationships between these groups.

For example, if we apply clustering to the meanings of words, we may get a group containing adjectives specific to emotions (‘angry’, ‘happy’, and so on), which itself belongs to a group containing all human-related adjectives (‘happy’, ‘handsome’, ‘young’), and this belongs to an even higher group containing all adjectives (‘happy’, ‘green’, ‘handsome’, ‘hard’ etc.).

Agglomerative Clustering

Let’s see an example of clustering the seeds data using an agglomerative clustering algorithm. There are many functions available in R for hierarchical clustering.

The hclust() function is one way to perform hierarchical clustering in R. It only needs one input and that is a distance matrix structure computed using distance metrics (e.g euclidean) as produced by dist(). hclust() also allows us to specify the agglomeration method to be used (i.e. "complete", "average", "single", or "ward.D").

Great! Let’s fit multiple hierarchical clustering models based on different aggromeration methods and see how the choice in aggromeration method changes the clustering.

# For reproducibility
set.seed(2056)

# Distance between observations matrix
d <- dist(x = seeds_features, method = "euclidean")

# Hierarchical clustering using Complete Linkage
seeds_hclust_complete <- hclust(d, method = "complete")

# Hierarchical clustering using Average Linkage
seeds_hclust_average <- hclust(d, method = "average")

# Hierarchical clustering using Ward Linkage
seeds_hclust_ward <- hclust(d, method = "ward.D2")

The factoextra provides functions (fviz_dend()) to visualize hierarchical clustering. Let’s visualize the dendrogram representation of the clusters starting with Complete aggromeration method.

library(factoextra)

# Visualize cluster separations
fviz_dend(seeds_hclust_complete, main = "Complete Linkage")

What about Average linkage?

# Visualize cluster separations
fviz_dend(seeds_hclust_average, main = "Average Linkage")

Lastly, the ward linkage.

# Visualize cluster separations
fviz_dend(seeds_hclust_ward, main = "Ward Linkage")

Note: If you are new to dendograms please see the following resources on how to interpret dendrograms:

Perfect! Take a moment and analyze the nature of the clusters.

This can be done mathematically by evaluating the aggromerative coefficient (AC), which measures the clustering structure of the dataset- with values closer to 1 suggest a more balanced clustering structure and values closer to 0 suggest less well-formed clusters. cluster::agnes() allows us to compute the hierarchical clustering as well as this metric too.

library(cluster)
#Compute ac values
ac_metric <- list(
  complete_ac = agnes(seeds_features, metric = "euclidean", method = "complete")$ac,
  average_ac = agnes(seeds_features, metric = "euclidean", method = "average")$ac,
  ward_ac = agnes(seeds_features, metric = "euclidean", method = "ward")$ac
)

ac_metric
## $complete_ac
## [1] 0.9344451
## 
## $average_ac
## [1] 0.8769204
## 
## $ward_ac
## [1] 0.9856365

As we explained earlier, complete and ward linkages tend to produce tight clustering of objects.

Now, let’s determine the optimal number of clusters. Although hierarchical clustering does not require one to pre-specify the number of clusters, one still needs to specify the number of clusters to extract. Let’s use the WCSS method to determine the optimal number of clusters.

# Determine and visuzalize optimal n.o of clusters
#  hcut (for hierarchical clustering)
fviz_nbclust(seeds_features, FUNcluster = hcut, method = "wss")

Just like in K-Means clustering, the optimal number of clusters for this data set is 3.

Let’s color our dendrogram according to k = 3 and observe how observations will be grouped. We’ll go with the ward linkage method.

# Visualize clustering structure for 3 groups
fviz_dend(seeds_hclust_ward, k = 3, main = "Ward Linkage")

Plausible enough 🤩!

We can now go ahead and cut the hierarchical clustering model into three clusters and extract the cluster labels for each observation associated with a given cut. This is done using cutree()

# Hierarchical clustering using Ward Linkage
seeds_hclust_ward <- hclust(d, method = "ward.D2")

# Group data into 3 clusters
results_hclust <- tibble(
  cluster_id = cutree(seeds_hclust_ward, k = 3)) %>% 
  mutate(cluster_id = factor(cluster_id)) %>% 
  bind_cols(features_2d)

results_hclust %>% 
  slice_head(n = 5)

We could probably do a little comparison between K-Means and Hierarchical clustering by counting the number of observations of each species in the corresponding clusters.

# Compare k-m and hc
results_hclust %>% 
  count(species, cluster_id) %>% 
  rename(n_hclust = n) %>% 
  bind_cols(results %>% 
              count(species, .cluster) %>%
              select(!species) %>% 
              rename(n_kmclust = n))

Ignoring the cluster_id and .cluster column since they are arbitrary, we can see that the observations were grouped quite similarly by the two algorithms. We could of course make a confusion matrix to better visualize this, but we’ll leave it at that for now.

Let’s wrap it up by making some plots showing how our observations were grouped into clusters.🥳

# Plot h-cluster assignmnet on the PC data
hclust_spc_plot <- results_hclust %>% 
  ggplot(mapping = aes(x = PC1, y = PC2)) +
  geom_point(aes(shape = cluster_id, color = species), size = 2, alpha = 0.8) +
  scale_color_manual(values = c("darkorange","purple","cyan4"))

# Make plot interactive
ggplotly(hclust_spc_plot)

4. Summary

In this module, you learned how clustering can be used to create unsupervised machine learning models that group data observations into clusters. You then used the Tidymodels framework in R to perform dimension reduction using PCA and various packages in the R ecosystem such as stats::kmeans(), stats::hclust(), cluster::agnes() to train K-Means and Hierarchical clustering models.

While Tidymodels (R) and scikit-learn (Python) are popular framework for writing code to train clustering models, you can also create machine learning solutions for clustering using the graphical tools in Microsoft Azure Machine Learning. You can learn more about no-code development of clustering models using Azure Machine Learning in the Create a clustering model with Azure Machine Learning designer module.

Challenge: Cluster Unlabelled Data

Now that you’ve seen how to create a clustering model, why not try for yourself? You’ll find a clustering challenge in the 04 - Clustering Challenge.ipynb notebook!

THANK YOU TO:

Allison Horst for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her gallery.

Bethany, Gold Microsoft Learn Student Ambassador, for her valuable feedback and suggestions.

FURTHER READING

Happy leaRning,

Eric (R_ic), Gold Microsoft Learn Student Ambassador.

