Load Packages and Dataset
lineup data - positions of soccer players
purr - for map function
dplyr - for data manipulation
lineup <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/lineup.rds")
customers_spend <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/ws_customers.rds")
library(purrr)
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(cluster)Practice
leverage map_dbl from the purrr library to run k-means using values of k ranging from 1 to 10 and extract the total within-cluster sum of squares metric from each one.
first step towards visualizing the elbow plot (scree plot)
tot_withinss <- map_dbl(1:10, function (k) {
model <- kmeans(x = lineup, centers = k)
model$tot.withinss
})
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)additional note on map
map() returns a list. map_lgl() returns a logical vector, int for integer, dbl for double vector, chr for character
all map functions always return an output vector the same length as the input
map() can accept an type of output
instead of using map() with an existing function, we can create an inline anonymous function - as shown in above code.
perform k-means with k=2 (since we know there are two teams on the field) and determine which player belongs to which team. note that k is specified using the centers argument.
we are once again using the lineup dataset.
model_km2 <- kmeans(lineup, centers = 2)- extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster- create new dataframe appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)- visualize the clustering
ggplot(lineup_km2, aes(x, y, color = factor(cluster))) +
geom_point()- repeat the whole process but with centers = 3. Though the algo still runs, the results won’t make sense in the context because we only have two teams on the field instead of three.
model_km3 <- kmeans(lineup, centers = 3)
clust_km3 <- model_km3$cluster
lineup_km3 <- mutate(lineup, cluster = clust_km3)
ggplot(lineup_km3, aes(x, y, color = factor(cluster))) +
geom_point()- important to remember that k-means will run with any k that is more than 2 and less than your total observations but it doesn’t always mean the results will be meaningful.
Silhouette analysis
silhouette width:
within cluster distance. C(i). The average euclidean distance from that observation to every other observation within the same cluster.
closest neighbor distance. N(i). An observation is the average distance from that observation to the points of the closest neighboring cluster. The smallest average distance to our observation is then used as the closest neighbor distance.
interpretation of Silhouette width S(i)
values close to one. well matched to cluster
value close to zero. on border between two cluster
value close to -1, better fit in neighboring cluster
calculating S(i)
we use pam () function from the cluster library. requires a dataframe and desired number of clusters.
pam function is very similar but not identical to kmeans
the greater the average width the better the individual observations match to their clusters
Practice
In this exercise we will leverage the
pam()and thesilhouette()functions from theclusterlibrary to perform silhouette analysis to compare the results of models with a k of 2 and a k of 3. We’ll continue working with thelineupdataset.generate k-means model using pam function with a k=2 and plot silhouette visual
pam_k2 <- pam(lineup, k = 2)
plot(silhouette(pam_k2))- generate k-means model using pam function with a k =3 and plot silhouette visual
pam_k3 <- pam(lineup, k = 3)
plot(silhouette(pam_k3))note that for k=2, no observations has a silhouette width close to 0 but for k = 3, for observation 3 its silhouette width is close to 0 and is negative - this suggests that k =3 is not the right number of clusters.
we can also visualize the silhouette width with different k applied similarly to the scree plot we generated above.
firstly we can sample the width from the results and the average results as well
and plot the results. we can see k = 2 resulted in the highest average silhouette width.
pam_k3$silinfo$widths## cluster neighbor sil_width
## 4 1 2 0.465320054
## 2 1 3 0.321729341
## 10 1 2 0.311385893
## 1 1 3 0.271890169
## 9 2 1 0.443606497
## 8 2 1 0.398547473
## 12 2 1 0.393982685
## 3 2 1 -0.009151755
## 11 3 1 0.546797052
## 6 3 1 0.529967901
## 5 3 1 0.359014657
## 7 3 1 0.207878188
pam_k3$silinfo$avg.width## [1] 0.353414
sil_width <- map_dbl(2:10, function(k) {
model <- pam(x = lineup, k = k)
model$silinfo$avg.width
})
sil_df <- data.frame(k = 2:10, sil_width = sil_width)
ggplot(sil_df, aes(k, sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)Making sense of K-means cluster
we are using the wholesale data once again
first, estimate the best k using average silhouette width
second, run k-means with the suggested k
third, characterize the spending habits of these clusters of customers
sil_width <- map_dbl(2:10, function(k) {
model <- pam(x = customers_spend, k = k)
model$silinfo$avg.width
})
sil_df <- data.frame(k = 2:10, sil_width = sil_width)
ggplot(sil_df, aes(k, sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)from the plot we can see that k =2 has the highest average silhouette width and is the best value of k and we will move on forward with this selection
build a k-means model with centers equivalent to 2 and extract the vector of cluster from the model and append the cluster assignment to as a column to the original dataset, and calculate the size of each cluster, finally calculate the mean for each category
set.seed(22)
model_customers <- kmeans(customers_spend, centers = 2)
clust_customers <- model_customers$cluster
segment_customers <- mutate(customers_spend, cluster = clust_customers)
count(segment_customers, cluster)## cluster n
## 1 1 10
## 2 2 35
segment_customers %>%
group_by(cluster) %>%
summarize_all(list(mean))## # A tibble: 2 × 4
## cluster Milk Grocery Frozen
## <int> <dbl> <dbl> <dbl>
## 1 1 13701. 17721 1173
## 2 2 2296. 5004 3354.
- recall that when we explored the data using hierarchical clustering, the method resulted in 4 clusters while using k-means it got us 2 clusters. both of these results are valid - but which one is appropriate depending on the business context and also subject matter expertise.