Clustering_k means

Load Packages and Dataset

lineup data - positions of soccer players
purr - for map function
dplyr - for data manipulation

lineup <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/lineup.rds")
customers_spend <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/ws_customers.rds")
library(purrr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(cluster)

Practice

leverage map_dbl from the purrr library to run k-means using values of k ranging from 1 to 10 and extract the total within-cluster sum of squares metric from each one.
first step towards visualizing the elbow plot (scree plot)

tot_withinss <- map_dbl(1:10, function (k) {
  model <- kmeans(x = lineup, centers = k)
  model$tot.withinss
})
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

additional note on map
- map() returns a list. map_lgl() returns a logical vector, int for integer, dbl for double vector, chr for character
- all map functions always return an output vector the same length as the input
- map() can accept an type of output
- instead of using map() with an existing function, we can create an inline anonymous function - as shown in above code.
perform k-means with k=2 (since we know there are two teams on the field) and determine which player belongs to which team. note that k is specified using the centers argument.
we are once again using the lineup dataset.

model_km2 <- kmeans(lineup, centers = 2)

extract the cluster assignment vector from the kmeans model

clust_km2 <- model_km2$cluster

create new dataframe appending the cluster assignment

lineup_km2 <- mutate(lineup, cluster = clust_km2)

visualize the clustering

ggplot(lineup_km2, aes(x, y, color = factor(cluster))) +
  geom_point()

repeat the whole process but with centers = 3. Though the algo still runs, the results won’t make sense in the context because we only have two teams on the field instead of three.

model_km3 <- kmeans(lineup, centers = 3)
clust_km3 <- model_km3$cluster
lineup_km3 <- mutate(lineup, cluster = clust_km3)
ggplot(lineup_km3, aes(x, y, color = factor(cluster))) +
  geom_point()

important to remember that k-means will run with any k that is more than 2 and less than your total observations but it doesn’t always mean the results will be meaningful.

Silhouette analysis

silhouette width:
- within cluster distance. C(i). The average euclidean distance from that observation to every other observation within the same cluster.
- closest neighbor distance. N(i). An observation is the average distance from that observation to the points of the closest neighboring cluster. The smallest average distance to our observation is then used as the closest neighbor distance.
interpretation of Silhouette width S(i)
- values close to one. well matched to cluster
- value close to zero. on border between two cluster
- value close to -1, better fit in neighboring cluster
calculating S(i)
- we use pam () function from the cluster library. requires a dataframe and desired number of clusters.
- pam function is very similar but not identical to kmeans
- the greater the average width the better the individual observations match to their clusters

Practice

In this exercise we will leverage the pam() and the silhouette() functions from the cluster library to perform silhouette analysis to compare the results of models with a k of 2 and a k of 3. We’ll continue working with the lineup dataset.
generate k-means model using pam function with a k=2 and plot silhouette visual

pam_k2 <- pam(lineup, k = 2)
plot(silhouette(pam_k2))

generate k-means model using pam function with a k =3 and plot silhouette visual

pam_k3 <- pam(lineup, k = 3)
plot(silhouette(pam_k3))

note that for k=2, no observations has a silhouette width close to 0 but for k = 3, for observation 3 its silhouette width is close to 0 and is negative - this suggests that k =3 is not the right number of clusters.

we can also visualize the silhouette width with different k applied similarly to the scree plot we generated above.
firstly we can sample the width from the results and the average results as well
and plot the results. we can see k = 2 resulted in the highest average silhouette width.

pam_k3$silinfo$widths

##    cluster neighbor    sil_width
## 4        1        2  0.465320054
## 2        1        3  0.321729341
## 10       1        2  0.311385893
## 1        1        3  0.271890169
## 9        2        1  0.443606497
## 8        2        1  0.398547473
## 12       2        1  0.393982685
## 3        2        1 -0.009151755
## 11       3        1  0.546797052
## 6        3        1  0.529967901
## 5        3        1  0.359014657
## 7        3        1  0.207878188

pam_k3$silinfo$avg.width

## [1] 0.353414

sil_width <- map_dbl(2:10, function(k) {
  model <- pam(x = lineup, k = k)
  model$silinfo$avg.width
})

sil_df <- data.frame(k = 2:10, sil_width = sil_width)
ggplot(sil_df, aes(k, sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

Making sense of K-means cluster

we are using the wholesale data once again
first, estimate the best k using average silhouette width
second, run k-means with the suggested k
third, characterize the spending habits of these clusters of customers

sil_width <- map_dbl(2:10, function(k) {
  model <- pam(x = customers_spend, k = k)
  model$silinfo$avg.width
})
sil_df <- data.frame(k = 2:10, sil_width = sil_width)

ggplot(sil_df, aes(k, sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

from the plot we can see that k =2 has the highest average silhouette width and is the best value of k and we will move on forward with this selection
build a k-means model with centers equivalent to 2 and extract the vector of cluster from the model and append the cluster assignment to as a column to the original dataset, and calculate the size of each cluster, finally calculate the mean for each category

set.seed(22)
model_customers <- kmeans(customers_spend, centers = 2)
clust_customers <- model_customers$cluster
segment_customers <- mutate(customers_spend, cluster = clust_customers)
count(segment_customers, cluster)

##   cluster  n
## 1       1 10
## 2       2 35

segment_customers %>% 
  group_by(cluster) %>% 
  summarize_all(list(mean))

## # A tibble: 2 × 4
##   cluster   Milk Grocery Frozen
##     <int>  <dbl>   <dbl>  <dbl>
## 1       1 13701.   17721  1173 
## 2       2  2296.    5004  3354.

recall that when we explored the data using hierarchical clustering, the method resulted in 4 clusters while using k-means it got us 2 clusters. both of these results are valid - but which one is appropriate depending on the business context and also subject matter expertise.

Clustering_k means

Kate C

2022-02-08

Load Packages and Dataset

Practice

Silhouette analysis

Practice

Making sense of K-means cluster