1 K-Means Clustering

1.1 Brief

We will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).

1.2 Dataset

We would borrow from a dataset prepared by Margarida Cardoso and available on the UCI Machine Learning repository

1.3 Library and Setup

Load the required library

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.1     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#library(ggforce)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(cowplot)
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(FactoMineR)
library(factoextra)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(class) 
library(ggplot2) 
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(inspectdf)  
library(tidymodels)  
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.3     ✔ rsample      1.1.1
## ✔ dials        1.1.0     ✔ tune         1.0.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
## ✔ parsnip      1.0.4     ✔ yardstick    1.1.0
## ✔ recipes      1.0.5     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ gridExtra::combine()     masks dplyr::combine()
## ✖ scales::discard()        masks purrr::discard()
## ✖ plotly::filter()         masks dplyr::filter(), stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ car::recode()            masks dplyr::recode()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some()              masks purrr::some()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
options(scipen = 100, max.print = 101)

2 Data Preparation

2.1 Data Set Information

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

Attribute Information:

  • FRESH: annual spending (m.u.) on fresh products (Continuous);
  • MILK: annual spending (m.u.) on milk products (Continuous);
  • GROCERY: annual spending (m.u.)on grocery products (Continuous);
  • FROZEN: annual spending (m.u.)on frozen products (Continuous)
  • DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  • DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
  • CHANNEL: customer Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
  • REGION: customer Region Lisnon, Oporto or Other (Nominal)

2.2 Load Dataset

# Read the dataset in, drop the "Region" feature because it's not interesting
wholesale <- read.csv("data_input/wholesale.csv")
wholesale <- wholesale[,-2]

Let’s convert the ‘Channel’ feature into ‘Industry’ and make it a factor.

wholesale$Industry <- factor(wholesale$Channel, levels = c(1, 2), labels = c("horeca", "retail"))

# After doing that we can remove the original Channel feature
wholesale <- wholesale[,-1]
table(wholesale$Industry)
## 
## horeca retail 
##    298    142

Notice here that, unlike the credit risk analysis example, we do not have a balanced dataset. The prior or baseline accuracy for predicting the majority class would be 67.7%.

2.3 Scaling Data

Normalization to z-score:

wholesale.z <- data.frame(scale(wholesale[,-7]))
summary(wholesale.z)
##      Fresh              Milk            Grocery            Frozen        
##  Min.   :-0.9486   Min.   :-0.7779   Min.   :-0.8364   Min.   :-0.62763  
##  1st Qu.:-0.7015   1st Qu.:-0.5776   1st Qu.:-0.6101   1st Qu.:-0.47988  
##  Median :-0.2764   Median :-0.2939   Median :-0.3363   Median :-0.31844  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.3901   3rd Qu.: 0.1889   3rd Qu.: 0.2846   3rd Qu.: 0.09935  
##  Max.   : 7.9187   Max.   : 9.1732   Max.   : 8.9264   Max.   :11.90545  
##  Detergents_Paper    Delicassen     
##  Min.   :-0.6037   Min.   :-0.5396  
##  1st Qu.:-0.5505   1st Qu.:-0.3960  
##  Median :-0.4331   Median :-0.1984  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2182   3rd Qu.: 0.1047  
##  Max.   : 7.9586   Max.   :16.4597

Merge the new data frame containing z-score values with the Industry vector

wholesale.n <- data.frame(cbind(wholesale.z, Industry = wholesale$Industry))

2.4 Explore Data

glimpse(wholesale.n)
## Rows: 440
## Columns: 7
## $ Fresh            <dbl> 0.052873004, -0.390857056, -0.446520984, 0.099997579,…
## $ Milk             <dbl> 0.52297247, 0.54383861, 0.40807319, -0.62331041, -0.0…
## $ Grocery          <dbl> -0.04106815, 0.17012470, -0.02812509, -0.39253008, -0…
## $ Frozen           <dbl> -0.588697039, -0.269829035, -0.137379340, 0.686363006…
## $ Detergents_Paper <dbl> -0.04351919, 0.08630859, 0.13308016, -0.49802132, -0.…
## $ Delicassen       <dbl> -0.06626363, 0.08904969, 2.24074190, 0.09330484, 1.29…
## $ Industry         <fct> retail, retail, retail, horeca, retail, retail, retai…

The data has 440 rows and 8 columns.

2.5 Check missing value

First, we check missing value

anyNA(wholesale.n)
## [1] FALSE

Our data has no missing value.

3 Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

3.1 Check Correlation

GGally::ggcorr(wholesale.n, hjust = 1, layout.exp = 2, label = T, label_size = 2.9)

Based on the plot above, there are some predictors variables which have a high correlation with one another.

4 Clustering

4.1 Finding Optimal Numbers of Clustering

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.

4.1.1 Elbow method

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

library(factoextra)
fviz_nbclust(x = wholesale.n %>% select (-c(Industry)),  
             FUNcluster = kmeans, 
             method = "wss")  

Using the elbow method, we know that 8 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

4.1.2 Silhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(wholesale.n %>% select (-c(Industry)), kmeans, "silhouette") + labs(subtitle = "Silhouette method")

Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2.

4.1.3 Gap Statistic

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(wholesale.n %>% select (-c(Industry)), kmeans, "gap_stat") + labs(subtitle = "Gap Statistic method")

Based on the gap statistic method, the optimal k is 4

The three methods suggest that k = 2, 4 or 8 is the optimum number of clusters. Because we know that our target has only 2 categories : retail and horeca. Let’s use k=2.

4.2 K-Means Clustering

Here is the algorithm behind K-Means Clustering:

  • Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
  • Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)
set.seed(123)
wholesale_km_opt <- kmeans(wholesale.n %>% select (-c(Industry)) , centers = 2)
wholesale_km_opt
## K-means clustering with 2 clusters of sizes 391, 49
## 
## Cluster means:
##         Fresh       Milk    Grocery      Frozen Detergents_Paper Delicassen
## 1  0.00279925 -0.2389321 -0.2590726 -0.03391377       -0.2449652 -0.0977538
## 2 -0.02233688  1.9065808  2.0672933  0.27061801        1.9547227  0.7800355
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
##  [38] 1 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1
##  [75] 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1
##  [ reached getOption("max.print") -- omitted 339 entries ]
## 
## Within cluster sum of squares by cluster:
## [1]  905.7543 1043.8463
##  (between_SS / total_SS =  26.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
wholesale.n$cluster <- wholesale_km_opt$cluster
wholesale.n
##           Fresh        Milk     Grocery     Frozen Detergents_Paper  Delicassen
## 1   0.052873004  0.52297247 -0.04106815 -0.5886970      -0.04351919 -0.06626363
## 2  -0.390857056  0.54383861  0.17012470 -0.2698290       0.08630859  0.08904969
## 3  -0.446520984  0.40807319 -0.02812509 -0.1373793       0.13308016  2.24074190
## 4   0.099997579 -0.62331041 -0.39253008  0.6863630      -0.49802132  0.09330484
## 5   0.839284120 -0.05233688 -0.07926595  0.1736612      -0.23165413  1.29786952
## 6  -0.204572662  0.33368675 -0.29729863 -0.4955909      -0.22787885 -0.02619421
## 7   0.009939037 -0.35191506 -0.10273183 -0.5339045       0.05421869 -0.34745874
## 8  -0.349583519 -0.11385135  0.15518231 -0.2889858       0.09218126  0.36918101
## 9  -0.477357535 -0.29107807 -0.18512545 -0.5452338      -0.24444815 -0.27476643
## 10 -0.473957607  0.71767797  1.15011422 -0.3940392       0.95294579  0.20322979
## 11 -0.682697336 -0.05328534  0.52853169  0.2735649       0.64924524  0.07770259
## 12  0.090588478 -0.63306601 -0.36075119 -0.3402766      -0.48921233 -0.36447938
##    Industry cluster
## 1    retail       1
## 2    retail       1
## 3    retail       1
## 4    horeca       1
## 5    retail       1
## 6    retail       1
## 7    retail       1
## 8    retail       1
## 9    horeca       1
## 10   retail       2
## 11   retail       1
## 12   retail       1
##  [ reached 'max' / getOption("max.print") -- omitted 428 rows ]
library(ggiraphExtra)
ggRadar(data=wholesale.n, aes(colour=cluster), interactive=TRUE)
fviz_cluster(object = wholesale_km_opt, 
             data = wholesale.n %>% select (-c(Industry)) )

# cluster profiling
wholesale.n %>%
  group_by(cluster) %>% 
  summarise_all(.funs = "mean") %>% 
  select(-Industry)
## Warning: There were 2 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `Industry = (function (x, ...) ...`.
## ℹ In group 1: `cluster = 1`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 1 remaining warning.
## # A tibble: 2 × 7
##   cluster    Fresh   Milk Grocery  Frozen Detergents_Paper Delicassen
##     <int>    <dbl>  <dbl>   <dbl>   <dbl>            <dbl>      <dbl>
## 1       1  0.00280 -0.239  -0.259 -0.0339           -0.245    -0.0978
## 2       2 -0.0223   1.91    2.07   0.271             1.95      0.780

From the profiling result, we can derive insights:

  • Cluster 1: consist mostly of kernels with the highest value on Frozen.
  • Cluster 2: consist mostly of kernels with the highest value on Fresh,Milk,Grocery,Detergents Paper,Delicassen.

5 Conclusion

We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis:

  • We can separate our data into at least 2 clusters based on all of the numerical features, with more than 36.7% of the total sum of squares come from the distance of observations between clusters.