We will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).
We would borrow from a dataset prepared by Margarida Cardoso and available on the UCI Machine Learning repository
Load the required library
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.1 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(cluster)
library(factoextra)## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#library(ggforce)
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(scales)##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(cowplot)##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(FactoMineR)
library(factoextra)
library(plotly)##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
library(car)## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(caret)## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(class)
library(ggplot2)
library(gridExtra) ##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(inspectdf)
library(tidymodels) ## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.3 ✔ rsample 1.1.1
## ✔ dials 1.1.0 ✔ tune 1.0.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.4 ✔ yardstick 1.1.0
## ✔ recipes 1.0.5
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ scales::discard() masks purrr::discard()
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ car::recode() masks dplyr::recode()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some() masks purrr::some()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
options(scipen = 100, max.print = 101)The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories
Attribute Information:
# Read the dataset in, drop the "Region" feature because it's not interesting
wholesale <- read.csv("data_input/wholesale.csv")
wholesale <- wholesale[,-2]Let’s convert the ‘Channel’ feature into ‘Industry’ and make it a factor.
wholesale$Industry <- factor(wholesale$Channel, levels = c(1, 2), labels = c("horeca", "retail"))
# After doing that we can remove the original Channel feature
wholesale <- wholesale[,-1]
table(wholesale$Industry)##
## horeca retail
## 298 142
Notice here that, unlike the credit risk analysis example, we do not have a balanced dataset. The prior or baseline accuracy for predicting the majority class would be 67.7%.
Normalization to z-score:
wholesale.z <- data.frame(scale(wholesale[,-7]))
summary(wholesale.z)## Fresh Milk Grocery Frozen
## Min. :-0.9486 Min. :-0.7779 Min. :-0.8364 Min. :-0.62763
## 1st Qu.:-0.7015 1st Qu.:-0.5776 1st Qu.:-0.6101 1st Qu.:-0.47988
## Median :-0.2764 Median :-0.2939 Median :-0.3363 Median :-0.31844
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3901 3rd Qu.: 0.1889 3rd Qu.: 0.2846 3rd Qu.: 0.09935
## Max. : 7.9187 Max. : 9.1732 Max. : 8.9264 Max. :11.90545
## Detergents_Paper Delicassen
## Min. :-0.6037 Min. :-0.5396
## 1st Qu.:-0.5505 1st Qu.:-0.3960
## Median :-0.4331 Median :-0.1984
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2182 3rd Qu.: 0.1047
## Max. : 7.9586 Max. :16.4597
Merge the new data frame containing z-score values with the Industry vector
wholesale.n <- data.frame(cbind(wholesale.z, Industry = wholesale$Industry))glimpse(wholesale.n)## Rows: 440
## Columns: 7
## $ Fresh <dbl> 0.052873004, -0.390857056, -0.446520984, 0.099997579,…
## $ Milk <dbl> 0.52297247, 0.54383861, 0.40807319, -0.62331041, -0.0…
## $ Grocery <dbl> -0.04106815, 0.17012470, -0.02812509, -0.39253008, -0…
## $ Frozen <dbl> -0.588697039, -0.269829035, -0.137379340, 0.686363006…
## $ Detergents_Paper <dbl> -0.04351919, 0.08630859, 0.13308016, -0.49802132, -0.…
## $ Delicassen <dbl> -0.06626363, 0.08904969, 2.24074190, 0.09330484, 1.29…
## $ Industry <fct> retail, retail, retail, horeca, retail, retail, retai…
The data has 440 rows and 8 columns.
First, we check missing value
anyNA(wholesale.n)## [1] FALSE
Our data has no missing value.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
GGally::ggcorr(wholesale.n, hjust = 1, layout.exp = 2, label = T, label_size = 2.9)Based on the plot above, there are some predictors variables which have a high correlation with one another.
Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
library(factoextra)
fviz_nbclust(x = wholesale.n %>% select (-c(Industry)),
FUNcluster = kmeans,
method = "wss") Using the elbow method, we know that 8 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
fviz_nbclust(wholesale.n %>% select (-c(Industry)), kmeans, "silhouette") + labs(subtitle = "Silhouette method")
Based on the silhouette method, number of clusters with maximum score is
considered as the optimum k-clusters. The graph shows that the optimum
number of cluster is 2.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
fviz_nbclust(wholesale.n %>% select (-c(Industry)), kmeans, "gap_stat") + labs(subtitle = "Gap Statistic method")Based on the gap statistic method, the optimal k is 4
The three methods suggest that k = 2, 4 or 8 is the optimum number of clusters. Because we know that our target has only 2 categories : retail and horeca. Let’s use k=2.
Here is the algorithm behind K-Means Clustering:
set.seed(123)
wholesale_km_opt <- kmeans(wholesale.n %>% select (-c(Industry)) , centers = 2)
wholesale_km_opt## K-means clustering with 2 clusters of sizes 391, 49
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 0.00279925 -0.2389321 -0.2590726 -0.03391377 -0.2449652 -0.0977538
## 2 -0.02233688 1.9065808 2.0672933 0.27061801 1.9547227 0.7800355
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
## [38] 1 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 339 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 905.7543 1043.8463
## (between_SS / total_SS = 26.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
wholesale.n$cluster <- wholesale_km_opt$cluster
wholesale.n## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 0.052873004 0.52297247 -0.04106815 -0.5886970 -0.04351919 -0.06626363
## 2 -0.390857056 0.54383861 0.17012470 -0.2698290 0.08630859 0.08904969
## 3 -0.446520984 0.40807319 -0.02812509 -0.1373793 0.13308016 2.24074190
## 4 0.099997579 -0.62331041 -0.39253008 0.6863630 -0.49802132 0.09330484
## 5 0.839284120 -0.05233688 -0.07926595 0.1736612 -0.23165413 1.29786952
## 6 -0.204572662 0.33368675 -0.29729863 -0.4955909 -0.22787885 -0.02619421
## 7 0.009939037 -0.35191506 -0.10273183 -0.5339045 0.05421869 -0.34745874
## 8 -0.349583519 -0.11385135 0.15518231 -0.2889858 0.09218126 0.36918101
## 9 -0.477357535 -0.29107807 -0.18512545 -0.5452338 -0.24444815 -0.27476643
## 10 -0.473957607 0.71767797 1.15011422 -0.3940392 0.95294579 0.20322979
## 11 -0.682697336 -0.05328534 0.52853169 0.2735649 0.64924524 0.07770259
## 12 0.090588478 -0.63306601 -0.36075119 -0.3402766 -0.48921233 -0.36447938
## Industry cluster
## 1 retail 1
## 2 retail 1
## 3 retail 1
## 4 horeca 1
## 5 retail 1
## 6 retail 1
## 7 retail 1
## 8 retail 1
## 9 horeca 1
## 10 retail 2
## 11 retail 1
## 12 retail 1
## [ reached 'max' / getOption("max.print") -- omitted 428 rows ]
library(ggiraphExtra)
ggRadar(data=wholesale.n, aes(colour=cluster), interactive=TRUE)fviz_cluster(object = wholesale_km_opt,
data = wholesale.n %>% select (-c(Industry)) )# cluster profiling
wholesale.n %>%
group_by(cluster) %>%
summarise_all(.funs = "mean") %>%
select(-Industry)## Warning: There were 2 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `Industry = (function (x, ...) ...`.
## ℹ In group 1: `cluster = 1`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 1 remaining warning.
## # A tibble: 2 × 7
## cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.00280 -0.239 -0.259 -0.0339 -0.245 -0.0978
## 2 2 -0.0223 1.91 2.07 0.271 1.95 0.780
From the profiling result, we can derive insights:
Frozen.Fresh,Milk,Grocery,Detergents Paper,Delicassen.We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis: