Clustering Walmart sales data

This document tries to search for similarities in sales data between Walmart stores. This might be useful in applying business strategies to certain locations, proving which stores are bound to be shut down and which are worth investing into. The data was taken from Kaggle (link). Among others, it contains weekly sales from each department of a certain store. The number of stores amounts to 45 in total. Three methods of clustering will be compared: K-means, Clara and PAM.

Loading necessary libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'forcats' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

library(cluster)
library(factoextra)
library(fpc)

Loading the data

data <- read.csv('walmart_cleaned.csv', header = TRUE, sep = ',')
head(data, 5)

##   X Store       Date IsHoliday Dept Weekly_Sales Temperature Fuel_Price
## 1 0     1 2010-02-05         0    1     24924.50       42.31      2.572
## 2 1     1 2010-02-05         0   26     11737.12       42.31      2.572
## 3 2     1 2010-02-05         0   17     13223.76       42.31      2.572
## 4 3     1 2010-02-05         0   45        37.44       42.31      2.572
## 5 4     1 2010-02-05         0   28      1085.29       42.31      2.572
##   MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5      CPI Unemployment Type
## 1         0         0         0         0         0 211.0964        8.106    3
## 2         0         0         0         0         0 211.0964        8.106    3
## 3         0         0         0         0         0 211.0964        8.106    3
## 4         0         0         0         0         0 211.0964        8.106    3
## 5         0         0         0         0         0 211.0964        8.106    3
##     Size
## 1 151315
## 2 151315
## 3 151315
## 4 151315
## 5 151315

Preparing the data

We choose the appropriate columns, that being Store, Dept and Weekly_Sales. To ensure there are no problems along the way, NAs are dropped from the Weekly_Sales column.

With NAs eliminated the data is ready for further transformations. The departments for each store will now be grouped and summarized.

summed_data <- data1 %>%
  group_by(Store, Date) %>%
  summarize(Total_Weekly_Sales = sum(Weekly_Sales))

## `summarise()` has grouped output by 'Store'. You can override using the
## `.groups` argument.

head(summed_data, 5)

## # A tibble: 5 × 3
## # Groups:   Store [1]
##   Store Date       Total_Weekly_Sales
##   <int> <chr>                   <dbl>
## 1     1 2010-02-05           1643691.
## 2     1 2010-02-12           1641957.
## 3     1 2010-02-19           1611968.
## 4     1 2010-02-26           1409728.
## 5     1 2010-03-05           1554807.

The new column is adequately named Total_Weekly_Sales. The dataframe will now be partially transposed in order for the stores to be the row indexes and weeks to be the column labels.

pivot_data <- summed_data %>%
  pivot_wider(names_from = Date, values_from = Total_Weekly_Sales, values_fill = 0)
head(pivot_data, 5)

## # A tibble: 5 × 144
## # Groups:   Store [5]
##   Store `2010-02-05` `2010-02-12` `2010-02-19` `2010-02-26` `2010-03-05`
##   <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
## 1     1     1643691.     1641957.     1611968.     1409728.     1554807.
## 2     2     2136989.     2137810.     2124452.     1865097.     1991013.
## 3     3      461622.      420729.      421642.      407205.      415202.
## 4     4     2135144.     2188307.     2049860.     1925729.     1971057.
## 5     5      317173.      311826.      303448.      270282.      288856.
## # ℹ 138 more variables: `2010-03-12` <dbl>, `2010-03-19` <dbl>,
## #   `2010-03-26` <dbl>, `2010-04-02` <dbl>, `2010-04-09` <dbl>,
## #   `2010-04-16` <dbl>, `2010-04-23` <dbl>, `2010-04-30` <dbl>,
## #   `2010-05-07` <dbl>, `2010-05-14` <dbl>, `2010-05-21` <dbl>,
## #   `2010-05-28` <dbl>, `2010-06-04` <dbl>, `2010-06-11` <dbl>,
## #   `2010-06-18` <dbl>, `2010-06-25` <dbl>, `2010-07-02` <dbl>,
## #   `2010-07-09` <dbl>, `2010-07-16` <dbl>, `2010-07-23` <dbl>, …

What is left is to create another dataframe without the Store column. This will be crucial in the process of normalizing as well as clustering the data, as the store ID does not represent any value. As such, it would interfere with the results. Afterwards, the result will be subject to normalization.

clustering_data <- data.frame(pivot_data[, 2:144])
normalized_data <- scale(clustering_data)
head(normalized_data, 1)

##      X2010.02.05 X2010.02.12 X2010.02.19 X2010.02.26 X2010.03.05 X2010.03.12
## [1,]   0.8681116   0.9944409   0.9279246   0.8339056   0.9286399    0.770161
##      X2010.03.19 X2010.03.26 X2010.04.02 X2010.04.09 X2010.04.16 X2010.04.23
## [1,]   0.8906981   0.8266387   0.7718645   0.8900727   0.8809906   0.7560658
##      X2010.04.30 X2010.05.07 X2010.05.14 X2010.05.21 X2010.05.28 X2010.06.04
## [1,]   0.8821617   0.9154209   0.9203883   0.7550353   0.6677022     0.84652
##      X2010.06.11 X2010.06.18 X2010.06.25 X2010.07.02 X2010.07.09 X2010.07.16
## [1,]   0.8656116   0.8024653    0.713538   0.7094072   0.8695134   0.7920981
##      X2010.07.23 X2010.07.30 X2010.08.06 X2010.08.13 X2010.08.20 X2010.08.27
## [1,]   0.7521591   0.7388255   0.9684423   0.8785393   0.8426111   0.8153112
##      X2010.09.03 X2010.09.10 X2010.09.17 X2010.09.24 X2010.10.01 X2010.10.08
## [1,]    0.899097   0.9364531   0.9575039   0.9222943    1.062556   0.9718218
##      X2010.10.15 X2010.10.22 X2010.10.29 X2010.11.05 X2010.11.12 X2010.11.19
## [1,]    1.000175    0.791923   0.8217848   0.9925369   0.8587862   0.8830504
##      X2010.11.26 X2010.12.03 X2010.12.10 X2010.12.17 X2010.12.24 X2010.12.31
## [1,]    0.615483   0.7289015   0.6454819   0.6670592   0.5664755    1.004777
##      X2011.01.07 X2011.01.14 X2011.01.21 X2011.01.28 X2011.02.04 X2011.02.11
## [1,]    1.030291    1.037841   0.8966259   0.9597115    1.048336    1.056747
##      X2011.02.18 X2011.02.25 X2011.03.04 X2011.03.11 X2011.03.18 X2011.03.25
## [1,]    1.038516   0.9160916    1.069369    1.072603    1.096936    1.186869
##      X2011.04.01 X2011.04.08 X2011.04.15 X2011.04.22 X2011.04.29 X2011.05.06
## [1,]    1.046361    1.116147    1.066403   0.8112394   0.9520887    1.076583
##      X2011.05.13 X2011.05.20 X2011.05.27 X2011.06.03 X2011.06.10 X2011.06.17
## [1,]    1.129256   0.8796547       0.875   0.9744103   0.9674483   0.8645591
##      X2011.06.24 X2011.07.01 X2011.07.08 X2011.07.15 X2011.07.22 X2011.07.29
## [1,]   0.7980379   0.7858423   0.8523684   0.8471669   0.7530059   0.7533436
##      X2011.08.05 X2011.08.12 X2011.08.19 X2011.08.26 X2011.09.02 X2011.09.09
## [1,]   0.9900843   0.9205942   0.8749704   0.7391206    1.015985   0.9209632
##      X2011.09.16 X2011.09.23 X2011.09.30 X2011.10.07 X2011.10.14 X2011.10.21
## [1,]    1.066779   0.8673592   0.9289371     1.04581   0.9805108    0.900259
##      X2011.10.28 X2011.11.04 X2011.11.11 X2011.11.18 X2011.11.25 X2011.12.02
## [1,]   0.7940951     1.06813   0.8979752   0.9157069   0.6896366   0.8395216
##      X2011.12.09 X2011.12.16 X2011.12.23 X2011.12.30 X2012.01.06 X2012.01.13
## [1,]   0.8277301   0.7343169   0.5675222    0.885992    1.098227    1.094852
##      X2012.01.20 X2012.01.27 X2012.02.03 X2012.02.10 X2012.02.17 X2012.02.24
## [1,]   0.9430757   0.9464125    1.127028    1.142514    1.163962    0.964544
##      X2012.03.02 X2012.03.09 X2012.03.16 X2012.03.23 X2012.03.30 X2012.04.06
## [1,]    1.179486    1.116429    1.146462   0.9877639    1.219847    1.101808
##      X2012.04.13 X2012.04.20 X2012.04.27 X2012.05.04 X2012.05.11 X2012.05.18
## [1,]    1.068964    1.006389    0.982169    1.157151    1.042291    1.014675
##      X2012.05.25 X2012.06.01 X2012.06.08 X2012.06.15 X2012.06.22 X2012.06.29
## [1,]   0.8834362   0.9826917    1.030526   0.9945156   0.8705539   0.9540289
##      X2012.07.06 X2012.07.13 X2012.07.20 X2012.07.27 X2012.08.03 X2012.08.10
## [1,]    1.065088   0.9647313   0.9002376   0.9161771    1.050861   0.9825201
##      X2012.08.17 X2012.08.24 X2012.08.31 X2012.09.07 X2012.09.14 X2012.09.21
## [1,]   0.9773254   0.8134265   0.9924001    1.063482    1.065404     1.01756
##      X2012.09.28 X2012.10.05 X2012.10.12 X2012.10.19 X2012.10.26
## [1,]   0.9339692    1.110451    1.033205   0.9837559     0.91554

Choosing the optimal number of clusters

As the data has been normalized, the process of choosing the optimal number of clusters can begin. K-means, PAM and Clara will be evaluates as methods. For each of those methods the gap statisctic graph will be shown and analyzed.

fviz_nbclust(normalized_data, FUNcluster=cluster::clara, method="gap_stat")+ theme_classic() + ggtitle('Clara')

fviz_nbclust(normalized_data, FUNcluster=kmeans, method="gap_stat")+ theme_classic() + ggtitle('K means')

fviz_nbclust(normalized_data, FUNcluster=cluster::pam, method="gap_stat")+ theme_classic() +ggtitle('PAM')

From the gap statistic we can deduce, that the optimal number of clusters is:

3 clusters for K-means;
4 clusters for Clara;
6 clusters for PAM.

Clustering results

Using the eclust() function from the factoextra package, clusters will be constructed.

As it can be seen, all 3 methods manage to explain 98.8% of the variability.

The quality of clustering derived from the 3 methods can now be analyzed using a plethora of methods. The fpc and factoextra packages will be used.

The first step will be to analyze the silhouettes of each method. This indicates the similarity of observations making up one cluster compared to the similarity of those observations to the other clusters. This score ranges from -1 to 1 and, in general, is desired to be as high as possible for high quality clustering.

##   cluster size ave.sil.width
## 1       1   20          0.40
## 2       2   18          0.70
## 3       3    7          0.71

##   cluster size ave.sil.width
## 1       1   10          0.59
## 2       2    7          0.60
## 3       3   16          0.61
## 4       4   12          0.51

##   cluster size ave.sil.width
## 1       1   10          0.55
## 2       2    7          0.60
## 3       3    6          0.57
## 4       4    9          0.35
## 5       5   10          0.38
## 6       6    3          0.61

The average silhouette of K-means clusters is 0.57, which makes the result very similar to Clara clustering, which comes in at 0.58. The result for PAM comes in at 0.49, with a visible outlier sporting a negative score in the 4th cluster. Judging by the silhouettes, K-means and Clara seem to be the most adequate methods for this scenario.

This thesis will now be verified using the Calinski-Harabasz and Dunn indices.

dist_Mtx <- dist(as.matrix(normalized_data), method = 'euclidean')
calinski_harabasz_index <- cluster.stats(dist_Mtx, km1$cluster)$ch
calinski_harabasz_index2 <- cluster.stats(dist_Mtx, cl1$cluster)$ch
calinski_harabasz_index3 <- cluster.stats(dist_Mtx, pa1$cluster)$ch

dunn_idx1 <- cluster.stats(dist_Mtx, km1$cluster)$dunn
dunn_idx2 <- cluster.stats(dist_Mtx, cl1$cluster)$dunn
dunn_idx3 <- cluster.stats(dist_Mtx, pa1$cluster)$dunn

cluster_comparison <- data.frame(Method = c('K-means', 'Clara', 'PAM'))
cluster_comparison['C-H_index'] <- c(calinski_harabasz_index, calinski_harabasz_index2, calinski_harabasz_index3)
cluster_comparison['Dunn_index'] = c(dunn_idx1, dunn_idx2, dunn_idx3)

print(cluster_comparison)

##    Method C-H_index Dunn_index
## 1 K-means  139.8957  0.2052408
## 2   Clara  211.4271  0.1821876
## 3     PAM  194.9478  0.1529446

As we can see, the indices give contradictory results. While K-means fairs the worst when it comes to the Calinski-Harabasz index, it lands the best score at the Dunn index. This might tell us more about the cluster characteristics. While the Dunn index is based on a ratio between the minimum inter-cluster distance compared to the maximum intra cluster distance, the Calinski-Harabasz index focuses on comparing the between cluster variance to the within cluster variance. Therefore, the highest score of the Dunn index achieved by K-means clustering, combined with its lowest Calinski-Harabasz score might mean that the clusters are not compact, contain a plethora of different observations, but none the less retain a strong separation from each other.

Conclusions

Based on those observations, one might assume that choosing which stores should be classified into one group should be done on the basis of the Clara result. It provides the best average cluster silhouette width, with the best Calinski-Harabasz index and a decent Dunn index, which could probably be further improved by reassigning stores 21 and 45 from the 4th cluster to the 3rd one (this was deduced by way of visual analysis).

Clustering Walmart sales data - a comparative study

Jan Szczepanek

2024-02-14