Source Dataset :https://www.kaggle.com/binovi/wholesale-customers-data-set
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(FactoMineR)
library(factoextra)
library(scales)
library(GGally)
wholesale <- read.csv("wholesale.csv")
glimpse(wholesale)
## Rows: 440
## Columns: 8
## $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
Description :
FRESH: annual spending (m.u.) on fresh products (Continuous).
MILK: annual spending (m.u.) on milk products (Continuous).
GROCERY: annual spending (m.u.)on grocery products (Continuous).
FROZEN: annual spending (m.u.)on frozen products (Continuous).
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous).
DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous).
CHANNEL: Channel - 1 = Horeca (Hotel/Restaurant/Cafe) or 2 = Retail channel (Nominal).
REGION : 1 = Lisbon / 2 = Porto / 3 = other region.
Check first 5 rows
head(wholesale)
Check Missing Value
colSums(is.na(wholesale))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
There is no missing value at our dataset.
Change data type
wholesale <- wholesale %>%
mutate(Channel = as.factor(Channel),
Region = as.factor(Region))
Check Summary
summary(wholesale)
## Channel Region Fresh Milk Grocery
## 1:298 1: 77 Min. : 3 Min. : 55 Min. : 3
## 2:142 2: 47 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153
## 3:316 Median : 8504 Median : 3627 Median : 4756
## Mean : 12000 Mean : 5796 Mean : 7951
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656
## Max. :112151 Max. :73498 Max. :92780
## Frozen Detergents_Paper Delicassen
## Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :60869.0 Max. :40827.0 Max. :47943.0
Data Visualization
ggplot(gather(wholesale %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10,fill="firebrick") +
facet_wrap(~key, scales = 'free_x',nrow=3) +
theme_bw()
We can see based on summary and data visualization. We found that our max value for each column is higher from the mean and median value. We can conclude that the data have outlier. We will check later using PCA to found the outliers.
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
quanti <- wholesale %>%
select_if(is.numeric) %>%
colnames()
quantivar <- which(colnames(wholesale) %in% quanti)
quali <- wholesale %>%
select_if(is.factor) %>%
colnames()
qualivar <- which(colnames(wholesale) %in% quali)
FactoMineR package for function PCA()
wholesale_pca <- PCA(wholesale,
scale.unit = T, #scaling
quali.sup = qualivar, #data categorical
graph = F, # not showing graph
ncp = 5)
wholesale_pca$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.64497357 44.082893 44.08289
## comp 2 1.70258397 28.376400 72.45929
## comp 3 0.74006477 12.334413 84.79371
## comp 4 0.56373023 9.395504 94.18921
## comp 5 0.28567634 4.761272 98.95048
## comp 6 0.06297111 1.049519 100.00000
Let’s visualize the percentage of variances captured by each dimensions.
fviz_eig(wholesale_pca, ncp = 6,
addlabels = T, main = "Variance explained by each dimensions")
Check Summary
summary(wholesale_pca)
##
## Call:
## PCA(X = wholesale, scale.unit = T, ncp = 5, quali.sup = qualivar,
## graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 2.645 1.703 0.740 0.564 0.286 0.063
## % of var. 44.083 28.376 12.334 9.396 4.761 1.050
## Cumulative % of var. 44.083 72.459 84.794 94.189 98.950 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 0.795 | 0.193 0.003 0.059 | -0.305 0.012 0.147 |
## 2 | 0.753 | 0.434 0.016 0.333 | -0.328 0.014 0.190 |
## 3 | 2.332 | 0.811 0.057 0.121 | 0.815 0.089 0.122 |
## 4 | 1.133 | -0.779 0.052 0.472 | 0.653 0.057 0.332 |
## 5 | 1.577 | 0.166 0.002 0.011 | 1.271 0.216 0.650 |
## 6 | 0.736 | -0.156 0.002 0.045 | -0.295 0.012 0.161 |
## 7 | 0.738 | -0.335 0.010 0.206 | -0.525 0.037 0.506 |
## 8 | 0.623 | 0.141 0.002 0.051 | -0.231 0.007 0.137 |
## 9 | 0.884 | -0.517 0.023 0.343 | -0.659 0.058 0.557 |
## 10 | 1.782 | 1.592 0.218 0.799 | -0.741 0.073 0.173 |
## Dim.3 ctr cos2
## 1 0.141 0.006 0.031 |
## 2 -0.319 0.031 0.179 |
## 3 -1.523 0.713 0.427 |
## 4 -0.163 0.008 0.021 |
## 5 -0.066 0.001 0.002 |
## 6 -0.148 0.007 0.040 |
## 7 0.303 0.028 0.169 |
## 8 -0.390 0.047 0.392 |
## 9 -0.183 0.010 0.043 |
## 10 -0.210 0.014 0.014 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Fresh | 0.070 0.184 0.005 | 0.689 27.871 0.475 | 0.699 65.976
## Milk | 0.887 29.715 0.786 | 0.109 0.692 0.012 | -0.052 0.365
## Grocery | 0.942 33.554 0.887 | -0.191 2.134 0.036 | 0.093 1.175
## Frozen | 0.083 0.262 0.007 | 0.798 37.366 0.636 | -0.153 3.182
## Detergents_Paper | 0.892 30.101 0.796 | -0.333 6.514 0.111 | 0.117 1.855
## Delicassen | 0.404 6.184 0.164 | 0.658 25.422 0.433 | -0.451 27.448
## cos2
## Fresh 0.488 |
## Milk 0.003 |
## Grocery 0.009 |
## Frozen 0.024 |
## Detergents_Paper 0.014 |
## Delicassen 0.203 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
## Channel_1 | 0.711 | -0.655 0.850 -12.227 | 0.274 0.149 6.383
## Channel_2 | 1.491 | 1.375 0.850 12.227 | -0.576 0.149 -6.383
## Region_1 | 0.128 | -0.102 0.632 -0.604 | -0.060 0.218 -0.442
## Region_2 | 0.375 | 0.089 0.056 0.395 | -0.102 0.073 -0.564
## Region_3 | 0.067 | 0.012 0.030 0.238 | 0.030 0.198 0.760
## Dim.3 cos2 v.test
## Channel_1 | 0.004 0.000 0.144 |
## Channel_2 | -0.009 0.000 -0.144 |
## Region_1 | -0.034 0.070 -0.380 |
## Region_2 | -0.060 0.026 -0.509 |
## Region_3 | 0.017 0.067 0.670 |
Through the PCA, I can retain some informative PC (high in cumulative variance) to perform dimension reduction. By doing this, I can reduce the dimension of the variables while also retaining as much information as possible.
Based on Plot and Summary above, I would like to retain at least 80% of the data. Therefore I am going to choose Dim 1-3
We can see outliers by using this plot.
plot.PCA(wholesale_pca,
choix = "ind",
invisible = "quali",
select = "contrib 5",
habillage = 1)
Through the individual plot of PCA, dim 1 could cover 44.08% variance of data.
We also found the 5 outlier to be (depends on the menu Channel):
2 from Hotel/Restaurant/Cafe : 184, 326.
3 from Retail : 48, 62, 86.
To represent more than two components tha variables will be positioned inside the circle of correlation. If the variable is closer to the circle (outside), that means the variable can reconstruct it better from the first two components. If the variable is closed to the center of the plot (inside), that means the variable is less important for the two components.
plot.PCA(wholesale_pca,
choix = "var")
Insight:
PC1 mostly sum of two variables: Grocery, Detergents_Paper.
PC2 mostly sum of two variables : Frozen, Fresh.
fviz_contrib(X = wholesale_pca,
choice = "var",
axes = 1)
Grocery, Detergents_Paper, Milk are most items contributed to dimention 1.
fviz_contrib(X = wholesale_pca,
choice = "var",
axes = 2)
Frozen, Fresh, Delicassen are most items contributed to dimention 2.
Choose numeric column
wholesale_num <- wholesale %>% select_if(is.numeric)
Scale dataset
wholesale_num_scale <- scale(wholesale_num)
Finding K-Optimum using Elbow Method
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
fviz_nbclust(wholesale_num_scale, kmeans, method = "wss", k.max = 30)+ labs(subtitle = "Elbow method") + theme_bw()
Using the elbow method, we know that 5 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
K-Means Clustering
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# k-means clustering
wholesale_kmeans <- kmeans(wholesale_num_scale, centers = 5)
fviz_cluster(wholesale_kmeans, data=wholesale_num_scale)
Cluster Profilling
#put label cluster into dataset
wholesale$cluster <- wholesale_kmeans$cluster
wholesale1 <- wholesale[,3:9]
# make profilling with summarise data
wholesale1 %>%
group_by(cluster) %>%
summarise_all(mean)
Based on result above, below are profiles on each clusters:
Fresh and lowest at Delicassen.Fresh and Frozen but lowest at Detergents_Paper.Fresh and lowest at Detergents_Paper.Grocery and lowest at Frozen.Grocery and lowest at Delicassen.After exploring our dataset by using PCA and K-Means for clustering, we are able to conclude that: