This markdown document comprises of followings:Dataset is specially selected to showcase different technical aspects of a project and their business value(s).
Primary objectives are to demonstrate following knowledge:
Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)
Identify, implement, evaluate the best possible statistics model based on data-based proof
Systematic documentation with reproducible R coding for others to understand & improve
Summarise observed findings with concise business values & insights through executive summary
Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application
For this project, the use case is to find the segment group of wholesale customers based on the item purchased. Dataset used will be wholesale customer dataset from public domain.
This dataset is chosen as clustering customer segment is one of the common topics seen. The clustering of variables/ features using unsupervised learning often allow discovery of inter class high similarity and intra class high dis-similarity. This clustering is easily reproducible in other context beside customer segment. Other domains such as automobile insurance, credit card users, telco customer churn and ecommerce shoppers.
As these modeling are based on historical data with certain assumption. This project did not take into other external factors such as mode of purchase (credit card/ cash), direct delivery or self collect and time of purchase transaction.
library(stats)
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(corrplot)
## corrplot 0.84 loaded
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(NbClust)
library (cluster)
df.original<-read.csv('C:/Users/user/Desktop/NewProject/Wholesale customers data.csv')
set.seed(1) #Ensure reproducable code
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 1 2 13360 944 11593 915 1679 573
## 2 2 3 4822 6721 9170 993 4973 3637
## 3 1 3 140 8847 3823 142 1062 3
## 4 1 3 5065 5499 11055 364 3485 1063
## 5 2 2 6758 4560 9965 934 4538 1037
## 6 1 1 15218 258 1138 2516 333 204
library (cluster)
df.original<-read.csv('Wholesale customers data.csv')
set.seed(1) #Ensure reproducable code
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 1 2 13360 944 11593 915 1679 573
## 2 2 3 4822 6721 9170 993 4973 3637
## 3 1 3 140 8847 3823 142 1062 3
## 4 1 3 5065 5499 11055 364 3485 1063
## 5 2 2 6758 4560 9965 934 4538 1037
## 6 1 1 15218 258 1138 2516 333 204
str(df)
## 'data.frame': 308 obs. of 8 variables:
## $ Channel : int 1 2 1 1 2 1 1 2 2 1 ...
## $ Region : int 2 3 3 3 2 1 3 2 3 3 ...
## $ Fresh : int 13360 4822 140 5065 6758 15218 3009 6468 11867 27901 ...
## $ Milk : int 944 6721 8847 5499 4560 258 521 12867 3327 3749 ...
## $ Grocery : int 11593 9170 3823 11055 9965 1138 854 21570 4814 6964 ...
## $ Frozen : int 915 993 142 364 934 2516 3470 1840 1178 4479 ...
## $ Detergents_Paper: int 1679 4973 1062 3485 4538 333 949 7558 3837 603 ...
## $ Delicassen : int 573 3637 3 1063 1037 204 727 1543 120 2503 ...
We should read through the data type in data frame(df) to ensure the structure do not need any changes.
summary(df)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3124 1st Qu.: 1490
## Median :1.000 Median :3.000 Median : 8873 Median : 3364
## Mean :1.315 Mean :2.529 Mean :11950 Mean : 5764
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:16851 3rd Qu.: 7175
## Max. :2.000 Max. :3.000 Max. :76237 Max. :54259
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 137 Min. : 25.0 Min. : 3 Min. : 3.0
## 1st Qu.: 2212 1st Qu.: 684.2 1st Qu.: 287 1st Qu.: 411.8
## Median : 4684 Median : 1498.0 Median : 811 Median : 923.5
## Mean : 7895 Mean : 3022.7 Mean : 2731 Mean : 1609.5
## 3rd Qu.:10732 3rd Qu.: 3398.0 3rd Qu.: 3851 3rd Qu.: 1858.8
## Max. :92780 Max. :60869.0 Max. :40827 Max. :47943.0
A quick summary stats on the variables. From here, we can conclude the following:
1.Only the food categories make sense in term of summary stats. 2.There is no need to scale as the unit of measurement are the same as we will be using the product categories for clustering.
summary(is.na(df))
## Channel Region Fresh Milk
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:308 FALSE:308 FALSE:308 FALSE:308
## Grocery Frozen Detergents_Paper Delicassen
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:308 FALSE:308 FALSE:308 FALSE:308
Check if there are any missing values (NA) in each variables. There is a total of 440 obs and all variables shwo false for missing values. We can safely concluded this dataset is quite cleaned. However, if there are NA values, following are the usual approach tackle this problem.
1.Exlcude them using R code: na.rm=TRUE -> Generally used on vector 2.Use default values that is pre-defined based on domain knowledge. Usual suspects are 0, -1 or 9999999999 3.Recode the missing value with a mean value of that variable 4.Omit the records with missing value: na.omit(df) -> Subset of the dataset
table(df$Channel)
##
## 1 2
## 211 97
Next, the frequency of the categorical variable are been observed. There are 298 obs that purchased via Horeca (Hotel/Restaurant/Café) and 142 by retail.
table(df$Region)
##
## 1 2 3
## 56 33 219
On Region, there are 77 annual transaction from Lisbon, 47 from Oporto and 316 from other Regions.
df %>%
group_by(Channel,Region) %>% # multiple group columns
summarise(total_fresh = sum(df$Fresh), total_Milk = sum(df$Milk), total_Grocery= sum(df$Grocery),total_Frozen=sum(df$Frozen),total_Detergents_Paper=sum(df$Detergents_Paper), total_Delicassen= sum(df$Delicassen)) # multiple summary columns
## `summarise()` regrouping output by 'Channel' (override with `.groups` argument)
## # A tibble: 6 x 8
## # Groups: Channel [2]
## Channel Region total_fresh total_Milk total_Grocery total_Frozen
## <int> <int> <int> <int> <int> <int>
## 1 1 1 3680600 1775418 2431602 930988
## 2 1 2 3680600 1775418 2431602 930988
## 3 1 3 3680600 1775418 2431602 930988
## 4 2 1 3680600 1775418 2431602 930988
## 5 2 2 3680600 1775418 2431602 930988
## 6 2 3 3680600 1775418 2431602 930988
## # ... with 2 more variables: total_Detergents_Paper <int>,
## # total_Delicassen <int>
Above show the sum of cost transaction for different product categories. Fresh and Grocery categories are the top sellers.
temp <- reshape(df, direction="long", varying=c("Fresh","Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"), v.names= "Total_price", timevar="Category", time=c("Fresh", "Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"))
ggplot(temp, aes(x=temp$Category, y =temp$Total_price)) +geom_boxplot() +stat_boxplot(geom ='errorbar') + theme(axis.text.x= element_text(angle=90,hjust=1))+ ggtitle("Product Category Distribution")
Above plot observed the following findings:
1.There are quite a few outliers for the above categories. 2 approaches: a.Remove outlier and run k mean algorithm b.Without removing outlier, run k medoids -> Partitioning Around Medoids (PAM) algorithm with manhattan distance. 2.Higher varience in pricing for Fresh and Grocery.
Moving next, we are going to aggregate above plot by year to have a holistic view of the resale transaction.
cor.result<- cor(df)
pairs(df[,-c(1:2)], col=df$Channel, pch=19, lower.panel = NULL) +title(main = "Category by Channel")
## integer(0)
plot(df[,-c(1:2)], col=df$Region, pch=19, lower.panel = NULL) +title(main = "Category by Region")
## integer(0)
corrplot(cor.result, method="ellipse") +title(main = "Corelation by category")
## numeric(0)
Above plots observed the following findings:
1.Detergents_Paper and Grocery has the highest correlation. Next, we check which variables have outliers.
#Remove Region & Channel Columns as they are not sensible
apply(X= df[,-c(1:2)],MARGIN=2, FUN = function(x)length(boxplot.stats(x)$out))
## Fresh Milk Grocery Frozen
## 14 21 16 35
## Detergents_Paper Delicassen
## 19 21
From above, the numeric shown are the number of outliers per category. To remove the outliers, Winsorizing technique will be used. In summary, the outliers will be replaced by a specific percentile of the data, normally 90th or 95th. First, we sort the value out for each category.
sort(boxplot.stats(df$Grocery)$out)
## [1] 23596 23998 24773 25957 26870 28540 28921 28986 30243 32034 33586 36486
## [13] 39694 55571 59598 92780
Next, examine the 90th, 95th, … percentile
quantile(df$Grocery, probs=seq(from =0.9, to=1,by=0.025))
## 90% 92.5% 95% 97.5% 100%
## 17938.10 20288.95 23140.30 29394.53 92780.00
From above, 95% percentile is selected due to the increment difference. Next, 95th percentile will replace the remaining outlier.
grocery.max <- as.numeric(quantile(df$Grocery,probs=0.95))
df$Grocery[df$Grocery > grocery.max] <- grocery.max
The same concept will be applied for detergents_paper category.
sort(boxplot.stats(df$Detergents_Paper)$out)
## [1] 9265 9529 9836 10069 11577 11783 12034 12408 12638 13308 13583 13726
## [13] 14841 18594 18906 19410 24171 26701 40827
quantile(df$Detergents_Paper, probs=seq(from =0.9, to=1,by=0.025))
## 90% 92.5% 95% 97.5% 100%
## 7243.00 8680.05 9987.45 13629.48 40827.00
grocery.max <- as.numeric(quantile(df$Detergents_Paper,probs=0.925))
df$Detergents_Paper[df$Detergents_Paper > grocery.max] <- grocery.max
For this project, will select the above 2 variables for simplicity.
ggplot(data=df, aes(x=Grocery, y =Detergents_Paper)) + geom_point(shape=1) +geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
Create subset to run k means.
df.subset1<-as.data.frame(df[,c("Grocery","Detergents_Paper")])
summary(df.subset1)
## Grocery Detergents_Paper
## Min. : 137 Min. : 3
## 1st Qu.: 2212 1st Qu.: 287
## Median : 4684 Median : 811
## Mean : 7176 Mean :2280
## 3rd Qu.:10732 3rd Qu.:3851
## Max. :23140 Max. :8680
For the case above, normalisation or scaling is required as the values are on different scales of magnitude.
df.subset1<- as.data.frame(scale(df.subset1))
summary(df.subset1)
## Grocery Detergents_Paper
## Min. :-1.0898 Min. :-0.8228
## 1st Qu.:-0.7685 1st Qu.:-0.7201
## Median :-0.3859 Median :-0.5308
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5506 3rd Qu.: 0.5675
## Max. : 2.4715 Max. : 2.3124
After scaling is done, following approaches will be run to find the optimal number of clusters: 1. Elbow method 2. Average Silhouette method 3. Gap Statisit method
For illustration purpose, the 3 tests will be run.
set.seed(102)
# Elbow method
fviz_nbclust(df.subset1, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
# Silhouette method
fviz_nbclust(df.subset1, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df.subset1, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
2 clusters is selected based on the tests (2 out of 3 tests) above. Next, we visualise the clusters with triangle mark as the centroid. ## Kmeans Clustering
set.seed(111)
kmean2.simple <- kmeans(df.subset1,centers=2, iter.max = 25, nstart=100)
df.subset1$cluster <- factor(kmean2.simple$cluster)
summary(df.subset1)
## Grocery Detergents_Paper cluster
## Min. :-1.0898 Min. :-0.8228 1: 71
## 1st Qu.:-0.7685 1st Qu.:-0.7201 2:237
## Median :-0.3859 Median :-0.5308
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5506 3rd Qu.: 0.5675
## Max. : 2.4715 Max. : 2.3124
ggplot(data=df.subset1, aes(x=Detergents_Paper, y=Grocery, colour=cluster))+geom_point()+geom_point(data=as.data.frame(kmean2.simple$centers), color ="black", size=4, shape =17)
## Evaluation & Interpretation Above plot illustrate 2 clusters based on the 2 variables, Grocery and Detergents_paper based on k means algorithm. Next, examine the summary of the k mean object result.
Clustering validation is done next.
3.Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters. Elbow method Silhouette method *Gap statistic method ## Internal Cluster Validation
D<- daisy(df.subset1)
plot(silhouette(kmean2.simple$cluster, D),col=1:2, border = NA)
Above plot show the following:
1.Cluster 1: 0.77 (Good as it is close to 1) 2.Cluster 2: 0.88 (Good as it is close to 1) The internal cluster quality evaluation is good as shown in the plot. ## External Cluster Validation
set.seed(111)
library("fpc")
# Compute cluster stats
species <- as.numeric(kmean2.simple$cluster)
clust_stats <- cluster.stats(d = dist(df),
species, kmean2.simple$cluster)
# Corrected Rand index
clust_stats$corrected.rand
## [1] 1
# Meila’s variation index VI
clust_stats$vi
## [1] 0
For illustration sake, above result shows the external cluster validation using Adjusted Rand Index and Meila’s variation Index VI. Do note there is no external class label used as we do not know the result beforehand. As such, findings is perfect result (impossible in real world).
The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement).
Rand index: 1 (towards 1 better) VI: 0 (lower better)
set.seed(111)
kmean2.simple
## K-means clustering with 2 clusters of sizes 71, 237
##
## Cluster means:
## Grocery Detergents_Paper
## 1 1.5282872 1.5779677
## 2 -0.4578413 -0.4727245
##
## Clustering vector:
## [1] 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2
## [38] 2 2 1 1 2 2 1 2 1 2 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2
## [75] 1 2 1 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 2 1 2 2 2 2 2 1 2 2 1
## [112] 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [149] 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 2 2 2 1
## [186] 2 2 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1
## [260] 2 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2
## [297] 2 2 2 2 2 1 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 74.39923 94.33843
## (between_SS / total_SS = 72.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
## checking betweenss i.e. the inter cluster distance between cluster
kmean2.simple$betweenss
## [1] 445.2623
Summary findings on the clusters:
1.74 and 234 observations respectively. A ration of 30% to 70%. 2.Cluster 1 price average is higher than cluster 2 based on the centroid mean. 3.Intra cluster bond strength factor Cluster 1: 72.88027 Cluster 2: 81.29706 4.Goodness of the classification k-means: 74.9% (good fit) 5.Between Clusters: 459.8227
Clustering
https://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/
https://stackoverflow.com/questions/32570693/make-silhouette-plot-legible-for-k-means
https://www.guru99.com/r-k-means-clustering.html
http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/
https://www.r-bloggers.com/k-means-clustering-in-r/
http://cowlet.org/2013/12/23/understanding-data-science-clustering-with-k-means-in-r.html
https://www.edureka.co/blog/k-means-clustering-algorithm/
http://www.sthda.com/english/wiki/print.php?id=241#clustering-validation-statistics
https://www.edureka.co/blog/clustering-on-bank-data-using-r/