This markdown document comprises of one of the 4 data science (DS) projects. Each open dataset is specially selected to showcase different technical aspects of a project and their business value(s).
Primary objectives are to demonstrate following knowledge:
Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)
Identify, implement, evaluate the best possible statistics model based on data-based proof
Systematic documentation with reproducible R coding for others to understand & improve
Summarise observed findings with concise business values & insights through executive summary
Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application
Data product via dashboard: Visualise processed output data through visualisation tools such as Qlik or Tableau - Though am trained in this ;)
Deep business domain knowledge explanation in the dataset
External dataset(s) to enrich or give another dimensions to tackle the business question
Working style or collaboration methods with other data team members (Range from data analyst, business analyst, data scientist, data engineer, visualisation engineer, business stakeholder, project manager)
Other statistics modeling which might be better or can be used as alternative.
Reasons:
All data sources used are taken from public domain and projects are implemented using Rstudio. For full Markdown version, kindly contact me via teochunwey@gmail.com
For this project, the use case is to find the segment group of wholesale customers based on the item purchased. Dataset used will be wholesale customer dataset from public domain.
This dataset is chosen as clustering customer segment is one of the common topics seen. The clustering of variables/ features using unsupervised learning often allow discovery of inter class high similarity and intra class high dis-similarity. This clustering is easily reproducible in other context beside customer segment. Other domains such as automobile insurance, credit card users, telco customer churn and ecommerce shoppers.
As these modeling are based on historical data with certain assumption. This project did not take into other external factors such as mode of purchase (credit card/ cash), direct delivery or self collect and time of purchase transaction.
Note: Monetary units (m.u.)
Data Source: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
library(stats)
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(corrplot)
## corrplot 0.84 loaded
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(NbClust)
library (cluster)
df.original<-read.csv('Wholesale customers data.csv')
set.seed(1) #Ensure reproducable code
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 117 1 3 11173 2521 3355 1517 310 222
## 164 2 3 5531 15726 26870 2367 13726 446
## 251 1 1 3191 1993 1799 1730 234 710
## 397 2 3 4515 11991 9345 2644 3378 2213
## 88 1 3 43265 5025 8117 6312 1579 14351
## 391 1 3 3352 1181 1328 5502 311 1000
Look at the sample records in the dataset to compare against data dictionary.
There are 6 product categories with annual spending by the wholesale custoemr group by Channel and Region.
str(df)
## 'data.frame': 308 obs. of 8 variables:
## $ Channel : int 1 2 1 2 1 1 1 1 1 1 ...
## $ Region : int 3 3 1 3 3 3 3 3 1 3 ...
## $ Fresh : int 11173 5531 3191 4515 43265 3352 8708 7149 2083 9898 ...
## $ Milk : int 2521 15726 1993 11991 5025 1181 3634 2247 5007 961 ...
## $ Grocery : int 3355 26870 1799 9345 8117 1328 6100 1242 1563 2861 ...
## $ Frozen : int 1517 2367 1730 2644 6312 5502 2349 1619 1120 3151 ...
## $ Detergents_Paper: int 310 13726 234 3378 1579 311 2123 1226 147 242 ...
## $ Delicassen : int 222 446 710 2213 14351 1000 5137 128 1550 833 ...
Read through the data type in data frame(df) to ensure the structure do not need any changes.
summary(df)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 18 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3094 1st Qu.: 1610
## Median :1.000 Median :3.000 Median : 8130 Median : 3684
## Mean :1.331 Mean :2.513 Mean : 12059 Mean : 5716
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16851 3rd Qu.: 7119
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 33.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2156 1st Qu.: 805.8 1st Qu.: 282.0 1st Qu.: 395.0
## Median : 4904 Median : 1619.0 Median : 824.5 Median : 944.5
## Mean : 7729 Mean : 3089.2 Mean : 2764.6 Mean : 1558.2
## 3rd Qu.:10550 3rd Qu.: 3532.5 3rd Qu.: 4003.2 3rd Qu.: 1820.2
## Max. :59598 Max. :60869.0 Max. :26701.0 Max. :47943.0
A quick summary stats on the variables. From here, we can conclude the following:
summary(is.na(df))
## Channel Region Fresh Milk
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:308 FALSE:308 FALSE:308 FALSE:308
## Grocery Frozen Detergents_Paper Delicassen
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:308 FALSE:308 FALSE:308 FALSE:308
Check if there are any missing values (NA) in each variables. There is a total of 440 obs and all variables shwo false for missing values. We can safely concluded this dataset is quite cleaned. However, if there are NA values, following are the usual approach tackle this problem.
table(df$Channel)
##
## 1 2
## 206 102
Next, the frequency of the categorical variable are been observed. There are 298 obs that purchased via Horeca (Hotel/Restaurant/Café) and 142 by retail.
table(df$Region)
##
## 1 2 3
## 57 36 215
On Region, there are 77 annual transaction from Lisbon, 47 from Oporto and 316 from other Regions.
df %>%
group_by(Channel,Region) %>% # multiple group columns
summarise(total_fresh = sum(df$Fresh), total_Milk = sum(df$Milk), total_Grocery= sum(df$Grocery),total_Frozen=sum(df$Frozen),total_Detergents_Paper=sum(df$Detergents_Paper), total_Delicassen= sum(df$Delicassen)) # multiple summary columns
## # A tibble: 6 x 8
## # Groups: Channel [?]
## Channel Region total_fresh total_Milk total_Grocery total_Frozen
## <int> <int> <int> <int> <int> <int>
## 1 1 1 3714211 1760547 2380616 951463
## 2 1 2 3714211 1760547 2380616 951463
## 3 1 3 3714211 1760547 2380616 951463
## 4 2 1 3714211 1760547 2380616 951463
## 5 2 2 3714211 1760547 2380616 951463
## 6 2 3 3714211 1760547 2380616 951463
## # ... with 2 more variables: total_Detergents_Paper <int>,
## # total_Delicassen <int>
Above show the sum of cost transaction for different product categories. Fresh and Grocery categories are the top sellers.
temp <- reshape(df, direction="long", varying=c("Fresh","Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"), v.names= "Total_price", timevar="Category", time=c("Fresh", "Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"))
ggplot(temp, aes(x=temp$Category, y =temp$Total_price)) +geom_boxplot() +stat_boxplot(geom ='errorbar') + theme(axis.text.x= element_text(angle=90,hjust=1))+ ggtitle("Product Category Distribution")
#if need to facet -> facet_wrap(~Category,ncol=3)
Above plot observed the following findings:
Moving next, we are going to aggregate above plot by year to have a holistic view of the resale transaction.
cor.result<- cor(df)
pairs(df[,-c(1:2)], col=df$Channel, pch=19, lower.panel = NULL) +title(main = "Category by Channel")
## integer(0)
plot(df[,-c(1:2)], col=df$Region, pch=19, lower.panel = NULL) +title(main = "Category by Region")
## integer(0)
corrplot(cor.result, method="ellipse") +title(main = "Corelation by category")
## numeric(0)
Above plots observed the following findings:
Next, we check which variables have outliers.
#Remove Region & Channel Columns as they are not sensible
apply(X= df[,-c(1:2)],MARGIN=2, FUN = function(x)length(boxplot.stats(x)$out))
## Fresh Milk Grocery Frozen
## 14 20 17 29
## Detergents_Paper Delicassen
## 21 18
From above, the numeric shown are the number of outliers per category. To remove the outliers, Winsorizing technique will be used. In summary, the outliers will be replaced by a specific percentile of the data, normally 90th or 95th. First, we sort the value out for each category.
sort(boxplot.stats(df$Grocery)$out)
## [1] 23596 23998 24708 24773 25957 26839 26866 26870 28540 28921 28986
## [12] 32114 33586 34792 36486 39694 59598
Next, examine the 90th, 95th, … percentile
quantile(df$Grocery, probs=seq(from =0.9, to=1,by=0.025))
## 90% 92.5% 95% 97.5% 100%
## 19258.40 21198.98 23857.30 28663.83 59598.00
From above, 95% percentile is selected due to the increment difference. Next, 95th percentile will replace the remaining outlier.
grocery.max <- as.numeric(quantile(df$Grocery,probs=0.95))
df$Grocery[df$Grocery > grocery.max] <- grocery.max
The same concept will be applied for detergents_paper category.
sort(boxplot.stats(df$Detergents_Paper)$out)
## [1] 9606 9836 9959 10069 11577 11783 12034 12218 12420 12591 12638
## [12] 13308 13583 13726 14235 17120 17740 18594 19410 20070 26701
quantile(df$Detergents_Paper, probs=seq(from =0.9, to=1,by=0.025))
## 90% 92.5% 95% 97.5% 100%
## 7464.900 8926.725 11710.900 13629.475 26701.000
grocery.max <- as.numeric(quantile(df$Detergents_Paper,probs=0.925))
df$Detergents_Paper[df$Detergents_Paper > grocery.max] <- grocery.max
For this project, will select the above 2 variables for simplicity.
ggplot(data=df, aes(x=Grocery, y =Detergents_Paper)) + geom_point(shape=1) +geom_smooth(method="lm")
Create subset to run k means.
df.subset1<-as.data.frame(df[,c("Grocery","Detergents_Paper")])
summary(df.subset1)
## Grocery Detergents_Paper
## Min. : 3 Min. : 3.0
## 1st Qu.: 2156 1st Qu.: 282.0
## Median : 4904 Median : 824.5
## Mean : 7336 Mean :2398.7
## 3rd Qu.:10550 3rd Qu.:4003.2
## Max. :23857 Max. :8926.7
For the case above, normalisation or scaling is required as the values are on different scales of magnitude.
df.subset1<- as.data.frame(scale(df.subset1))
summary(df.subset1)
## Grocery Detergents_Paper
## Min. :-1.0922 Min. :-0.8308
## 1st Qu.:-0.7715 1st Qu.:-0.7341
## Median :-0.3624 Median :-0.5459
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4786 3rd Qu.: 0.5565
## Max. : 2.4606 Max. : 2.2640
After scaling is done, following approaches will be run to find the optimal number of clusters: 1. Elbow method 2. Average Silhouette method 3. Gap Statisit method
For illustration purpose, the 3 tests will be run.
set.seed(102)
# Elbow method
fviz_nbclust(df.subset1, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
# Silhouette method
fviz_nbclust(df.subset1, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df.subset1, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
2 clusters is selected based on the tests (2 out of 3 tests) above. Next, we visualise the clusters with triangle mark as the centroid.
set.seed(111)
kmean2.simple <- kmeans(df.subset1,centers=2, iter.max = 25, nstart=100)
df.subset1$cluster <- factor(kmean2.simple$cluster)
summary(df.subset1)
## Grocery Detergents_Paper cluster
## Min. :-1.0922 Min. :-0.8308 1: 74
## 1st Qu.:-0.7715 1st Qu.:-0.7341 2:234
## Median :-0.3624 Median :-0.5459
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4786 3rd Qu.: 0.5565
## Max. : 2.4606 Max. : 2.2640
ggplot(data=df.subset1, aes(x=Detergents_Paper, y=Grocery, colour=cluster))+geom_point()+geom_point(data=as.data.frame(kmean2.simple$centers), color ="black", size=4, shape =17)
Above plot illustrate 2 clusters based on the 2 variables, Grocery and Detergents_paper based on k means algorithm. Next, examine the summary of the k mean object result.
Clustering validation is done next.
External cluster validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. It measures the extent to which cluster labels match externally supplied class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific data set. -Adjusted Rand Measure: varies from -1 (no agreement) to 1 (perfect agreement) -Meila’s variation index VI -Jaccard Index
D<- daisy(df.subset1)
plot(silhouette(kmean2.simple$cluster, D),col=1:2, border = NA)
Above plot show the following:
The internal cluster quality evaluation is good as shown in the plot.
set.seed(111)
library("fpc")
# Compute cluster stats
species <- as.numeric(kmean2.simple$cluster)
clust_stats <- cluster.stats(d = dist(df),
species, kmean2.simple$cluster)
# Corrected Rand index
clust_stats$corrected.rand
## [1] 1
# Meila’s variation index VI
clust_stats$vi
## [1] 0
For illustration sake, above result shows the external cluster validation using Adjusted Rand Index and Meila’s variation Index VI. Do note there is no external class label used as we do not know the result beforehand. As such, findings is perfect result (impossible in real world).
The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement).
Rand index: 1 (towards 1 better) VI: 0 (lower better)
set.seed(111)
kmean2.simple
## K-means clustering with 2 clusters of sizes 74, 234
##
## Cluster means:
## Grocery Detergents_Paper
## 1 1.5076718 1.5641070
## 2 -0.4767851 -0.4946321
##
## Clustering vector:
## 117 164 251 397 88 391 410 287 272 27 89 76 295 165 328 212 305 420
## 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## 161 426 393 430 273 53 112 422 6 158 359 140 198 246 202 429 336 271
## 1 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 1 2
## 321 44 291 427 329 259 312 220 210 398 10 188 288 405 187 406 170 95
## 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1
## 28 39 122 199 253 156 347 416 174 126 245 97 179 286 32 325 377 310
## 2 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 1 1
## 128 123 175 326 315 142 282 380 157 256 144 440 270 72 252 43 87 51
## 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2
## 84 21 224 424 356 276 360 141 278 207 223 120 92 335 213 355 403 160
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
## 307 383 323 241 118 363 49 5 232 34 358 206 318 159 155 55 239 143
## 1 2 2 2 2 2 1 2 2 2 1 1 2 2 2 2 2 2
## 415 65 337 186 374 24 11 434 285 183 171 333 298 153 205 180 395 77
## 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2
## 216 134 52 219 31 354 369 162 387 131 322 294 151 22 79 60 80 250
## 1 2 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2
## 125 217 244 114 18 338 197 418 304 227 230 105 102 238 346 195 319 236
## 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
## 297 50 229 130 225 365 193 184 265 138 178 375 26 268 70 145 431 300
## 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2
## 260 189 313 417 367 64 353 384 132 292 281 166 222 23 366 296 311 409
## 2 1 1 1 2 1 2 2 2 2 2 1 1 2 2 2 2 2
## 146 389 425 407 58 37 71 111 200 110 56 394 419 181 73 386 78 433
## 1 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 1 2
## 81 327 343 113 352 91 62 116 211 29 82 42 350 248 247 332 376 177
## 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 1 2 2
## 192 203 201 349 46 435 68 388 392 57 399 411 124 432 12 341 107 228
## 2 2 1 2 1 2 1 2 2 1 2 2 2 2 2 1 1 2
## 308 19 240 264 279 372 261 30 208 342 371 194 400 59 8 283 231 309
## 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2
## 185 330 284 20 348 255 75 99 15 274 381 234 299 14 182 237 306 428
## 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 2 2 2
## 301 109
## 2 2
##
## Within cluster sum of squares by cluster:
## [1] 73.20304 81.10911
## (between_SS / total_SS = 74.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
## checking betweenss i.e. the inter cluster distance between cluster
kmean2.simple$betweenss
## [1] 459.6878
Summary findings on the clusters:
Will be doing a k means clustering on all product category in the next project using same dataset. This will be followed by using PAM clustering (with no outlier values changed). At the end, there will be a comparison of clusters analysis using cluster.stats().
Clustering
https://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/
https://stackoverflow.com/questions/32570693/make-silhouette-plot-legible-for-k-means
https://www.guru99.com/r-k-means-clustering.html
http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/
https://www.r-bloggers.com/k-means-clustering-in-r/
http://cowlet.org/2013/12/23/understanding-data-science-clustering-with-k-means-in-r.html
https://www.edureka.co/blog/k-means-clustering-algorithm/
http://www.sthda.com/english/wiki/print.php?id=241#clustering-validation-statistics
https://www.edureka.co/blog/clustering-on-bank-data-using-r/