Specific Data Analytics Question
Analyzing the Supermarket data to find the best performing branches, Generate Association rules from the market basket and performing Customer segmentation.
Metrics for Success
Building a Customer Segmentation model that groups Customers accurately into non-overlapping sub-groups distinct from each other.
Understanding the Context
Businesses are often indulged in the race to increase their customers and make revenue. Without it, a company cannot earn profit and stay viable in the long run. Retail Store XYZ has been in operation since 2008. It’s until 2016 that the store’s revenue started decreasing by 10%. This research seeks to populate a Customer Segmentation Model that can help the firm market and sell more effectively.
Customer segmentation is the process by which customers are divided based on Demographics or Behavior. The business impact of doing this is more important as it increases Customer lifetime value and drives greater customer loyalty.
While traditional mass marketing techniques work and are still effective, it is inefficient and costly compared to targeted advertisements where one can correctly identify their customers. In simple terms, if customers can be accurately clustered, customized advertisements and offers to increase engagements can be created.
Specific objectives
Determine the effect spending on Advertisement, Promotion and Administration have on the profit of the supermarket.
Identify the state with the best performing branches.
Generate the revenue from the branches.
Determine which advertisement attracted customers the most.
Find the most selling product in the supermarket.
Determine the customers demographic information.
Perform Clustering.
Experimental Design
1.Data Preparation
Loading the dataset
Data Uniformity
Checking for Missing/Duplicate Values
Checking Outliers
2.Exploratory Data Analysis
Univariate Analysis
Bivariate Analysis
3.Clustering
4.Conclusions and Recommendations
Data Relevance
Supermarket Branches Dataset:
Advertisement Spending ~ Total advertisement Spending across all the 3 branches
Promotion Spending ~ Total promotion Spending across all the 3 branches
Administration Spending ~ Total administration Spending across all the 3 branches
State ~ New York, California and Florida
Profit ~ Total amount of profit generated.
Supermarket Customers Dataset:
Customer ID ~ Unique customer ID number
Gender ~ Gender of the customer (Male or Female)
Age ~ Age of the Customer in Years
Annual Income ~ Income of the customers in thousands (K$)
Spending Score ~ Spending score of the customer from a scale of 1-100
Data Preparation
Loading Libraries
#Loading dependencies:
library(readxl)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(mlr)
## Loading required package: ParamHelpers
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
library(grid)
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(crosstable)
##
## Attaching package: 'crosstable'
## The following object is masked from 'package:purrr':
##
## compact
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:mlr':
##
## train
## The following object is masked from 'package:purrr':
##
## lift
library(ggcorrplot)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(superml)
## Loading required package: R6
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("ggdendro")
library(flashClust)
##
## Attaching package: 'flashClust'
## The following object is masked from 'package:stats':
##
## hclust
library(NbClust)
library(cluster)
library(purrr)
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(moments)
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
Loading the Dataset
#Importing the data to the Global Environment:
data <- read_excel("C:/Users/Elijah/Desktop/50_SupermarketBranches.xlsx")
#Printing the first 4 rows of the dataframe
head(data, n=4)
## # A tibble: 4 x 5
## `Advertisement Spend` `Promotion Spend` `Administration Spend` State Profit
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 165349. 136898. 471784. New York 1.92e5
## 2 162598. 151378. 443899. Califor~ 1.92e5
## 3 153442. 101146. 407935. Florida 1.91e5
## 4 144372. 118672. 383200. New York 1.83e5
#Checking the data structure:
str(data)
## tibble [50 x 5] (S3: tbl_df/tbl/data.frame)
## $ Advertisement Spend : num [1:50] 165349 162598 153442 144372 142107 ...
## $ Promotion Spend : num [1:50] 136898 151378 101146 118672 91392 ...
## $ Administration Spend: num [1:50] 471784 443899 407935 383200 366168 ...
## $ State : chr [1:50] "New York" "California" "Florida" "New York" ...
## $ Profit : num [1:50] 192262 191792 191050 182902 166188 ...
The data has 50 observations and 5 variables. 4 of the variables are numeric and 1 is character
#checking for distinct values in the data:
unique(data)
## # A tibble: 50 x 5
## `Advertisement Spend` `Promotion Spend` `Administration Spend` State Profit
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 165349. 136898. 471784. New Yo~ 1.92e5
## 2 162598. 151378. 443899. Califo~ 1.92e5
## 3 153442. 101146. 407935. Florida 1.91e5
## 4 144372. 118672. 383200. New Yo~ 1.83e5
## 5 142107. 91392. 366168. Florida 1.66e5
## 6 131877. 99815. 362861. New Yo~ 1.57e5
## 7 134615. 147199. 127717. Califo~ 1.56e5
## 8 130298. 145530. 323877. Florida 1.56e5
## 9 120543. 148719. 311613. New Yo~ 1.52e5
## 10 123335. 108679. 304982. Califo~ 1.50e5
## # ... with 40 more rows
Checking missing values
#checking for null values
sum(is.na(data))
## [1] 0
Checking for duplicates
#checking duplicates
anyDuplicated((data))
## [1] 0
The data has no missing or duplicate values
Checking for outliers
#Renaming the column names for ease of reference:
names(data) <- c('Advertisement.Spend', 'Promotion.Spend', 'Administration.Spend', 'State', 'Profit')
#subset of numeric data
data_numeric<-data[,c(1:3,5)]
#Using a Box plot to check for outliers on the numerical variables:
qplot( x = Advertisement.Spend, y = "", geom = "boxplot", data = data_numeric, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Advertisment Spending")
qplot( x = Promotion.Spend, y = "", geom = "boxplot", data = data_numeric, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Promotion Spending")
qplot( x = Administration.Spend, y = "", geom = "boxplot", data = data_numeric ,col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Administration Spending")
The dataset has no outlier
Exploratory Data Analysis
Univariate Analysis
#Getting the summary statistics:
summary(data_numeric)
## Advertisement.Spend Promotion.Spend Administration.Spend Profit
## Min. : 0 Min. : 51283 Min. : 0 Min. : 14681
## 1st Qu.: 39936 1st Qu.:103731 1st Qu.:129300 1st Qu.: 90139
## Median : 73051 Median :122700 Median :212716 Median :107978
## Mean : 73722 Mean :121345 Mean :211025 Mean :112013
## 3rd Qu.:101603 3rd Qu.:144842 3rd Qu.:299469 3rd Qu.:139766
## Max. :165349 Max. :182646 Max. :471784 Max. :192262
The leading average cost is the administration followed by promotion and finally the advertisement.the mean profit generated was 112013
#Finding the standard deviation and variance:
stats <- data.frame(
sd=apply(data_numeric, 2 ,sd),
var=apply(data_numeric,2,var),
kurtosis= apply(data_numeric,2,kurtosis),
skewness = apply(data_numeric,2,skewness)
)
stats
## sd var kurtosis skewness
## Advertisement.Spend 45902.26 2107017150 2.194932 0.15904052
## Promotion.Spend 28017.80 784997271 3.085538 -0.47423007
## Administration.Spend 122290.31 14954920097 2.275967 -0.04506632
## Profit 40306.18 1624588173 2.824704 0.02258638
The variables have high spread from the mean because they have very large standard deviation.
The variables are not highly skewed as they have skewness between 1 and -1.
The variables are mesokurtic as they have kurtosis not greater than 3 thus showing that the data has no outliers
#The number of branches per state:
state<- table(data$State)
#visualizing using a bar plot:
barplot(state,col="cyan3", main = "Distribution of States")
California and New York have 17 branches and Florida has 16
Bivariate Analysis
#Relationship between advert spending and profit:
ggplot(data,aes(x=Advertisement.Spend,y=Profit))+
geom_point() + labs(title = "Advertising Spending and Profit")
Advertisement and Profit have a strong positive correlation because the more the advertising the more the profits.
#Relationship between promotion and profit:
ggplot(data,aes(x=Promotion.Spend,y=Profit))+
geom_point() + labs(title = "Promotion Spending and Profit")
Profit and Promotion have a weak positive correlation meaning that Promotion spending has no great effect on the profit
#Relationship between administration nd profit:
ggplot(data,aes(x=Administration.Spend,y=Profit))+
geom_point() + labs(title = "Administration Spending and Profit")
Administration cost has a strong positive correlation with the profit meaning better remuneration of the employees increases the profit
#Comparing profits with the various state:
profit_per_state = data%>% group_by(State) %>%
summarise(total_Profits = sum(Profit),
,
.groups = 'drop')
head(profit_per_state)
## # A tibble: 3 x 2
## State total_Profits
## <chr> <dbl>
## 1 California 1766388.
## 2 Florida 1900384.
## 3 New York 1933860.
ggplot(data ) +
geom_point(mapping = aes(x = 1:nrow(data), y = Profit, color = State)) + labs(title = " Profit across the Different States")
The states with the best performing branches based on the profits are New York followed by Florida and then California
#Comparing Advertisement.Spend with the various state:
Advertisement.Spend_per_state = data%>% group_by(State) %>%
summarise(total_Advertisement.Spend = sum(Advertisement.Spend),
,
.groups = 'drop')
head(Advertisement.Spend_per_state)
## # A tibble: 3 x 2
## State total_Advertisement.Spend
## <chr> <dbl>
## 1 California 1099180.
## 2 Florida 1291584.
## 3 New York 1295316.
ggplot(data ) +
geom_point(mapping = aes(x = 1:nrow(data), y = Advertisement.Spend, shape = State)) + labs(title = "Advertisment Cost across States")
The states with the highest advertisement cost is New York,followed by Florida and the California. This could explain why New York has the leading profit since the profit and advertisement costs are highly correlated.
#Comparing Promotion.Spend with the various state:
Promotion.Spend_per_state = data%>% group_by(State) %>%
summarise(total_Promotion.Spend = sum(Promotion.Spend),
,
.groups = 'drop')
head(Promotion.Spend_per_state)
## # A tibble: 3 x 2
## State total_Promotion.Spend
## <chr> <dbl>
## 1 California 2052691.
## 2 Florida 1948302.
## 3 New York 2066239
ggplot(data ) +
geom_point(mapping = aes(x = 1:nrow(data), y = Promotion.Spend, shape = State)) + labs(title = "Promotion Cost across the States ")
The leading state with the promotion cost is the New York followed by California and then Florida .
New York has leading profit because of doing lots of promotion and California spends lots on promotion yet the correlation between the promotion and profit is weak thus the reason it has the least profits.
#comparing Administration.Spend with the various state:
Administration.Spend_per_state = data%>% group_by(State) %>%
summarise(total_Administration.Spend= sum(Administration.Spend),
,
.groups = 'drop')
head(Administration.Spend_per_state)
## # A tibble: 3 x 2
## State total_Administration.Spend
## <chr> <dbl>
## 1 California 3103196.
## 2 Florida 3957177.
## 3 New York 3490882.
ggplot(data ) +
geom_point(mapping = aes(x = 1:nrow(data), y = Administration.Spend, color = State)) + labs(title = " Administration Cost across States")
The leading state with high cost of administration is Florida ,followed by New York and finally California.
Florida has better profits since the administration cost has a positive correlation with the profits
#creating a new column containing the sum of all the expenses:
data$expenses<- rowSums(data[,c(1:3)])
head(data)
## # A tibble: 6 x 6
## Advertisement.Spend Promotion.Spend Administration.Spend State Profit expenses
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 165349. 136898. 471784. New ~ 1.92e5 774031.
## 2 162598. 151378. 443899. Cali~ 1.92e5 757874.
## 3 153442. 101146. 407935. Flor~ 1.91e5 662522.
## 4 144372. 118672. 383200. New ~ 1.83e5 646244.
## 5 142107. 91392. 366168. Flor~ 1.66e5 599668.
## 6 131877. 99815. 362861. New ~ 1.57e5 594553.
#plotting a scatter plot for the expenses and profit:
x<- data$expenses
y<-data$Profit
plot(x, y, main = "profit vs expensess",
xlab = "expenses", ylab = "profit")
abline(lm(y ~ x, data = data), col = "blue")
The profit and expenses have a positive correlation meaning for the profits to increase you ought to incur an extra cost
#Getting the state with the most expenses:
expenses_per_state = data%>% group_by(State) %>%
summarise(total_expenses= sum(expenses),
,
.groups = 'drop')
head(expenses_per_state)
## # A tibble: 3 x 2
## State total_expenses
## <chr> <dbl>
## 1 California 6255067.
## 2 Florida 7197063.
## 3 New York 6852437.
ggplot(data ) +
geom_point(mapping = aes(x = 1:nrow(data), y = expenses, color = State)) + labs(title = "Expenses across the States")
The leading state with high cost is Florida followed by New York and California.
This explains why California is the least in the profits meaning it does not incur lots of cost to generate profits but also a cause of concern since New York is performing better than Florida yet it does not include lot of of expenses
#The total revenue of a company equals to expenses plus profit thus we create a new column of revenue:
data$total.revenue<-rowSums(data[,c(5:6)])
#finding state with the most revenue
total.revenue_per_state = data%>% group_by(State) %>%
summarise(total_total.revenue= sum(total.revenue),
,
.groups = 'drop')
total.revenue_per_state
## # A tibble: 3 x 2
## State total_total.revenue
## <chr> <dbl>
## 1 California 8021455.
## 2 Florida 9097448.
## 3 New York 8786297.
#visualizing the revenue per stat
pct = round((total.revenue_per_state$total_total.revenue/sum(total.revenue_per_state$total_total.revenue))*100,1)
# Plot the chart.
pie(total.revenue_per_state$total_total.revenue,labels = pct,main='Revenue Collection across the States', col = rainbow((nrow(total.revenue_per_state))))
#adding legend to our pie chart
legend("topright", c("California","Florida","New York"),cex = 0.8,fill = rainbow(nrow(total.revenue_per_state)))
The leading states with the revenue is Florida followed by New York and then California.this means that based on Revenue generation the Florida branches are performing well but New York has the leading profits because it tends to focus its cost to the cost that has a high correlation with profits unlike Florida that uses it revenue on costs that has weak correlation with the profits
#Encoding the state column
#install.packages("superml")
library("superml")
lbl = LabelEncoder$new()
data$State = lbl$fit_transform(data$State)
head(data)
## # A tibble: 6 x 7
## Advertisement.Spend Promotion.Spend Administration.Spend State Profit expenses
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 165349. 136898. 471784. 0 1.92e5 774031.
## 2 162598. 151378. 443899. 1 1.92e5 757874.
## 3 153442. 101146. 407935. 2 1.91e5 662522.
## 4 144372. 118672. 383200. 0 1.83e5 646244.
## 5 142107. 91392. 366168. 2 1.66e5 599668.
## 6 131877. 99815. 362861. 0 1.57e5 594553.
## # ... with 1 more variable: total.revenue <dbl>
#Finding the correlation matrix:
cormat<- cor(data)
#Get upper triangle of the correlation matrix:
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
### Melt
library(reshape2)
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## The following object is masked from 'package:tidyr':
##
## smiths
melted_cormat <- melt(upper_tri, na.rm = TRUE)
### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+ (labs(title = "Correlation Matrix \n"))
coord_fixed()
## <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## ratio: 1
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
Summary:
The state with the highest number of branches is California and New York
New York is the best performing State in terms of Profit
Florida reported the highest amount of expenses
Loading the Data
#Importing the data to the Global Environment:
tr <- read.transactions("C:/Users/Elijah/Desktop/groceries - groceries.csv", sep=',')
#Exploring the data using the Summary function:
summary(tr)
## transactions as itemMatrix in sparse format with
## 9836 rows (elements/itemsets/transactions) and
## 231 columns (items) and a density of 0.0234297
##
## most frequent items:
## whole milk 1 other vegetables rolls/buns
## 2513 2159 1903 1809
## soda (Other)
## 1715 43136
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 18 19 20 21 22 23 24 25 27 28 29 30 33
## 29 14 14 9 11 4 6 1 1 1 1 3 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 5.412 7.000 33.000
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 11
There are 9,836 transactions and 231 products
Determining the frequency of items appearing in the Market Basket
#Create an item frequency plot for the top 10 items:
if (!require("RColorBrewer")) {
# install color package of R
install.packages("RColorBrewer")
#include library RColorBrewer
library(RColorBrewer)
}
## Loading required package: RColorBrewer
itemFrequencyPlot(tr,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
# absolute type parameter gives us numeric frequencies of items independently
#Relative Item Frequency Plot
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Pastel2'),main="Relative Item Frequency Plot")
#relative type parameter the number of times an item has appeared as compared to others
Relative type parameter gives the number of times an item has appeared as compared to others .
Generating rules using Apriori algorithm
# Min Support as 0.001, confidence as 0.8. which are the default values and max of 10 items
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[231 item(s), 9836 transaction(s)] done [0.01s].
## sorting and recoding items ... [177 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(association.rules)
## set of 463 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 47 259 145 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.263 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001118 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001242 Mean :0.8663 Mean :0.001444 Mean : 3.933
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.307
## Max. :0.003152 Max. :1.0000 Max. :0.003558 Max. :11.236
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :11.00
## Mean :12.22
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## tr 9836 0.001 0.8
## call
## apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10))
The total number of rules are 463
Distribution of rule length: rule with length of 4 has the highest number of rules while that with length of 6 has the lowest number of rules.
Inspecting the rules
# printing the first 8 rules
inspect(association.rules[1:8])
## lhs rhs support confidence
## [1] {17, tropical fruit} => {whole milk} 0.001118341 1.0000000
## [2] {cereals, curd} => {whole milk} 0.001016673 0.9090909
## [3] {cereals, yogurt} => {whole milk} 0.001728345 0.8095238
## [4] {butter, jam} => {whole milk} 0.001016673 0.8333333
## [5] {liquor, red/blush wine} => {bottled beer} 0.001931680 0.9047619
## [6] {bottled beer, soups} => {whole milk} 0.001118341 0.9166667
## [7] {7, specialty cheese} => {whole milk} 0.001016673 0.9090909
## [8] {16, whipped/sour cream} => {whole milk} 0.001220008 0.8571429
## coverage lift count
## [1] 0.001118341 3.914047 11
## [2] 0.001118341 3.558225 10
## [3] 0.002135014 3.168514 17
## [4] 0.001220008 3.261706 10
## [5] 0.002135014 11.236412 19
## [6] 0.001220008 3.587876 11
## [7] 0.001118341 3.558225 10
## [8] 0.001423343 3.354897 12
100% of customers of who bought (17,tropical fruit) also bought whole milk
91% of those who bought cereals and curd bought whole milk
85% of those who bought {16, whipped/sour cream} also bought whole milk
#To find out which items are likely to be purchased with whole milk:
rules<-apriori(data=tr, parameter=list(supp=0.001,conf = 0.8),
appearance = list(default="lhs",rhs="whole milk"),
control = list(verbose=F))
rules<-sort(rules, decreasing=TRUE,by="confidence")
inspect(rules[1:6])
## lhs rhs support confidence coverage lift count
## [1] {17,
## tropical fruit} => {whole milk} 0.001118341 1 0.001118341 3.914047 11
## [2] {rice,
## sugar} => {whole milk} 0.001220008 1 0.001220008 3.914047 12
## [3] {canned fish,
## hygiene articles} => {whole milk} 0.001118341 1 0.001118341 3.914047 11
## [4] {butter,
## rice,
## root vegetables} => {whole milk} 0.001016673 1 0.001016673 3.914047 10
## [5] {14,
## curd,
## other vegetables} => {whole milk} 0.001016673 1 0.001016673 3.914047 10
## [6] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728345 1 0.001728345 3.914047 17
# Filter rules with confidence greater than 0.6 or 60%
subRules<-association.rules[quality(association.rules)$confidence>0.6]
#Plot SubRules
plot(subRules,jitter=0)
# rules with high lift have low surpport
Rules with high lift have low support while rules with low lift have high support
Plotting using 2-key- plot
#Using confidence as y and support as x
plot(subRules,method="two-key plot",jitter= 0)
# order is the number of items in a rule
rule with 6 items has low support which means a rule with more itemshas low support than a rule with few items
Graph Based visualizations
top10subRules <- head(subRules, n = 10, by = "confidence")
plot(top10subRules, method = "graph", engine = "htmlwidget")
Loading the Dataset
#Importing the data to the Global Environment:
Demographics <- read_excel("C:/Users/Elijah/Desktop/Customer Demographics.xlsx")
#Printing the first 4 rows of the dataframe
head(Demographics, n=4)
## # A tibble: 4 x 5
## CustomerID Genre Age `Annual Income (k$)` `Spending Score (1-100)`
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
#Checking the Number of Rows and Columns:
dim(Demographics)
## [1] 200 5
Data Uniformity
#Getting Information on the data types on each respective column:
sapply(Demographics, class)
## CustomerID Genre Age
## "numeric" "character" "numeric"
## Annual Income (k$) Spending Score (1-100)
## "numeric" "numeric"
#Renaming the column names for ease of reference:
names(Demographics) <- c('CustomerID', 'Gender', 'Age', 'AnnualIncome', 'SpendingScore')
#Viewing the full information:
str(Demographics)
## tibble [200 x 5] (S3: tbl_df/tbl/data.frame)
## $ CustomerID : num [1:200] 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr [1:200] "Male" "Male" "Female" "Female" ...
## $ Age : num [1:200] 19 21 20 23 31 22 35 23 64 30 ...
## $ AnnualIncome : num [1:200] 15 15 16 16 17 17 18 18 19 19 ...
## $ SpendingScore: num [1:200] 39 81 6 77 40 76 6 94 3 72 ...
The Dataset is made up of 200 Rows and 5 Columns
The datatypes in the data are as follows:
Continuous: Customer ID, Age, Annual Income and Spending Score
Character: Gender
Missing Values
#Checking for null entries in each column:
colSums(is.na(Demographics))
## CustomerID Gender Age AnnualIncome SpendingScore
## 0 0 0 0 0
Duplicate values
#Checking for identical entries:
sum(duplicated(Demographics))
## [1] 0
The data has no missing or duplicate values.
Checking for Outliers
#Using a Box plot to check for outliers on the numerical variables:
qplot( x = Age, y = "", geom = "boxplot", data = Demographics, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Age ")
qplot( x = AnnualIncome, y = "", geom = "boxplot", data = Demographics, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Annual Income (K$)")
qplot( x = SpendingScore, y = "", geom = "boxplot", data = Demographics ,col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Spending Score (1-100)")
An outlier can be observed on Annual Income. Since there’s no basis to assume the entry is not valid, the outlier is not dropped.
Exploratory Data Analysis
This process involves investigating the dataset to discover patterns.
Univariate Analysis
This analysis aims to explore each demographic variable in the dataset separately
#To view the distribution of Gender:
Gen_table <- table(Demographics$Gender)
#Plotting the Information above:
x <- c(56, 44)
labels <- c('Females', 'Males')
colors <- c('cyan3','cyan4')
#pie_percent<- round(100*x/sum(x), 0)
pie(x, labels = percent(x/100), main=' Gender Distribution', density=30, col=colors)
legend("topright", c("Females", "Males"), cex = 0.9, fill = colors)
Of the 200 sampled customers, 112 were Females while 88 were Males.
#To view the Age distribution of the customers:
Age_table <- table(Demographics$Age)
#Plotting the Information above:
ggplot(data = Demographics, mapping = aes(x = Age)) +
geom_histogram(fill = "cyan3", color = "black", binwidth = 2) + labs(x = "Age (Year's)", title = "Age Distribution")
The Age of the customers range from 18-70 years.
Those frequently visiting the supermarket are of the young and adult age groups (27-45)
Less frequent visitors are those aged 55 and above.
The distribution is close to a normal distribution.
#To view the Income distribution of the customers:
Income_table <- table(Demographics$AnnualIncome)
#Plotting the Information above:
ggplot(data = Demographics, mapping = aes(x = AnnualIncome)) +
geom_histogram(fill = "cyan3", color = "black", binwidth = 10) + labs(x = "Income (K$)", title = "Income Levels")
The Income levels of the customers range from $15k - 137k.
People earning an average income of 70k$ have the highest frequency count.
The average annual income of the customers is 60.56k$.
#To view the Spending Score of the customers:
Spending_table <- table(Demographics$SpendingScore)
#Plotting the Information above:
ggplot(data = Demographics, mapping = aes(x = SpendingScore)) +
geom_histogram(fill = "cyan3", color = "black", binwidth = 10) + labs(x = "Spending Score (1-100)", title = "Spending Scores")
Spending Score is a score given to a customer by the Supermarket authorities based on the money spent and the behavior of the customer.
This is an Important Chart as it gives an idea about the Spending rate of the Customers Visiting the Supermarket.
From the plot, most of the customers have a score in the range of 40 - 60.
There are customers also having a score of 99 showing that the Supermarket caters for the variety of customers with varying needs and requirements.
#Printing the Descriptive Summary:
summary(Demographics)
## CustomerID Gender Age AnnualIncome
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## SpendingScore
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
Bivariate Analysis
This analysis involves two variables being observed against each other.
#Creating a dataframe Cr:
Cr<- Demographics
#Visualizing the Plot:
corr_map <- ggcorr(Cr[,3:5], method=c("everything", "pearson"), label=TRUE, hjust = .90, size = 3, layout.exp = 2) + (labs(title = "Correlation Matrix \n"))
corr_map
Age is Negatively correlated with Spending Score.
No correlation exists between Annual Income and Age, Spending Score and Annual Income
The analysis seeks to further answer:
#To see the relationship between Income and Spending scores across the different genders:
ggplot(Demographics, aes(x=AnnualIncome, y = SpendingScore )) + geom_point(aes(colour= `Gender`))+labs(title='Relationship Between Income and Spending Score')
Roughly, Annual income of $40–70k corresponds to a 40–60 spending score.
There seem to be a cluster-like pattern when looking at income and spending score: High Income Individuals with High Income spending, High Income Individuals with Low Spending Score.
There doesn’t seem to be any difference in spending score when comparing the gender of a customer.
#To see the relationship between Age and Spending scores across the different genders:
ggplot(Demographics, aes(x=Age, y = SpendingScore )) + geom_point(aes(colour= `Gender`))+labs(title='Relationship between Age and Spending Score')
The lower the age higher the spending score.
Customers in the age range of 15-40 years old make up most the the customers with a spending score of above 60
Customers above 40 years old does not seem to have a spending score of above 60. This can be justified with the fact that they make a small number of the entire sample.
#To see the relationship between Age and Annual Income across the different genders:
ggplot(Demographics, aes(x=Age, y = AnnualIncome )) + geom_point(aes(colour= `Gender`))+labs(title='Relationship between Age and Annual Income')
Income Levels are high among customers aged 30-50 years.
K-Means Clustering
This algorithm aims to partition observations into clusters based on Feature similarity.
#Creating the modelling dataframe:
Model <- select(Demographics , 'Age', 'Age', 'AnnualIncome', 'SpendingScore')
#Viewing the Data types:
sapply(Model, class)
## Age AnnualIncome SpendingScore
## "numeric" "numeric" "numeric"
Feature Engineering
#Transforming the ranges to the same scale of 0 to 1 using Min Max:
#Model <- as.data.frame(sapply(Model, function(x) (x-min(x))/(max(x)-min(x))))
Model Training
Determining the Best K using the 3 popular methods:
#Using the Elbow method to find the optimal k:
set.seed(123)
#Function to calculate total intra-cluster sum of square:
iss <- function(k) {
kmeans(Model,k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")
From the above graph, we conclude that 4 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot.
2.Gap statistic
# Finding the optimal number of clusters using the Gap Statistic method:
set.seed(123)
stat_gap <- clusGap(Model, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
fviz_gap_stat(stat_gap)
3.Silhouette method
#Finding the optimal number of clusters using the Silhouette method:
fviz_nbclust(Model, kmeans, method = "silhouette")
From the Methods above, the best K is 6
# Plotting the K_Mean Clusters with the Optimal K:
New_Model = kmeans(Model,6,iter.max=100,nstart=50,algorithm="Lloyd")
New_Model
## K-means clustering with 6 clusters of sizes 45, 21, 35, 39, 38, 22
##
## Cluster means:
## Age AnnualIncome SpendingScore
## 1 56.15556 53.37778 49.08889
## 2 44.14286 25.14286 19.52381
## 3 41.68571 88.22857 17.28571
## 4 32.69231 86.53846 82.12821
## 5 27.00000 56.65789 49.13158
## 6 25.27273 25.72727 79.36364
##
## Clustering vector:
## [1] 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2
## [38] 6 2 6 1 6 1 5 2 6 1 5 5 5 1 5 5 1 1 1 1 1 5 1 1 5 1 1 1 5 1 1 5 5 1 1 1 1
## [75] 1 5 1 5 5 1 1 5 1 1 5 1 1 5 5 1 1 5 1 5 5 5 1 5 1 5 5 1 1 5 1 5 1 1 1 1 1
## [112] 5 5 5 5 5 1 1 1 1 5 5 5 4 5 4 3 4 3 4 3 4 5 4 3 4 3 4 3 4 3 4 5 4 3 4 3 4
## [149] 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3
## [186] 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
##
## Within cluster sum of squares by cluster:
## [1] 8062.133 7732.381 16690.857 13972.359 7742.895 4099.818
## (between_SS / total_SS = 81.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#Looking at the Important components using PCA:
pcclust=prcomp(Model,scale=FALSE)
summary(pcclust)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 26.4625 26.1597 12.9317
## Proportion of Variance 0.4512 0.4410 0.1078
## Cumulative Proportion 0.4512 0.8922 1.0000
pcclust$rotation[,1:2]
## PC1 PC2
## Age 0.1889742 -0.1309652
## AnnualIncome -0.5886410 -0.8083757
## SpendingScore -0.7859965 0.5739136
The first two components explain 89% of the Total variation in the data set.
Interpreting the Clusters
To get a more comprehensible understanding of the predicted clusters:
#To view the clustering:
set.seed(1)
ggplot(Model, aes(x =AnnualIncome, y = SpendingScore)) +
geom_point(stat = "identity", aes(color = as.factor(New_Model$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Supermarket Customers", subtitle = "Annual Income and Spending Score")
Cluster 2:
This Cluster represents customers with low annual income and low spending score.
This is quite reasonable as people having low salaries prefer to buy less.
The Supermarket would be least interested in people belonging to this cluster.
Cluster 3:
This Cluster represents customers with high annual income and low spending score
This might be those unsatisfied with the products/services being offered. These can be the prime targets as they have the potential to spend money. The Supermarket can look into adding new facilities that can attract these people and meet their needs
Cluster 1 and 5:
This clusters represents customers with average annual income and average spending scores.
These customers are careful with their spending scale as their income levels are not excessive.
Cluster 4:
This cluster represents customers with high annual income and high spending score.
This is the ideal case as this group forms the prime sources of profit.These are the regular customers who are satisfied with the products/services
Cluster 6:
This cluster represents customers with low annual income but high spending score.
These are those people who for some reason love to buy products more often even though they have a low income.
The supermarket might not target these people that effectively but still will not want to lose them.
Recommendations and Conclusion
From Customer Segmentation:
For those with high spending scores: These customers may potentially be impulsive buyers (cluster 6 and 4) that purchase based on what looks fresh or special discounts. Management could consider having daily/weekly sale items or limited-time new items for sale to keep these customers excited and coming back.
For those with lower spending scores: These customers may potentially be discount/value buyers (clusters 2 and 3) that purchase based on prices and discounts. As these customers may easily move their shopping experience to another supermarket if our prices are more expensive, management could consider creating value on top of competitive prices. Value can be created through excellent customer service, point reward programs, or have an essential item always priced lower than competitors.
General:
Find a way to reduce expenses in the branches that are making losses to maximize profits
Decrease revenue spent on advertising for products with less clicks
Arrange the shelves in a way that products that have high association go together or side by side
Also have promotions that includes the products with high association
As you have seen, the data they have been collecting have brought a wide range of insights. It’s not as useless as the xyz supermarket had suggested in the beginning.