Specific Data Analytics Question

Analyzing the Supermarket data to find the best performing branches, Generate Association rules from the market basket and performing Customer segmentation.

Metrics for Success

Building a Customer Segmentation model that groups Customers accurately into non-overlapping sub-groups distinct from each other.

Understanding the Context

Businesses are often indulged in the race to increase their customers and make revenue. Without it, a company cannot earn profit and stay viable in the long run. Retail Store XYZ has been in operation since 2008. It’s until 2016 that the store’s revenue started decreasing by 10%. This research seeks to populate a Customer Segmentation Model that can help the firm market and sell more effectively.

Customer segmentation is the process by which customers are divided based on Demographics or Behavior. The business impact of doing this is more important as it increases Customer lifetime value and drives greater customer loyalty.

While traditional mass marketing techniques work and are still effective, it is inefficient and costly compared to targeted advertisements where one can correctly identify their customers. In simple terms, if customers can be accurately clustered, customized advertisements and offers to increase engagements can be created.

Specific objectives

  1. Determine the effect spending on Advertisement, Promotion and Administration have on the profit of the supermarket.

  2. Identify the state with the best performing branches.

  3. Generate the revenue from the branches.

  4. Determine which advertisement attracted customers the most.

  5. Find the most selling product in the supermarket.

  6. Determine the customers demographic information.

  7. Perform Clustering.

Experimental Design

1.Data Preparation

2.Exploratory Data Analysis

3.Clustering

4.Conclusions and Recommendations

Data Relevance

Supermarket Branches Dataset:

Supermarket Customers Dataset:

SUPERMARKET BRANCH ANALYSIS

Data Preparation

Loading Libraries

#Loading dependencies: 
library(readxl)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(mlr)
## Loading required package: ParamHelpers
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
library(grid)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(crosstable)
## 
## Attaching package: 'crosstable'
## The following object is masked from 'package:purrr':
## 
##     compact
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:mlr':
## 
##     train
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggcorrplot)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(superml)
## Loading required package: R6
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("ggdendro")
library(flashClust)
## 
## Attaching package: 'flashClust'
## The following object is masked from 'package:stats':
## 
##     hclust
library(NbClust)
library(cluster)
library(purrr)
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
library(moments)

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

Loading the Dataset

#Importing the data to the Global Environment:
data <- read_excel("C:/Users/Elijah/Desktop/50_SupermarketBranches.xlsx")

#Printing the first 4 rows of the dataframe
head(data, n=4)
## # A tibble: 4 x 5
##   `Advertisement Spend` `Promotion Spend` `Administration Spend` State    Profit
##                   <dbl>             <dbl>                  <dbl> <chr>     <dbl>
## 1               165349.           136898.                471784. New York 1.92e5
## 2               162598.           151378.                443899. Califor~ 1.92e5
## 3               153442.           101146.                407935. Florida  1.91e5
## 4               144372.           118672.                383200. New York 1.83e5
#Checking the data structure:
str(data)
## tibble [50 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Advertisement Spend : num [1:50] 165349 162598 153442 144372 142107 ...
##  $ Promotion Spend     : num [1:50] 136898 151378 101146 118672 91392 ...
##  $ Administration Spend: num [1:50] 471784 443899 407935 383200 366168 ...
##  $ State               : chr [1:50] "New York" "California" "Florida" "New York" ...
##  $ Profit              : num [1:50] 192262 191792 191050 182902 166188 ...

The data has 50 observations and 5 variables. 4 of the variables are numeric and 1 is character

#checking for distinct values in the data:
unique(data)
## # A tibble: 50 x 5
##    `Advertisement Spend` `Promotion Spend` `Administration Spend` State   Profit
##                    <dbl>             <dbl>                  <dbl> <chr>    <dbl>
##  1               165349.           136898.                471784. New Yo~ 1.92e5
##  2               162598.           151378.                443899. Califo~ 1.92e5
##  3               153442.           101146.                407935. Florida 1.91e5
##  4               144372.           118672.                383200. New Yo~ 1.83e5
##  5               142107.            91392.                366168. Florida 1.66e5
##  6               131877.            99815.                362861. New Yo~ 1.57e5
##  7               134615.           147199.                127717. Califo~ 1.56e5
##  8               130298.           145530.                323877. Florida 1.56e5
##  9               120543.           148719.                311613. New Yo~ 1.52e5
## 10               123335.           108679.                304982. Califo~ 1.50e5
## # ... with 40 more rows

Checking missing values

#checking for null values
sum(is.na(data))
## [1] 0

Checking for duplicates

#checking duplicates
anyDuplicated((data))
## [1] 0

The data has no missing or duplicate values

Checking for outliers

#Renaming the column names for ease of reference:
names(data) <- c('Advertisement.Spend', 'Promotion.Spend', 'Administration.Spend', 'State', 'Profit') 

#subset of numeric data
data_numeric<-data[,c(1:3,5)]

#Using a Box plot to check for outliers on the numerical variables:
qplot( x =  Advertisement.Spend, y = "", geom = "boxplot", data = data_numeric, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Advertisment Spending")

qplot( x =  Promotion.Spend,  y = "", geom = "boxplot", data = data_numeric, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Promotion Spending")

qplot( x =  Administration.Spend, y = "", geom = "boxplot", data = data_numeric ,col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Administration Spending")

The dataset has no outlier

Exploratory Data Analysis

Univariate Analysis

#Getting the summary statistics:
summary(data_numeric)
##  Advertisement.Spend Promotion.Spend  Administration.Spend     Profit      
##  Min.   :     0      Min.   : 51283   Min.   :     0       Min.   : 14681  
##  1st Qu.: 39936      1st Qu.:103731   1st Qu.:129300       1st Qu.: 90139  
##  Median : 73051      Median :122700   Median :212716       Median :107978  
##  Mean   : 73722      Mean   :121345   Mean   :211025       Mean   :112013  
##  3rd Qu.:101603      3rd Qu.:144842   3rd Qu.:299469       3rd Qu.:139766  
##  Max.   :165349      Max.   :182646   Max.   :471784       Max.   :192262

The leading average cost is the administration followed by promotion and finally the advertisement.the mean profit generated was 112013

#Finding the standard deviation and variance:
stats <- data.frame(
  sd=apply(data_numeric, 2 ,sd),
  var=apply(data_numeric,2,var),
  kurtosis= apply(data_numeric,2,kurtosis),
  skewness = apply(data_numeric,2,skewness)
)
stats
##                             sd         var kurtosis    skewness
## Advertisement.Spend   45902.26  2107017150 2.194932  0.15904052
## Promotion.Spend       28017.80   784997271 3.085538 -0.47423007
## Administration.Spend 122290.31 14954920097 2.275967 -0.04506632
## Profit                40306.18  1624588173 2.824704  0.02258638

The variables have high spread from the mean because they have very large standard deviation.

The variables are not highly skewed as they have skewness between 1 and -1.

The variables are mesokurtic as they have kurtosis not greater than 3 thus showing that the data has no outliers

#The number of branches per state:
state<- table(data$State)

#visualizing using a bar plot:
barplot(state,col="cyan3", main = "Distribution of States")

California and New York have 17 branches and Florida has 16

Bivariate Analysis

#Relationship between advert spending and profit:
ggplot(data,aes(x=Advertisement.Spend,y=Profit))+
  geom_point() + labs(title = "Advertising Spending and Profit")

Advertisement and Profit have a strong positive correlation because the more the advertising the more the profits.

#Relationship between promotion and profit:
ggplot(data,aes(x=Promotion.Spend,y=Profit))+
  geom_point() + labs(title = "Promotion Spending and Profit")

Profit and Promotion have a weak positive correlation meaning that Promotion spending has no great effect on the profit

#Relationship between administration nd profit:
ggplot(data,aes(x=Administration.Spend,y=Profit))+
  geom_point() + labs(title = "Administration Spending and Profit")

Administration cost has a strong positive correlation with the profit meaning better remuneration of the employees increases the profit

#Comparing profits with the various state:
profit_per_state = data%>% group_by(State) %>%
                    summarise(total_Profits = sum(Profit),
                              ,
                              .groups = 'drop')
head(profit_per_state)
## # A tibble: 3 x 2
##   State      total_Profits
##   <chr>              <dbl>
## 1 California      1766388.
## 2 Florida         1900384.
## 3 New York        1933860.
ggplot(data ) + 
   geom_point(mapping = aes(x = 1:nrow(data), y = Profit, color = State)) + labs(title = " Profit across the Different States")

The states with the best performing branches based on the profits are New York followed by Florida and then California

#Comparing Advertisement.Spend with the various state:
Advertisement.Spend_per_state = data%>% group_by(State) %>%
                    summarise(total_Advertisement.Spend = sum(Advertisement.Spend),
                              ,
                              .groups = 'drop')
head(Advertisement.Spend_per_state)
## # A tibble: 3 x 2
##   State      total_Advertisement.Spend
##   <chr>                          <dbl>
## 1 California                  1099180.
## 2 Florida                     1291584.
## 3 New York                    1295316.
ggplot(data ) + 
   geom_point(mapping = aes(x = 1:nrow(data), y = Advertisement.Spend, shape = State)) + labs(title = "Advertisment Cost across States")

The states with the highest advertisement cost is New York,followed by Florida and the California. This could explain why New York has the leading profit since the profit and advertisement costs are highly correlated.

#Comparing Promotion.Spend with the various state:
Promotion.Spend_per_state = data%>% group_by(State) %>%
                    summarise(total_Promotion.Spend = sum(Promotion.Spend),
                              ,
                              .groups = 'drop')
head(Promotion.Spend_per_state)
## # A tibble: 3 x 2
##   State      total_Promotion.Spend
##   <chr>                      <dbl>
## 1 California              2052691.
## 2 Florida                 1948302.
## 3 New York                2066239
ggplot(data ) + 
   geom_point(mapping = aes(x = 1:nrow(data), y = Promotion.Spend, shape = State)) + labs(title = "Promotion Cost across the States ")

The leading state with the promotion cost is the New York followed by California and then Florida .

New York has leading profit because of doing lots of promotion and California spends lots on promotion yet the correlation between the promotion and profit is weak thus the reason it has the least profits.

#comparing Administration.Spend with the various state:
Administration.Spend_per_state = data%>% group_by(State) %>%
                    summarise(total_Administration.Spend= sum(Administration.Spend),
                              ,
                              .groups = 'drop')
head(Administration.Spend_per_state)
## # A tibble: 3 x 2
##   State      total_Administration.Spend
##   <chr>                           <dbl>
## 1 California                   3103196.
## 2 Florida                      3957177.
## 3 New York                     3490882.
ggplot(data ) + 
   geom_point(mapping = aes(x = 1:nrow(data), y = Administration.Spend, color  = State)) + labs(title = " Administration Cost across States")

The leading state with high cost of administration is Florida ,followed by New York and finally California.

Florida has better profits since the administration cost has a positive correlation with the profits

#creating a new column containing the sum of all the expenses:
data$expenses<- rowSums(data[,c(1:3)])
head(data)
## # A tibble: 6 x 6
##   Advertisement.Spend Promotion.Spend Administration.Spend State Profit expenses
##                 <dbl>           <dbl>                <dbl> <chr>  <dbl>    <dbl>
## 1             165349.         136898.              471784. New ~ 1.92e5  774031.
## 2             162598.         151378.              443899. Cali~ 1.92e5  757874.
## 3             153442.         101146.              407935. Flor~ 1.91e5  662522.
## 4             144372.         118672.              383200. New ~ 1.83e5  646244.
## 5             142107.          91392.              366168. Flor~ 1.66e5  599668.
## 6             131877.          99815.              362861. New ~ 1.57e5  594553.
#plotting a scatter plot for the expenses and profit:
x<- data$expenses
y<-data$Profit
plot(x, y, main = "profit vs expensess",
     xlab = "expenses", ylab = "profit")
abline(lm(y ~ x, data = data), col = "blue")

The profit and expenses have a positive correlation meaning for the profits to increase you ought to incur an extra cost

#Getting the state with the most expenses:
expenses_per_state = data%>% group_by(State) %>%
                    summarise(total_expenses= sum(expenses),
                              ,
                              .groups = 'drop')
head(expenses_per_state)
## # A tibble: 3 x 2
##   State      total_expenses
##   <chr>               <dbl>
## 1 California       6255067.
## 2 Florida          7197063.
## 3 New York         6852437.
ggplot(data ) + 
   geom_point(mapping = aes(x = 1:nrow(data), y = expenses, color  = State)) + labs(title = "Expenses across the States")

The leading state with high cost is Florida followed by New York and California.

This explains why California is the least in the profits meaning it does not incur lots of cost to generate profits but also a cause of concern since New York is performing better than Florida yet it does not include lot of of expenses

#The total revenue of a company equals to expenses plus profit thus we create a new column of revenue:
data$total.revenue<-rowSums(data[,c(5:6)])
#finding state with the most revenue
total.revenue_per_state = data%>% group_by(State) %>%
                    summarise(total_total.revenue= sum(total.revenue),
                              ,
                              .groups = 'drop')
total.revenue_per_state
## # A tibble: 3 x 2
##   State      total_total.revenue
##   <chr>                    <dbl>
## 1 California            8021455.
## 2 Florida               9097448.
## 3 New York              8786297.
#visualizing the revenue per stat
pct = round((total.revenue_per_state$total_total.revenue/sum(total.revenue_per_state$total_total.revenue))*100,1)
# Plot the chart.
pie(total.revenue_per_state$total_total.revenue,labels = pct,main='Revenue Collection across the States', col = rainbow((nrow(total.revenue_per_state))))
#adding legend to our pie chart
legend("topright", c("California","Florida","New York"),cex = 0.8,fill = rainbow(nrow(total.revenue_per_state))) 

The leading states with the revenue is Florida followed by New York and then California.this means that based on Revenue generation the Florida branches are performing well but New York has the leading profits because it tends to focus its cost to the cost that has a high correlation with profits unlike Florida that uses it revenue on costs that has weak correlation with the profits

#Encoding the state column
#install.packages("superml")
library("superml")
lbl = LabelEncoder$new()
data$State = lbl$fit_transform(data$State)
head(data)
## # A tibble: 6 x 7
##   Advertisement.Spend Promotion.Spend Administration.Spend State Profit expenses
##                 <dbl>           <dbl>                <dbl> <dbl>  <dbl>    <dbl>
## 1             165349.         136898.              471784.     0 1.92e5  774031.
## 2             162598.         151378.              443899.     1 1.92e5  757874.
## 3             153442.         101146.              407935.     2 1.91e5  662522.
## 4             144372.         118672.              383200.     0 1.83e5  646244.
## 5             142107.          91392.              366168.     2 1.66e5  599668.
## 6             131877.          99815.              362861.     0 1.57e5  594553.
## # ... with 1 more variable: total.revenue <dbl>
#Finding the correlation matrix:
cormat<- cor(data)

#Get upper triangle of the correlation matrix:
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}
upper_tri <- get_upper_tri(cormat)
### Melt
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## The following object is masked from 'package:tidyr':
## 
##     smiths
melted_cormat <- melt(upper_tri, na.rm = TRUE)
### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+ (labs(title = "Correlation Matrix \n"))

 coord_fixed()
## <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     ratio: 1
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     setup_params: function
##     train_panel_guides: function
##     transform: function
##     super:  <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>

Summary:

ASSOCIATION ANALYSIS

Loading the Data

#Importing the data to the Global Environment:
tr <- read.transactions("C:/Users/Elijah/Desktop/groceries - groceries.csv", sep=',')

#Exploring the data using the Summary function:
summary(tr)
## transactions as itemMatrix in sparse format with
##  9836 rows (elements/itemsets/transactions) and
##  231 columns (items) and a density of 0.0234297 
## 
## most frequent items:
##       whole milk                1 other vegetables       rolls/buns 
##             2513             2159             1903             1809 
##             soda          (Other) 
##             1715            43136 
## 
## element (itemset/transaction) length distribution:
## sizes
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   18   19   20   21   22   23   24   25   27   28   29   30   33 
##   29   14   14    9   11    4    6    1    1    1    1    3    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   5.412   7.000  33.000 
## 
## includes extended item information - examples:
##   labels
## 1      1
## 2     10
## 3     11

There are 9,836 transactions and 231 products

Determining the frequency of items appearing in the Market Basket

#Create an item frequency plot for the top 10 items:
if (!require("RColorBrewer")) {
# install color package of R
install.packages("RColorBrewer")
#include library RColorBrewer
library(RColorBrewer)
}
## Loading required package: RColorBrewer
itemFrequencyPlot(tr,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")

# absolute type parameter gives  us numeric frequencies of items  independently
#Relative Item Frequency Plot
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Pastel2'),main="Relative Item Frequency Plot")

#relative type parameter the number of times an item has appeared as compared to others 

Relative type parameter gives the number of times an item has appeared as compared to others .

Generating rules using Apriori algorithm

# Min Support as 0.001, confidence as 0.8. which are the default values and max of 10 items
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[231 item(s), 9836 transaction(s)] done [0.01s].
## sorting and recoding items ... [177 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(association.rules) 
## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  47 259 145  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.263   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001118   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001242   Mean   :0.8663   Mean   :0.001444   Mean   : 3.933  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.307  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003558   Max.   :11.236  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :11.00  
##  Mean   :12.22  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr          9836   0.001        0.8
##                                                                         call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10))

The total number of rules are 463

Distribution of rule length: rule with length of 4 has the highest number of rules while that with length of 6 has the lowest number of rules.

Inspecting the rules

# printing  the first 8 rules
inspect(association.rules[1:8])
##     lhs                         rhs            support     confidence
## [1] {17, tropical fruit}     => {whole milk}   0.001118341 1.0000000 
## [2] {cereals, curd}          => {whole milk}   0.001016673 0.9090909 
## [3] {cereals, yogurt}        => {whole milk}   0.001728345 0.8095238 
## [4] {butter, jam}            => {whole milk}   0.001016673 0.8333333 
## [5] {liquor, red/blush wine} => {bottled beer} 0.001931680 0.9047619 
## [6] {bottled beer, soups}    => {whole milk}   0.001118341 0.9166667 
## [7] {7, specialty cheese}    => {whole milk}   0.001016673 0.9090909 
## [8] {16, whipped/sour cream} => {whole milk}   0.001220008 0.8571429 
##     coverage    lift      count
## [1] 0.001118341  3.914047 11   
## [2] 0.001118341  3.558225 10   
## [3] 0.002135014  3.168514 17   
## [4] 0.001220008  3.261706 10   
## [5] 0.002135014 11.236412 19   
## [6] 0.001220008  3.587876 11   
## [7] 0.001118341  3.558225 10   
## [8] 0.001423343  3.354897 12
#To find out which items are likely to be purchased with whole milk:
rules<-apriori(data=tr, parameter=list(supp=0.001,conf = 0.8), 
 appearance = list(default="lhs",rhs="whole milk"),
 control = list(verbose=F))
 rules<-sort(rules, decreasing=TRUE,by="confidence")
 inspect(rules[1:6])
##     lhs                     rhs              support confidence    coverage     lift count
## [1] {17,                                                                                  
##      tropical fruit}     => {whole milk} 0.001118341          1 0.001118341 3.914047    11
## [2] {rice,                                                                                
##      sugar}              => {whole milk} 0.001220008          1 0.001220008 3.914047    12
## [3] {canned fish,                                                                         
##      hygiene articles}   => {whole milk} 0.001118341          1 0.001118341 3.914047    11
## [4] {butter,                                                                              
##      rice,                                                                                
##      root vegetables}    => {whole milk} 0.001016673          1 0.001016673 3.914047    10
## [5] {14,                                                                                  
##      curd,                                                                                
##      other vegetables}   => {whole milk} 0.001016673          1 0.001016673 3.914047    10
## [6] {flour,                                                                               
##      root vegetables,                                                                     
##      whipped/sour cream} => {whole milk} 0.001728345          1 0.001728345 3.914047    17
# Filter rules with confidence greater than 0.6 or 60%
subRules<-association.rules[quality(association.rules)$confidence>0.6]
#Plot SubRules
plot(subRules,jitter=0)

# rules with high lift have low surpport

Rules with high lift have low support while rules with low lift have high support

Plotting using 2-key- plot

#Using confidence as y and support as x
plot(subRules,method="two-key plot",jitter= 0)

# order is the number of items in a rule

rule with 6 items has low support which means a rule with more itemshas low support than a rule with few items

Graph Based visualizations

top10subRules <- head(subRules, n = 10, by = "confidence")
plot(top10subRules, method = "graph",  engine = "htmlwidget")

CUSTOMER SEGMENTATION

Loading the Dataset

#Importing the data to the Global Environment:
Demographics <- read_excel("C:/Users/Elijah/Desktop/Customer Demographics.xlsx")

#Printing the first 4 rows of the dataframe
head(Demographics, n=4)
## # A tibble: 4 x 5
##   CustomerID Genre    Age `Annual Income (k$)` `Spending Score (1-100)`
##        <dbl> <chr>  <dbl>                <dbl>                    <dbl>
## 1          1 Male      19                   15                       39
## 2          2 Male      21                   15                       81
## 3          3 Female    20                   16                        6
## 4          4 Female    23                   16                       77
#Checking the Number of Rows and Columns:
dim(Demographics)
## [1] 200   5

Data Uniformity

#Getting Information on the data types on each respective column:
sapply(Demographics, class)
##             CustomerID                  Genre                    Age 
##              "numeric"            "character"              "numeric" 
##     Annual Income (k$) Spending Score (1-100) 
##              "numeric"              "numeric"
#Renaming the column names for ease of reference:
names(Demographics) <- c('CustomerID', 'Gender', 'Age', 'AnnualIncome', 'SpendingScore')
#Viewing the full information:
str(Demographics)
## tibble [200 x 5] (S3: tbl_df/tbl/data.frame)
##  $ CustomerID   : num [1:200] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender       : chr [1:200] "Male" "Male" "Female" "Female" ...
##  $ Age          : num [1:200] 19 21 20 23 31 22 35 23 64 30 ...
##  $ AnnualIncome : num [1:200] 15 15 16 16 17 17 18 18 19 19 ...
##  $ SpendingScore: num [1:200] 39 81 6 77 40 76 6 94 3 72 ...

Missing Values

#Checking for null entries in each column:
colSums(is.na(Demographics))
##    CustomerID        Gender           Age  AnnualIncome SpendingScore 
##             0             0             0             0             0

Duplicate values

#Checking for identical entries:
sum(duplicated(Demographics))
## [1] 0

The data has no missing or duplicate values.

Checking for Outliers

#Using a Box plot to check for outliers on the numerical variables:
qplot( x =  Age, y = "", geom = "boxplot", data = Demographics, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Age ")

qplot( x =  AnnualIncome,  y = "", geom = "boxplot", data = Demographics, col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Annual Income (K$)")

qplot( x =  SpendingScore, y = "", geom = "boxplot", data = Demographics ,col = I("darkolivegreen4"), fill = I("cyan3"), main = "Boxplot on Spending Score (1-100)")

An outlier can be observed on Annual Income. Since there’s no basis to assume the entry is not valid, the outlier is not dropped.

Exploratory Data Analysis

This process involves investigating the dataset to discover patterns.

Univariate Analysis

This analysis aims to explore each demographic variable in the dataset separately

#To view the distribution of Gender:
Gen_table <- table(Demographics$Gender)

#Plotting the Information above:
x <- c(56, 44)
labels <- c('Females', 'Males')
colors <- c('cyan3','cyan4')
#pie_percent<- round(100*x/sum(x), 0) 
pie(x, labels = percent(x/100), main=' Gender Distribution', density=30, col=colors)
legend("topright", c("Females", "Males"), cex = 0.9, fill = colors)

Of the 200 sampled customers, 112 were Females while 88 were Males.

#To view the Age distribution of the customers:
Age_table <- table(Demographics$Age)

#Plotting the Information above:
ggplot(data = Demographics, mapping = aes(x = Age)) + 
  geom_histogram(fill = "cyan3", color = "black", binwidth = 2) + labs(x = "Age (Year's)", title = "Age Distribution")

The Age of the customers range from 18-70 years.

Those frequently visiting the supermarket are of the young and adult age groups (27-45)

Less frequent visitors are those aged 55 and above.

The distribution is close to a normal distribution.

#To view the Income distribution of the customers:
Income_table <- table(Demographics$AnnualIncome)

#Plotting the Information above:
ggplot(data = Demographics, mapping = aes(x = AnnualIncome)) + 
  geom_histogram(fill = "cyan3", color = "black", binwidth = 10) + labs(x = "Income (K$)", title = "Income Levels")

The Income levels of the customers range from $15k - 137k.

People earning an average income of 70k$ have the highest frequency count.

The average annual income of the customers is 60.56k$.

#To view the Spending Score of the customers:
Spending_table <- table(Demographics$SpendingScore)

#Plotting the Information above:
ggplot(data = Demographics, mapping = aes(x = SpendingScore)) + 
  geom_histogram(fill = "cyan3", color = "black", binwidth = 10) + labs(x = "Spending Score (1-100)", title = "Spending Scores")

Spending Score is a score given to a customer by the Supermarket authorities based on the money spent and the behavior of the customer.

This is an Important Chart as it gives an idea about the Spending rate of the Customers Visiting the Supermarket.

From the plot, most of the customers have a score in the range of 40 - 60.

There are customers also having a score of 99 showing that the Supermarket caters for the variety of customers with varying needs and requirements.

#Printing the Descriptive Summary:
summary(Demographics)
##    CustomerID        Gender               Age         AnnualIncome   
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00  
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50  
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50  
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56  
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00  
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00  
##  SpendingScore  
##  Min.   : 1.00  
##  1st Qu.:34.75  
##  Median :50.00  
##  Mean   :50.20  
##  3rd Qu.:73.00  
##  Max.   :99.00

Bivariate Analysis

This analysis involves two variables being observed against each other.

#Creating a dataframe Cr:
Cr<- Demographics

#Visualizing the Plot:
corr_map <- ggcorr(Cr[,3:5], method=c("everything", "pearson"), label=TRUE, hjust = .90, size = 3, layout.exp = 2) + (labs(title = "Correlation Matrix \n"))
corr_map

Age is Negatively correlated with Spending Score.

No correlation exists between Annual Income and Age, Spending Score and Annual Income

The analysis seeks to further answer:

#To see the relationship between Income and Spending scores across the different genders:
ggplot(Demographics, aes(x=AnnualIncome, y = SpendingScore )) + geom_point(aes(colour= `Gender`))+labs(title='Relationship Between Income and Spending Score')

Roughly, Annual income of $40–70k corresponds to a 40–60 spending score.

There seem to be a cluster-like pattern when looking at income and spending score: High Income Individuals with High Income spending, High Income Individuals with Low Spending Score.

There doesn’t seem to be any difference in spending score when comparing the gender of a customer.

#To see the relationship between Age and Spending scores across the different genders:
ggplot(Demographics, aes(x=Age, y = SpendingScore )) + geom_point(aes(colour= `Gender`))+labs(title='Relationship between Age and Spending Score')

The lower the age higher the spending score.

Customers in the age range of 15-40 years old make up most the the customers with a spending score of above 60

Customers above 40 years old does not seem to have a spending score of above 60. This can be justified with the fact that they make a small number of the entire sample.

#To see the relationship between Age and Annual Income across the different genders:
ggplot(Demographics, aes(x=Age, y = AnnualIncome )) + geom_point(aes(colour= `Gender`))+labs(title='Relationship between Age and Annual Income')

Income Levels are high among customers aged 30-50 years.

K-Means Clustering

This algorithm aims to partition observations into clusters based on Feature similarity.

#Creating the modelling dataframe:
Model <- select(Demographics , 'Age', 'Age', 'AnnualIncome', 'SpendingScore')

#Viewing the Data types:
sapply(Model, class)
##           Age  AnnualIncome SpendingScore 
##     "numeric"     "numeric"     "numeric"

Feature Engineering

  1. Feature Scaling
#Transforming the ranges to the same scale of 0 to 1 using Min Max:
#Model <- as.data.frame(sapply(Model, function(x) (x-min(x))/(max(x)-min(x))))

Model Training

Determining the Best K using the 3 popular methods:

  1. Elbow Method
#Using the Elbow method to find the optimal k:
set.seed(123)

#Function to calculate total intra-cluster sum of square: 
iss <- function(k) {
  kmeans(Model,k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
    type="b", pch = 19, frame = FALSE, 
    xlab="Number of clusters K",
    ylab="Total intra-clusters sum of squares")

From the above graph, we conclude that 4 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot.

2.Gap statistic

# Finding the optimal number of clusters using the Gap Statistic method:
set.seed(123)
stat_gap <- clusGap(Model, FUN = kmeans, nstart = 25,
            K.max = 10, B = 50)
fviz_gap_stat(stat_gap)

3.Silhouette method

#Finding the optimal number of clusters using the Silhouette method:
fviz_nbclust(Model, kmeans, method = "silhouette")

From the Methods above, the best K is 6

# Plotting the K_Mean Clusters with the Optimal K:
New_Model = kmeans(Model,6,iter.max=100,nstart=50,algorithm="Lloyd")
New_Model
## K-means clustering with 6 clusters of sizes 45, 21, 35, 39, 38, 22
## 
## Cluster means:
##        Age AnnualIncome SpendingScore
## 1 56.15556     53.37778      49.08889
## 2 44.14286     25.14286      19.52381
## 3 41.68571     88.22857      17.28571
## 4 32.69231     86.53846      82.12821
## 5 27.00000     56.65789      49.13158
## 6 25.27273     25.72727      79.36364
## 
## Clustering vector:
##   [1] 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2 6 2
##  [38] 6 2 6 1 6 1 5 2 6 1 5 5 5 1 5 5 1 1 1 1 1 5 1 1 5 1 1 1 5 1 1 5 5 1 1 1 1
##  [75] 1 5 1 5 5 1 1 5 1 1 5 1 1 5 5 1 1 5 1 5 5 5 1 5 1 5 5 1 1 5 1 5 1 1 1 1 1
## [112] 5 5 5 5 5 1 1 1 1 5 5 5 4 5 4 3 4 3 4 3 4 5 4 3 4 3 4 3 4 3 4 5 4 3 4 3 4
## [149] 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3
## [186] 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
## 
## Within cluster sum of squares by cluster:
## [1]  8062.133  7732.381 16690.857 13972.359  7742.895  4099.818
##  (between_SS / total_SS =  81.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
#Looking at the Important components using PCA:
pcclust=prcomp(Model,scale=FALSE) 
summary(pcclust)
## Importance of components:
##                            PC1     PC2     PC3
## Standard deviation     26.4625 26.1597 12.9317
## Proportion of Variance  0.4512  0.4410  0.1078
## Cumulative Proportion   0.4512  0.8922  1.0000
pcclust$rotation[,1:2]
##                      PC1        PC2
## Age            0.1889742 -0.1309652
## AnnualIncome  -0.5886410 -0.8083757
## SpendingScore -0.7859965  0.5739136

The first two components explain 89% of the Total variation in the data set.

Interpreting the Clusters

To get a more comprehensible understanding of the predicted clusters:

#To view the clustering:
set.seed(1)
ggplot(Model, aes(x =AnnualIncome, y = SpendingScore)) + 
  geom_point(stat = "identity", aes(color = as.factor(New_Model$cluster))) +
  scale_color_discrete(name=" ",
              breaks=c("1", "2", "3", "4", "5","6"),
              labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
  ggtitle("Segments of Supermarket Customers", subtitle = "Annual Income and Spending Score")

Cluster 2:

Cluster 3:

Cluster 1 and 5:

Cluster 4:

Cluster 6:

Recommendations and Conclusion

From Customer Segmentation:

General: