1.Problem statement

Esther and co.analytics have been contracted by Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). We will divide the project into four parts where we’ll explore a recent marketing dataset by performing various analysis and algorithms and later providing recommendations based on our insights.

2.Metrics of success

Performing indepth analysis and provide insights.Especially on the fast moving the most revenue growers products(product lines in our case).Carrefour deepest concern is simply strategies for shorter the conversion cycle.This invaluable supply chain metric will CARREFOUR Kenya can take the right measures to ensure that they run business with less money tied up in operations.

3.Business Understanding

On 17 May 2016, the Majid Al Futtaim Group - the official franchise partner opened its first Carrefour hypermarket in Kenya, in Nairobi county.The country’s .in 2019 it operated two hypermarket locations, one at the Hub and the other at Two Rivers.The European based supermarket Carrefour, which is the first hypermarket in the world has finally opened two branches in Kenya and is set out to open another one at Thika Road Mall.

4.Data Understanding

3.Loading and Exploring our dataset

library("readr")

cr <- read.csv("http://bit.ly/CarreFourDataset")

head(cr,n=3)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
##   Quantity     Tax     Date  Time     Payment   cogs gross.margin.percentage
## 1        7 26.1415 1/5/2019 13:08     Ewallet 522.83                4.761905
## 2        5  3.8200 3/8/2019 10:29        Cash  76.40                4.761905
## 3        7 16.2155 3/3/2019 13:23 Credit card 324.31                4.761905
##   gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
#previewing the bottom of the dataset

tail(cr,n=3)
##       Invoice.ID Branch Customer.type Gender        Product.line Unit.price
## 998  727-02-1313      A        Member   Male  Food and beverages      31.84
## 999  347-56-2442      A        Normal   Male  Home and lifestyle      65.82
## 1000 849-09-3807      A        Member Female Fashion accessories      88.34
##      Quantity    Tax      Date  Time Payment   cogs gross.margin.percentage
## 998         1  1.592  2/9/2019 13:22    Cash  31.84                4.761905
## 999         1  3.291 2/22/2019 15:33    Cash  65.82                4.761905
## 1000        7 30.919 2/18/2019 13:28    Cash 618.38                4.761905
##      gross.income Rating   Total
## 998         1.592    7.7  33.432
## 999         3.291    4.1  69.111
## 1000       30.919    6.6 649.299
#we check the column names we have
names(cr)
##  [1] "Invoice.ID"              "Branch"                 
##  [3] "Customer.type"           "Gender"                 
##  [5] "Product.line"            "Unit.price"             
##  [7] "Quantity"                "Tax"                    
##  [9] "Date"                    "Time"                   
## [11] "Payment"                 "cogs"                   
## [13] "gross.margin.percentage" "gross.income"           
## [15] "Rating"                  "Total"

we observe that we have total of 16 columns heads we can confirm with dim function below

#we can also check the no of rows and columns

dim(cr)
## [1] 1000   16
#we can also go ahead and check for the datatypes

str(cr)
## 'data.frame':    1000 obs. of  16 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
#we can go ahead and check for the number missing values per column

colSums(is.na(cr))
##              Invoice.ID                  Branch           Customer.type 
##                       0                       0                       0 
##                  Gender            Product.line              Unit.price 
##                       0                       0                       0 
##                Quantity                     Tax                    Date 
##                       0                       0                       0 
##                    Time                 Payment                    cogs 
##                       0                       0                       0 
## gross.margin.percentage            gross.income                  Rating 
##                       0                       0                       0 
##                   Total 
##                       0

we observe that we have no missing values

#we can go ahead and check for duplicated values
duplicated_rows <- cr[duplicated(cr),]

dim(duplicated_rows)
## [1]  0 16

we have o rows in the 16 columns that have been duplicated

#we can go ahead and check for the outliers
#first we can go ahead and choose the numerical columns only

nums <- subset(cr, select = c(Unit.price, Quantity, Tax,cogs, gross.margin.percentage,gross.income, Rating ,Total))

colnames(nums)
## [1] "Unit.price"              "Quantity"               
## [3] "Tax"                     "cogs"                   
## [5] "gross.margin.percentage" "gross.income"           
## [7] "Rating"                  "Total"
boxplot(nums)

We observe that column cogs and total have a few outliers

5.Tidying the Dataset

#we can first start by changing the pesky column names 
#we will first change them to lowercase then we will rename them
# First we Change the type of the loaded dataset to a dataframe

cr = as.data.frame(cr)

# Change column names, by making them uniform

colnames(cr) = tolower(colnames(cr))

#to confirm the change

names (cr)
##  [1] "invoice.id"              "branch"                 
##  [3] "customer.type"           "gender"                 
##  [5] "product.line"            "unit.price"             
##  [7] "quantity"                "tax"                    
##  [9] "date"                    "time"                   
## [11] "payment"                 "cogs"                   
## [13] "gross.margin.percentage" "gross.income"           
## [15] "rating"                  "total"
library(reshape)
cr <- rename(cr, c(invoice.id="invoice"))
cr <- rename(cr, c(gross.margin.percentage="grossmargin")) 
cr <- rename(cr, c(product.line="product"))
cr <- rename(cr, c(customer.type="customer"))
cr <- rename(cr, c(gross.income="grossincome"))
cr <- rename(cr, c(unit.price="unitprice"))

names(cr)
##  [1] "invoice"     "branch"      "customer"    "gender"      "product"    
##  [6] "unitprice"   "quantity"    "tax"         "date"        "time"       
## [11] "payment"     "cogs"        "grossmargin" "grossincome" "rating"     
## [16] "total"
#since we have columns that are identical tax and gross income we drop one
#we also drop the constant column grossmargin
#we dont need invoiceid in our modelling or analysis

cr[ ,c('tax', 'grossmargin','invoice')] <- list(NULL)
#to confirm they have been deleted

names(cr)
##  [1] "branch"      "customer"    "gender"      "product"     "unitprice"  
##  [6] "quantity"    "date"        "time"        "payment"     "cogs"       
## [11] "grossincome" "rating"      "total"
#we can also separate the date column into 'year','month','day'

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:reshape':
## 
##     expand, smiths
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
crd <- separate(cr, date, c("month", "day", "year"))

head(crd,n=2) 
##   branch customer gender                product unitprice quantity month day
## 1      A   Member Female      Health and beauty     74.69        7     1   5
## 2      C   Normal Female Electronic accessories     15.28        5     3   8
##   year  time payment   cogs grossincome rating    total
## 1 2019 13:08 Ewallet 522.83     26.1415    9.1 548.9715
## 2 2019 10:29    Cash  76.40      3.8200    9.6  80.2200

After splitting the data we observe that our data is for the year 2019

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v stringr 1.4.0
## v tibble  3.0.4     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks reshape::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::rename() masks reshape::rename()
library(magrittr) 
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
#Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
#For the categorical data we change to levels

cat2= c('customer', 'payment','branch', 'gender','product','year','month','day')

# Changing columns to factors

crd[,cat2] %<>% lapply(function(x) as.factor(as.character(x)))

str(crd)
## 'data.frame':    1000 obs. of  15 variables:
##  $ branch     : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
##  $ customer   : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
##  $ gender     : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
##  $ product    : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
##  $ unitprice  : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ quantity   : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ month      : Factor w/ 3 levels "1","2","3": 1 3 3 1 2 3 2 2 1 2 ...
##  $ day        : Factor w/ 31 levels "1","10","11",..: 27 30 23 20 30 18 18 17 2 13 ...
##  $ year       : Factor w/ 1 level "2019": 1 1 1 1 1 1 1 1 1 1 ...
##  $ time       : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ payment    : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
##  $ cogs       : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ grossincome: num  26.14 3.82 16.22 23.29 30.21 ...
##  $ rating     : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ total      : num  549 80.2 340.5 489 634.4 ...

we observe that:

6.Preparing our dataset for Analysis

#we start by separating our datatypes columns we divide the into numerical and into categorical
#this helps alot when it comes to plotting and statistical analysis 
# we had already defined both but need subsets with the cleaned dataset

numcols <- subset(crd, select = c(unitprice, quantity,cogs,grossincome, rating ,total))

head(numcols,n=3)
##   unitprice quantity   cogs grossincome rating    total
## 1     74.69        7 522.83     26.1415    9.1 548.9715
## 2     15.28        5  76.40      3.8200    9.6  80.2200
## 3     46.33        7 324.31     16.2155    7.4 340.5255
catcols <- subset(crd, select = c(branch, customer,gender,product, month,day, year,payment))

head(catcols,n=3)
##   branch customer gender                product month day year     payment
## 1      A   Member Female      Health and beauty     1   5 2019     Ewallet
## 2      C   Normal Female Electronic accessories     3   8 2019        Cash
## 3      A   Normal   Male     Home and lifestyle     3   3 2019 Credit card
#we can go ahead and change the above dataframes into tibbles for easier 

numt<-as_tibble(numcols)

head(numt,n=3)
## # A tibble: 3 x 6
##   unitprice quantity  cogs grossincome rating total
##       <dbl>    <int> <dbl>       <dbl>  <dbl> <dbl>
## 1      74.7        7 523.        26.1     9.1 549. 
## 2      15.3        5  76.4        3.82    9.6  80.2
## 3      46.3        7 324.        16.2     7.4 341.
cat3<-as_tibble(catcols)

head(cat3,n=3)
## # A tibble: 3 x 8
##   branch customer gender product                month day   year  payment    
##   <fct>  <fct>    <fct>  <fct>                  <fct> <fct> <fct> <fct>      
## 1 A      Member   Female Health and beauty      1     5     2019  Ewallet    
## 2 C      Normal   Female Electronic accessories 3     8     2019  Cash       
## 3 A      Normal   Male   Home and lifestyle     3     3     2019  Credit card

7.Exploratory Data Analysis

Univariate Analysis
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#Descriptive analysis into measures of central Tendency
#this is usually carried out for the numerical data in the datasets
#we get the statistics for the numerics and draw conclusions from them

describe(numt)
##             vars    n   mean     sd median trimmed    mad   min     max   range
## unitprice      1 1000  55.67  26.49  55.23   55.62  33.37 10.08   99.96   89.88
## quantity       2 1000   5.51   2.92   5.00    5.51   2.97  1.00   10.00    9.00
## cogs           3 1000 307.59 234.18 241.76  279.91 222.65 10.17  993.00  982.83
## grossincome    4 1000  15.38  11.71  12.09   14.00  11.13  0.51   49.65   49.14
## rating         5 1000   6.97   1.72   7.00    6.97   2.22  4.00   10.00    6.00
## total          6 1000 322.97 245.89 253.85  293.91 233.78 10.68 1042.65 1031.97
##             skew kurtosis   se
## unitprice   0.01    -1.22 0.84
## quantity    0.01    -1.22 0.09
## cogs        0.89    -0.09 7.41
## grossincome 0.89    -0.09 0.37
## rating      0.01    -1.16 0.05
## total       0.89    -0.09 7.78

we observe that we 13 numerical columns:

  • the mean of cogs and total are high and grossmargin and quantity low

  • grossincome is mostly 15units of cost with the highest being 49.6 and the lowest being 0.51.

  • quantities of most units is 5 with the highest being 10pcs and lowest being 1pc

  • most units cost around 56 units cost price with the highest being 99 and lowest 10.

  • We also observe that the mean of tax columns is the same with that of grosss income

par(mfrow = c(2, 2))
hist(numt$unitprice)
hist(numt$quantity)
hist(numt$cogs) 
hist(numt$grossincome) 

hist(numt$rating) 
hist(numt$total)

par(mfrow = c(2, 2))
f1 <- cat3$branch 
f1y<- table(f1) 
head.matrix(f1y)
## f1
##   A   B   C 
## 340 332 328
f2 <- cat3$customer 
f2y<- table(f2) 
head.matrix(f2y)
## f2
## Member Normal 
##    501    499
f3 <- cat3$gender 
f3y<- table(f3)
head.matrix(f3y)
## f3
## Female   Male 
##    501    499
f4 <- cat3$product 
f4y<- table(f4) 
head.matrix(f4y)
## f4
## Electronic accessories    Fashion accessories     Food and beverages 
##                    170                    178                    174 
##      Health and beauty     Home and lifestyle      Sports and travel 
##                    152                    160                    166
f5 <- cat3$month 
f5y<- table(f5)
head.matrix(f5y)
## f5
##   1   2   3 
## 352 303 345
f6 <- cat3$day 
f6y<- table(f6) 
head.matrix(f6y)
## f6
##  1 10 11 12 13 14 
## 28 32 27 31 28 39
f7 <- cat3$payment 
f7y<- table(f7)
head.matrix(f7y)
## f7
##        Cash Credit card     Ewallet 
##         344         311         345
par(mfrow = c(2, 2))
barplot(f1y,col="green")
barplot(f2y,col="red")
barplot(f3y,col="blue")
barplot(f4y,col="black")

barplot(f5y,col="yellow")
barplot(f6y,col="dark green")
barplot(f7y,col="purple")

Bivariate Analysis

crd %>%
ggplot() +
aes(x = quantity, month = ..count../nrow(crd), fill = month) +
geom_bar() +
ylab("monthly sale trends")

crd %>%
ggplot() +
aes(x = day, grossincome = ..count../nrow(crd), fill = grossincome) +
geom_bar() +
ylab("daily revenue trends")

crd %>%
ggplot() +
aes(x = quantity, product = ..count../nrow(crd), fill = product) +
geom_bar() +
ylab("product sales trends")

crd %>%
ggplot() +
aes(x = quantity, payment = ..count../nrow(crd), fill = payment) +
geom_bar() +
ylab("payment sales trends")

crd %>%
ggplot() +
aes(x = quantity, gender = ..count../nrow(crd), fill = gender) +
geom_bar() +
ylab("product quantity trends")

crd %>%
ggplot() +
aes(x = quantity, customer = ..count../nrow(crd), fill = customer) +
geom_bar() +
ylab("product sale trends")

crd %>%
ggplot() +
aes(x = quantity, branxh = ..count../nrow(crd), fill = branch) +
geom_bar() +
ylab("branch sale trends")

#Distribution of income per Gender

ggplot(crd,aes(x = total, fill = gender)) +geom_density(alpha = 0.4) +labs(title = "Distribution of total sales per Gender")

#Distribution of income per Gender

ggplot(crd,aes(x = total, fill = branch)) +geom_density(alpha = 0.4) +labs(title = "Distribution of total sales per branch")

#Distribution of income per Gender

ggplot(crd,aes(x = total, fill = customer)) +geom_density(alpha = 0.4) +labs(title = "does membership affect sales")

library(ggcorrplot)
library(corrplot)
## corrplot 0.84 loaded
# Plotting the Correlation Heatmap
crcor<-cor(numt)

library(ggcorrplot)
ggcorrplot(crcor, hc.order = F,type = 
"lower", lab = T ,
  ggtheme = ggplot2::theme_gray,
  colors = c("#00798c", "white", "#edae49"))

The highly correlated features in this dataset were the Cogs , grossincome and total

features with a correlation coefficient of 1(greater than 0.75).We can just drop the

above highly correlated columns during modelling.This is how feature selection is done

during multivariate analysis.

8.Feature Enginnering

#We had already ran the numericals and dropped redundant columns 
#we named the numt
pca <- prcomp(numt[,c(1:6)], center = TRUE, scale = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4      PC5       PC6
## Standard deviation     1.9817 1.0002 0.9939 0.2909 2.89e-16 1.377e-16
## Proportion of Variance 0.6545 0.1667 0.1646 0.0141 0.00e+00 0.000e+00
## Cumulative Proportion  0.6545 0.8213 0.9859 1.0000 1.00e+00 1.000e+00

we observe that after carrying the first principal component explained 65.45% of the total variance while the last three components explained 34.45% of the variance

#used the string function to know which variables were to be consindered

str(pca)
## List of 5
##  $ sdev    : num [1:6] 1.98 1.00 9.94e-01 2.91e-01 2.89e-16 ...
##  $ rotation: num [1:6, 1:6] -0.3281 -0.3649 -0.5029 -0.5029 0.0217 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "unitprice" "quantity" "cogs" "grossincome" ...
##   .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:6] 55.67 5.51 307.59 15.38 6.97 ...
##   ..- attr(*, "names")= chr [1:6] "unitprice" "quantity" "cogs" "grossincome" ...
##  $ scale   : Named num [1:6] 26.49 2.92 234.18 11.71 1.72 ...
##   ..- attr(*, "names")= chr [1:6] "unitprice" "quantity" "cogs" "grossincome" ...
##  $ x       : num [1:1000, 1:6] -1.781 2.087 -0.173 -1.343 -2.497 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

we observe that unit pricr,quantity,cogs and grossincome were thr four variables to be highly consindered during modelling.

#plotting the PCA to see the above visually
#we will use ggbiplot library

library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:reshape':
## 
##     rename, round_any
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Loading required package: grid
ggbiplot(pca, obs.scale = 0.5, var.scale = 0.5, groups = crd$customer,ellipse = TRUE, circle = TRUE)

we observe that whether the customer is loyal or new customer the four variables will explain 65.5 plus 16.7 making it 81.2 variance which is fairly high.

9.Wrapper methods

# Sequential forward greedy search (default)
# ---
#
library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
## 
##     sim
## The following object is masked from 'package:purrr':
## 
##     map
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
library(mclust)
out = clustvarsel(numt)
out
## ------------------------------------------------------ 
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------ 
## 
##  Variable proposed Type of step   BICclust Model G   BICdiff Decision
##           quantity          Add  -4308.761     E 9  687.4466 Accepted
##               cogs          Add -16306.851   VEV 9 1083.0132 Accepted
##          unitprice          Add -21393.079   EVV 7 2812.3733 Accepted
##          unitprice       Remove -16306.851   VEV 9 2812.3733 Rejected
##             rating          Add -25510.859   EVV 7 -184.0917 Rejected
##          unitprice       Remove -16306.851   VEV 9 2812.3733 Rejected
## 
## Selected subset: quantity, cogs, unitprice

we observe that the columns quantity,cogs and unitprice have been accepted and can be used for modelling.This is a more accurate way of feature selection it can be used to furtherbreakdown from PCA and CORRELATION matrix .It is a form of hyperparametee tuning.

# The selection algorithm would indicate that the subset 
# we use for the clustering model is composed of variables X1 and X2 
# and that other variables should be rejected. 
# Having identified the variables that we use, we proceed to build the clustering model:
# 
Subset1 = numt[,out$subset]

crc = Mclust(Subset1)

summary(crc)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EVV (ellipsoidal, equal volume) model with 7 components: 
## 
##  log-likelihood    n df       BIC       ICL
##       -10478.95 1000 63 -21393.08 -21446.89
## 
## Clustering table:
##   1   2   3   4   5   6   7 
## 173 134 126 148 128 142 149