Overview

You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.

Loading Libraries

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## Loading required package: scales
## Loading required package: grid
library(Rtsne)
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(corrplot)
## corrplot 0.92 loaded
library(wskm)
## Loading required package: latticeExtra
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
## Loading required package: fpc
library("cluster")
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()      masks plyr::arrange()
## ✖ readr::col_factor()   masks scales::col_factor()
## ✖ purrr::compact()      masks plyr::compact()
## ✖ dplyr::count()        masks plyr::count()
## ✖ purrr::discard()      masks scales::discard()
## ✖ tidyr::expand()       masks Matrix::expand()
## ✖ dplyr::failwith()     masks plyr::failwith()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ dplyr::id()           masks plyr::id()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ latticeExtra::layer() masks ggplot2::layer()
## ✖ purrr::lift()         masks caret::lift()
## ✖ dplyr::mutate()       masks plyr::mutate()
## ✖ tidyr::pack()         masks Matrix::pack()
## ✖ dplyr::recode()       masks arules::recode()
## ✖ dplyr::rename()       masks plyr::rename()
## ✖ dplyr::src()          masks Hmisc::src()
## ✖ dplyr::summarise()    masks plyr::summarise()
## ✖ dplyr::summarize()    masks plyr::summarize(), Hmisc::summarize()
## ✖ tidyr::unpack()       masks Matrix::unpack()
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(tibbletime)
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
library(dplyr)

Part 1: Dimensionality Reduction

This section of the project entails reducing your dataset to a low dimensional dataset using the t-SNE algorithm or PCA. You will be required to perform your analysis and provide insights gained from your analysis.

Loading the data set

sales <- read.csv("http://bit.ly/CarreFourDataset")

Previwing the data set

# Preview the head
head(sales)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Tax      Date  Time     Payment   cogs gross.margin.percentage
## 1        7 26.1415  1/5/2019 13:08     Ewallet 522.83                4.761905
## 2        5  3.8200  3/8/2019 10:29        Cash  76.40                4.761905
## 3        7 16.2155  3/3/2019 13:23 Credit card 324.31                4.761905
## 4        8 23.2880 1/27/2019 20:33     Ewallet 465.76                4.761905
## 5        7 30.2085  2/8/2019 10:37     Ewallet 604.17                4.761905
## 6        7 29.8865 3/25/2019 18:30     Ewallet 597.73                4.761905
##   gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
## 4      23.2880    8.4 489.0480
## 5      30.2085    5.3 634.3785
## 6      29.8865    4.1 627.6165
# preview the tail
tail(sales)
##       Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 995  652-49-6720      C        Member Female Electronic accessories      60.95
## 996  233-67-5758      C        Normal   Male      Health and beauty      40.35
## 997  303-96-2227      B        Normal Female     Home and lifestyle      97.38
## 998  727-02-1313      A        Member   Male     Food and beverages      31.84
## 999  347-56-2442      A        Normal   Male     Home and lifestyle      65.82
## 1000 849-09-3807      A        Member Female    Fashion accessories      88.34
##      Quantity     Tax      Date  Time Payment   cogs gross.margin.percentage
## 995         1  3.0475 2/18/2019 11:40 Ewallet  60.95                4.761905
## 996         1  2.0175 1/29/2019 13:46 Ewallet  40.35                4.761905
## 997        10 48.6900  3/2/2019 17:16 Ewallet 973.80                4.761905
## 998         1  1.5920  2/9/2019 13:22    Cash  31.84                4.761905
## 999         1  3.2910 2/22/2019 15:33    Cash  65.82                4.761905
## 1000        7 30.9190 2/18/2019 13:28    Cash 618.38                4.761905
##      gross.income Rating     Total
## 995        3.0475    5.9   63.9975
## 996        2.0175    6.2   42.3675
## 997       48.6900    4.4 1022.4900
## 998        1.5920    7.7   33.4320
## 999        3.2910    4.1   69.1110
## 1000      30.9190    6.6  649.2990
# preview the structure of the dataset 
str(sales)
## 'data.frame':    1000 obs. of  16 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
# preview the data summary
summary(sales)
##   Invoice.ID           Branch          Customer.type         Gender         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.line         Unit.price       Quantity          Tax         
##  Length:1000        Min.   :10.08   Min.   : 1.00   Min.   : 0.5085  
##  Class :character   1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 5.9249  
##  Mode  :character   Median :55.23   Median : 5.00   Median :12.0880  
##                     Mean   :55.67   Mean   : 5.51   Mean   :15.3794  
##                     3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.:22.4453  
##                     Max.   :99.96   Max.   :10.00   Max.   :49.6500  
##      Date               Time             Payment               cogs       
##  Length:1000        Length:1000        Length:1000        Min.   : 10.17  
##  Class :character   Class :character   Class :character   1st Qu.:118.50  
##  Mode  :character   Mode  :character   Mode  :character   Median :241.76  
##                                                           Mean   :307.59  
##                                                           3rd Qu.:448.90  
##                                                           Max.   :993.00  
##  gross.margin.percentage  gross.income         Rating           Total        
##  Min.   :4.762           Min.   : 0.5085   Min.   : 4.000   Min.   :  10.68  
##  1st Qu.:4.762           1st Qu.: 5.9249   1st Qu.: 5.500   1st Qu.: 124.42  
##  Median :4.762           Median :12.0880   Median : 7.000   Median : 253.85  
##  Mean   :4.762           Mean   :15.3794   Mean   : 6.973   Mean   : 322.97  
##  3rd Qu.:4.762           3rd Qu.:22.4453   3rd Qu.: 8.500   3rd Qu.: 471.35  
##  Max.   :4.762           Max.   :49.6500   Max.   :10.000   Max.   :1042.65
# using describe to better understand our data set
describe(sales)
## sales 
## 
##  16  Variables      1000  Observations
## --------------------------------------------------------------------------------
## Invoice.ID 
##        n  missing distinct 
##     1000        0     1000 
## 
## lowest : 101-17-6199 101-81-4070 102-06-2002 102-77-2261 105-10-6182
## highest: 894-41-5205 895-03-6665 895-66-0685 896-34-0956 898-04-2717
## --------------------------------------------------------------------------------
## Branch 
##        n  missing distinct 
##     1000        0        3 
##                             
## Value          A     B     C
## Frequency    340   332   328
## Proportion 0.340 0.332 0.328
## --------------------------------------------------------------------------------
## Customer.type 
##        n  missing distinct 
##     1000        0        2 
##                         
## Value      Member Normal
## Frequency     501    499
## Proportion  0.501  0.499
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##     1000        0        2 
##                         
## Value      Female   Male
## Frequency     501    499
## Proportion  0.501  0.499
## --------------------------------------------------------------------------------
## Product.line 
##        n  missing distinct 
##     1000        0        6 
## 
## lowest : Electronic accessories Fashion accessories    Food and beverages     Health and beauty      Home and lifestyle    
## highest: Fashion accessories    Food and beverages     Health and beauty      Home and lifestyle     Sports and travel     
##                                                                                
## Value      Electronic accessories    Fashion accessories     Food and beverages
## Frequency                     170                    178                    174
## Proportion                  0.170                  0.178                  0.174
##                                                                                
## Value           Health and beauty     Home and lifestyle      Sports and travel
## Frequency                     152                    160                    166
## Proportion                  0.152                  0.160                  0.166
## --------------------------------------------------------------------------------
## Unit.price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      943        1    55.67     30.6    15.28    19.31 
##      .25      .50      .75      .90      .95 
##    32.88    55.23    77.94    93.12    97.22 
## 
## lowest : 10.08 10.13 10.16 10.17 10.18, highest: 99.82 99.83 99.89 99.92 99.96
## --------------------------------------------------------------------------------
## Quantity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       10     0.99     5.51     3.36        1        1 
##      .25      .50      .75      .90      .95 
##        3        5        8       10       10 
## 
## lowest :  1  2  3  4  5, highest:  6  7  8  9 10
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency    112    91    90   109   102    98   102    85    92   119
## Proportion 0.112 0.091 0.090 0.109 0.102 0.098 0.102 0.085 0.092 0.119
## --------------------------------------------------------------------------------
## Tax 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1    15.38    12.89    1.956    3.243 
##      .25      .50      .75      .90      .95 
##    5.925   12.088   22.445   34.234   39.166 
## 
## lowest :  0.5085  0.6045  0.6270  0.6390  0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Date 
##        n  missing distinct 
##     1000        0       89 
## 
## lowest : 1/1/2019  1/10/2019 1/11/2019 1/12/2019 1/13/2019
## highest: 3/5/2019  3/6/2019  3/7/2019  3/8/2019  3/9/2019 
## --------------------------------------------------------------------------------
## Time 
##        n  missing distinct 
##     1000        0      506 
## 
## lowest : 10:00 10:01 10:02 10:03 10:04, highest: 20:52 20:54 20:55 20:57 20:59
## --------------------------------------------------------------------------------
## Payment 
##        n  missing distinct 
##     1000        0        3 
##                                               
## Value             Cash Credit card     Ewallet
## Frequency          344         311         345
## Proportion       0.344       0.311       0.345
## --------------------------------------------------------------------------------
## cogs 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1    307.6    257.8    39.11    64.86 
##      .25      .50      .75      .90      .95 
##   118.50   241.76   448.91   684.68   783.33 
## 
## lowest :  10.17  12.09  12.54  12.78  13.98, highest: 973.80 975.00 985.20 989.80 993.00
## --------------------------------------------------------------------------------
## gross.margin.percentage 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0    4.762        0 
##                    
## Value      4.761905
## Frequency      1000
## Proportion        1
## --------------------------------------------------------------------------------
## gross.income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1    15.38    12.89    1.956    3.243 
##      .25      .50      .75      .90      .95 
##    5.925   12.088   22.445   34.234   39.166 
## 
## lowest :  0.5085  0.6045  0.6270  0.6390  0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Rating 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       61        1    6.973    1.985    4.295    4.500 
##      .25      .50      .75      .90      .95 
##    5.500    7.000    8.500    9.400    9.700 
## 
## lowest :  4.0  4.1  4.2  4.3  4.4, highest:  9.6  9.7  9.8  9.9 10.0
## --------------------------------------------------------------------------------
## Total 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1      323    270.7    41.07    68.10 
##      .25      .50      .75      .90      .95 
##   124.42   253.85   471.35   718.91   822.50 
## 
## lowest :   10.6785   12.6945   13.1670   13.4190   14.6790
## highest: 1022.4900 1023.7500 1034.4600 1039.2900 1042.6500
## --------------------------------------------------------------------------------

From the describe function:

  • There are no missing values in this dataset

  • We have 16 Variables and 1000 Observations

# preview if there are duplicates
sales[duplicated(sales),]
##  [1] Invoice.ID              Branch                  Customer.type          
##  [4] Gender                  Product.line            Unit.price             
##  [7] Quantity                Tax                     Date                   
## [10] Time                    Payment                 cogs                   
## [13] gross.margin.percentage gross.income            Rating                 
## [16] Total                  
## <0 rows> (or 0-length row.names)

There are no duplicates in our dataset

# preview if there are any outliers
# Selecting numerical columns only
df = sales
num_cols <- unlist(lapply(df, is.numeric))        
num_df <- df[ , num_cols]                        
head(num_df)                                           
##   Unit.price Quantity     Tax   cogs gross.margin.percentage gross.income
## 1      74.69        7 26.1415 522.83                4.761905      26.1415
## 2      15.28        5  3.8200  76.40                4.761905       3.8200
## 3      46.33        7 16.2155 324.31                4.761905      16.2155
## 4      58.22        8 23.2880 465.76                4.761905      23.2880
## 5      86.31        7 30.2085 604.17                4.761905      30.2085
## 6      85.39        7 29.8865 597.73                4.761905      29.8865
##   Rating    Total
## 1    9.1 548.9715
## 2    9.6  80.2200
## 3    7.4 340.5255
## 4    8.4 489.0480
## 5    5.3 634.3785
## 6    4.1 627.6165
# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
  boxplot(num_df[i], main = paste('Boxplot of',  names(num_df)[i]), 
          ylab = 'Count')
}

There are a few duplicates in Total, gross.income, Tax and cogs. We will keep them for futher analysis.

Principal Component Analysis

Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.

sales.pca <- prcomp(sales[,c(6,7,8,12,14,15,16)], center = TRUE, scale. = TRUE)
summary(sales.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4       PC5       PC6
## Standard deviation     2.2185 1.0002 0.9939 0.30001 2.981e-16 1.493e-16
## Proportion of Variance 0.7031 0.1429 0.1411 0.01286 0.000e+00 0.000e+00
## Cumulative Proportion  0.7031 0.8460 0.9871 1.00000 1.000e+00 1.000e+00
##                              PC7
## Standard deviation     9.831e-17
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

We obtain 7 Principal components, each which explains a percentage of the total variation of the dataset. PC1 gives 70.31% of the total variation PC2 gives 14.29% and PC3 gives 14.11 %. The first 3 principal components gives 98.71% variations. It’s safe to say that we can fully understand this data set using only these three principal components.

# checking the PCA model
str(sales.pca)
## List of 5
##  $ sdev    : num [1:7] 2.22 1.00 9.94e-01 3.00e-01 2.98e-16 ...
##  $ rotation: num [1:7, 1:7] -0.292 -0.325 -0.45 -0.45 -0.45 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
##   .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:7] 55.67 5.51 15.38 307.59 15.38 ...
##   ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
##  $ scale   : Named num [1:7] 26.49 2.92 11.71 234.18 11.71 ...
##   ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
##  $ x       : num [1:1000, 1:7] -2.005 2.306 -0.186 -1.504 -2.8 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"
# plots
ggbiplot(sales.pca)

From the plots, unit.price, Rating, Quantity and gross.Income are the major factors to consider in this analysis.

PC1 explains 70.3% of the total variance. This means that nearly two-thirds of the dataset ( 7 variables ) can be encapsulated by the one Principal Component (PC1)

PC2 explains 14.3% of the variance.

# plotting PC3 and PC4
ggbiplot(sales.pca,ellipse=TRUE,choices=c(3,4))

We still have unit.price, Rating, Quantity and gross.Income as the main attributes.

t-SNE algorithm

# curate data set for analysis
Labels <- sales$Product.line
sales$Product.line <- as.factor(sales$Product.line)
# setting up for plots
colors = rainbow(length(unique(sales$Product.line)))
names(colors) = unique(sales$Product.line)
# tsne model
tsne <- Rtsne(sales[,-5], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1000 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.11 seconds (sparsity = 0.101260)!
## Learning embedding...
## Iteration 50: error is 59.570291 (50 iterations in 0.12 seconds)
## Iteration 100: error is 52.841815 (50 iterations in 0.11 seconds)
## Iteration 150: error is 51.800627 (50 iterations in 0.11 seconds)
## Iteration 200: error is 51.365338 (50 iterations in 0.11 seconds)
## Iteration 250: error is 51.142453 (50 iterations in 0.10 seconds)
## Iteration 300: error is 0.566525 (50 iterations in 0.11 seconds)
## Iteration 350: error is 0.406761 (50 iterations in 0.11 seconds)
## Iteration 400: error is 0.373841 (50 iterations in 0.10 seconds)
## Iteration 450: error is 0.360382 (50 iterations in 0.10 seconds)
## Iteration 500: error is 0.349284 (50 iterations in 0.11 seconds)
## Fitting performed in 1.07 seconds.
# model summary
summary(tsne)
##                     Length Class  Mode   
## N                      1   -none- numeric
## Y                   2000   -none- numeric
## costs               1000   -none- numeric
## itercosts             10   -none- numeric
## origD                  1   -none- numeric
## perplexity             1   -none- numeric
## theta                  1   -none- numeric
## max_iter               1   -none- numeric
## stop_lying_iter        1   -none- numeric
## mom_switch_iter        1   -none- numeric
## momentum               1   -none- numeric
## final_momentum         1   -none- numeric
## eta                    1   -none- numeric
## exaggeration_factor    1   -none- numeric
# Plots
plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=sales$Product.line, col=colors[sales$Product.line])

Conclusions and Recommendations

  • The PCA is more effective in this data set compared to tsne.

  • Carrefour should highly consider unit.price, Rating, Quantity and gross.Income in their marketing strategies.

Part 2: Feature Selection

This section requires you to perform feature selection through the use of the unsupervised learning method. You will be required to perform your analysis and provide insights on the features that contribute the most information to the dataset.

Filter Methods

These methods apply a metric to assign a scoring to each feature. The features would then be ranked by the score. Examples of such metrics include dependency metrics which have the ability to predict one feature from the other, information metrics which compare the information gain of individual features for example entropy or information gain and distance metrics which would aid in the effective separation of the features.

sales_filter <- sales[,c(6,7,8,12,14,15,16)]
head(sales_filter)
##   Unit.price Quantity     Tax   cogs gross.income Rating    Total
## 1      74.69        7 26.1415 522.83      26.1415    9.1 548.9715
## 2      15.28        5  3.8200  76.40       3.8200    9.6  80.2200
## 3      46.33        7 16.2155 324.31      16.2155    7.4 340.5255
## 4      58.22        8 23.2880 465.76      23.2880    8.4 489.0480
## 5      86.31        7 30.2085 604.17      30.2085    5.3 634.3785
## 6      85.39        7 29.8865 597.73      29.8865    4.1 627.6165
# correlation matrix
corrmatrix <- cor(sales_filter)
corrmatrix
##                Unit.price    Quantity        Tax       cogs gross.income
## Unit.price    1.000000000  0.01077756  0.6339621  0.6339621    0.6339621
## Quantity      0.010777564  1.00000000  0.7055102  0.7055102    0.7055102
## Tax           0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
## cogs          0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
## gross.income  0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
## Rating       -0.008777507 -0.01581490 -0.0364417 -0.0364417   -0.0364417
## Total         0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
##                    Rating      Total
## Unit.price   -0.008777507  0.6339621
## Quantity     -0.015814905  0.7055102
## Tax          -0.036441705  1.0000000
## cogs         -0.036441705  1.0000000
## gross.income -0.036441705  1.0000000
## Rating        1.000000000 -0.0364417
## Total        -0.036441705  1.0000000
# finding highly correlated variables
highlyCorrelated <- findCorrelation(corrmatrix, cutoff=0.75)
names(sales_filter[,highlyCorrelated])
## [1] "cogs"  "Total" "Tax"
  • cogs, Total and Tax are highly correlated. We wull drop these variables.
sales_new <- sales_filter[-highlyCorrelated]
head(sales_new)
##   Unit.price Quantity gross.income Rating
## 1      74.69        7      26.1415    9.1
## 2      15.28        5       3.8200    9.6
## 3      46.33        7      16.2155    7.4
## 4      58.22        8      23.2880    8.4
## 5      86.31        7      30.2085    5.3
## 6      85.39        7      29.8865    4.1
# plots to preview
par(mfrow = c(1, 2))
corrplot(corrmatrix, order = "hclust")
corrplot(cor(sales_new), order = "hclust")

Embedded Methods

These methods learn which features best contribute to the accuracy of the model while the model is being created.

set.seed(23)
model <- ewkm(sales_filter,3, lambda = 2, maxiter = 1000)
clusplot(sales_filter, model$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis for Supermarket sales')

# checking the weights
round(model$weights*100.2)
##   Unit.price Quantity Tax cogs gross.income Rating Total
## 1          0        0   0    0            0    100     0
## 2          0        0  50    0           50      0     0
## 3          0        0  50    0           50      0     0

Conclusions and Recommendations

  • From our analysis, the most important variables for determining the sales in the Carrefour Supermarket are Tax, cogs, gross income, Quantity and Unit price

  • Carrefour should build their marketing strategies around these items.

Part 3: Association Rules

This section will require that you create association rules that will allow you to identify relationships between variables in the dataset. You are provided with a separate dataset that comprises groups of items that will be associated with others. Just like in the other sections, you will also be required to provide insights for your analysis.

Loading Dataset

# loading the data set
path <- "http://bit.ly/SupermarketDatasetII"
rules <- read.transactions(path, sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions

Preview Dataset

# preview the data set 
rules
## transactions in sparse format with
##  7501 transactions (rows) and
##  119 items (columns)
# preview the data set as a dataframe
items<-as.data.frame(itemLabels(rules))
colnames(items) <- "Item"
head(items, 10) 
##                 Item
## 1            almonds
## 2  antioxydant juice
## 3          asparagus
## 4            avocado
## 5        babies food
## 6              bacon
## 7     barbecue sauce
## 8          black tea
## 9        blueberries
## 10        body spray
# summary
summary(rules)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

The most frequent items are:

  • mineral water

  • eggs

  • spaghetti

  • chocolate

# Plotting the top ten most frequent items
par(mfrow = c(1, 2))
itemFrequencyPlot(rules, topN = 10,col="darkgreen")

From the plot above, mineral water is the most frequently purchased item followed by:

  • Eggs

  • Spaghetti

  • French Fries

  • Chocolate

  • Green Tea

  • Milk

  • Ground beef

  • Frozen Vegetables

  • Pancakes

# plotting items with at least 10% relative importance
itemFrequencyPlot(rules, support = 0.1,col="darkred")

Model

# Building the model using apriori function using  Min Support as 0.001 and confidence as 0.8
ass_rules <- apriori (rules, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Number of rules
ass_rules
## set of 74 rules

In this model, using a Min Support as 0.001 and confidence as 0.8, we get a set of 74 rules. Let’s try increasing the confidence to 0.9 and see what happens.

ass1_rules <- apriori (rules, parameter = list(supp = 0.001, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Number of rules
ass1_rules
## set of 11 rules

We get only 11 rules. Lastly, let’s increase the support to 0.002 and lower confidence to 0.7

ass2_rules <- apriori (rules, parameter = list(supp = 0.002, conf = 0.7))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 15 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Number of rules
ass2_rules
## set of 11 rules

We again get 11 rules. Note:

  • Using a high level of Min Support makes the model lose interesting rules. For example, using a support of 0.002 and confidence level of 0.8 we will only get 2 rules compared to 74 rules we get when using support of 0.001 and confidence of 0.8.
ass3_rules <- apriori (rules, parameter = list(supp = 0.002, conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 15 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Number of rules
ass3_rules
## set of 2 rules

Model Summary

# Let's preview a summary of our first model
summary(ass_rules)
## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 16  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.041   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001067   Min.   : 3.356  
##  1st Qu.:0.001067   1st Qu.:0.8000   1st Qu.:0.001333   1st Qu.: 3.432  
##  Median :0.001133   Median :0.8333   Median :0.001333   Median : 3.795  
##  Mean   :0.001256   Mean   :0.8504   Mean   :0.001479   Mean   : 4.823  
##  3rd Qu.:0.001333   3rd Qu.:0.8889   3rd Qu.:0.001600   3rd Qu.: 4.877  
##  Max.   :0.002533   Max.   :1.0000   Max.   :0.002666   Max.   :12.722  
##      count       
##  Min.   : 8.000  
##  1st Qu.: 8.000  
##  Median : 8.500  
##  Mean   : 9.419  
##  3rd Qu.:10.000  
##  Max.   :19.000  
## 
## mining info:
##   data ntransactions support confidence
##  rules          7501   0.001        0.8
##                                                               call
##  apriori(data = rules, parameter = list(supp = 0.001, conf = 0.8))
# Preview the first 10 rules in our first model
# Order the rules by "confidence". Other ordering methods include (by = “lift” or by = “support”). 
ass_rules<-sort(ass_rules, by="confidence", decreasing=TRUE)
inspect(ass_rules[1:10])
##      lhs                        rhs                 support confidence    coverage      lift count
## [1]  {french fries,                                                                               
##       mushroom cream sauce,                                                                       
##       pasta}                 => {escalope}      0.001066524  1.0000000 0.001066524 12.606723     8
## [2]  {ground beef,                                                                                
##       light cream,                                                                                
##       olive oil}             => {mineral water} 0.001199840  1.0000000 0.001199840  4.195190     9
## [3]  {cake,                                                                                       
##       meatballs,                                                                                  
##       mineral water}         => {milk}          0.001066524  1.0000000 0.001066524  7.717078     8
## [4]  {cake,                                                                                       
##       olive oil,                                                                                  
##       shrimp}                => {mineral water} 0.001199840  1.0000000 0.001199840  4.195190     9
## [5]  {mushroom cream sauce,                                                                       
##       pasta}                 => {escalope}      0.002532996  0.9500000 0.002666311 11.976387    19
## [6]  {red wine,                                                                                   
##       soup}                  => {mineral water} 0.001866418  0.9333333 0.001999733  3.915511    14
## [7]  {eggs,                                                                                       
##       mineral water,                                                                              
##       pasta}                 => {shrimp}        0.001333156  0.9090909 0.001466471 12.722185    10
## [8]  {herb & pepper,                                                                              
##       mineral water,                                                                              
##       rice}                  => {ground beef}   0.001333156  0.9090909 0.001466471  9.252498    10
## [9]  {ground beef,                                                                                
##       pancakes,                                                                                   
##       whole wheat rice}      => {mineral water} 0.001333156  0.9090909 0.001466471  3.813809    10
## [10] {frozen vegetables,                                                                          
##       milk,                                                                                       
##       spaghetti,                                                                                  
##       turkey}                => {mineral water} 0.001199840  0.9000000 0.001333156  3.775671     9

Interpretation

Let’s interpret the 1st and the 10th rule.

  • If someone buys {french fries, mushroom cream sauce, pasta}, then there is 100% chance he/she will buy {escalope}

  • If someone buys {frozen vegetables, milk, spaghetti, turkey}, then there is 90% chance he/she will buy {mineral water}

Conclusions and Recommendations

  • From the analysis, mineral water is the most frequently bought item. Followed by eggs, spaghetti, french fries, chocolate, green tea, milk, ground beef, frozen vegetables and pancakes in that order. These items should be prioritized in marketing campaigns.

  • When arranging the aisles in the supermarket, goods that are often bought together should be arranged close together where a customer can get them very easily. This will increase the sales of associated goods.

Part 4: Anomaly Detection

You have also been requested to check whether there are any anomalies in the given sales dataset. The objective of this task being fraud detection.

Loading Dataset

sales <- read.csv("http://bit.ly/CarreFourSalesDataset")

sales$Date <- as.Date(sales$Date, format ="%m/%d/%Y")
sales$Date <- sort(sales$Date, decreasing = FALSE)

sales <- as_tbl_time(sales, index = Date)

sales <- sales %>%
    as_period("daily")

Preview the Data set

# preview the dimensions
dim(sales)
## [1] 89  2

We have 89 rows/observations and 2 columns/variables

# preview head
head(sales)
## # A time tibble: 6 × 2
## # Index: Date
##   Date       Sales
##   <date>     <dbl>
## 1 2019-01-01  549.
## 2 2019-01-02  246.
## 3 2019-01-03  452.
## 4 2019-01-04  464.
## 5 2019-01-05  418.
## 6 2019-01-06  536.
# preview tail
tail(sales)
## # A time tibble: 6 × 2
## # Index: Date
##   Date       Sales
##   <date>     <dbl>
## 1 2019-03-25 361. 
## 2 2019-03-26 188. 
## 3 2019-03-27  43.9
## 4 2019-03-28 271. 
## 5 2019-03-29 244. 
## 6 2019-03-30 633.

Detecting Anomalise and plotting

sales %>%
    time_decompose(Sales) %>%
    anomalize(remainder) %>%
    time_recompose() %>%
    plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Conclusions

There are no anomalies in our data set