You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## Loading required package: scales
## Loading required package: grid
library(Rtsne)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(corrplot)
## corrplot 0.92 loaded
library(wskm)
## Loading required package: latticeExtra
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
## Loading required package: fpc
library("cluster")
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ purrr::discard() masks scales::discard()
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ latticeExtra::layer() masks ggplot2::layer()
## ✖ purrr::lift() masks caret::lift()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::recode() masks arules::recode()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize(), Hmisc::summarize()
## ✖ tidyr::unpack() masks Matrix::unpack()
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(dplyr)
This section of the project entails reducing your dataset to a low dimensional dataset using the t-SNE algorithm or PCA. You will be required to perform your analysis and provide insights gained from your analysis.
sales <- read.csv("http://bit.ly/CarreFourDataset")
# Preview the head
head(sales)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 1 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
# preview the tail
tail(sales)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 995 652-49-6720 C Member Female Electronic accessories 60.95
## 996 233-67-5758 C Normal Male Health and beauty 40.35
## 997 303-96-2227 B Normal Female Home and lifestyle 97.38
## 998 727-02-1313 A Member Male Food and beverages 31.84
## 999 347-56-2442 A Normal Male Home and lifestyle 65.82
## 1000 849-09-3807 A Member Female Fashion accessories 88.34
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 995 1 3.0475 2/18/2019 11:40 Ewallet 60.95 4.761905
## 996 1 2.0175 1/29/2019 13:46 Ewallet 40.35 4.761905
## 997 10 48.6900 3/2/2019 17:16 Ewallet 973.80 4.761905
## 998 1 1.5920 2/9/2019 13:22 Cash 31.84 4.761905
## 999 1 3.2910 2/22/2019 15:33 Cash 65.82 4.761905
## 1000 7 30.9190 2/18/2019 13:28 Cash 618.38 4.761905
## gross.income Rating Total
## 995 3.0475 5.9 63.9975
## 996 2.0175 6.2 42.3675
## 997 48.6900 4.4 1022.4900
## 998 1.5920 7.7 33.4320
## 999 3.2910 4.1 69.1110
## 1000 30.9190 6.6 649.2990
# preview the structure of the dataset
str(sales)
## 'data.frame': 1000 obs. of 16 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
# preview the data summary
summary(sales)
## Invoice.ID Branch Customer.type Gender
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.line Unit.price Quantity Tax
## Length:1000 Min. :10.08 Min. : 1.00 Min. : 0.5085
## Class :character 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 5.9249
## Mode :character Median :55.23 Median : 5.00 Median :12.0880
## Mean :55.67 Mean : 5.51 Mean :15.3794
## 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.:22.4453
## Max. :99.96 Max. :10.00 Max. :49.6500
## Date Time Payment cogs
## Length:1000 Length:1000 Length:1000 Min. : 10.17
## Class :character Class :character Class :character 1st Qu.:118.50
## Mode :character Mode :character Mode :character Median :241.76
## Mean :307.59
## 3rd Qu.:448.90
## Max. :993.00
## gross.margin.percentage gross.income Rating Total
## Min. :4.762 Min. : 0.5085 Min. : 4.000 Min. : 10.68
## 1st Qu.:4.762 1st Qu.: 5.9249 1st Qu.: 5.500 1st Qu.: 124.42
## Median :4.762 Median :12.0880 Median : 7.000 Median : 253.85
## Mean :4.762 Mean :15.3794 Mean : 6.973 Mean : 322.97
## 3rd Qu.:4.762 3rd Qu.:22.4453 3rd Qu.: 8.500 3rd Qu.: 471.35
## Max. :4.762 Max. :49.6500 Max. :10.000 Max. :1042.65
# using describe to better understand our data set
describe(sales)
## sales
##
## 16 Variables 1000 Observations
## --------------------------------------------------------------------------------
## Invoice.ID
## n missing distinct
## 1000 0 1000
##
## lowest : 101-17-6199 101-81-4070 102-06-2002 102-77-2261 105-10-6182
## highest: 894-41-5205 895-03-6665 895-66-0685 896-34-0956 898-04-2717
## --------------------------------------------------------------------------------
## Branch
## n missing distinct
## 1000 0 3
##
## Value A B C
## Frequency 340 332 328
## Proportion 0.340 0.332 0.328
## --------------------------------------------------------------------------------
## Customer.type
## n missing distinct
## 1000 0 2
##
## Value Member Normal
## Frequency 501 499
## Proportion 0.501 0.499
## --------------------------------------------------------------------------------
## Gender
## n missing distinct
## 1000 0 2
##
## Value Female Male
## Frequency 501 499
## Proportion 0.501 0.499
## --------------------------------------------------------------------------------
## Product.line
## n missing distinct
## 1000 0 6
##
## lowest : Electronic accessories Fashion accessories Food and beverages Health and beauty Home and lifestyle
## highest: Fashion accessories Food and beverages Health and beauty Home and lifestyle Sports and travel
##
## Value Electronic accessories Fashion accessories Food and beverages
## Frequency 170 178 174
## Proportion 0.170 0.178 0.174
##
## Value Health and beauty Home and lifestyle Sports and travel
## Frequency 152 160 166
## Proportion 0.152 0.160 0.166
## --------------------------------------------------------------------------------
## Unit.price
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 943 1 55.67 30.6 15.28 19.31
## .25 .50 .75 .90 .95
## 32.88 55.23 77.94 93.12 97.22
##
## lowest : 10.08 10.13 10.16 10.17 10.18, highest: 99.82 99.83 99.89 99.92 99.96
## --------------------------------------------------------------------------------
## Quantity
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 10 0.99 5.51 3.36 1 1
## .25 .50 .75 .90 .95
## 3 5 8 10 10
##
## lowest : 1 2 3 4 5, highest: 6 7 8 9 10
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 112 91 90 109 102 98 102 85 92 119
## Proportion 0.112 0.091 0.090 0.109 0.102 0.098 0.102 0.085 0.092 0.119
## --------------------------------------------------------------------------------
## Tax
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 15.38 12.89 1.956 3.243
## .25 .50 .75 .90 .95
## 5.925 12.088 22.445 34.234 39.166
##
## lowest : 0.5085 0.6045 0.6270 0.6390 0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Date
## n missing distinct
## 1000 0 89
##
## lowest : 1/1/2019 1/10/2019 1/11/2019 1/12/2019 1/13/2019
## highest: 3/5/2019 3/6/2019 3/7/2019 3/8/2019 3/9/2019
## --------------------------------------------------------------------------------
## Time
## n missing distinct
## 1000 0 506
##
## lowest : 10:00 10:01 10:02 10:03 10:04, highest: 20:52 20:54 20:55 20:57 20:59
## --------------------------------------------------------------------------------
## Payment
## n missing distinct
## 1000 0 3
##
## Value Cash Credit card Ewallet
## Frequency 344 311 345
## Proportion 0.344 0.311 0.345
## --------------------------------------------------------------------------------
## cogs
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 307.6 257.8 39.11 64.86
## .25 .50 .75 .90 .95
## 118.50 241.76 448.91 684.68 783.33
##
## lowest : 10.17 12.09 12.54 12.78 13.98, highest: 973.80 975.00 985.20 989.80 993.00
## --------------------------------------------------------------------------------
## gross.margin.percentage
## n missing distinct Info Mean Gmd
## 1000 0 1 0 4.762 0
##
## Value 4.761905
## Frequency 1000
## Proportion 1
## --------------------------------------------------------------------------------
## gross.income
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 15.38 12.89 1.956 3.243
## .25 .50 .75 .90 .95
## 5.925 12.088 22.445 34.234 39.166
##
## lowest : 0.5085 0.6045 0.6270 0.6390 0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Rating
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 61 1 6.973 1.985 4.295 4.500
## .25 .50 .75 .90 .95
## 5.500 7.000 8.500 9.400 9.700
##
## lowest : 4.0 4.1 4.2 4.3 4.4, highest: 9.6 9.7 9.8 9.9 10.0
## --------------------------------------------------------------------------------
## Total
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 323 270.7 41.07 68.10
## .25 .50 .75 .90 .95
## 124.42 253.85 471.35 718.91 822.50
##
## lowest : 10.6785 12.6945 13.1670 13.4190 14.6790
## highest: 1022.4900 1023.7500 1034.4600 1039.2900 1042.6500
## --------------------------------------------------------------------------------
From the describe function:
There are no missing values in this dataset
We have 16 Variables and 1000 Observations
# preview if there are duplicates
sales[duplicated(sales),]
## [1] Invoice.ID Branch Customer.type
## [4] Gender Product.line Unit.price
## [7] Quantity Tax Date
## [10] Time Payment cogs
## [13] gross.margin.percentage gross.income Rating
## [16] Total
## <0 rows> (or 0-length row.names)
There are no duplicates in our dataset
# preview if there are any outliers
# Selecting numerical columns only
df = sales
num_cols <- unlist(lapply(df, is.numeric))
num_df <- df[ , num_cols]
head(num_df)
## Unit.price Quantity Tax cogs gross.margin.percentage gross.income
## 1 74.69 7 26.1415 522.83 4.761905 26.1415
## 2 15.28 5 3.8200 76.40 4.761905 3.8200
## 3 46.33 7 16.2155 324.31 4.761905 16.2155
## 4 58.22 8 23.2880 465.76 4.761905 23.2880
## 5 86.31 7 30.2085 604.17 4.761905 30.2085
## 6 85.39 7 29.8865 597.73 4.761905 29.8865
## Rating Total
## 1 9.1 548.9715
## 2 9.6 80.2200
## 3 7.4 340.5255
## 4 8.4 489.0480
## 5 5.3 634.3785
## 6 4.1 627.6165
# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
boxplot(num_df[i], main = paste('Boxplot of', names(num_df)[i]),
ylab = 'Count')
}
There are a few duplicates in Total, gross.income, Tax and cogs. We will keep them for futher analysis.
Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.
sales.pca <- prcomp(sales[,c(6,7,8,12,14,15,16)], center = TRUE, scale. = TRUE)
summary(sales.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.2185 1.0002 0.9939 0.30001 2.981e-16 1.493e-16
## Proportion of Variance 0.7031 0.1429 0.1411 0.01286 0.000e+00 0.000e+00
## Cumulative Proportion 0.7031 0.8460 0.9871 1.00000 1.000e+00 1.000e+00
## PC7
## Standard deviation 9.831e-17
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
We obtain 7 Principal components, each which explains a percentage of the total variation of the dataset. PC1 gives 70.31% of the total variation PC2 gives 14.29% and PC3 gives 14.11 %. The first 3 principal components gives 98.71% variations. It’s safe to say that we can fully understand this data set using only these three principal components.
# checking the PCA model
str(sales.pca)
## List of 5
## $ sdev : num [1:7] 2.22 1.00 9.94e-01 3.00e-01 2.98e-16 ...
## $ rotation: num [1:7, 1:7] -0.292 -0.325 -0.45 -0.45 -0.45 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:7] 55.67 5.51 15.38 307.59 15.38 ...
## ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
## $ scale : Named num [1:7] 26.49 2.92 11.71 234.18 11.71 ...
## ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
## $ x : num [1:1000, 1:7] -2.005 2.306 -0.186 -1.504 -2.8 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
# plots
ggbiplot(sales.pca)
From the plots, unit.price, Rating, Quantity and gross.Income are the major factors to consider in this analysis.
PC1 explains 70.3% of the total variance. This means that nearly two-thirds of the dataset ( 7 variables ) can be encapsulated by the one Principal Component (PC1)
PC2 explains 14.3% of the variance.
# plotting PC3 and PC4
ggbiplot(sales.pca,ellipse=TRUE,choices=c(3,4))
We still have unit.price, Rating, Quantity and gross.Income as the main attributes.
# curate data set for analysis
Labels <- sales$Product.line
sales$Product.line <- as.factor(sales$Product.line)
# setting up for plots
colors = rainbow(length(unique(sales$Product.line)))
names(colors) = unique(sales$Product.line)
# tsne model
tsne <- Rtsne(sales[,-5], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1000 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.11 seconds (sparsity = 0.101260)!
## Learning embedding...
## Iteration 50: error is 59.570291 (50 iterations in 0.12 seconds)
## Iteration 100: error is 52.841815 (50 iterations in 0.11 seconds)
## Iteration 150: error is 51.800627 (50 iterations in 0.11 seconds)
## Iteration 200: error is 51.365338 (50 iterations in 0.11 seconds)
## Iteration 250: error is 51.142453 (50 iterations in 0.10 seconds)
## Iteration 300: error is 0.566525 (50 iterations in 0.11 seconds)
## Iteration 350: error is 0.406761 (50 iterations in 0.11 seconds)
## Iteration 400: error is 0.373841 (50 iterations in 0.10 seconds)
## Iteration 450: error is 0.360382 (50 iterations in 0.10 seconds)
## Iteration 500: error is 0.349284 (50 iterations in 0.11 seconds)
## Fitting performed in 1.07 seconds.
# model summary
summary(tsne)
## Length Class Mode
## N 1 -none- numeric
## Y 2000 -none- numeric
## costs 1000 -none- numeric
## itercosts 10 -none- numeric
## origD 1 -none- numeric
## perplexity 1 -none- numeric
## theta 1 -none- numeric
## max_iter 1 -none- numeric
## stop_lying_iter 1 -none- numeric
## mom_switch_iter 1 -none- numeric
## momentum 1 -none- numeric
## final_momentum 1 -none- numeric
## eta 1 -none- numeric
## exaggeration_factor 1 -none- numeric
# Plots
plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=sales$Product.line, col=colors[sales$Product.line])
The PCA is more effective in this data set compared to tsne.
Carrefour should highly consider unit.price, Rating, Quantity and gross.Income in their marketing strategies.
This section requires you to perform feature selection through the use of the unsupervised learning method. You will be required to perform your analysis and provide insights on the features that contribute the most information to the dataset.
These methods apply a metric to assign a scoring to each feature. The features would then be ranked by the score. Examples of such metrics include dependency metrics which have the ability to predict one feature from the other, information metrics which compare the information gain of individual features for example entropy or information gain and distance metrics which would aid in the effective separation of the features.
sales_filter <- sales[,c(6,7,8,12,14,15,16)]
head(sales_filter)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 74.69 7 26.1415 522.83 26.1415 9.1 548.9715
## 2 15.28 5 3.8200 76.40 3.8200 9.6 80.2200
## 3 46.33 7 16.2155 324.31 16.2155 7.4 340.5255
## 4 58.22 8 23.2880 465.76 23.2880 8.4 489.0480
## 5 86.31 7 30.2085 604.17 30.2085 5.3 634.3785
## 6 85.39 7 29.8865 597.73 29.8865 4.1 627.6165
# correlation matrix
corrmatrix <- cor(sales_filter)
corrmatrix
## Unit.price Quantity Tax cogs gross.income
## Unit.price 1.000000000 0.01077756 0.6339621 0.6339621 0.6339621
## Quantity 0.010777564 1.00000000 0.7055102 0.7055102 0.7055102
## Tax 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## cogs 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## gross.income 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## Rating -0.008777507 -0.01581490 -0.0364417 -0.0364417 -0.0364417
## Total 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## Rating Total
## Unit.price -0.008777507 0.6339621
## Quantity -0.015814905 0.7055102
## Tax -0.036441705 1.0000000
## cogs -0.036441705 1.0000000
## gross.income -0.036441705 1.0000000
## Rating 1.000000000 -0.0364417
## Total -0.036441705 1.0000000
# finding highly correlated variables
highlyCorrelated <- findCorrelation(corrmatrix, cutoff=0.75)
names(sales_filter[,highlyCorrelated])
## [1] "cogs" "Total" "Tax"
sales_new <- sales_filter[-highlyCorrelated]
head(sales_new)
## Unit.price Quantity gross.income Rating
## 1 74.69 7 26.1415 9.1
## 2 15.28 5 3.8200 9.6
## 3 46.33 7 16.2155 7.4
## 4 58.22 8 23.2880 8.4
## 5 86.31 7 30.2085 5.3
## 6 85.39 7 29.8865 4.1
# plots to preview
par(mfrow = c(1, 2))
corrplot(corrmatrix, order = "hclust")
corrplot(cor(sales_new), order = "hclust")
These methods learn which features best contribute to the accuracy of the model while the model is being created.
set.seed(23)
model <- ewkm(sales_filter,3, lambda = 2, maxiter = 1000)
clusplot(sales_filter, model$cluster, color=TRUE, shade=TRUE,
labels=2, lines=1,main='Cluster Analysis for Supermarket sales')
# checking the weights
round(model$weights*100.2)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 0 0 0 0 0 100 0
## 2 0 0 50 0 50 0 0
## 3 0 0 50 0 50 0 0
From our analysis, the most important variables for determining the sales in the Carrefour Supermarket are Tax, cogs, gross income, Quantity and Unit price
Carrefour should build their marketing strategies around these items.
This section will require that you create association rules that will allow you to identify relationships between variables in the dataset. You are provided with a separate dataset that comprises groups of items that will be associated with others. Just like in the other sections, you will also be required to provide insights for your analysis.
# loading the data set
path <- "http://bit.ly/SupermarketDatasetII"
rules <- read.transactions(path, sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions
# preview the data set
rules
## transactions in sparse format with
## 7501 transactions (rows) and
## 119 items (columns)
# preview the data set as a dataframe
items<-as.data.frame(itemLabels(rules))
colnames(items) <- "Item"
head(items, 10)
## Item
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
## 4 avocado
## 5 babies food
## 6 bacon
## 7 barbecue sauce
## 8 black tea
## 9 blueberries
## 10 body spray
# summary
summary(rules)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
The most frequent items are:
mineral water
eggs
spaghetti
chocolate
# Plotting the top ten most frequent items
par(mfrow = c(1, 2))
itemFrequencyPlot(rules, topN = 10,col="darkgreen")
From the plot above, mineral water is the most frequently purchased item followed by:
Eggs
Spaghetti
French Fries
Chocolate
Green Tea
Milk
Ground beef
Frozen Vegetables
Pancakes
# plotting items with at least 10% relative importance
itemFrequencyPlot(rules, support = 0.1,col="darkred")
# Building the model using apriori function using Min Support as 0.001 and confidence as 0.8
ass_rules <- apriori (rules, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Number of rules
ass_rules
## set of 74 rules
In this model, using a Min Support as 0.001 and confidence as 0.8, we get a set of 74 rules. Let’s try increasing the confidence to 0.9 and see what happens.
ass1_rules <- apriori (rules, parameter = list(supp = 0.001, conf = 0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Number of rules
ass1_rules
## set of 11 rules
We get only 11 rules. Lastly, let’s increase the support to 0.002 and lower confidence to 0.7
ass2_rules <- apriori (rules, parameter = list(supp = 0.002, conf = 0.7))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Number of rules
ass2_rules
## set of 11 rules
We again get 11 rules. Note:
ass3_rules <- apriori (rules, parameter = list(supp = 0.002, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Number of rules
ass3_rules
## set of 2 rules
# Let's preview a summary of our first model
summary(ass_rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## rules 7501 0.001 0.8
## call
## apriori(data = rules, parameter = list(supp = 0.001, conf = 0.8))
# Preview the first 10 rules in our first model
# Order the rules by "confidence". Other ordering methods include (by = “lift” or by = “support”).
ass_rules<-sort(ass_rules, by="confidence", decreasing=TRUE)
inspect(ass_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {french fries,
## mushroom cream sauce,
## pasta} => {escalope} 0.001066524 1.0000000 0.001066524 12.606723 8
## [2] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [3] {cake,
## meatballs,
## mineral water} => {milk} 0.001066524 1.0000000 0.001066524 7.717078 8
## [4] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [5] {mushroom cream sauce,
## pasta} => {escalope} 0.002532996 0.9500000 0.002666311 11.976387 19
## [6] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 0.001999733 3.915511 14
## [7] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.722185 10
## [8] {herb & pepper,
## mineral water,
## rice} => {ground beef} 0.001333156 0.9090909 0.001466471 9.252498 10
## [9] {ground beef,
## pancakes,
## whole wheat rice} => {mineral water} 0.001333156 0.9090909 0.001466471 3.813809 10
## [10] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
Let’s interpret the 1st and the 10th rule.
If someone buys {french fries, mushroom cream sauce, pasta}, then there is 100% chance he/she will buy {escalope}
If someone buys {frozen vegetables, milk, spaghetti, turkey}, then there is 90% chance he/she will buy {mineral water}
From the analysis, mineral water is the most frequently bought item. Followed by eggs, spaghetti, french fries, chocolate, green tea, milk, ground beef, frozen vegetables and pancakes in that order. These items should be prioritized in marketing campaigns.
When arranging the aisles in the supermarket, goods that are often bought together should be arranged close together where a customer can get them very easily. This will increase the sales of associated goods.
You have also been requested to check whether there are any anomalies in the given sales dataset. The objective of this task being fraud detection.
sales <- read.csv("http://bit.ly/CarreFourSalesDataset")
sales$Date <- as.Date(sales$Date, format ="%m/%d/%Y")
sales$Date <- sort(sales$Date, decreasing = FALSE)
sales <- as_tbl_time(sales, index = Date)
sales <- sales %>%
as_period("daily")
# preview the dimensions
dim(sales)
## [1] 89 2
We have 89 rows/observations and 2 columns/variables
# preview head
head(sales)
## # A time tibble: 6 × 2
## # Index: Date
## Date Sales
## <date> <dbl>
## 1 2019-01-01 549.
## 2 2019-01-02 246.
## 3 2019-01-03 452.
## 4 2019-01-04 464.
## 5 2019-01-05 418.
## 6 2019-01-06 536.
# preview tail
tail(sales)
## # A time tibble: 6 × 2
## # Index: Date
## Date Sales
## <date> <dbl>
## 1 2019-03-25 361.
## 2 2019-03-26 188.
## 3 2019-03-27 43.9
## 4 2019-03-28 271.
## 5 2019-03-29 244.
## 6 2019-03-30 633.
sales %>%
time_decompose(Sales) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
There are no anomalies in our data set