CarreFour Sales and Marketing Strategies.

1. Research Question

Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax).This project is aimed at doing analysis on the dataset provided by carrefour and create insights on how to achieve highest sales. .

2. Metric of Success

Be able to do away with redundancy in the dataset and any existing anomaly.

3. Understanding the context.

CarreFour is an International chain of retail supemarkets in the world, It was set up in Kenya in the year 2016 and has been performing well over the years. This project is aimed at creating insights from existing and current trends to develop marketing strategies that will enable the marketing team achieve higher sales.

4. Recording the Experimental Design

  1. Data Loading
  2. Data Cleaning and preprocessing
  3. Exploratory Data Analysis
  4. Implementation of solution
  5. Recommendations and Conclusions.

5. Data Relevance.

he provided data is relevant for this study since it’s been sourced from CarreFour database and is a reflection of current transactions.

Data Preview

Loading the libraries

library(modelr)
library(broom)
## 
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
## 
##     bootstrap
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart)
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x broom::bootstrap()    masks modelr::bootstrap()
## x dplyr::filter()       masks stats::filter()
## x data.table::first()   masks dplyr::first()
## x dplyr::lag()          masks stats::lag()
## x data.table::last()    masks dplyr::last()
## x purrr::lift()         masks caret::lift()
## x purrr::transpose()    masks data.table::transpose()

Loading the data

sales.dates <- fread('http://bit.ly/CarreFourSalesDataset')
sales.dates
##            Date     Sales
##    1:  1/5/2019  548.9715
##    2:  3/8/2019   80.2200
##    3:  3/3/2019  340.5255
##    4: 1/27/2019  489.0480
##    5:  2/8/2019  634.3785
##   ---                    
##  996: 1/29/2019   42.3675
##  997:  3/2/2019 1022.4900
##  998:  2/9/2019   33.4320
##  999: 2/22/2019   69.1110
## 1000: 2/18/2019  649.2990

Checking on the first 6 entries

head(sales.dates, 6)
##         Date    Sales
## 1:  1/5/2019 548.9715
## 2:  3/8/2019  80.2200
## 3:  3/3/2019 340.5255
## 4: 1/27/2019 489.0480
## 5:  2/8/2019 634.3785
## 6: 3/25/2019 627.6165

checking on last 6 entries

tail(sales.dates, 6)
##         Date     Sales
## 1: 2/18/2019   63.9975
## 2: 1/29/2019   42.3675
## 3:  3/2/2019 1022.4900
## 4:  2/9/2019   33.4320
## 5: 2/22/2019   69.1110
## 6: 2/18/2019  649.2990

checking on data types

str(sales.dates)
## Classes 'data.table' and 'data.frame':   1000 obs. of  2 variables:
##  $ Date : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Checking on dataset description

summary(sales.dates)
##      Date               Sales        
##  Length:1000        Min.   :  10.68  
##  Class :character   1st Qu.: 124.42  
##  Mode  :character   Median : 253.85  
##                     Mean   : 322.97  
##                     3rd Qu.: 471.35  
##                     Max.   :1042.65

checking the size/shape of a dataframe

dim(sales.dates)
## [1] 1000    2

Data Preprocessing.

i. Completeness

This is achieved by checking for missing values if any imputed to ensure correct predictions are made.

is.null(sales.dates)
## [1] FALSE
Total number of null valuesin dataset
total_null <- sum(is.na(sales.dates))
total_null
## [1] 0
checking for missing values in every column
lapply(sales.dates, function(x) sum(is.na(x)))
## $Date
## [1] 0
## 
## $Sales
## [1] 0

ii. Consistency.

Consistency is achieved when all the duplicated rows are done away with.

duplicated_rows <- sales.dates[duplicated(sales.dates), ]
duplicated_rows
## Empty data.table (0 rows and 2 cols): Date,Sales
anyDuplicated(sales.dates)
## [1] 0

iii. Relevance.

Relevance is achieved by ensuring all the features provided for the analysis are relevant to the objective Which in this case all provided features are.

iv. Accuracy.

Checking that all entries are correct.

Outliers

We can visualize any outliers in a dataset using boxplots

colnames(sales.dates)
## [1] "Date"  "Sales"
  1. Sales
boxplot(sales.dates$Sales)

ANOMALY DETECTION.

#install.packages('tidyverse')
#install.packages('tibbletime')
#install.packages('anomalize')
#install.packages('timetk')
library(tidyverse)
library(tibbletime)
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
library(anomalize)
## == Use anomalize to improve your Forecasts by 50%! =============================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(timetk)
## 
## Attaching package: 'timetk'
## The following object is masked from 'package:data.table':
## 
##     :=
head(sales.dates)
##         Date    Sales
## 1:  1/5/2019 548.9715
## 2:  3/8/2019  80.2200
## 3:  3/3/2019 340.5255
## 4: 1/27/2019 489.0480
## 5:  2/8/2019 634.3785
## 6: 3/25/2019 627.6165
sales.dates$Date <- as.Date(sales.dates$Date, format = "%m/%d/%y")
str(sales.dates)
## Classes 'data.table' and 'data.frame':   1000 obs. of  2 variables:
##  $ Date : Date, format: "2020-01-05" "2020-03-08" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#Convertion to POCIXct type
sales.dates$Date <- as.POSIXct(sales.dates$Date)
sales_tbl <- as_tibble(sales.dates)
head(sales_tbl)
## # A tibble: 6 x 2
##   Date                Sales
##   <dttm>              <dbl>
## 1 2020-01-05 03:00:00 549. 
## 2 2020-03-08 03:00:00  80.2
## 3 2020-03-03 03:00:00 341. 
## 4 2020-01-27 03:00:00 489. 
## 5 2020-02-08 03:00:00 634. 
## 6 2020-03-25 03:00:00 628.
#sales_tbl <- na.omit(sales_tbl)
head(sales_tbl)
## # A tibble: 6 x 2
##   Date                Sales
##   <dttm>              <dbl>
## 1 2020-01-05 03:00:00 549. 
## 2 2020-03-08 03:00:00  80.2
## 3 2020-03-03 03:00:00 341. 
## 4 2020-01-27 03:00:00 489. 
## 5 2020-02-08 03:00:00 634. 
## 6 2020-03-25 03:00:00 628.
sales_tbl %>%
  time_decompose(Sales, method = "stl",  frequency = "auto", trend = "auto") %>%  
  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% 
  plot_anomaly_decomposition()
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 12 seconds
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 12 seconds
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Recomposing.

sales_tbl %>% 
  time_decompose(Sales, method = "stl",  frequency = "auto", trend = "auto") %>% 
  anomalize(remainder,  method = "gesd", alpha = 0.05, max_anoms = 0.1) %>%
  time_recompose() %>%  
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 12 seconds
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 12 seconds

Finding Anomalies.

anomalies = sales_tbl %>% 
  time_decompose(Sales, method = "stl",  frequency = "auto", trend = "auto") %>%  
  anomalize(remainder,  method = "gesd", alpha = 0.05, max_anoms = 0.1) %>%  
  time_recompose() %>%  
  filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 12 seconds
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 12 seconds
anomalies
## # A time tibble: 9 x 10
## # Index: Date
##   Date                observed season trend remainder remainder_l1 remainder_l2
##   <dttm>                 <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
## 1 2020-03-29 03:00:00     923.  -14.4 203.       734.        -738.         714.
## 2 2020-02-15 03:00:00    1043.   28.7 286.       728.        -738.         714.
## 3 2020-02-08 03:00:00    1021.   28.7 201.       791.        -738.         714.
## 4 2020-03-08 03:00:00     952.   12.7 217.       722.        -738.         714.
## 5 2020-01-30 03:00:00    1034.   10.4 209.       815.        -738.         714.
## 6 2020-03-09 03:00:00     935.  -34.6 211.       759.        -738.         714.
## 7 2020-01-15 03:00:00    1022.  -14.4 209.       828.        -738.         714.
## 8 2020-02-19 03:00:00     932.  -34.6 189.       778.        -738.         714.
## 9 2020-03-02 03:00:00    1022.  -14.4  91.6      945.        -738.         714.
## # ... with 3 more variables: anomaly <chr>, recomposed_l1 <dbl>,
## #   recomposed_l2 <dbl>

The dates on the list contain anomalies.

Conclusion.

The listed anomalies should be looked into by the team since it shouldnt just be considered as normal entries in the data.