Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax).This project is aimed at doing analysis on the dataset provided by carrefour and create insights on how to achieve highest sales. .
Be able to do away with redundancy in the dataset and any existing anomaly.
CarreFour is an International chain of retail supemarkets in the world, It was set up in Kenya in the year 2016 and has been performing well over the years. This project is aimed at creating insights from existing and current trends to develop marketing strategies that will enable the marketing team achieve higher sales.
he provided data is relevant for this study since it’s been sourced from CarreFour database and is a reflection of current transactions.
library(modelr)
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart)
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x broom::bootstrap() masks modelr::bootstrap()
## x dplyr::filter() masks stats::filter()
## x data.table::first() masks dplyr::first()
## x dplyr::lag() masks stats::lag()
## x data.table::last() masks dplyr::last()
## x purrr::lift() masks caret::lift()
## x purrr::transpose() masks data.table::transpose()
sales.dates <- fread('http://bit.ly/CarreFourSalesDataset')
sales.dates
## Date Sales
## 1: 1/5/2019 548.9715
## 2: 3/8/2019 80.2200
## 3: 3/3/2019 340.5255
## 4: 1/27/2019 489.0480
## 5: 2/8/2019 634.3785
## ---
## 996: 1/29/2019 42.3675
## 997: 3/2/2019 1022.4900
## 998: 2/9/2019 33.4320
## 999: 2/22/2019 69.1110
## 1000: 2/18/2019 649.2990
head(sales.dates, 6)
## Date Sales
## 1: 1/5/2019 548.9715
## 2: 3/8/2019 80.2200
## 3: 3/3/2019 340.5255
## 4: 1/27/2019 489.0480
## 5: 2/8/2019 634.3785
## 6: 3/25/2019 627.6165
tail(sales.dates, 6)
## Date Sales
## 1: 2/18/2019 63.9975
## 2: 1/29/2019 42.3675
## 3: 3/2/2019 1022.4900
## 4: 2/9/2019 33.4320
## 5: 2/22/2019 69.1110
## 6: 2/18/2019 649.2990
str(sales.dates)
## Classes 'data.table' and 'data.frame': 1000 obs. of 2 variables:
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(sales.dates)
## Date Sales
## Length:1000 Min. : 10.68
## Class :character 1st Qu.: 124.42
## Mode :character Median : 253.85
## Mean : 322.97
## 3rd Qu.: 471.35
## Max. :1042.65
dim(sales.dates)
## [1] 1000 2
This is achieved by checking for missing values if any imputed to ensure correct predictions are made.
is.null(sales.dates)
## [1] FALSE
total_null <- sum(is.na(sales.dates))
total_null
## [1] 0
lapply(sales.dates, function(x) sum(is.na(x)))
## $Date
## [1] 0
##
## $Sales
## [1] 0
Consistency is achieved when all the duplicated rows are done away with.
duplicated_rows <- sales.dates[duplicated(sales.dates), ]
duplicated_rows
## Empty data.table (0 rows and 2 cols): Date,Sales
anyDuplicated(sales.dates)
## [1] 0
Relevance is achieved by ensuring all the features provided for the analysis are relevant to the objective Which in this case all provided features are.
Checking that all entries are correct.
We can visualize any outliers in a dataset using boxplots
colnames(sales.dates)
## [1] "Date" "Sales"
boxplot(sales.dates$Sales)
#install.packages('tidyverse')
#install.packages('tibbletime')
#install.packages('anomalize')
#install.packages('timetk')
library(tidyverse)
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(anomalize)
## == Use anomalize to improve your Forecasts by 50%! =============================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(timetk)
##
## Attaching package: 'timetk'
## The following object is masked from 'package:data.table':
##
## :=
head(sales.dates)
## Date Sales
## 1: 1/5/2019 548.9715
## 2: 3/8/2019 80.2200
## 3: 3/3/2019 340.5255
## 4: 1/27/2019 489.0480
## 5: 2/8/2019 634.3785
## 6: 3/25/2019 627.6165
sales.dates$Date <- as.Date(sales.dates$Date, format = "%m/%d/%y")
str(sales.dates)
## Classes 'data.table' and 'data.frame': 1000 obs. of 2 variables:
## $ Date : Date, format: "2020-01-05" "2020-03-08" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
## - attr(*, ".internal.selfref")=<externalptr>
#Convertion to POCIXct type
sales.dates$Date <- as.POSIXct(sales.dates$Date)
sales_tbl <- as_tibble(sales.dates)
head(sales_tbl)
## # A tibble: 6 x 2
## Date Sales
## <dttm> <dbl>
## 1 2020-01-05 03:00:00 549.
## 2 2020-03-08 03:00:00 80.2
## 3 2020-03-03 03:00:00 341.
## 4 2020-01-27 03:00:00 489.
## 5 2020-02-08 03:00:00 634.
## 6 2020-03-25 03:00:00 628.
#sales_tbl <- na.omit(sales_tbl)
head(sales_tbl)
## # A tibble: 6 x 2
## Date Sales
## <dttm> <dbl>
## 1 2020-01-05 03:00:00 549.
## 2 2020-03-08 03:00:00 80.2
## 3 2020-03-03 03:00:00 341.
## 4 2020-01-27 03:00:00 489.
## 5 2020-02-08 03:00:00 634.
## 6 2020-03-25 03:00:00 628.
sales_tbl %>%
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>%
plot_anomaly_decomposition()
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 12 seconds
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 12 seconds
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Recomposing.
sales_tbl %>%
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 12 seconds
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 12 seconds
Finding Anomalies.
anomalies = sales_tbl %>%
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>%
time_recompose() %>%
filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 12 seconds
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 12 seconds
anomalies
## # A time tibble: 9 x 10
## # Index: Date
## Date observed season trend remainder remainder_l1 remainder_l2
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-03-29 03:00:00 923. -14.4 203. 734. -738. 714.
## 2 2020-02-15 03:00:00 1043. 28.7 286. 728. -738. 714.
## 3 2020-02-08 03:00:00 1021. 28.7 201. 791. -738. 714.
## 4 2020-03-08 03:00:00 952. 12.7 217. 722. -738. 714.
## 5 2020-01-30 03:00:00 1034. 10.4 209. 815. -738. 714.
## 6 2020-03-09 03:00:00 935. -34.6 211. 759. -738. 714.
## 7 2020-01-15 03:00:00 1022. -14.4 209. 828. -738. 714.
## 8 2020-02-19 03:00:00 932. -34.6 189. 778. -738. 714.
## 9 2020-03-02 03:00:00 1022. -14.4 91.6 945. -738. 714.
## # ... with 3 more variables: anomaly <chr>, recomposed_l1 <dbl>,
## # recomposed_l2 <dbl>
The dates on the list contain anomalies.
The listed anomalies should be looked into by the team since it shouldnt just be considered as normal entries in the data.