I am a Data analyst at Carrefour Kenya and currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the hughest number of sales (total price including tax). I’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights. I will be checking whether there are any anomalies in the sales dataset, with the objective being fraud detection.
A sales dataset has been provided to perform dimensionality reduction on. We first begin with loading and previewing the dataset at
Install the anomalize and tibbletime package
install.packages("anomalize")
## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("tibbletime")
## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
Let’s load the libraries we’ll need for this task.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
Loading the tibbletime and dplyr library.
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(dplyr)
sales <- read.csv("http://bit.ly/CarreFourSalesDataset")
sales$Date <- as.Date(sales$Date, format ="%m/%d/%Y")
sales$Date <- sort(sales$Date, decreasing = FALSE)
sales <- as_tbl_time(sales, index = Date)
sales <- sales %>%
as_period("daily")
Checking for the dataset’s dimensions
dim(sales)
## [1] 89 2
Let’s preview the top of our dataset
head(sales)
## # A time tibble: 6 × 2
## # Index: Date
## Date Sales
## <date> <dbl>
## 1 2019-01-01 549.
## 2 2019-01-02 246.
## 3 2019-01-03 452.
## 4 2019-01-04 464.
## 5 2019-01-05 418.
## 6 2019-01-06 536.
Checking the bottom of our dataset
tail(sales)
## # A time tibble: 6 × 2
## # Index: Date
## Date Sales
## <date> <dbl>
## 1 2019-03-25 361.
## 2 2019-03-26 188.
## 3 2019-03-27 43.9
## 4 2019-03-28 271.
## 5 2019-03-29 244.
## 6 2019-03-30 633.
Detecting our anomalies. Let’s plot to visualize our data.
library(anomalize)
library(dplyr)
sales %>%
time_decompose(Sales) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
There are no anomalies in our dataset.