Research Question
You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.
We need to check for anomalies in the dataset for fraud detection.
Finding anomalies in the dataset.
The dataset provided contains sales by dates for Carrefour Supermarket.
Reading the data
Checking the data - data understanding
Implementing the solution
Challenge the solution
Follow up Questions
Conclusion.
Recommendations.
We aim to find sales anomalies in our dataset as we have provided with sales and their respective dates.
# import Tibble package
library(tibble)
# url <- http://bit.ly/CarreFourSalesDataset
# load the dataset as dataframe
df <- read.csv('http://bit.ly/CarreFourSalesDataset')
# convert dataframe to Tibble
df <- as_tibble(df)
#check data structure of our dataset
class(df)
## [1] "tbl_df" "tbl" "data.frame"
# Previewing our first 5 records
#
head(df)
## # A tibble: 6 × 2
## Date Sales
## <chr> <dbl>
## 1 1/5/2019 549.
## 2 3/8/2019 80.2
## 3 3/3/2019 341.
## 4 1/27/2019 489.
## 5 2/8/2019 634.
## 6 3/25/2019 628.
# Previewing our last 5 records
#
tail(df)
## # A tibble: 6 × 2
## Date Sales
## <chr> <dbl>
## 1 2/18/2019 64.0
## 2 1/29/2019 42.4
## 3 3/2/2019 1022.
## 4 2/9/2019 33.4
## 5 2/22/2019 69.1
## 6 2/18/2019 649.
# check shape of the dataset
dim(df)
## [1] 1000 2
Our dataset contains 1000 records and 2 columns
# check column datatypes
sapply(df, class)
## Date Sales
## "character" "numeric"
We need to convert date into date datatype
# inspect variable classes
str(df)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Sales: num [1:1000] 549 80.2 340.5 489 634.4 ...
# convert timestamp to datetime
df$Date <- as.Date(df$Date, format = "%m/%d/%Y")
# preview dataset
head(df)
## # A tibble: 6 × 2
## Date Sales
## <date> <dbl>
## 1 2019-01-05 549.
## 2 2019-03-08 80.2
## 3 2019-03-03 341.
## 4 2019-01-27 489.
## 5 2019-02-08 634.
## 6 2019-03-25 628.
Dataset is provided by the client.
# Check for outliers
# import ggplot2 for plotting
library(ggplot2)
# plot boxplot for sales
ggplot(data=df, mapping=aes(Sales))+geom_boxplot()
We only have outliers of sales above 1000.
Most sales lie roughly between 125 and 500, the median sales is 250.
# check for missing values
colSums(is.na(df))
## Date Sales
## 0 0
Our dataset has no missing values
# check for duplicates
sum(duplicated(df))
## [1] 0
Our dataset has no duplicated rows.
# checking summary of dataset
summary(df)
## Date Sales
## Min. :2019-01-01 Min. : 10.68
## 1st Qu.:2019-01-24 1st Qu.: 124.42
## Median :2019-02-13 Median : 253.85
## Mean :2019-02-14 Mean : 322.97
## 3rd Qu.:2019-03-08 3rd Qu.: 471.35
## Max. :2019-03-30 Max. :1042.65
Sales records were collected between 1st January, 2019 and 30th March, 2019, a period of 3 months.
Max sales was 1042.65 and minimum sales was 10.68 according to the dataset.
Average sales for the three months is 322.97 and the median is 253.85 which implies that the dataset is left skewed.
The 25% quatile is 124.42 while the 75% quatile is 471.35.
# Check sales interquatile range
IQR(df$Sales)
## [1] 346.9279
The interquatile range of sales is 346.9279.
# plot line graph for sales vs day
ggplot() + geom_line(data=df, aes(x=Date, y=Sales, group=Date), col = "blue")+
geom_point() + scale_x_date(date_labels = "%Y %b %d", date_breaks = "5 days") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Day") + ggtitle("Sales against Date") + ylab("Sales")
Sales were lowest on 13th February, 2019. Further investigations could determine why this is the case.
# sort dates in ascending order
df <- df[order(df$Date),]
# preview dataset
head(df)
## # A tibble: 6 × 2
## Date Sales
## <date> <dbl>
## 1 2019-01-01 457.
## 2 2019-01-01 400.
## 3 2019-01-01 471.
## 4 2019-01-01 388.
## 5 2019-01-01 133.
## 6 2019-01-01 132.
# Group sales by Date by summing over sales
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- summarise_all(group_by(df, Date), sum)
# preview result
head(df)
## # A tibble: 6 × 2
## Date Sales
## <date> <dbl>
## 1 2019-01-01 4745.
## 2 2019-01-02 1946.
## 3 2019-01-03 2078.
## 4 2019-01-04 1624.
## 5 2019-01-05 3537.
## 6 2019-01-06 3614.
#Load necessary packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
# performing anomaly detection
df %>%
# decompose sales column into “observed”, “season”, “trend”, and “remainder” columns.
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
# perform anomaly detection on the decomposed data using
# the remainder column through the use of the anomalize() function
# which provides 3 new columns; “remainder_l1” (lower limit),
# “remainder_l2” (upper limit), and “anomaly” (Yes/No Flag). We use the default method=iqr
anomalize(remainder, alpha = 0.5, max_anoms = 0.2) %>%
# recomposes the lower and upper bounds of the anomalies around the observed values.
time_recompose() %>%
# visualize data
plot_anomaly_decomposition() + ggtitle("20% Anomalies on Sales(alpha=0.5)")
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
The spread of anomalies appears uniform beginning from late January.
# plotting anomalies of recomposed data
df %>%
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
anomalize(remainder, alpha = 0.5, max_anoms = 0.2) %>%
time_recompose() %>%
# visualize data
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5) + ggtitle("20% Anomalies on Sales(alpha=0.5)")
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 7 days
## trend = 30 days
There are several points of shoot up in sales between late January and March. There are also very low sales experienced somewhere in January and February. Further investigations on what events took place during this time periods can help uncover the course of the anomalies.
# extracting anomaly data points
df %>%
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
anomalize(remainder, alpha = 0.5, max_anoms = 0.2) %>%
time_recompose() %>%
filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 7 days
## trend = 30 days
## # A time tibble: 17 × 10
## # Index: Date
## Date observed season trend remainder remainder_l1 remainder_l2 anomaly
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2019-01-22 1705. 596. 4032. -2923. -1358. 1667. Yes
## 2 2019-01-23 5994. -216. 4022. 2188. -1358. 1667. Yes
## 3 2019-01-24 5402. -610. 4012. 2000. -1358. 1667. Yes
## 4 2019-01-31 5232. -610. 3876. 1967. -1358. 1667. Yes
## 5 2019-02-07 7228. -610. 3402. 4437. -1358. 1667. Yes
## 6 2019-02-08 5085. -351. 3345. 2091. -1358. 1667. Yes
## 7 2019-02-13 934. -216. 3101. -1951. -1358. 1667. Yes
## 8 2019-02-15 6831. -351. 3022. 4161. -1358. 1667. Yes
## 9 2019-02-17 5300. 210. 2994. 2096. -1358. 1667. Yes
## 10 2019-02-27 5859. -216. 3423. 2652. -1358. 1667. Yes
## 11 2019-03-02 6560. 696. 3529. 2335. -1358. 1667. Yes
## 12 2019-03-05 6231. 596. 3573. 2062. -1358. 1667. Yes
## 13 2019-03-09 7474. 696. 3470. 3308. -1358. 1667. Yes
## 14 2019-03-14 7215. -610. 3256. 4569. -1358. 1667. Yes
## 15 2019-03-19 5740. 596. 3157. 1988. -1358. 1667. Yes
## 16 2019-03-20 5458. -216. 3152. 2522. -1358. 1667. Yes
## 17 2019-03-26 1963. 596. 3181. -1814. -1358. 1667. Yes
## # … with 2 more variables: recomposed_l1 <dbl>, recomposed_l2 <dbl>
We are able to select dates with anomalies above. We spot anomalies from 22nd January to 26th March, 2019.
We have no facts supporting why the anomalies are appearing on these dates as no more data was provided by the client.
The data was provided by the client.
The research question was also provided by the client.
We are able to detect anomalies in our dataset from late January(22nd) onwards. There is no data provided to the analytics team to support that these points are not real anomalies.
We recommend the following follow up measures: * Determine what events actually took place during these dates. * Follow up with the sales team for more information regarding what products lead to a shoot up or sudden decrease in sales in this period. * Determine presence of change in the norm of making sales or manegerial positions which can also influence an increase or decrease in sales. * Confirm that system is not flawed hence saving ridiculous sales values.
Note that it may not be only a single factor contributing to the presence of anomalies.