Carrefour Kenya are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). We have been taxed have also been requested to check whether there are any anomalies in the given sales data set. The objective of this task being fraud detection.
Anormallies are detected in the sales data for the given time frame
The data set provided has two columns;
Date - The date the transaction was recorded Sale- The value of the sale
# loading libraries we are going to use
#
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggpubr)
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
# Loading the dataset
#
sales_df <- read.csv("http://bit.ly/CarreFourSalesDataset")
# Previewing the dataset
#
head(sales_df)
## Date Sales
## 1 1/5/2019 548.9715
## 2 3/8/2019 80.2200
## 3 3/3/2019 340.5255
## 4 1/27/2019 489.0480
## 5 2/8/2019 634.3785
## 6 3/25/2019 627.6165
# Previewing the dimensions of the dataset
#
dim(sales_df)
## [1] 1000 2
The dataset has 100o records and 2 varaibles
# Preview the dataset structure
#
str(sales_df)
## 'data.frame': 1000 obs. of 2 variables:
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
The date variable has inappropriate data structure.
Since the data set is provided by the client Carrefour supermarket the data will be assumed to be accurate and up to date
Here we are going to assign the appropriate data type to the date variable
# Converting variable date from object to date type
#
# sales_df <- mutate(sales_df, Date = as.Date(Date, format= "%m/%d/%Y"))
sales_df$Date <- as.POSIXct( sales_df$Date, format="%m/%d/%y" )
# ordering datset based on date
sales_df$Date<- sort(sales_df$Date,decreasing = FALSE)
# Previewing the date class
#
class(sales_df$Date)
## [1] "POSIXct" "POSIXt"
All columns have appropriate data types
# sort the date in descending order
#
# Checking the number of missing values per column in the data set
#
colSums(is.na(sales_df))
## Date Sales
## 0 0
The data set has no null values
# finding the duplicated rows in the data set and assign to a variable
# duplicated_rows below
#
duplicated_rows = sales_df[duplicated(sales_df),]
# Printing out the duplicated rows
duplicated_rows
## [1] Date Sales
## <0 rows> (or 0-length row.names)
The dataset has no duplicate records
# Selecting Numeric columns
num<- select_if(sales_df, is.numeric)
colnames(num)
## [1] "Sales"
#Function to calculate the percentage of outliers in each column
outlier_vals<- function(x){
out<- boxplot.stats(x)$out
return((length(out)/1000)*100)
}
#Evaluating outliers per column
sapply(num,outlier_vals)
## Sales
## 0.9
0.9% of the sales were classified as outliers
# Boxplot of sales to check for outliers
#
boxplot(sales_df$Sales)
The sales column has some outliers which shall be explored later
# describing the sales data
#
describe(num)
## vars n mean sd median trimmed mad min max range skew
## Sales 1 1000 322.97 245.89 253.85 293.91 233.78 10.68 1042.65 1031.97 0.89
## kurtosis se
## Sales -0.09 7.78
The sales has a positive skew to the right. This shows it is not normally distributed
This is a line chart representing the total sales over time
# line graph of date vs sales
#
ggplot(sales_df,aes(x=Date, y= Sales))+geom_line()
There is no pattern to the line chart.
We shall apply Tidy Anomaly Detection
The anomalies package is a feature-rich package for performing anomaly detection. Its geared towards time series analysis, which is one of the biggest needs for understanding when anomalies occur.
# Converting dataframe to tibble
#
sales_df <- as_tibble(sales_df)
# Plotting the plot_anomaly_decomposition() to visualize out data.
#
sales_df %>% time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>% anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Observation
We have some beautiful plots with the first plot being overall observed data, second being season, third being trend and the final plot analyzed for anomalies. The red points indicate anomalies according to the anomalize function. However, we are not looking for this plot. We only want the anomalies plot with trend and seasonality removed.
# Plotting the data again with recomposed data.
#
sales_df %>% time_decompose(Sales) %>% anomalize(remainder) %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
Observations
From the plots above, anomalies were observed in the data set but had no patterns to it. The obvious and prominent anormally was observed towards end of April
# Extract the anomalies
anomalies=sales_df %>% time_decompose(Sales) %>% anomalize(remainder) %>% time_recompose() %>% filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
anomalies
## # A time tibble: 1 × 10
## # Index: Date
## Date observed season trend remainder remainder_l1 remainder_l2
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-03-30 00:00:00 1022. -38.0 98.8 962. -866. 895.
## # … with 3 more variables: anomaly <chr>, recomposed_l1 <dbl>,
## # recomposed_l2 <dbl>
There was one anormlally observed at the end of March.
In conclusion only one anormally was detected on the 20th of March 2020.
For this study the date and sales data is sufficient for us to analyses and determine which sale stood out as an anormally
Yes. It is important to scan for anormalies in the transactions of a supermarket in order to investigate further unusual activity.