1. Defining the question

Carrefour Kenya are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). We have been taxed have also been requested to check whether there are any anomalies in the given sales data set. The objective of this task being fraud detection.

2. Success criteria

Anormallies are detected in the sales data for the given time frame

3. Research Methodology

4. Undestanding the data

The data set provided has two columns;

Date - The date the transaction was recorded Sale- The value of the sale

Loading libraries

# loading libraries we are going to use
#
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggpubr)
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(tibbletime)
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter

loading dataset

# Loading the dataset
#
sales_df <- read.csv("http://bit.ly/CarreFourSalesDataset")

previewing dataset

# Previewing the dataset
#
head(sales_df)
##        Date    Sales
## 1  1/5/2019 548.9715
## 2  3/8/2019  80.2200
## 3  3/3/2019 340.5255
## 4 1/27/2019 489.0480
## 5  2/8/2019 634.3785
## 6 3/25/2019 627.6165
# Previewing the dimensions of the dataset
#
dim(sales_df)
## [1] 1000    2

The dataset has 100o records and 2 varaibles

# Preview the dataset structure
#
str(sales_df)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...

The date variable has inappropriate data structure.

Validating the dataset

Since the data set is provided by the client Carrefour supermarket the data will be assumed to be accurate and up to date

Data cleaning

Here we are going to assign the appropriate data type to the date variable

# Converting variable date from object to date type
#
# sales_df <- mutate(sales_df, Date = as.Date(Date, format= "%m/%d/%Y"))
sales_df$Date <- as.POSIXct( sales_df$Date, format="%m/%d/%y" )

# ordering datset based on date
sales_df$Date<- sort(sales_df$Date,decreasing = FALSE)


# Previewing the date class
#
class(sales_df$Date)
## [1] "POSIXct" "POSIXt"

All columns have appropriate data types

# sort the date in descending order
#

Checking for missing values

# Checking the number of missing values per column in the data set
#
colSums(is.na(sales_df))
##  Date Sales 
##     0     0

The data set has no null values

Checking for duplicates

# finding the duplicated rows in the data set and assign to a variable 
# duplicated_rows below
#
duplicated_rows = sales_df[duplicated(sales_df),]

# Printing out the duplicated rows
duplicated_rows
## [1] Date  Sales
## <0 rows> (or 0-length row.names)

The dataset has no duplicate records

Checking for outliers

# Selecting Numeric columns
num<- select_if(sales_df, is.numeric)
colnames(num)
## [1] "Sales"
#Function to calculate the percentage of outliers in each column
outlier_vals<- function(x){
  out<- boxplot.stats(x)$out
  return((length(out)/1000)*100)
}

#Evaluating outliers per column
sapply(num,outlier_vals)
## Sales 
##   0.9

0.9% of the sales were classified as outliers

# Boxplot of sales to check for outliers
#
boxplot(sales_df$Sales)

The sales column has some outliers which shall be explored later

Univariate analysis

# describing the sales data
#
describe(num)
##       vars    n   mean     sd median trimmed    mad   min     max   range skew
## Sales    1 1000 322.97 245.89 253.85  293.91 233.78 10.68 1042.65 1031.97 0.89
##       kurtosis   se
## Sales    -0.09 7.78

The sales has a positive skew to the right. This shows it is not normally distributed

BIvariate analysis

This is a line chart representing the total sales over time

# line graph of date vs sales
#
ggplot(sales_df,aes(x=Date, y= Sales))+geom_line()

There is no pattern to the line chart.

Implementetion of solution

We shall apply Tidy Anomaly Detection

The anomalies package is a feature-rich package for performing anomaly detection. Its geared towards time series analysis, which is one of the biggest needs for understanding when anomalies occur.

# Converting dataframe to tibble
#
sales_df <- as_tibble(sales_df)
# Plotting the plot_anomaly_decomposition() to visualize out data.
# 
sales_df %>% time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.1) %>% plot_anomaly_decomposition()
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Observation

We have some beautiful plots with the first plot being overall observed data, second being season, third being trend and the final plot analyzed for anomalies. The red points indicate anomalies according to the anomalize function. However, we are not looking for this plot. We only want the anomalies plot with trend and seasonality removed.

# Plotting the data again with recomposed data.
#
sales_df %>% time_decompose(Sales) %>% anomalize(remainder) %>% time_recompose() %>%  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds

Observations

From the plots above, anomalies were observed in the data set but had no patterns to it. The obvious and prominent anormally was observed towards end of April

# Extract the anomalies
anomalies=sales_df %>% time_decompose(Sales) %>%  anomalize(remainder) %>%  time_recompose() %>%  filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
anomalies
## # A time tibble: 1 × 10
## # Index: Date
##   Date                observed season trend remainder remainder_l1 remainder_l2
##   <dttm>                 <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
## 1 2020-03-30 00:00:00    1022.  -38.0  98.8      962.        -866.         895.
## # … with 3 more variables: anomaly <chr>, recomposed_l1 <dbl>,
## #   recomposed_l2 <dbl>

There was one anormlally observed at the end of March.

Conclusion

In conclusion only one anormally was detected on the 20th of March 2020.

Further Questions

A) Do we have the right data

For this study the date and sales data is sufficient for us to analyses and determine which sale stood out as an anormally

B) Do we have the right question?

Yes. It is important to scan for anormalies in the transactions of a supermarket in order to investigate further unusual activity.