Carrefour Sales Anomaly Detection with R

1. Defining the Question

Research Question

You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.

a.) Specifying the question

We need to check for anomalies in the dataset for fraud detection.

b.) The metric for success

Finding anomalies in the dataset.

c.) Understanding the context

The dataset provided contains sales by dates for Carrefour Supermarket.

d.) Experimental design taken

Reading the data
Checking the data - data understanding
Implementing the solution
Challenge the solution
Follow up Questions
Conclusion.
Recommendations.

e.) Data appropriateness to answer the given question.

We aim to find sales anomalies in our dataset as we have provided with sales and their respective dates.

2. Loading the dataset

# import Tibble package
library(tibble)


# url <- http://bit.ly/CarreFourSalesDataset
# load the dataset as dataframe

df <- read.csv('http://bit.ly/CarreFourSalesDataset') 

# convert dataframe to Tibble

df <- as_tibble(df)

#check data structure of our dataset

class(df)

## [1] "tbl_df"     "tbl"        "data.frame"

3. Checking the data

# Previewing our first 5 records
#
head(df)

## # A tibble: 6 × 2
##   Date      Sales
##   <chr>     <dbl>
## 1 1/5/2019  549. 
## 2 3/8/2019   80.2
## 3 3/3/2019  341. 
## 4 1/27/2019 489. 
## 5 2/8/2019  634. 
## 6 3/25/2019 628.

# Previewing our last 5 records
#
tail(df)

## # A tibble: 6 × 2
##   Date       Sales
##   <chr>      <dbl>
## 1 2/18/2019   64.0
## 2 1/29/2019   42.4
## 3 3/2/2019  1022. 
## 4 2/9/2019    33.4
## 5 2/22/2019   69.1
## 6 2/18/2019  649.

# check shape of the dataset

dim(df)

## [1] 1000    2

Our dataset contains 1000 records and 2 columns

# check column datatypes

sapply(df, class)

##        Date       Sales 
## "character"   "numeric"

We need to convert date into date datatype

# inspect variable classes

str(df)

## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num [1:1000] 549 80.2 340.5 489 634.4 ...

# convert timestamp to datetime


df$Date <- as.Date(df$Date, format = "%m/%d/%Y")

# preview dataset
head(df)

## # A tibble: 6 × 2
##   Date       Sales
##   <date>     <dbl>
## 1 2019-01-05 549. 
## 2 2019-03-08  80.2
## 3 2019-03-03 341. 
## 4 2019-01-27 489. 
## 5 2019-02-08 634. 
## 6 2019-03-25 628.

4. Data cleaning

Validation

Dataset is provided by the client.

Outliers

# Check for outliers

# import ggplot2 for plotting
library(ggplot2)

# plot boxplot for sales
ggplot(data=df, mapping=aes(Sales))+geom_boxplot()

We only have outliers of sales above 1000.

Most sales lie roughly between 125 and 500, the median sales is 250.

Missing data

# check for missing values
colSums(is.na(df))

##  Date Sales 
##     0     0

Our dataset has no missing values

Duplicates

# check for duplicates

sum(duplicated(df))

## [1] 0

Our dataset has no duplicated rows.

5. Exploratory Data Analysis

# checking summary of dataset
summary(df)

##       Date                Sales        
##  Min.   :2019-01-01   Min.   :  10.68  
##  1st Qu.:2019-01-24   1st Qu.: 124.42  
##  Median :2019-02-13   Median : 253.85  
##  Mean   :2019-02-14   Mean   : 322.97  
##  3rd Qu.:2019-03-08   3rd Qu.: 471.35  
##  Max.   :2019-03-30   Max.   :1042.65

Sales records were collected between 1st January, 2019 and 30th March, 2019, a period of 3 months.

Max sales was 1042.65 and minimum sales was 10.68 according to the dataset.

Average sales for the three months is 322.97 and the median is 253.85 which implies that the dataset is left skewed.

The 25% quatile is 124.42 while the 75% quatile is 471.35.

# Check sales interquatile range
IQR(df$Sales)

## [1] 346.9279

The interquatile range of sales is 346.9279.

# plot line graph for sales vs day

ggplot() + geom_line(data=df, aes(x=Date, y=Sales, group=Date), col = "blue")+
  geom_point() + scale_x_date(date_labels = "%Y %b %d", date_breaks = "5 days") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Day") + ggtitle("Sales against Date") + ylab("Sales")

Sales were lowest on 13th February, 2019. Further investigations could determine why this is the case.

6. Implementing the solution

# sort dates in ascending order

df <- df[order(df$Date),]

# preview dataset

head(df)

## # A tibble: 6 × 2
##   Date       Sales
##   <date>     <dbl>
## 1 2019-01-01  457.
## 2 2019-01-01  400.
## 3 2019-01-01  471.
## 4 2019-01-01  388.
## 5 2019-01-01  133.
## 6 2019-01-01  132.

# Group sales by Date by summing over sales

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- summarise_all(group_by(df, Date), sum)

# preview result

head(df)

## # A tibble: 6 × 2
##   Date       Sales
##   <date>     <dbl>
## 1 2019-01-01 4745.
## 2 2019-01-02 1946.
## 3 2019-01-03 2078.
## 4 2019-01-04 1624.
## 5 2019-01-05 3537.
## 6 2019-01-06 3614.

#Load necessary packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(anomalize)

## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

# performing anomaly detection
df %>%
  
  # decompose sales column into “observed”, “season”, “trend”, and “remainder” columns.
    time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
  
  #  perform anomaly detection on the decomposed data using 
  # the remainder column through the use of the anomalize() function 
  # which provides 3 new columns; “remainder_l1” (lower limit), 
  # “remainder_l2” (upper limit), and “anomaly” (Yes/No Flag). We use the default method=iqr
    anomalize(remainder, alpha = 0.5, max_anoms = 0.2) %>%
  
  #  recomposes  the lower and upper bounds of the anomalies around the observed values.
    time_recompose() %>%
  # visualize data
    plot_anomaly_decomposition() + ggtitle("20% Anomalies on Sales(alpha=0.5)")

## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date

## frequency = 7 days

## trend = 30 days

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

The spread of anomalies appears uniform beginning from late January.

# plotting anomalies of recomposed data

df %>%
    time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
    anomalize(remainder, alpha = 0.5, max_anoms = 0.2) %>%
    time_recompose() %>%
    # visualize data
   plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5) + ggtitle("20% Anomalies on Sales(alpha=0.5)")

## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date

## frequency = 7 days

## trend = 30 days

There are several points of shoot up in sales between late January and March. There are also very low sales experienced somewhere in January and February. Further investigations on what events took place during this time periods can help uncover the course of the anomalies.

# extracting anomaly data points
df %>% 
  time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
  anomalize(remainder, alpha = 0.5, max_anoms = 0.2) %>%
  time_recompose() %>%
  filter(anomaly == 'Yes')

## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date

## frequency = 7 days

## trend = 30 days

## # A time tibble: 17 × 10
## # Index: Date
##    Date       observed season trend remainder remainder_l1 remainder_l2 anomaly
##    <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl> <chr>  
##  1 2019-01-22    1705.   596. 4032.    -2923.       -1358.        1667. Yes    
##  2 2019-01-23    5994.  -216. 4022.     2188.       -1358.        1667. Yes    
##  3 2019-01-24    5402.  -610. 4012.     2000.       -1358.        1667. Yes    
##  4 2019-01-31    5232.  -610. 3876.     1967.       -1358.        1667. Yes    
##  5 2019-02-07    7228.  -610. 3402.     4437.       -1358.        1667. Yes    
##  6 2019-02-08    5085.  -351. 3345.     2091.       -1358.        1667. Yes    
##  7 2019-02-13     934.  -216. 3101.    -1951.       -1358.        1667. Yes    
##  8 2019-02-15    6831.  -351. 3022.     4161.       -1358.        1667. Yes    
##  9 2019-02-17    5300.   210. 2994.     2096.       -1358.        1667. Yes    
## 10 2019-02-27    5859.  -216. 3423.     2652.       -1358.        1667. Yes    
## 11 2019-03-02    6560.   696. 3529.     2335.       -1358.        1667. Yes    
## 12 2019-03-05    6231.   596. 3573.     2062.       -1358.        1667. Yes    
## 13 2019-03-09    7474.   696. 3470.     3308.       -1358.        1667. Yes    
## 14 2019-03-14    7215.  -610. 3256.     4569.       -1358.        1667. Yes    
## 15 2019-03-19    5740.   596. 3157.     1988.       -1358.        1667. Yes    
## 16 2019-03-20    5458.  -216. 3152.     2522.       -1358.        1667. Yes    
## 17 2019-03-26    1963.   596. 3181.    -1814.       -1358.        1667. Yes    
## # … with 2 more variables: recomposed_l1 <dbl>, recomposed_l2 <dbl>

We are able to select dates with anomalies above. We spot anomalies from 22nd January to 26th March, 2019.

7. Challenging the solution

We have no facts supporting why the anomalies are appearing on these dates as no more data was provided by the client.

8. Follow up Questions

Do we have the right data?

The data was provided by the client.

Do we have the right question?

The research question was also provided by the client.

9. Conclusion

We are able to detect anomalies in our dataset from late January(22nd) onwards. There is no data provided to the analytics team to support that these points are not real anomalies.

10. Recommendations

We recommend the following follow up measures: * Determine what events actually took place during these dates. * Follow up with the sales team for more information regarding what products lead to a shoot up or sudden decrease in sales in this period. * Determine presence of change in the norm of making sales or manegerial positions which can also influence an increase or decrease in sales. * Confirm that system is not flawed hence saving ridiculous sales values.

Note that it may not be only a single factor contributing to the presence of anomalies.