Defining the context.

Being a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.

Part 4: Anomaly Detection in sales

We are going to use the provided dataset to check whether there are any anomalies in the given sales dataset.

Main objective

The detect fraud in sales.

Defining the question

i)Specifying the Data Analytic Question

Check whether there are any anomalies in the given sales dataset.

ii)Defining the Metric for Success

To detect and plot any anomalies in our dataset.

Implementing the solution

load the required packages.

library(anomalize) 
## == Use anomalize to improve your Forecasts by 50%! =============================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(tidyverse) 
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tibble)
library(tibbletime) 
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(dplyr)

loading the dataset

df<- read.csv("http://bit.ly/CarreFourSalesDataset")
#Lets preview the head
head(df)
##        Date    Sales
## 1  1/5/2019 548.9715
## 2  3/8/2019  80.2200
## 3  3/3/2019 340.5255
## 4 1/27/2019 489.0480
## 5  2/8/2019 634.3785
## 6 3/25/2019 627.6165

The dataset contains two columns that is sales and particular dates that those sales were done but dates are in string format.

tail(df)
##           Date     Sales
## 995  2/18/2019   63.9975
## 996  1/29/2019   42.3675
## 997   3/2/2019 1022.4900
## 998   2/9/2019   33.4320
## 999  2/22/2019   69.1110
## 1000 2/18/2019  649.2990

Checking the dimensions of the dataset.

dim(df)
## [1] 1000    2

It has 1000 observations on two variables that is dates and sales

checking for missing values.

colSums(is.na(df))
##  Date Sales 
##     0     0

There are no missing values in the dataset

Checking the datatypes

#data structure
str(df)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...

Date column is a string and it needs to be converted to date using as.Date()

df$Date <- as.Date(df$Date, format = "%m/%d/%Y")
df$Date <- sort(df$Date, decreasing = FALSE)

#Rechecking the structure to confirm the operation

#data structure
str(df)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date : Date, format: "2019-01-01" "2019-01-01" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...

Convert the sales into a tibbletime object.

library(timetk)
## 
## Attaching package: 'timetk'
## The following object is masked from 'package:data.table':
## 
##     :=
df$Date <- as.POSIXct(df$Date)
library(tibble)
library(tibbletime) 
library(ggplot2)
df <- as_tibble(df)

Anomaly Detection.

Decomposing

df %>%
time_decompose(Sales, method = 'stl', frequency = 'auto', trend = 'auto') %>%
anomalize(remainder, method = 'gesd', alpha = 0.1, max_anoms = 0.5) %>%
plot_anomaly_decomposition(ncol = 3, alpha_dots = 0.7)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'

Recomposing

Using time_recompose() function to generate bands around the normal levels of observed values

df %>%
time_decompose(Sales, method = 'stl', frequency = 'auto', trend = 'auto') %>%
anomalize(remainder, method = 'gesd', alpha = 0.1, max_anoms = 0.1) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'

We can actually go ahead to determine the exact dates that had anomalies in our dataset.

anomalies = df %>%
time_decompose(Sales, method = 'stl', frequency = 'auto', trend = 'auto') %>%
anomalize(remainder, method = 'gesd', alpha = 0.05, max_anoms = 0.1) %>%
time_recompose() %>%
filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
anomalies
## # A time tibble: 20 x 10
## # Index: Date
##    Date                observed season trend remainder remainder_l1 remainder_l2
##    <dttm>                 <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
##  1 2019-01-06 03:00:00     868. -37.1  223.       682.        -699.         670.
##  2 2019-01-10 03:00:00     923.  12.4  181.       729.        -699.         670.
##  3 2019-01-19 03:00:00     830. -37.1  168.       699.        -699.         670.
##  4 2019-01-20 03:00:00     942. -25.2  261.       707.        -699.         670.
##  5 2019-01-28 03:00:00     820. -38.0  174.       684.        -699.         670.
##  6 2019-01-31 03:00:00    1043.  -9.34 276.       776.        -699.         670.
##  7 2019-02-01 03:00:00    1002. -37.1  269.       771.        -699.         670.
##  8 2019-02-07 03:00:00    1021.  12.5  197.       811.        -699.         670.
##  9 2019-02-07 03:00:00     952. -25.2  186.       791.        -699.         670.
## 10 2019-02-07 03:00:00     938. -38.0  211.       765.        -699.         670.
## 11 2019-02-19 03:00:00    1034. -18.9  204.       849.        -699.         670.
## 12 2019-02-25 03:00:00     935. -38.0  194.       779.        -699.         670.
## 13 2019-02-25 03:00:00     874. -18.9  182.       711.        -699.         670.
## 14 2019-02-26 03:00:00     867. -25.2  135.       757.        -699.         670.
## 15 2019-03-01 03:00:00     915.  14.8  201.       698.        -699.         670.
## 16 2019-03-04 03:00:00    1024. -38.0  387.       675.        -699.         670.
## 17 2019-03-05 03:00:00     856. -38.0  171.       724.        -699.         670.
## 18 2019-03-11 03:00:00    1022. -25.2  217.       831.        -699.         670.
## 19 2019-03-26 03:00:00     932.  22.1  170.       740.        -699.         670.
## 20 2019-03-30 03:00:00    1022. -38.0   98.8      962.        -699.         670.
## # ... with 3 more variables: anomaly <chr>, recomposed_l1 <dbl>,
## #   recomposed_l2 <dbl>

There are several anomalies in the carrefour dataset between the 6th February 2019 and 30th march 2019

Conclusion

There were several anomalies in the month of February and March

Recommendations

carrerfour should review sales that happen between February and march to investigate the reason why there were why anomalies between February and March.