Main objective
The detect fraud in sales.
Being a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.
We are going to use the provided dataset to check whether there are any anomalies in the given sales dataset.
The detect fraud in sales.
Check whether there are any anomalies in the given sales dataset.
To detect and plot any anomalies in our dataset.
load the required packages.
library(anomalize)
## == Use anomalize to improve your Forecasts by 50%! =============================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tibble)
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(dplyr)
df<- read.csv("http://bit.ly/CarreFourSalesDataset")
#Lets preview the head
head(df)
## Date Sales
## 1 1/5/2019 548.9715
## 2 3/8/2019 80.2200
## 3 3/3/2019 340.5255
## 4 1/27/2019 489.0480
## 5 2/8/2019 634.3785
## 6 3/25/2019 627.6165
The dataset contains two columns that is sales and particular dates that those sales were done but dates are in string format.
tail(df)
## Date Sales
## 995 2/18/2019 63.9975
## 996 1/29/2019 42.3675
## 997 3/2/2019 1022.4900
## 998 2/9/2019 33.4320
## 999 2/22/2019 69.1110
## 1000 2/18/2019 649.2990
dim(df)
## [1] 1000 2
It has 1000 observations on two variables that is dates and sales
colSums(is.na(df))
## Date Sales
## 0 0
There are no missing values in the dataset
#data structure
str(df)
## 'data.frame': 1000 obs. of 2 variables:
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
Date column is a string and it needs to be converted to date using as.Date()
df$Date <- as.Date(df$Date, format = "%m/%d/%Y")
df$Date <- sort(df$Date, decreasing = FALSE)
#Rechecking the structure to confirm the operation
#data structure
str(df)
## 'data.frame': 1000 obs. of 2 variables:
## $ Date : Date, format: "2019-01-01" "2019-01-01" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
library(timetk)
##
## Attaching package: 'timetk'
## The following object is masked from 'package:data.table':
##
## :=
df$Date <- as.POSIXct(df$Date)
library(tibble)
library(tibbletime)
library(ggplot2)
df <- as_tibble(df)
df %>%
time_decompose(Sales, method = 'stl', frequency = 'auto', trend = 'auto') %>%
anomalize(remainder, method = 'gesd', alpha = 0.1, max_anoms = 0.5) %>%
plot_anomaly_decomposition(ncol = 3, alpha_dots = 0.7)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
Using time_recompose() function to generate bands around the normal levels of observed values
df %>%
time_decompose(Sales, method = 'stl', frequency = 'auto', trend = 'auto') %>%
anomalize(remainder, method = 'gesd', alpha = 0.1, max_anoms = 0.1) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
We can actually go ahead to determine the exact dates that had anomalies in our dataset.
anomalies = df %>%
time_decompose(Sales, method = 'stl', frequency = 'auto', trend = 'auto') %>%
anomalize(remainder, method = 'gesd', alpha = 0.05, max_anoms = 0.1) %>%
time_recompose() %>%
filter(anomaly == 'Yes')
## Converting from tbl_df to tbl_time.
## Auto-index message: index = Date
## frequency = 11 seconds
## trend = 11 seconds
## Warning: `type_convert()` only converts columns of type 'character'.
## - `df` has no columns of type 'character'
anomalies
## # A time tibble: 20 x 10
## # Index: Date
## Date observed season trend remainder remainder_l1 remainder_l2
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-01-06 03:00:00 868. -37.1 223. 682. -699. 670.
## 2 2019-01-10 03:00:00 923. 12.4 181. 729. -699. 670.
## 3 2019-01-19 03:00:00 830. -37.1 168. 699. -699. 670.
## 4 2019-01-20 03:00:00 942. -25.2 261. 707. -699. 670.
## 5 2019-01-28 03:00:00 820. -38.0 174. 684. -699. 670.
## 6 2019-01-31 03:00:00 1043. -9.34 276. 776. -699. 670.
## 7 2019-02-01 03:00:00 1002. -37.1 269. 771. -699. 670.
## 8 2019-02-07 03:00:00 1021. 12.5 197. 811. -699. 670.
## 9 2019-02-07 03:00:00 952. -25.2 186. 791. -699. 670.
## 10 2019-02-07 03:00:00 938. -38.0 211. 765. -699. 670.
## 11 2019-02-19 03:00:00 1034. -18.9 204. 849. -699. 670.
## 12 2019-02-25 03:00:00 935. -38.0 194. 779. -699. 670.
## 13 2019-02-25 03:00:00 874. -18.9 182. 711. -699. 670.
## 14 2019-02-26 03:00:00 867. -25.2 135. 757. -699. 670.
## 15 2019-03-01 03:00:00 915. 14.8 201. 698. -699. 670.
## 16 2019-03-04 03:00:00 1024. -38.0 387. 675. -699. 670.
## 17 2019-03-05 03:00:00 856. -38.0 171. 724. -699. 670.
## 18 2019-03-11 03:00:00 1022. -25.2 217. 831. -699. 670.
## 19 2019-03-26 03:00:00 932. 22.1 170. 740. -699. 670.
## 20 2019-03-30 03:00:00 1022. -38.0 98.8 962. -699. 670.
## # ... with 3 more variables: anomaly <chr>, recomposed_l1 <dbl>,
## # recomposed_l2 <dbl>
There are several anomalies in the carrefour dataset between the 6th February 2019 and 30th march 2019
There were several anomalies in the month of February and March
carrerfour should review sales that happen between February and march to investigate the reason why there were why anomalies between February and March.