Are there fraudulent entries in the sales records?
Correct identification of anomalies within the provided data.
Carrefour Kenya was undertaking a project that would inform the marketing department on the presence of fraudulent sales. They provided a data set from which anomalies were to be detected, if they existed.
Correct detection of anomalies was of importance as under reporting of sales is a crime, as less taxes will be filed. Plus, this activity, as well as over reporting of sales can lead to incorrect restocking of goods, leading to product shortages, or over stocking respectively. This can lead to heavy losses being incurred.
The data provided should have the real records, in order for the anomaly detection insights to be useful to the client.
# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(psych)
library(coda)
library(base) # Date-time conversion
library(ggplot2) # Plotting Library
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(moments) # Measures of distribution
library(ggcorrplot) # Correlation plotting
library(tidyverse, quietly = T)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ dplyr::between() masks data.table::between()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ✖ purrr::transpose() masks data.table::transpose()
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(timetk)
##
## Attaching package: 'timetk'
## The following object is masked from 'package:data.table':
##
## :=
# Loading data set
part4 <- read.csv('dataset_4.csv')
# Data set records summary
records.summary <- function(data){
cat('Number of rows = ', nrow(data), ' and columns = ', ncol(data), '\n')
}
# Data set summary
records.summary(part4)
## Number of rows = 1000 and columns = 2
# Top data set preview
head(part4)
## Date Sales
## 1 1/5/2019 548.9715
## 2 3/8/2019 80.2200
## 3 3/3/2019 340.5255
## 4 1/27/2019 489.0480
## 5 2/8/2019 634.3785
## 6 3/25/2019 627.6165
Part4
# Bottom data set preview
tail(part4)
## Date Sales
## 995 2/18/2019 63.9975
## 996 1/29/2019 42.3675
## 997 3/2/2019 1022.4900
## 998 2/9/2019 33.4320
## 999 2/22/2019 69.1110
## 1000 2/18/2019 649.2990
The invoice ID will be dropped as it doesn’t provide the required information, while the date column will be changed to a date data type.
# Data type for the third data set.
str(part4)
## 'data.frame': 1000 obs. of 2 variables:
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
The date variable will be changed to a date type.
# Unique time values
head(unique(part4$Date), 10)
## [1] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" "2/8/2019" "3/25/2019"
## [7] "2/25/2019" "2/24/2019" "1/10/2019" "2/20/2019"
The format is ‘%m/%d/%Y’.
# Data type conversion to POSIXt
# part4$Date <- strptime(part4$Date, format = '%m/%d/%Y')
part4$Date <- as.Date(part4$Date, '%m/%d/%Y')
# Confirming changes
cat('Class =', class(part4$Date), '\n')
## Class = Date
# Preview
head(part4$Date)
## [1] "2019-01-05" "2019-03-08" "2019-03-03" "2019-01-27" "2019-02-08"
## [6] "2019-03-25"
The column data type has been changed to Date.
The data sets have been provided by the client, therefore, external data set validation will not be used.
# Data set columns
colnames(part4)
## [1] "Date" "Sales"
Both columns will be kept in order to analyze the seasonality, trend and the presence and absence of anomalies.
# Missing values
colSums(is.na(part4))
## Date Sales
## 0 0
There are no missing values in the second data set.
# Checking for duplicates
sum(duplicated(part4))
## [1] 0
There are no duplicates in the last data set.
Checking the uniformity of column names and values.
# Column names
colnames(part4)
## [1] "Date" "Sales"
The column names are uniform, therefore, they will not be renamed.
# Selecting numerical columns
num <- unlist(lapply(part4, is.numeric))
# Numeric data frame
num_df <- part4[, num]
# Number of numeric columns
length(names(num_df))
## [1] 0
## [1] 10
# Box plots
par(mar=c(1,5,2,2))
boxplot(num_df, main = paste('Boxplot of Sales'),
ylab = 'Count')
The outliers will be retained for anomaly detection.
# Year
unique(format(part4$Date, format = '%Y'))
## [1] "2019"
All records are from 2019.
# Months
unique(format(part4$Date, format = '%m'))
## [1] "01" "03" "02"
The data set only contains records for 3 months.
# Copy of the data set.
df <- data.frame(part4)
# Creating month and day columns.
df$Month <- as.factor(format(part4$Date, format = '%m'))
df$Day <- as.integer(format(part4$Date, format = '%d'))
df$Weekday <- as.factor(wday(df$Date))
Function definition
# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend = NULL, colors = NULL, method, col2 = NULL){
if (method == 'count'){
ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
ggtitle(paste(title, 'Frequency Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)}
else if (method == 'bar'){
ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)}
}
Months
# Bar plot
bar.plt (df, Month, title = 'Month Column', method = 'count', legend =
'Month', colors = c( '01' = '#0398dc','02' = '#026592',
'03' = '#001118'),
col2 = NULL)
January has the highest frequency, while February has the lowest.
Day of the Week
unique(df$Weekday)
## [1] 7 6 1 2 5 4 3
## Levels: 1 2 3 4 5 6 7
# Bar plot
bar.plt (df, Weekday, title = 'Month Column', method = 'count', legend =
'Month', colors = c( '1' = '#03a9f4', '2' = '#0398dc',
'3' = '#0287c3', '4' = '#0276AB',
'5' = '#026592', '6' = '#02557A',
'7' = '#014462'),
col2 = NULL)
Sunday, Wednesday and Thursday had the highest frequency. Tuesday had the lowest.
# Mode function
mode <- function(col, data) {
unique.value <- unique(data[, col])
unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}
central.tendency <- function(col, data){
cat('Measures of Central Tendency \n')
# Mean
cat('Mean = ', mean(data[, col]), '\n')
# Median
cat('Median = ', median(data[,col]), '\n')
# Mode
cat('Mode = ', mode(col, data), '\n')
}
dispersion <- function(col, data){
# Range
cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
# IQR
cat('IQR = ', IQR(data[ ,col]), '\n')
# Variance
cat('Variance = ', var(data[ ,col]), '\n')
# Standard Deviation
cat('Standard Deviation = ', sd(data[ ,col]), '\n')
# Skewness
cat('Skewness = ', skewness(data[ ,col]), '\n')
# Kurtosis
cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
}
# Column names
colnames(df)
## [1] "Date" "Sales" "Month" "Day" "Weekday"
# Measures of central tendency
central.tendency(colnames(df)[2], df)
## Measures of Central Tendency
## Mean = 322.9667
## Median = 253.848
## Mode = 829.08
The average sales is 323.
# Measures of dispersion
dispersion(colnames(df)[2], df)
## Range = 10.6785 - 1042.65
## IQR = 346.9279
## Variance = 60459.6
## Standard Deviation = 245.8853
## Skewness = 0.8912304
## Kurtosis = 2.91253
The distribution has a slight positive skew, with a platykurtic distribution.
# Sales Histogram
hist(df$Sales,
main = 'Histogram of Sales',
xlab = 'Sales')
Lower sales have a higher frequency.
Days
# Sales Histogram
hist(df$Day,
main = 'Histogram of Days',
xlab = 'Day')
The frequency of records reduces towards the end of the month.
Summary
Sales
Sales Trend
# Scatter plot and correlation function
line.plt <- function(col1, col2, data, title){
ggplot(data, aes(x = {{col1}}, y= {{col2}})) + geom_line(color = '#014462', size = 1)+ ggtitle(paste(title, 'Line Plot')) +
theme(plot.title = element_text(hjust = 0.5))
}
# Trend
line.plt(data = df, col1 = Date, col2 = Sales,
title = 'Sales Trend')
Pre-Processing
# Converting data to tibble
# Ordering the index
part4$Date <- sort(part4$Date)
part4 <- as_tbl_time(part4, index = Date)
#part4 <- part4 %>% mutate(index=seq(n())) %>%
# arrange(index)
Anomaly Detection
# Detection
anomalies <- part4 %>%
group_by(Date) %>%
summarise(Sales = sum(Sales)) %>%
time_decompose(Sales, merge = T, method = 'stl') %>%
anomalize(remainder, method = 'iqr', alpha = 0.2) %>%
time_recompose()
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# View result
anomalies %>% glimpse()
## Rows: 89
## Columns: 11
## $ Date <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01…
## $ Sales <dbl> 4343.682, 3260.827, 1920.250, 2837.614, 4007.829, 2742.7…
## $ observed <dbl> 4343.682, 3260.827, 1920.250, 2837.614, 4007.829, 2742.7…
## $ season <dbl> 340.12150, -136.97236, -498.88780, -452.66977, 566.26578…
## $ trend <dbl> 3211.772, 3260.846, 3309.921, 3358.996, 3404.445, 3449.8…
## $ remainder <dbl> 791.78898, 136.95367, -890.78257, -68.71128, 37.11856, -…
## $ remainder_l1 <dbl> -2006.696, -2006.696, -2006.696, -2006.696, -2006.696, -…
## $ remainder_l2 <dbl> 2240.483, 2240.483, 2240.483, 2240.483, 2240.483, 2240.4…
## $ anomaly <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "…
## $ recomposed_l1 <dbl> 1545.1969, 1117.1777, 804.3369, 899.6296, 1964.0143, 146…
## $ recomposed_l2 <dbl> 5792.376, 5364.357, 5051.516, 5146.809, 6211.194, 5709.9…
The frequency = 7 days, trend = 30 days.
# Anomalies plot
anomalies %>% plot_anomalies(time_recomposed = TRUE, alpha_dots = 0.5) +
ggtitle('Anomalies Plot')
# Anomaly decomposition plot
anomalies%>%
plot_anomaly_decomposition()+
ggtitle("Decomposition of Anomalies Plot")
Adjusting Trend and Seasonality
# Default settings
auto <- anomalies %>% plot_anomaly_decomposition() +
ggtitle('Frequency and Trend = Auto')
auto
# Logical frequency
get_time_scale_template()
## # A tibble: 8 × 3
## time_scale frequency trend
## <chr> <chr> <chr>
## 1 second 1 hour 12 hours
## 2 minute 1 day 14 days
## 3 hour 1 day 1 month
## 4 day 1 week 3 months
## 5 week 1 quarter 1 year
## 6 month 1 year 5 years
## 7 quarter 1 year 10 years
## 8 year 5 years 30 years
The frequency being used is 1 day, and the trend = 1 month.
# Changing the trend to 14 days, as the dataset only contains values from
# Janauary to April
half <- part4 %>%
group_by(Date) %>%
summarise(Sales = sum(Sales)) %>%
time_decompose(Sales, trend = '14 days') %>%
anomalize(remainder, alpha = 0.2) %>%
time_recompose %>%
plot_anomaly_decomposition() +
ggtitle('Trend = 14 days(Local)')
## frequency = 7 days
## trend = 14 days
half
Extracting Anomalous Data Points
# Anomalies
part4 %>%
group_by(Date) %>%
summarise(Sales = sum(Sales)) %>%
time_decompose(Sales, trend = '14 days') %>%
anomalize(remainder, alpha = 0.2) %>%
time_recompose %>%
filter(anomaly == 'Yes')
## frequency = 7 days
## trend = 14 days
## # A time tibble: 11 × 10
## # Index: Date
## Date observed season trend remainder remainder_l1 remainder_l2 anomaly
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2019-01-08 7893. 368. 3340. 4185. -1639. 1919. Yes
## 2 2019-01-22 1886. 368. 3821. -2303. -1639. 1919. Yes
## 3 2019-01-31 5650. -536. 3469. 2718. -1639. 1919. Yes
## 4 2019-02-07 7282. -536. 3035. 4783. -1639. 1919. Yes
## 5 2019-02-15 6343. -426. 2489. 4280. -1639. 1919. Yes
## 6 2019-02-23 1622. 613. 2683. -1673. -1639. 1919. Yes
## 7 2019-02-25 6385. 150. 2811. 3423. -1639. 1919. Yes
## 8 2019-03-09 6566. 613. 4025. 1928. -1639. 1919. Yes
## 9 2019-03-14 4909. -536. 3106. 2339. -1639. 1919. Yes
## 10 2019-03-17 1156. 46.7 2767. -1658. -1639. 1919. Yes
## 11 2019-03-20 6613. -215. 2753. 4076. -1639. 1919. Yes
## # … with 2 more variables: recomposed_l1 <dbl>, recomposed_l2 <dbl>
Analysis Summary
11 anomalies have been found, 3 are cases of under reporting, and 8 are over stated sales.
Each month has a single case of under reported sales.
The trend was reduced to 14 days, from 30 days. Some anomalies were shifted from their positions, changing from Jan = 4, Feb = 4 and Mar = 3 to Jan = 3, Feb = 4 and Mar = 4. 14 days was selected due to the short time frame of the records.
In conclusion, these sales records might be instances of fraud, if the possibility of erroneous recording is eliminated.
Implementation of fully automatic and restricted recording of transactions to reduce chances of errors or fraud.
Obtaining more data to get a deeper insight on the frequency of these anomalies.
Yes, we have the right data, as it was provided by the client, based on previous sales records, from which anomalies were detected.
Yes, more data can provide more insights into the frequency of occurrence of anomalies. This can also show if they are random, or have a pattern of occurrence.
. Yes, we have the right question, as the aim of this analysis is to cater to the client’s request.
# Suppressing warnings
options(warn = defaultW)