MARKET BASKET ANALYSIS

1. Defining the Question

a) Specifying the Question

Are there fraudulent entries in the sales records?

b) Defining the Metric for Success

Correct identification of anomalies within the provided data.

c) Understanding the Context

Carrefour Kenya was undertaking a project that would inform the marketing department on the presence of fraudulent sales. They provided a data set from which anomalies were to be detected, if they existed.

Correct detection of anomalies was of importance as under reporting of sales is a crime, as less taxes will be filed. Plus, this activity, as well as over reporting of sales can lead to incorrect restocking of goods, leading to product shortages, or over stocking respectively. This can lead to heavy losses being incurred.

d) Recording the Experimental Design

  1. Data sourcing/loading
  2. Data Understanding
  3. Data Relevance
  4. External Dataset Validation
  5. Data Preparation
  6. Univariate Analysis
  7. Bivariate Analysis
  8. Anomaly detection
  9. Implementing the solution
  10. Challenging the solution
  11. Conclusion
  12. Follow up questions

e) Data Relevance

The data provided should have the real records, in order for the anomaly detection insights to be useful to the client.

2. Data Understanding

a) Importing Required Libraries

# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(psych)
library(coda)
library(base)       # Date-time conversion
library(ggplot2)    # Plotting Library
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(moments)    # Measures of distribution 
library(ggcorrplot) # Correlation plotting 
library(tidyverse, quietly = T)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()     masks psych::%+%()
## ✖ ggplot2::alpha()   masks psych::alpha()
## ✖ dplyr::arrange()   masks plyr::arrange()
## ✖ dplyr::between()   masks data.table::between()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ dplyr::count()     masks plyr::count()
## ✖ dplyr::failwith()  masks plyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::id()        masks plyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ dplyr::mutate()    masks plyr::mutate()
## ✖ dplyr::rename()    masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ✖ purrr::transpose() masks data.table::transpose()
library(tibbletime)
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
library(anomalize)
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(timetk)
## 
## Attaching package: 'timetk'
## The following object is masked from 'package:data.table':
## 
##     :=

b) Reading the Data

# Loading data set
part4 <- read.csv('dataset_4.csv')

c) Checking the Data

i) Number of Records

# Data set records summary
records.summary <- function(data){
  cat('Number of rows = ', nrow(data), ' and columns = ', ncol(data), '\n')
  
}
# Data set summary
records.summary(part4)
## Number of rows =  1000  and columns =  2

ii) Top Dataset Preview

# Top data set preview
head(part4)
##        Date    Sales
## 1  1/5/2019 548.9715
## 2  3/8/2019  80.2200
## 3  3/3/2019 340.5255
## 4 1/27/2019 489.0480
## 5  2/8/2019 634.3785
## 6 3/25/2019 627.6165

iii) Bottom Dataset Preview

Part4

# Bottom  data set preview
tail(part4)
##           Date     Sales
## 995  2/18/2019   63.9975
## 996  1/29/2019   42.3675
## 997   3/2/2019 1022.4900
## 998   2/9/2019   33.4320
## 999  2/22/2019   69.1110
## 1000 2/18/2019  649.2990

The invoice ID will be dropped as it doesn’t provide the required information, while the date column will be changed to a date data type.

d) Checking Datatypes

# Data type for the third data set.
str(part4)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...

The date variable will be changed to a date type.

# Unique time values
head(unique(part4$Date), 10)
##  [1] "1/5/2019"  "3/8/2019"  "3/3/2019"  "1/27/2019" "2/8/2019"  "3/25/2019"
##  [7] "2/25/2019" "2/24/2019" "1/10/2019" "2/20/2019"

The format is ‘%m/%d/%Y’.

# Data type conversion to POSIXt
# part4$Date <- strptime(part4$Date, format = '%m/%d/%Y')
part4$Date <- as.Date(part4$Date, '%m/%d/%Y')
# Confirming changes
cat('Class =', class(part4$Date), '\n')
## Class = Date
# Preview
head(part4$Date)
## [1] "2019-01-05" "2019-03-08" "2019-03-03" "2019-01-27" "2019-02-08"
## [6] "2019-03-25"

The column data type has been changed to Date.

3. External Data Set Validation

The data sets have been provided by the client, therefore, external data set validation will not be used.

4. Data Preperation

a) Validation

# Data set columns
colnames(part4)
## [1] "Date"  "Sales"

Both columns will be kept in order to analyze the seasonality, trend and the presence and absence of anomalies.

b) Consistency

# Missing values
colSums(is.na(part4))
##  Date Sales 
##     0     0

There are no missing values in the second data set.

c) Completeness

# Checking for duplicates
sum(duplicated(part4))
## [1] 0

There are no duplicates in the last data set.

d) Uniformity

Checking the uniformity of column names and values.

# Column names
colnames(part4)
## [1] "Date"  "Sales"

The column names are uniform, therefore, they will not be renamed.

e) Outliers

# Selecting numerical columns
num <- unlist(lapply(part4, is.numeric))
# Numeric data frame
num_df <- part4[, num]
# Number of numeric columns
length(names(num_df))
## [1] 0
## [1] 10
# Box plots
par(mar=c(1,5,2,2))
boxplot(num_df, main = paste('Boxplot of Sales'), 
          ylab = 'Count')

The outliers will be retained for anomaly detection.

5. Descriptive Analysis

a) Univariate Analysis

i) Categorical

# Year
unique(format(part4$Date, format = '%Y'))
## [1] "2019"

All records are from 2019.

# Months
unique(format(part4$Date, format = '%m'))
## [1] "01" "03" "02"

The data set only contains records for 3 months.

# Copy of the data set.
df <- data.frame(part4)
# Creating month and day columns.
df$Month <- as.factor(format(part4$Date, format = '%m'))
df$Day <- as.integer(format(part4$Date, format = '%d'))
df$Weekday <- as.factor(wday(df$Date))

Function definition

# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend = NULL, colors = NULL, method, col2 = NULL){
  if (method == 'count'){
    ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
    ggtitle(paste(title, 'Frequency Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
  else if (method == 'bar'){
    ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
}

Months

# Bar plot
bar.plt (df, Month, title = 'Month Column', method = 'count', legend = 
           'Month', colors = c( '01' = '#0398dc','02' = '#026592',
                                '03' = '#001118'),
         col2 = NULL)

January has the highest frequency, while February has the lowest.

Day of the Week

unique(df$Weekday)
## [1] 7 6 1 2 5 4 3
## Levels: 1 2 3 4 5 6 7
# Bar plot
bar.plt (df, Weekday, title = 'Month Column', method = 'count', legend = 
           'Month', colors = c( '1' = '#03a9f4', '2' = '#0398dc',
                               '3' = '#0287c3', '4' = '#0276AB',
                               '5' = '#026592', '6' = '#02557A',
                               '7' = '#014462'),
         col2 = NULL)

Sunday, Wednesday and Thursday had the highest frequency. Tuesday had the lowest.

ii) Numerical

# Mode function
mode <- function(col, data) {
   unique.value <- unique(data[, col])
   unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}

central.tendency <- function(col, data){
  cat('Measures of Central Tendency \n')
  # Mean
  cat('Mean = ', mean(data[, col]), '\n')
  # Median
  cat('Median = ', median(data[,col]), '\n')
  # Mode
  cat('Mode = ', mode(col, data), '\n')
  
}

dispersion <- function(col, data){
  # Range
  cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
  # IQR
  cat('IQR = ', IQR(data[ ,col]), '\n')
  # Variance
  cat('Variance = ', var(data[ ,col]), '\n')
  # Standard Deviation
  cat('Standard Deviation = ', sd(data[ ,col]), '\n')
  # Skewness
  cat('Skewness = ', skewness(data[ ,col]), '\n')
  # Kurtosis
  cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
  
}
# Column names
colnames(df)
## [1] "Date"    "Sales"   "Month"   "Day"     "Weekday"
# Measures of central tendency
central.tendency(colnames(df)[2], df)
## Measures of Central Tendency 
## Mean =  322.9667 
## Median =  253.848 
## Mode =  829.08

The average sales is 323.

# Measures of dispersion
dispersion(colnames(df)[2], df)
## Range =  10.6785 - 1042.65 
## IQR =  346.9279 
## Variance =  60459.6 
## Standard Deviation =  245.8853 
## Skewness =  0.8912304 
## Kurtosis =  2.91253

The distribution has a slight positive skew, with a platykurtic distribution.

# Sales Histogram
hist(df$Sales, 
     main = 'Histogram of Sales', 
     xlab = 'Sales')

Lower sales have a higher frequency.

Days

# Sales Histogram
hist(df$Day, 
     main = 'Histogram of Days', 
     xlab = 'Day')

The frequency of records reduces towards the end of the month.

Summary

Sales

  • January has the highest frequency, while February has the lowest.
  • Sunday, Wednesday and Thursday had the highest frequency. Tuesday had the lowest.
  • The average sales is 323.
  • Lower sales have a higher frequency.

b) Bivariate Analysis

Sales Trend

# Scatter plot and correlation function
line.plt <- function(col1, col2, data, title){
ggplot(data, aes(x = {{col1}}, y= {{col2}})) + geom_line(color = '#014462', size = 1)+ ggtitle(paste(title, 'Line Plot')) +
    theme(plot.title = element_text(hjust = 0.5)) 

}
# Trend
line.plt(data = df, col1 =  Date, col2 = Sales,
         title = 'Sales Trend')

  • A very slight periodic trend can be seen.
  • The sales fluctuate on a daily basis.

Part 4: Anomaly Detection

Pre-Processing

# Converting data to tibble
# Ordering the index
part4$Date <- sort(part4$Date)
part4 <- as_tbl_time(part4, index = Date)
#part4 <- part4 %>% mutate(index=seq(n())) %>% 
     # arrange(index)

Anomaly Detection

# Detection
anomalies <- part4 %>%
  group_by(Date) %>%
  summarise(Sales = sum(Sales)) %>%
  time_decompose(Sales, merge = T, method = 'stl') %>%
  anomalize(remainder, method = 'iqr', alpha = 0.2) %>%
  time_recompose()
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# View result
anomalies %>% glimpse()
## Rows: 89
## Columns: 11
## $ Date          <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01…
## $ Sales         <dbl> 4343.682, 3260.827, 1920.250, 2837.614, 4007.829, 2742.7…
## $ observed      <dbl> 4343.682, 3260.827, 1920.250, 2837.614, 4007.829, 2742.7…
## $ season        <dbl> 340.12150, -136.97236, -498.88780, -452.66977, 566.26578…
## $ trend         <dbl> 3211.772, 3260.846, 3309.921, 3358.996, 3404.445, 3449.8…
## $ remainder     <dbl> 791.78898, 136.95367, -890.78257, -68.71128, 37.11856, -…
## $ remainder_l1  <dbl> -2006.696, -2006.696, -2006.696, -2006.696, -2006.696, -…
## $ remainder_l2  <dbl> 2240.483, 2240.483, 2240.483, 2240.483, 2240.483, 2240.4…
## $ anomaly       <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "…
## $ recomposed_l1 <dbl> 1545.1969, 1117.1777, 804.3369, 899.6296, 1964.0143, 146…
## $ recomposed_l2 <dbl> 5792.376, 5364.357, 5051.516, 5146.809, 6211.194, 5709.9…

The frequency = 7 days, trend = 30 days.

# Anomalies plot
anomalies %>% plot_anomalies(time_recomposed = TRUE, alpha_dots = 0.5) +
  ggtitle('Anomalies Plot')

  • Alpha has been set to 0.2, ensure that anomalies are not inclused in the normal range.
  • There are more cases of overstating sales, than under reporting of sales.
# Anomaly decomposition plot
anomalies%>%
  plot_anomaly_decomposition()+
  ggtitle("Decomposition of Anomalies Plot")

  • From the general trend, there are 3-4 instances of anomales. The trend is slightly periodic, dipping greatly in the month of February.

Adjusting Trend and Seasonality

# Default settings
auto <- anomalies %>% plot_anomaly_decomposition() +
  ggtitle('Frequency and Trend = Auto')
auto

  • Resultant plot for default frequency and trend values.
  • Anomalies: Jan = 4, Feb = 4 and Mar = 3.
# Logical frequency
get_time_scale_template()
## # A tibble: 8 × 3
##   time_scale frequency trend   
##   <chr>      <chr>     <chr>   
## 1 second     1 hour    12 hours
## 2 minute     1 day     14 days 
## 3 hour       1 day     1 month 
## 4 day        1 week    3 months
## 5 week       1 quarter 1 year  
## 6 month      1 year    5 years 
## 7 quarter    1 year    10 years
## 8 year       5 years   30 years

The frequency being used is 1 day, and the trend = 1 month.

# Changing the trend to 14 days, as the dataset only contains values from 
# Janauary to April
half <- part4 %>%
  group_by(Date) %>%
  summarise(Sales = sum(Sales)) %>%
  time_decompose(Sales, trend = '14 days') %>%
  anomalize(remainder, alpha = 0.2) %>%
  time_recompose %>%
  plot_anomaly_decomposition() + 
  ggtitle('Trend = 14 days(Local)')
## frequency = 7 days
## trend = 14 days
half

  • The overall trend is generally constant, with slight deviations from the center value.
  • Reducing the trend to 14 days shifts the anomalies slightly. Jan = 3, Feb = 4, Mar = 4. One anomaly is shifted to the next month, shifting the rest forward.

Extracting Anomalous Data Points

# Anomalies
part4 %>%
  group_by(Date) %>%
  summarise(Sales = sum(Sales)) %>%
  time_decompose(Sales, trend = '14 days') %>%
  anomalize(remainder, alpha = 0.2) %>%
  time_recompose %>%
  filter(anomaly == 'Yes')
## frequency = 7 days
## trend = 14 days
## # A time tibble: 11 × 10
## # Index: Date
##    Date       observed season trend remainder remainder_l1 remainder_l2 anomaly
##    <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl> <chr>  
##  1 2019-01-08    7893.  368.  3340.     4185.       -1639.        1919. Yes    
##  2 2019-01-22    1886.  368.  3821.    -2303.       -1639.        1919. Yes    
##  3 2019-01-31    5650. -536.  3469.     2718.       -1639.        1919. Yes    
##  4 2019-02-07    7282. -536.  3035.     4783.       -1639.        1919. Yes    
##  5 2019-02-15    6343. -426.  2489.     4280.       -1639.        1919. Yes    
##  6 2019-02-23    1622.  613.  2683.    -1673.       -1639.        1919. Yes    
##  7 2019-02-25    6385.  150.  2811.     3423.       -1639.        1919. Yes    
##  8 2019-03-09    6566.  613.  4025.     1928.       -1639.        1919. Yes    
##  9 2019-03-14    4909. -536.  3106.     2339.       -1639.        1919. Yes    
## 10 2019-03-17    1156.   46.7 2767.    -1658.       -1639.        1919. Yes    
## 11 2019-03-20    6613. -215.  2753.     4076.       -1639.        1919. Yes    
## # … with 2 more variables: recomposed_l1 <dbl>, recomposed_l2 <dbl>
  • Shifting the period shifts the position of anomalies. A trend of 14 days is used as the time length is short.
  • Each month has one case of under reported sales.
  • Most anomalies are over reporting of sales.

6. Implementing the Solution

Analysis Summary

  1. 11 anomalies have been found, 3 are cases of under reporting, and 8 are over stated sales.

  2. Each month has a single case of under reported sales.

7. Challenging the Solution

The trend was reduced to 14 days, from 30 days. Some anomalies were shifted from their positions, changing from Jan = 4, Feb = 4 and Mar = 3 to Jan = 3, Feb = 4 and Mar = 4. 14 days was selected due to the short time frame of the records.

8. Conclusion

In conclusion, these sales records might be instances of fraud, if the possibility of erroneous recording is eliminated.

9. Recommendations

  1. Implementation of fully automatic and restricted recording of transactions to reduce chances of errors or fraud.

  2. Obtaining more data to get a deeper insight on the frequency of these anomalies.

10. Follow Up Questions

a) Did we have the right data?

Yes, we have the right data, as it was provided by the client, based on previous sales records, from which anomalies were detected.

b) Do we need other data to answer our question?

Yes, more data can provide more insights into the frequency of occurrence of anomalies. This can also show if they are random, or have a pattern of occurrence.

c) Did we have the right question?

. Yes, we have the right question, as the aim of this analysis is to cater to the client’s request.

# Suppressing warnings
options(warn = defaultW)