Anomaly Detection

Overview

We are to check whether there are any anomalies in the given sales dataset. The objective of this task being fraud detection.

Load data

# Installing anomalize package
#install.packages("anomalize",repos = "http://cran.us.r-project.org")
# Load tidyverse and anomalize
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(anomalize,warn.conflicts = FALSE)
## Warning: package 'anomalize' was built under R version 4.1.3
## == Use anomalize to improve your Forecasts by 50%! =============================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(tibbletime)
## Warning: package 'tibbletime' was built under R version 4.1.3
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
# read data
forecast <- read.csv('http://bit.ly/CarreFourSalesDataset')
View(forecast)
# checking the structure of our data
str(forecast)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...
# checking the shape
dim(forecast)
## [1] 1000    2

We have 1000 observations and 2 variables.

# converting variables to our preferred format
forecast$Date <- as.Date(forecast$Date, "%m/%d/%Y")
str(forecast)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date : Date, format: "2019-01-05" "2019-03-08" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...

Visualization

# visualizing our sales
hist(forecast$Sales,col="blue")

# Sales distribution over time
library(ggplot2)
ggplot(data = forecast, aes(x = Date, y = Sales)) +
      geom_bar(stat = "identity", fill = "green") +
      labs(title = "Sales distribution",
           x = "Date", y = "Sales(ksh)")

# Ordering the data by Date
forecast = forecast %>% arrange(Date)
head(forecast)
# Since our data has many records per day, 
# We get the average per day, so that the data
forecast = aggregate(Sales ~ Date , forecast , mean)
head(forecast)
# Converting data frame to a tibble time (tbl_time)
# tbl_time have a time index that contains information about which column 
# should be used for time-based subsetting and other time-based manipulation,
forecast= tbl_time(forecast, Date)
class(forecast)
## [1] "tbl_time"   "tbl_df"     "tbl"        "data.frame"

We now use the following functions to detect and visualize anomalies;

forecast %>%
    time_decompose(Sales) %>%
    anomalize(remainder) %>%
    time_recompose() %>%
    plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Conclusion

There were no anomalies detected in the data.