We are to check whether there are any anomalies in the given sales dataset. The objective of this task being fraud detection.
# Installing anomalize package
#install.packages("anomalize",repos = "http://cran.us.r-project.org")
# Load tidyverse and anomalize
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(anomalize,warn.conflicts = FALSE)
## Warning: package 'anomalize' was built under R version 4.1.3
## == Use anomalize to improve your Forecasts by 50%! =============================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(tibbletime)
## Warning: package 'tibbletime' was built under R version 4.1.3
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
# read data
forecast <- read.csv('http://bit.ly/CarreFourSalesDataset')
View(forecast)
# checking the structure of our data
str(forecast)
## 'data.frame': 1000 obs. of 2 variables:
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
# checking the shape
dim(forecast)
## [1] 1000 2
We have 1000 observations and 2 variables.
# converting variables to our preferred format
forecast$Date <- as.Date(forecast$Date, "%m/%d/%Y")
str(forecast)
## 'data.frame': 1000 obs. of 2 variables:
## $ Date : Date, format: "2019-01-05" "2019-03-08" ...
## $ Sales: num 549 80.2 340.5 489 634.4 ...
# visualizing our sales
hist(forecast$Sales,col="blue")
# Sales distribution over time
library(ggplot2)
ggplot(data = forecast, aes(x = Date, y = Sales)) +
geom_bar(stat = "identity", fill = "green") +
labs(title = "Sales distribution",
x = "Date", y = "Sales(ksh)")
# Ordering the data by Date
forecast = forecast %>% arrange(Date)
head(forecast)
# Since our data has many records per day,
# We get the average per day, so that the data
forecast = aggregate(Sales ~ Date , forecast , mean)
head(forecast)
# Converting data frame to a tibble time (tbl_time)
# tbl_time have a time index that contains information about which column
# should be used for time-based subsetting and other time-based manipulation,
forecast= tbl_time(forecast, Date)
class(forecast)
## [1] "tbl_time" "tbl_df" "tbl" "data.frame"
We now use the following functions to detect and visualize anomalies;
forecast %>%
time_decompose(Sales) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## frequency = 7 days
## trend = 30 days
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
There were no anomalies detected in the data.