Activity 2

The first step involves loading the dataset using a function like read_excel() to bring the data into R. Then, the dataset is prepared by converting the date column to a Date object and calculating the total quantity sold per day.

library(anomalize)

## Warning: package 'anomalize' was built under R version 4.2.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

library(timetk)

## Warning: package 'timetk' was built under R version 4.2.3

## 
## Attaching package: 'timetk'
## 
## The following objects are masked from 'package:anomalize':
## 
##     anomalize, plot_anomalies

online_retail <- read_excel("C:/Users/patop/Downloads/online+retail/Online Retail.xlsx")

The dataset is converted to a tibble format with tk_tbl() after formatting the InvoiceDate column to a proper date format using as.Date. This step ensures that the date is in a format suitable for time series analysis. The data is grouped by InvoiceDate and summarized to calculate the total quantity sold per day.

# Correctly format the date and convert to tibble
online_retail_tbl <- online_retail %>%
  mutate(InvoiceDate = as.Date(InvoiceDate, format = "%Y-%m-%d")) %>%
  select(InvoiceDate, Quantity) %>%
  group_by(InvoiceDate) %>%
  summarise(total_quantity = sum(Quantity)) %>%
  mutate(total_quantity = as.numeric(total_quantity)) %>%
  tk_tbl()

## Warning in tk_tbl.data.frame(.): Warning: No index to preserve. Object
## otherwise converted to tibble successfully.

The time_decompose function is used to decompose the total_quantity time series into three components: trend, seasonality, and remainder. This step is critical for isolating the component that is most likely to contain anomalies (the remainder). The output shows the auto-detected frequency and trend periods, indicating how the decomposition method has determined the seasonal patterns and long-term trends in the data.

# Decompose the time series
decomposed_data <- online_retail_tbl %>%
  time_decompose(total_quantity, method = "stl")

## Converting from tbl_df to tbl_time.
## Auto-index message: index = InvoiceDate

## frequency = 6 days

## trend = 72 days

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Apply anomaly detection specifying both .date_var and .value
retail_anomalized <- decomposed_data %>%
  anomalize(.date_var = InvoiceDate, .value = remainder)

## frequency = 6 observations per 1 week

## trend = 78 observations per 3 months

# Remove any existing recomposed columns to avoid duplication
retail_anomalized <- retail_anomalized %>%
  select(-contains("recomposed_l1"), -contains("recomposed_l2"))

# Ensure necessary columns for recomposition are present
retail_anomalized <- retail_anomalized %>%
  mutate(remainder_l1 = lag(remainder), remainder_l2 = lag(remainder, 2))  # Create lag columns for recomposition

# Apply time recomposition
retail_anomalized <- retail_anomalized %>%
  time_recompose()

The anomalize function is applied to the decomposed data. This function identifies outliers in the remainder component, which are considered anomalies because they deviate significantly from the expected pattern. The function uses .date_var to specify the date column and .value to specify the column containing the remainder values. The time_recompose function is used to recompute the original time series data by removing detected anomalies. This function is helpful in understanding what the data would look like without these anomalies. The anomalies are then plotted using plot_anomalies, with ggtitle added to give the plot a title. This visualization shows the time series with detected anomalies marked, making it easy to identify periods with unusual activity.

# Correctly plot the anomalies and add title with ggplot syntax
retail_anomalized %>%
  plot_anomalies(.date_var = InvoiceDate) %>%
  ggtitle("Anomalies in Retail Sales Time Series")

## $title
## 
## $subtitle
## [1] "Anomalies in Retail Sales Time Series"
## 
## attr(,"class")
## [1] "labels"

# View the anomalies
anomalies <- retail_anomalized %>%
  filter(anomaly == 'Yes')

print(anomalies)

## # A time tibble: 1 × 14
## # Index:         InvoiceDate
##   InvoiceDate observed season trend remainder seasadj anomaly anomaly_direction
##   <date>         <dbl>  <dbl> <dbl>     <dbl>   <dbl> <chr>               <dbl>
## 1 2011-06-14   -28279.  -61.7 -235.   -27983. -28217. Yes                    -1
## # ℹ 6 more variables: anomaly_score <dbl>, observed_clean <dbl>,
## #   remainder_l1 <dbl>, remainder_l2 <dbl>, recomposed_l1 <dbl>,
## #   recomposed_l2 <dbl>

The final part of the code filters the data to show only the rows where anomalies were detected (anomaly == ‘Yes’). This output provides detailed information about the specific dates and magnitude of each anomaly.

Activity 2

Patricio Pérez A01642110

2024-08-28