Anomaly Detection – Signal vs. Noise through Time-Series Decomposition

General Note

This example is part of a larger set of examples of using Google Analytics with R:

For additional examples, downloadable code, and explanations of the overall effort, see: https://github.com/SDITools/ga-and-r-examples
To see this specific example in Shiny (web/interactive) form, see: https://gilligan.shinyapps.io/forecasting/

Overview

This example is an exploration of using Holt-Winters forecasting for anomaly detection. Mainly, it’s geared towards an explanation of time-series decomposition and how that technique can be used to identify whether some set of data falls outside of an expected range.

Setup/Config

Start with the initial configuration of variables and the theme.

# Load the necessary libraries. 
if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

pacman::p_load(googleAnalyticsR,  # How we actually get the Google Analytics data
               tidyverse,         # Includes dplyr, ggplot2, and others; very key!
               splitstackshape,   # For some data munging (probably better to use a tidyverse package)
               plotly,            # We're going to make the charts interactive
               scales,            # Useful for some number formatting in the visualizations
               lubridate,         # For the wday function
               tools)             # For our stepwise regression

# Authorize GA. Depending on if you've done this already and a .ga-httr-oauth file has
# been saved or not, this may pop you over to a browser to authenticate.
ga_auth(token = ".httr-oauth", new_user = TRUE)

## Token cache file: .httr-oauth

# Set the total number of weeks to be used. The most recent week will be the "checked" week,
# while the weeks before will be used to build the forecast.
total_weeks <- 7

# Set the view ID and the date range. If you want to, you can swap out the Sys.getenv()
# call and just replace that with a hardcoded value for the view ID. 
view_id <- Sys.getenv("GA_VIEW_ID")
end_date <- Sys.Date() - wday(Sys.Date())     # The most recent Saturday
start_date <- end_date - total_weeks * 7 + 1  # Start date based on total_weeks

# Define our theme -- this will be used later for visualizations of the full-sized charts
theme_hw <- theme_bw() +
  theme(text = element_text(family="Nunito"),
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.margin = margin(1.5,0,0,0,"cm"),
        axis.text.x = element_blank(),
        axis.text.y = element_text(size = 14),
        axis.title = element_blank(),
        axis.title.x = element_text(size = 10, hjust = 0.5),
        axis.ticks = element_blank(),
        axis.line.x = element_line(color = "gray50"),
        axis.line.y = element_blank(),
        legend.title = element_blank(),
        legend.background = element_blank(),
        legend.position = "top",
        legend.justification = "center",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(size=0.5, colour = "gray90"),
        panel.grid.minor = element_blank())

# And, a theme for the time-series decomposition
theme_sparklines <- theme_bw() +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        axis.line = element_blank(),
        legend.title = element_blank(),
        legend.background = element_blank(),
        legend.position = "none",
        legend.justification = "center",
        strip.text.x = element_text(face = "bold", size = 14, colour = "grey10", family="Nunito"),
        strip.text.y = element_text(face = "bold", size = 14, colour = "grey10", 
                                    angle = 180, hjust=1, family="Nunito"),
        strip.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.spacing = unit(0,"in"),
        panel.background = element_rect(fill = NA, color = NA))

Pull the Data

This is a simple data pull. We’re just pulling data by day (because each day is a different day of the week) and sessions. And, since Google Analytics has a “Day of Week Name” dimension, we’ll go ahead and pull that, too (although we could have easily calculated this afterwards).

# Pull the data. See ?google_analytics_4() for additional parameters. The anti_sample = TRUE
# parameter will slow the query down a smidge and isn't strictly necessary, but it will
# ensure you do not get sampled data.
ga_data <- google_analytics(viewId = view_id,
                            date_range = c(start_date, end_date),
                            metrics = "sessions",
                            dimensions = "date",
                            anti_sample = TRUE)

Plot the Data as Weekly Data

We pulled the data daily, so we’re going to roll this up to be weekly and then plot it to show how the data might look in a dashboard that includes a weekly sparkline.

# Aggregate the data to be weekly
ga_data_weekly <- ga_data %>%
  mutate(week = date - wday(date) + 1) %>% 
  group_by(week) %>% 
  summarise(sessions = sum(sessions)) %>% 
  mutate(last_week_sessions = ifelse(week == max(week), sessions, NA ))

# Build a plot
ga_plot_weekly <- ggplot(ga_data_weekly, mapping = aes(x = week, y = sessions)) +
  geom_line(color = "#0060AF", size = 1) +
  geom_point(aes(y=sessions), size=3, shape = 19, colour = "#0060AF") +
  geom_point(aes(y=sessions), size=3, shape = 20, colour = "white") +
  geom_point(aes(y=last_week_sessions), size=5, colour = "white") +
  geom_point(aes(y=last_week_sessions), size=3.5, colour = "#9A258F") +
  scale_y_continuous(label=comma, expand = c(0, 0), limits = c(0, max(ga_data_weekly$sessions) * 1.03)) +
  labs(x = " ") +
  theme_hw

# Plot the data
ga_plot_weekly

Data Munging

A lot of this is for the purposes of the visualization, but, essentially, we need to break the data into the “data before last week” (this is the data we’ll use to build forecast) and the “data to be evaluated” (the data from last week).

# Determine how many rows of the data will be used to build the forecast. This
# is everything except the last week.
rowcount_forecast <- nrow(ga_data) - 7

# Also figure out the date where the cutoff is between training and forecast
cutoff_date <- ga_data[rowcount_forecast,1] 

# We actually want to shift this over a little bit to fall between two points when we plot
cutoff_date <- (2*as.numeric(cutoff_date) + 1)/2

# Make a data set that removes the "rows to be evaluated." This will get 
# used both to generate the time series for the forecast as well as for modeling
ga_data_training <- ga_data %>%
  top_n(-rowcount_forecast, wt = date) 

# Get the date values for the forecast period
dates_forecast <- ga_data %>%
    top_n(7, wt = date) %>%
  dplyr::select(date)

# Make a time-series object using the data for the training period. This
# is what we'll use to build the forecast
ga_data_ts <- ga_data_training[[2]] %>%
  ts(frequency = 7)

# Start building out our master data for plotting by adding a column that
# has just the data being used for the training
ga_data_plot <- ga_data %>%
  left_join(ga_data_training, by = c(date = "date"))

# Rename columns to be a bit clearer
names(ga_data_plot) <- c("date", "sessions_all", "sessions_training")

# Add a column that is just the actuals data of interest
ga_data_plot <- ga_data_plot %>%
  mutate(sessions_assess = ifelse(is.na(sessions_training), sessions_all, NA))

Visualizing the Data

The basic view of the raw data:

We’re going to split this data up into two different groups:

The data of interest is the last seven days:

The earlier data is our data for context – this is the data we’ll actually use to build a forecast:

So, really, we’re just going to work with the data before the cutoff for now. This is the data that we’re going to “decompose” and, ultimately, use to build a forecast:

Explaining / Visualizing Time-Series Decomposition

This step isn’t strictly necessary for this exercise, either. But, for pedagogical purposes, we’re going to walk through the actual “decomposing” of the training data. It’s actually an oversimplification of what actually happens in Holt Winters forecasting, in that Holt Winters applies “exponential smoothing” at each step. This matters and is important, but it’s a little easier to intuit than the fundamental decomposition of the time-series in the first place, so that’s what we’re going to walk through here.

The plot below shows the decomposition of the actual data into three components:

Seasonal – this is a recurring pattern every 7 days, so think of this as each day being the mean value for that weekday across the entire date range (e.g., the Monday value in the Seasonal component is simply the mean of the traffic for all Mondays in the data set); in reality, Holt-Winters applies “exponential smoothing” and can take into account a secondary seasonal pattern (like annual seasonality), but that’s getting farther into the weeds than is necessary here (and, sure, it’s farther in the weeds than I fully grok myself)
Trend – again, we’ll think of this in a simplified manner as simply a moving average of the values that are left after the Seasonal values are subtracted from the Actual value. Again, it’s a little more complicated than that, but that works conceptually, so we’ll stick with that.
Random – this is simply “what’s left” after the Seasonal and Trend values have been subtracted from the Actual value

So, if you’re following along, this means that, for any day, the Seasonal component plus the Trend component plus the Random component exactly equals the Actual value.

Build the Forecast

This is the actual building of the forecast. It does all of the time-series decomposition described above (but with a bit more complexity) and builds the model inherently within these function calls. To be clear, the time-series decomposition illustrated in the last step is not exactly what is being used for the forecasting, but, conceptually, it’s pretty much what is happening under the hood with the HoltWinters function.

# Generate a Holt Winters forecast
hw <- HoltWinters(ga_data_ts)

# Predict the next 7 days (the 7 days of interest). Go ahead and convert it to a data frame
forecast_sessions <- predict(hw, n.ahead = 7, prediction.interval = T, interval_level = 0.95) %>%
  as.data.frame()

# Add in the dates so we can join this with the original data. We know it was the 7 days
# starting from cutoff_date
forecast_sessions$date <- dates_forecast$date

# Add these columns to the original data and add a column that IDs anomaly points by 
# checking to see if the actual value is outside the upper or lower bounds. If it is,
# put the value. We'll use this to highlight the anomalies.
ga_data_plot <- ga_data_plot %>%
  left_join(forecast_sessions) %>%
  mutate(anomaly = ifelse(sessions_all < lwr | sessions_all > upr, sessions_all, NA))

# Figure out the max value (for plotting) and then bump it up a bit
max_y <- max(dplyr::select(ga_data_plot, -date))

Plot the Forecast

The forecast comes from the first two components of our time-series decomposition:

Seasonal – this is simply a recurring cycle, so it can be easily extended (actual details a bit more complicated)
Trend – this is essentially a moving average, so each “next point” can be calculated based on the average of the previous X points (again…actual details a bit more complicated)

The Forecast Will Not Be Perfect

We can add a “prediction interval” that is based on the variability of the Random component from our time-series decomposition. The more fluctuation there was in that component, the bigger the prediction interval (the farther off from the forecast an actual result can be and still be within an “expected” range). The below chart shows the “95% prediction interval,” meaning that, if everything stays pretty much the same, then that is the range we would expect the actual values to fall within 95% of the time.

Compare the Forecast to the Actuals

Now, we can layer back in the actual values for those seven days and have some useful context.