setwd("C:/Users/DellPC/Desktop/Corner/R_source_code/time_series")

suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))

## Warning: package 'parsnip' was built under R version 4.1.3

## Warning: package 'recipes' was built under R version 4.1.3

suppressMessages(library(fpp3))

## Warning: package 'lubridate' was built under R version 4.1.2

## Warning: package 'tsibble' was built under R version 4.1.2

## Warning: package 'tsibbledata' was built under R version 4.1.2

## Warning: package 'feasts' was built under R version 4.1.2

## Warning: package 'fabletools' was built under R version 4.1.2

## Warning: package 'fable' was built under R version 4.1.2

suppressMessages(library(scales))

df <- read.csv("ads.csv")

Executive Summary

Topic that will be discussed in this notebook:

Data Exploratrion through Feature Engineering
Interpreting ACF and PACF
Understanding BoxCox Transformations
Implementation of the Prophet Library

Business Case

The data set consists of the following 2 columns: - Time: This column shows the specific date and time the event occured - Ads: The number of ads that appeared at a specific point in time.

The goal in this notebook will be to come up with an accurate forecast to find out the number of ads that will appear in time. We will use the concept of feature engineering and transformations to come up with an accurate forecast

Loading Libraries

In this notebook we will use the following libraries to deal with time series - timetk: Library to deal with time series data especially for data wrangling and visualization purposes - modeltime: This will mainly due to implement forecast for our time series data - Prophet: We will use this library for forecasting as well, Prophet was developed by developers from Facebook to easily produce accurate forecast when dealing with time series data

# Data Wrangling
library(tidyverse)
library(timetk)

## Warning: package 'timetk' was built under R version 4.1.3

library(lubridate)
library(tidyquant)

## Warning: package 'tidyquant' was built under R version 4.1.3

## Loading required package: PerformanceAnalytics

## Warning: package 'PerformanceAnalytics' was built under R version 4.1.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.1.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.1.2

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:fpp3':
## 
##     prices

## The following object is masked from 'package:graphics':
## 
##     legend

## Loading required package: quantmod

## Warning: package 'quantmod' was built under R version 4.1.2

## Loading required package: TTR

## Warning: package 'TTR' was built under R version 4.1.2

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## == Need to Learn tidyquant? ====================================================
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

## 
## Attaching package: 'tidyquant'

## The following object is masked from 'package:fable':
## 
##     VAR

# Visualization
library(ggthemes)
library(gghalves)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggthemes':
## 
##     theme_map

## The following object is masked from 'package:lubridate':
## 
##     stamp

# Modeling
library(modeltime)

## Warning: package 'modeltime' was built under R version 4.1.3

## 
## Attaching package: 'modeltime'

## The following object is masked from 'package:TTR':
## 
##     growth

library(prophet)

## Warning: package 'prophet' was built under R version 4.1.3

## Loading required package: Rcpp

## Warning: package 'Rcpp' was built under R version 4.1.2

## 
## Attaching package: 'Rcpp'

## The following object is masked from 'package:rsample':
## 
##     populate

## Loading required package: rlang

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice

library(tidymodels)

# Pipeline
library(workflows)

Introduction

Visualizing Time Series

We can clearly see some seasonality patterns at a frist glimpse of just looking our time seris visualizations. However, we will confirm this later in the notebooks through processes already explained in the Part 1 of this series. However, we will revisit some of those concepts in this time series notebooks

hourly_tbl <- df %>% 
   set_names(names(.) %>% str_to_lower()) %>%
   mutate(
      time = ymd_hms(time)
   )

hourly_tbl %>%
   ggplot(aes(x = time, y = ads)) + 
   geom_line(size = 1, color = "#4E79A7") +
   geom_point(size = 2, color = "#2C3E50") + 
   theme_bw() + 
   scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
   labs(
      title = "Number of ads across time"
   ) + 
   theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Seasonality Patterns

The concept of finding out seasonality patterns is really important if we want to have a better grasp of what the time series is telling us. In this section, we will split our time series into hours and weekday and see how the distribution of the time series behaves

Summary - Highest distribution of ads by hour: We see that the highest distribution of ads occurs at 12, 18 and 19. - Lowest distribution of ads by hour: The lowest distribution of ads occur not surprisingly during the easrly morning hours - Highest distribution of ads by Weekday: We see that the highest distribution of ads occurs during Sundays followed by Saturdays. - Lowest distribution of ads by Weekday: We see that the lowest distribution of ads occurs during the days of Mondays and Thurdays. Although, there is not that much variation during the business days.

This seasonal diagnostics plot comes from the timetk package.

hourly_tbl %>% 
   plot_seasonal_diagnostics(
      .date_var = time,
      .value = ads,
      .interactive = TRUE,
      .geom_color = "#4E79A7"
   )

Raincloud Plots

What are raincloud plots?

Raincloud Plots: is a way to dig in deeper into the distribution of our dataset. We use a combination of the following: - Boxplot - Histogram - Violin Plot

Interpreting a boxplot

We use it in descriptive statistics to understand the distribution of our observations. Boxplots are composed of the following:
Lower Fence: Q1 - 1.5*IQR
First Quartile (25th Percentile): The middle value between the smallest number and the middle value (median)
Median: The middle value in our series
Third Quartile (75th Percentile): The middle number between the middle value (median) and the highest number
Interquartile Range (IQR): Values between the First Quartiles and the Third Quartile
Upper Fence: Q3 + 1.5*IQR
Outliers: Any observation that falls outside the upper fence or lower fence

The Histogram and the Violin Plot just shows the overall distributions of the different observations in our dataset

weekday_pal <- c("#DBB165", "#DEB18B", "#2E604A", "#27223C", "#D1362F", "#7496D2","#CECD7B")

hourly_tbl %>% 
   mutate(
      weekday = as.factor(WEEKDAY(time, label = TRUE, abbr = FALSE))
   ) %>%
   ggplot(aes(x = weekday, y = ads)) + 
   geom_jitter(aes(color = weekday), 
               alpha = 0.5, 
               size = 0.5,
               show.legend = FALSE) + 
   geom_half_violin(aes(fill = weekday),
                    side = "l",
                    alpha = 0.45,
                    show.legend = FALSE,
                    trim = FALSE) +
   geom_half_boxplot(aes(fill = weekday), 
                     side = "r",
                     outlier.size = 3,
                     outlier.color = "red",
                     width = 0.4,
                     alpha = 0.45,
                     show.legend = FALSE) +
   stat_summary(fun = mean, geom = "line") + 
   theme_bw() +
   labs(
      title = "Raincloud Plots",
      subtitle = "A more descriptive way of distributions"
   ) + 
   scale_color_manual(values = weekday_pal) + 
   scale_fill_manual(values = weekday_pal)

## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

AutoCorrelation and Partial Correlation

Going back to this topic as a refresher of the first part of the time series fresher I would like to discuss a bit more in detail the definitions of both Auto Correlation (ACF) and Partial Auto Correlation (PACF). Forst, let me say that the reason why I emphasize this is because lags at different points in time could be used as features to improve the predictive accuracy of our models. Before getting into this topic we must understand the following: It is possible that past values have an influence in the behavior of future values in a time series. When this occurs, we say that the current value has a correlation with past values in the time series.

Definitions: - Autocorrelation: Autocorrelation is just the correlation between current value and past values. There is an indirect effect in autocorrelation when we find the correlation of lag. - Partial Autocorrelation (PACF): In shorter terms partial autocorrelation removes any effect of lags that are in between the lags you evaluating. In other words, at lag k it is the correlation that results after removing the effect of any correlation at a different lags. For instance, if we want to find the correlation of lag k and lag k-3 we remove the effect of lag k-1 and lag k-2 in order to find the correlation of the current value and lag k-3

Interpretation of ACF and PACF - Understanding our dataset: Remeber, that our dataset is in hours so each lag is a representation hours. - Correlation: We see that there is apositive correlation every 24 hours approximately in the ACF and PACF - What does this mean? Meaning that at a any given day in the next 24 hours we will see at a specific point in time a similar movement in the number of ads as opposed to that same time in the previous day and hence, there is a positive correlation at every 24 lags.

color_pal <- c("#5C5C5C", "#EA3B46")

hourly_tbl %>% 
   tk_acf_diagnostics(
      .date_var = time,
      .value = ads
   ) %>%
   pivot_longer(cols = c("ACF", "PACF")) %>%
   ggplot(aes(x = lag, y = value, color = name)) +
   geom_line(size = 1) + 
   geom_point(size = 0.5) + 
   facet_wrap(~name) + 
   theme_bw() +
   scale_color_manual(values = color_pal) +
   theme(legend.position = "bottom", strip.background = element_rect(fill = "#BDBDBD")) +
   labs(
      title = "Autocorrelation and Partial Auto Correlation",
      color = "Metric"
   )

## Max lag exceeds data available. Using max lag: 215

Feature Engineering

In this section we will start implementing the concept of feature engineering. I will be using the timetk package to extract valuable feature which could enhance the predictability of our model later in the notebook. Timetk is a powerful package that gives you leverage into analyzing time series datasets. I

Extracting Features

Summary of the feature engineering process: - index.num: Number of seconds passed since 1970 to the current timestamp - Diff: Number of seconds passed between timestamps - half: Which half of the semester during the year (1 or 2) - quarter: Quarter of the year - month: Month of the year - day: Day of the year - hour: Hour of the day - hour12: hour from 0-12 (it restarts in the afternoon) instead of 0-24 - weekday: Day of the week - mday: Day of the month - qday: Day of the quarter - yday: Day of the year

Note: The features that will have the most significance are most probably the hour and the weekday but we will find these in the upcoming updates

# Feature engineering with one line of code 

prepared_hourly_tbl <- hourly_tbl %>% 
   tk_augment_timeseries_signature() %>%
   mutate(
      ads_transformed = log_interval_vec(ads, limit_lower = 0, offset = 1),
      ads_transformed = standardize_vec(ads_transformed)
   ) %>% 
   select(time, ads, ads_transformed, everything())

## tk_augment_timeseries_signature(): Using the following .date_var variable: time

## log_interval_vec(): 
##  Using limit_lower: 0
##  Using limit_upper: 179857.5
##  Using offset: 1

## Standardization Parameters
## mean: 0.851871721269838
## standard deviation: 0.77381949281498

prepared_hourly_tbl %>% 
   glimpse()

## Rows: 216
## Columns: 31
## $ time            <dttm> 2017-09-13 00:00:00, 2017-09-13 01:00:00, 2017-09-13 ~
## $ ads             <int> 80115, 79885, 89325, 101930, 121630, 116475, 106495, 1~
## $ ads_transformed <dbl> -1.3840152, -1.3907071, -1.1181897, -0.7538472, -0.148~
## $ index.num       <dbl> 1505260800, 1505264400, 1505268000, 1505271600, 150527~
## $ diff            <dbl> NA, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 36~
## $ year            <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ~
## $ year.iso        <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ~
## $ half            <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ quarter         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ month           <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, ~
## $ month.xts       <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ~
## $ month.lbl       <ord> September, September, September, September, September,~
## $ day             <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13~
## $ hour            <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ~
## $ minute          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ second          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ hour12          <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5~
## $ am.pm           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, ~
## $ wday            <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
## $ wday.xts        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ wday.lbl        <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wednesday,~
## $ mday            <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13~
## $ qday            <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75~
## $ yday            <int> 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,~
## $ mweek           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ week            <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37~
## $ week.iso        <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37~
## $ week2           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ week3           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ week4           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ mday7           <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~

Linear and Non-Linear Regression

Now we will see what impact some of those new features that we extracted from the time column have on a linear model. Before showing you the impact ofthose features lets just go over the main concepts of Linear Regression

Concept of Linear Regression: The concept of linear regression can be confusing especially for the individuals getting into the machine learning world. When I think about the concept of linear regression I usually think as of “the best fitted line in a set of observations”. Furthermore, a linear model can have “predictors” or a set of features thatr can help the model reduce the standard of errors.

Summary: - Linear Regression Model based on time: As you can see only adding a linear regression model based on time itself is not that useful. We can see that there is only a straight line that is not able to capture the movement of our time series. We can see that we have an adjusted square of -0.002508 - Using features from Feature Engineering: However, we see that our linear model improves if we include the feature of day, weekday and hour in our linear formula. We have an adjusted R Squared of 0.2062. A big improvement with just having one flat line in our time series - So now what? Is this good enough? Of course not, this was mainly to show a bit of the power of feature engineering. We will explore further in the notebook other transformation techniques to improve accuracy when forecasting future values - So why can Feature engineering help us? As we saw below, the features we extracted from the time column, help the linear regression model capture the movement of our series and this is what will help our models further in the notebook improve accuracy when it comes to forecasting.

# Plot 1

p1 <- prepared_hourly_tbl %>% 
   plot_time_series_regression(
      time,
      .formula = ads_transformed ~ time, 
      .interactive = FALSE,
      .show_summary = TRUE,
      .title = "Linear Model Without Implementing Feature Engineering",
      .legend_show = FALSE,
      .line_color = "#D1362F",
      .smooth_color = "#EA3B46",
      .line_size = 1
   )

## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61037 -0.83165 -0.08248  0.84333  2.57092 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.107e+02  4.570e+02    0.68    0.497
## time        -2.063e-07  3.035e-07   -0.68    0.497
## 
## Residual standard error: 1.001 on 214 degrees of freedom
## Multiple R-squared:  0.002155,   Adjusted R-squared:  -0.002508 
## F-statistic: 0.4622 on 1 and 214 DF,  p-value: 0.4973

# Plot 2

seasonality_formula <- as.formula(
   ads_transformed ~ 
      splines::ns(index.num) +
      day + wday.lbl + hour
)

p2 <- prepared_hourly_tbl %>% 
   plot_time_series_regression(
      time,
      .formula = seasonality_formula,
      .interactive = FALSE, 
      .show_summary = TRUE,
      .title = "Linear Model With Feature Engineering",
      .legend_show = TRUE,
      .line_color = "#D1362F",
      .smooth_color = "#EA3B46",
      .line_size = 1
   )

## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64709 -0.56690  0.06925  0.68511  1.69159 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            18.37274    2.66914   6.883 6.84e-11 ***
## splines::ns(index.num) 15.86812    2.34830   6.757 1.40e-10 ***
## day                    -1.45289    0.21178  -6.861 7.79e-11 ***
## wday.lbl.L             -0.26134    0.18746  -1.394  0.16476    
## wday.lbl.Q              0.55521    0.17040   3.258  0.00131 ** 
## wday.lbl.C             -0.15773    0.17803  -0.886  0.37667    
## wday.lbl^4              0.27134    0.17074   1.589  0.11355    
## wday.lbl^5             -0.08790    0.16808  -0.523  0.60155    
## wday.lbl^6              0.06717    0.15460   0.434  0.66439    
## hour                         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8909 on 207 degrees of freedom
## Multiple R-squared:  0.2358, Adjusted R-squared:  0.2062 
## F-statistic: 7.983 on 8 and 207 DF,  p-value: 2.306e-09

plot_grid(p1, p2, nrow = 2)

Data Preparation

Box Cox Transformations

Why BoxCox Transformations?

Briefly, boxcox transformations try to transform our data into a normal distrubution or at least clsoe to normal distribution. A lot of models have a huge tendency to perform better under normal distributions althourgh there are models that can perfactly perform without a normal distribution. Why some models perform better under a normal distribution? Well, by havign a normal distribution we avoid noise in our data which allows our model to perform much better

# We use lambnda to transform back

lambda <- 0.91150846258978

ts_boxcox_ads_tbl <- hourly_tbl %>%
   mutate(ads_box_cox = box_cox_vec(ads, lambda = "auto"))

## box_cox_vec(): Using value for lambda: 0.91150846258978

ts_boxcox_ads_tbl %>% 
   ggplot(aes(x = time, y = ads_box_cox)) + 
   geom_line(size = 1, color = "#5C5C5C") + 
   geom_point(size = 0.5, color = "#5C5C5C") + 
   theme_bw() + 
   scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) + 
   labs(
      title = "Number of ads across time", 
      subtitle = "Box Cox Transformation"
   ) +
   theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Time Series Split

When fitting a model to a dataset usually we wplit the data into a training, validation and test set. To make it more visually appealing for the user I will split into training (approx 80%) of our dataset and test set (approx 20%) of the dataset.

how do we split the dataset in time series?

ts_boxcox_ads_tbl <- ts_boxcox_ads_tbl %>% 
   select(time, ads_box_cox) 

# 22% is the testing test
# Splitting dataframe 

splitted_df <- time_series_split(ts_boxcox_ads_tbl, assess = "2 days", cumulative = TRUE)

## Using date_var: time

new_colors_pal <- c("#27223C", "#D1362F")

# Extracting training and testing sets 

train_df <- training(splitted_df)
test_df <- testing(splitted_df)

splitted_df %>% 
   tk_time_series_cv_plan() %>% 
   ggplot(aes(x = time, y = ads_box_cox, color = .key)) + 
   geom_line(size = 1) + 
   scale_color_manual(values = new_colors_pal) + 
   theme_bw() +
   labs(
      title = "Time Series Split",
      subtitle = "BoxCox Value"
   ) + 
   geom_label(
      label = "Training Set",
      color = "#27223C",
      vjust = 0.5, 
      aes(x= as_datetime(c("2017-09-16 21:00:00")),
          y = 65000
   )) + 
   geom_label(
      label = "Testing Set",
      color = "#D1362F",
      vjust = 0.5,
      aes(x = as_datetime(c("2017-09-20 21:00:00")),
          y = 62500)
   ) +
   theme(legend.position = "none") + 
   scale_x_datetime(expand = c(0,0)) +
   geom_vline(xintercept = as_datetime(c("2017-09-20 00:00:00")), lty = "dashed")

Forecasting with Prophet

What is Prophet?

As described in the official website of Prophet “Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.” Prophet is an open source software desgined by Facebooks Core Data Science team. You can use it both in Python and R.

Would Prophet work well in our time series? Although our time series does not have an extensive historical data, it has a lot of seasonality (hours were we have a high number of ads and other hours where we dont. (Repetitive patterns occur.)) This could help the Prophet model perform well but we will see this in a second.

Summary: - Forecast on Testing Data: Our forecast seems to capture most of the seasonality and trend of our time series. - Metrics: As described below we see that the metrics resembles what the prediction against the actual data show us. - Forecasting for the next 5 days: We see that the same seasonality patterns are kept but interesting enough we see a bit of a downward trend in our forecast as seen with the trendline. - Forecasting with inverted Box Cox Transformation: To invert the boxcox transformation we had in the previous section, we used lambda to invert to the previous numbers. Remember, we use lambda to approximate the distribution of the observations to a “normal distribution.”

prophet_fit <- prophet_reg() %>%
   set_engine("prophet") %>%
   fit(ads_box_cox ~ time, data = ts_boxcox_ads_tbl)

## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

model_tbl <- modeltime_table(
   prophet_fit
)

calibration_tbl <- model_tbl %>%
   modeltime_calibrate(
      new_data = test_df
   )

forecast_tbl <- calibration_tbl %>% 
   modeltime_forecast(actual_data = ts_boxcox_ads_tbl)

## Using '.calibration_data' to forecast.

forecast_tbl %>% 
  rename(
    "time" = ".index",
    "value" = ".value"
  ) %>% 
  ggplot(aes(x=time, y=value, color=.key)) + geom_line(size=1) + geom_point(size=0.5) + theme_bw() + scale_color_manual(values=new_colors_pal) + 
  theme(legend.position = "bottom") + 
  labs(
    title = "Forecasting with Prophet",
    subtitle = "Number of Ads",
    y = "BoxCox Transform Value"
  ) + scale_x_datetime(expand = c(0,0))

Accuracy Evaluation

calibration_tbl %>% 
  modeltime_accuracy() %>% 
  pivot_longer(cols = mae:rsq) %>% 
  mutate(
    name = name %>% as_factor() %>% fct_reorder(value) %>% fct_rev()
  ) %>% 
  ggplot(aes(x=name, y=value, fill=name)) + geom_col(color="black") + 
  theme_bw() + scale_fill_tableau() + theme(legend.position = "bottom") + 
  geom_label(aes(label = round(value,2)), size=3, color="white") + 
  labs(
    title = "Metrics for evaluating Accuracy",
    subtitle = "Prophet Model with BoxCox Transformation"
  )

Forecast with inversted Box Cox

# New color palette for the visualization
# new_color_pal <- c("#456355", "#B62A3D")
# We retrain the model on the actual data
refit_tbl <- calibration_tbl %>% 
  modeltime_refit(data = ts_boxcox_ads_tbl)

## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

# inverting transformation
forecast_ads_tbl <- refit_tbl %>% 
  modeltime_forecast(h = "5 days", actual_data = ts_boxcox_ads_tbl)

forecast_ads_tbl <- forecast_ads_tbl %>% 
  mutate(.value = box_cox_inv_vec(.value, lambda = lambda),
         .conf_lo = box_cox_inv_vec(.conf_lo, lambda = lambda),
         .conf_hi = box_cox_inv_vec(.conf_hi, lambda = lambda))


forecast_ads_tbl %>% 
    rename(
    "time" = ".index",
    "value" = ".value"
  )  %>% 
    ggplot(aes(x=time, y=value, color=.key)) + geom_line(size=1) + 
  theme_bw() + scale_color_manual(values = new_colors_pal) + 
  labs(
    title = "Prophet Forecast for the next 5 days",
    subtitle = "Number of Ads (Boxcox Inverted)",
    y = "Number of Ads",
    caption = str_glue("Lambda: {lambda}")
  ) + 
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) + 
  theme(legend.position = "bottom") + geom_smooth(method = "loess", color = "#191919", lty="dashed") + scale_x_datetime(expand = c(0,0))

## `geom_smooth()` using formula 'y ~ x'

Fourier Series

Introduction to Fourier Series

Now we will talk about an interesting topic that could help improve our model predictability. Fourier Series are a great day to deal with multiple seasonality (which surprisingly enough this is the case). Before going into what is Fourier Series, lets revisit our data with what we know so far.

Summary of our Dataset: What did the lags tell us? As we have seen in the ACF and PACF plots, there is a strong correlation every 24 hours which mean that at a specific hour the next day the tendency of the number of ads will be similar to the date of the previous day. 24 hour periods: This time interval in our time series will be important to determine seasonality movements in our time series.

Untitled

Nguyen_LSCM

4/21/2022