1 Executive Summary

Topics that will be discussed in this notebook: - Data Exploration through Feature Engineering - Interpreting ACF and PACF - Understanding BoxCox Transformations - Implementation of the Prophet Library

2 Loading libraries

# Data Wrangling
library(tidyverse)
library(timetk)
library(lubridate)
library(tidyquant)

# Visualization 
library(ggthemes)
library(gghalves)
library(cowplot)

# Modeling 
library(modeltime)
library(prophet)
library(tidymodels)

# Pipeline 
library(workflows)

3 Introduction

3.1 Visualizing time series

We can clearly see some seasonality patterns at a first glimpse of just looking at our time series visualizations.

df <- read_csv("/Users/nguyenbuiminh/Desktop/coding_python/ml/time_series/ads.csv")

df
## # A tibble: 216 × 2
##    Time                  Ads
##    <chr>               <dbl>
##  1 2017-09-13 0:00:00  80115
##  2 2017-09-13 1:00:00  79885
##  3 2017-09-13 2:00:00  89325
##  4 2017-09-13 3:00:00 101930
##  5 2017-09-13 4:00:00 121630
##  6 2017-09-13 5:00:00 116475
##  7 2017-09-13 6:00:00 106495
##  8 2017-09-13 7:00:00 102795
##  9 2017-09-13 8:00:00 108055
## 10 2017-09-13 9:00:00 116125
## # ℹ 206 more rows
hourly_tbl <- df %>%
  set_names(names(.) %>% str_to_lower()) %>%
  mutate(
    time = ymd_hms(time)
  )

hourly_tbl
## # A tibble: 216 × 2
##    time                   ads
##    <dttm>               <dbl>
##  1 2017-09-13 00:00:00  80115
##  2 2017-09-13 01:00:00  79885
##  3 2017-09-13 02:00:00  89325
##  4 2017-09-13 03:00:00 101930
##  5 2017-09-13 04:00:00 121630
##  6 2017-09-13 05:00:00 116475
##  7 2017-09-13 06:00:00 106495
##  8 2017-09-13 07:00:00 102795
##  9 2017-09-13 08:00:00 108055
## 10 2017-09-13 09:00:00 116125
## # ℹ 206 more rows
hourly_tbl %>%
  ggplot(aes(x=time, y=ads)) + 
  geom_line(size = 1, color = "#4E79A7") +
  geom_point(size=0.5, color = "#2C3E50") +
  theme_bw() +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  labs(
    title = "Number of ads across time"
  ) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

3.2 Seasonality Patterns

The concept of finding out seasonality patterns is really important if we want to have a better grasp of what ttime series is telling us. In this section we will split out time series into hours and weekday and see how the distribution of the time series behaves

Summary -Highest distribution of ads by hour: We see that the highest distribution of ads occurs at 12, 18, and 19 - Lowest distribution of ads by hour: The lowest distribution of ads occur not surprisingly during the early morning hours -Highest distribution of ads by Weekday: We see that the highest distribution of ads occurs during Sundays followed by Saturdays - Lowest distribution of ads by Weekday: We see that the lowest distribution of ads occurs during the days of Monday and Thursday. Although, there is not that much variation during the business days

The seasonal diagnostics plot comes from the timetk package.

hourly_tbl %>%
  plot_seasonal_diagnostics(
    .date_var = time, 
    .value = ads,
    .interactive = TRUE,
    .geom_color = "#4E79A7"
  )

4 Raincloud Plots

Raincloud Plots is a way to dig in deeper into the distribution of dataset. We use a combination of the following: - Boxplot - Histogram - Violin Plot

Interpreting a boxplot

The Histogram and the Violin Plot just shows the overall distributions of the different observations in our dataset.

weekday_pal <- c("#DBB165", "#DEB18B", "#2E604A", "#27223C", "#D1362F", "#7496D2",
                 "#CECD7B")

gg<- hourly_tbl %>%
  mutate(
    weekday = WEEKDAY(time, label = TRUE, abbr = FALSE),
    weekday = weekday %>% as_factor()
  ) %>%
  ggplot(aes(x=weekday, y=ads)) +
  geom_jitter(aes(color=weekday),
              alpha=0.5,
              size=0.5,
              show.legend=FALSE) +
  geom_half_violin(aes(fill=weekday),
                   side ="l",
                   alpha =0.45,
                   show.legend = FALSE,
                   trim =FALSE) +
  geom_half_boxplot(aes(fill=weekday),
                    side= "r",
                    outlier.size = 3,
                    outlier.color = "red",
                    width = 0.4,
                    alpha = 0.45,
                    show.legend = FALSE) +
  stat_summary(fun=mean, geom="line") +
  theme_bw() +
  labs(
    title = "Raincloud Plots",
    subtitle = "A more descriptive way of distributions"
  ) +
  scale_color_manual(values = weekday_pal) + 
  scale_fill_manual(values=weekday_pal)

gg

5 AutoCorrelation and Partial Autocorrelation

Going back to this topic as a refresher of the first part of time series fresher I would like to discuss a bit more in detail the definitions of both Auo Correlation (ACF) and Partital Auto Correlation (PACF). FOrst, let me say that the reason why I emphasize this is because lags at different points in time could be used as features to improve the predictive accuracy of our models. Before getting into this topic we must understand the following: It is possible that past values have an influence in the behavior of future values in a time series. When this occurs, we say that the current value has a correlation with past values in the time series.

Definitions - Autocorrelation: Autocorrelation is just the correlation between current value and past values, There is an indirect effect in autocorrelation when we find the correlation of lag. - Partital Autocorrelation (PACF): In shorter terms of partital autocorrelation removes any effect of lags that are in between the lags you are evaluating. In other words, at lag k it is the corerlation that results after removing the effect of any correlation at a different lags. For instance, if we want to find the correlation of lag k and lag k-3

Formula:

\(\LARGE r_{k} = \frac{\sum_{i=1}^{N-k}(Y_{i} - \bar{Y})(Y_{i+k} - \bar{Y})} {\sum_{i=1}^{N}(Y_{i} - \bar{Y})^{2} }\)



Where:
X = time
Y = Tagret Variable (In this case number of ads)
r = Correlation
k = Lag (Can be lag 1,2,3…)

Interpretation of ACF and PACF:

6 Feature Engineering

6.1 Extracting Features

Summary of the feature engineering process: - index,num: Number of seconds passed since 1970 to the current timestamp - Diff: Number of seconds passed between timestamps - half: which half of the semester during the year (1 or 2) - quarter: Quarter of the year - month: Month of the year - day: Day of the month - Hour: Hour of the day - hour12: hour from 0 -12 - weekday: Day of the week - mday: Day of the month - qday: Day of the quarter - yday: Day of the year

Note: The features that will have the most significance are most probably the hour and the weekday but we will find in these upcoming updates

# Feature engineering

prepared_hourly_tbl <- hourly_tbl %>%
  tk_augment_timeseries_signature() %>%
  mutate(
    ads_transformed = log_interval_vec(ads, limit_lower = 0, offset = 1)
  ) %>%
  select(time, ads, ads_transformed, everything())

prepared_hourly_tbl %>%
  glimpse()
## Rows: 216
## Columns: 31
## $ time            <dttm> 2017-09-13 00:00:00, 2017-09-13 01:00:00, 2017-09-13 …
## $ ads             <dbl> 80115, 79885, 89325, 101930, 121630, 116475, 106495, 1…
## $ ads_transformed <dbl> -0.21910625, -0.22428453, -0.01340526, 0.26853004, 0.7…
## $ index.num       <dbl> 1505260800, 1505264400, 1505268000, 1505271600, 150527…
## $ diff            <dbl> NA, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 36…
## $ year            <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ year.iso        <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ half            <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ quarter         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ month           <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
## $ month.xts       <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
## $ month.lbl       <ord> September, September, September, September, September,…
## $ day             <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ hour            <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ minute          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ second          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ hour12          <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5…
## $ am.pm           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
## $ wday            <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ wday.xts        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ wday.lbl        <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wednesday,…
## $ mday            <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ qday            <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75…
## $ yday            <int> 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,…
## $ mweek           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ week            <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ week.iso        <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ week2           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ week3           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ week4           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ mday7           <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …

6.2 Linear and Non-Linear Regression

Concept of Linear Regression: The concept of linear regression can be confusing especially for the individuals getting into the ml world. Linear regression is “the best fitted line in a set of observations”.

Metrics to know of Linear Regression: - Mean Squared Error (MSE): The average of the squared differenced between predicted values and actual values - Mean Absolute Error (MAE): The absolute average between actual values and predicted values - Root Mean Squared Error (RMSE): This is the squared root the average of the squared difference of the predicted and actual value - Adjusted R Squared: Measures the variance explained only by independent variables that has a direct effect on dependent variables - Residuals: Think about it as the “left over” when substracting the actual values - predicted values

Summary: - Linear Regression Model based on time:

# Plot 1

p1 <- prepared_hourly_tbl %>%
  plot_time_series_regression(
    time,
    .formula = ads_transformed ~ time, 
    .interactive = FALSE,
    .show_summary = TRUE,
    .title = "Linear Model Without Implementing Feature Engineering",
    .legend_show = FALSE,
    .line_color = "#D1362F",
    .smooth_color = "#EA3B46",
    .line_size = 1 
    
  )
## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24614 -0.64355 -0.06383  0.65258  1.98943 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.413e+02  3.536e+02   0.682    0.496
## time        -1.597e-07  2.349e-07  -0.680    0.497
## 
## Residual standard error: 0.7748 on 214 degrees of freedom
## Multiple R-squared:  0.002155,   Adjusted R-squared:  -0.002508 
## F-statistic: 0.4622 on 1 and 214 DF,  p-value: 0.4973
p1

# Plot 2

seasonality_formula <- as.formula(
  ads_transformed ~ 
    splines::ns(index.num) + day + wday.lbl + hour
)

p2 <- prepared_hourly_tbl %>%
  plot_time_series_regression(
    time, 
    .formula = seasonality_formula,
    .interactive = FALSE, 
    .show_summary = TRUE,
    .title = "Linear Model With Feature Engineering",
    .legend_show = FALSE,
    .line_color = "#D1362F",
    .smooth_color = "#EA3B46",
    .line_size = 1
  )
## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04837 -0.43868  0.05359  0.53015  1.30898 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            15.06905    2.06543   7.296 6.22e-12 ***
## splines::ns(index.num) 12.27906    1.81716   6.757 1.40e-10 ***
## day                    -1.12428    0.16388  -6.861 7.79e-11 ***
## wday.lbl.L             -0.20223    0.14506  -1.394  0.16476    
## wday.lbl.Q              0.42963    0.13186   3.258  0.00131 ** 
## wday.lbl.C             -0.12205    0.13776  -0.886  0.37667    
## wday.lbl^4              0.20997    0.13213   1.589  0.11355    
## wday.lbl^5             -0.06802    0.13006  -0.523  0.60155    
## wday.lbl^6              0.05198    0.11963   0.434  0.66439    
## hour                         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6894 on 207 degrees of freedom
## Multiple R-squared:  0.2358, Adjusted R-squared:  0.2062 
## F-statistic: 7.983 on 8 and 207 DF,  p-value: 2.306e-09
p2

plot_grid(p1, p2)

7 Data Preparation

7.1 Box Cox Transformations

Why Box Cox Transformations?

Briefly, boxcox transformations tries to transform our data into a normal distribution or at least close to a normal distribution. By having a normal distribution we avoid noise in our data which allows our model to perform much better.

BoxCox Transformation Formula:


Where:
  • \(W =\) Transformed Variable
  • \(Y =\) Target Variable (Number of Ads)
  • \(T =\) Time Period
  • \(lambda =\) Parameter chose

Note: lambda is important because it is the parameter that will help us transform out time series to the original values after performing the transformations. Moreover, lambda is important because the value picked in lambda is the most approximate value to a normal distribution

# We use lambda to transform back 

lambda <- 0.91150846258978

ts_boxcox_ads_tbl <- hourly_tbl %>%
  mutate(
    ads_box_cox = box_cox_vec(ads, lambda = "auto")
  )


p1<- ts_boxcox_ads_tbl %>%
  ggplot(aes(x=time, y=ads_box_cox)) + geom_line(size=1, color = "#5C5C5C") + geom_point(size=0.5, color="#5C5C5C") +
  theme_bw() + labs(
    title = "Number of ads across time",
    subtitle = "Box Cox Transformation"
  ) +
  theme(plot.title = element_text(hjust=0.5, face="bold"))

p2<- ts_boxcox_ads_tbl %>%
  ggplot(aes(x=time, y=ads)) + geom_line(size=1, color = "#5C5C5C") + geom_point(size=0.5, color="#5C5C5C") +
  theme_bw() + labs(
    title = "Number of ads across time",
    subtitle = "Box Cox Transformation"
  ) +
  theme(plot.title = element_text(hjust=0.5, face="bold"))
plot_grid(p1,p2)

7.2 Time Series Split

ts_boxcox_ads_tbl <- ts_boxcox_ads_tbl%>%
  select(time, ads_box_cox)

splitted_df <- time_series_split(ts_boxcox_ads_tbl, assess = "2 days", cumulative=TRUE)

splitted_df
## <Analysis/Assess/Total>
## <168/48/216>
new_colors_pal <- c("#27223C", "#D1362F")

# Extracting training and test sets
train_df <- training(splitted_df)
test_df <- testing(splitted_df)


splitted_df %>% 
  tk_time_series_cv_plan() %>% 
  ggplot(aes(x=time, y = ads_box_cox, color=.key)) + geom_line(size=1) + 
  scale_color_manual(values = new_colors_pal) + theme_bw() + 
  labs(
    title = "Time Series Split",
    subtitle = "Boxcox Value"
  ) +  geom_label(
    label = "Training Set",
    color = "#27223C",
    vjust = 0.5,
    aes(x=AS_DATETIME(c("2017-09-16 21:00:00")),
        y = 65000)) + 
  geom_label(
    label = "Testing Set",
    color = "#D1362F",
    vjust = 0.5,
    aes(x=AS_DATETIME(c("2017-09-20 21:00:00")),
        y = 62500)) +
  theme(legend.position = "none") + scale_x_datetime(expand = c(0,0)) + 
  geom_vline(xintercept = AS_DATETIME(c("2017-09-20 00:00:00")), lty="dashed")

8 Forecasting with Prophet

8.1 Prophet definition

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers wells.

Summary:

prophet_fit <- prophet_reg() %>%
  set_engine("prophet") %>%
  fit(ads_box_cox ~ time, data = ts_boxcox_ads_tbl)


model_tbl <- modeltime_table(
  prophet_fit
)

calibration_tbl <- model_tbl %>%
  modeltime_calibrate(
    new_data = test_df
  )

forecast_tbl <- calibration_tbl %>%
  modeltime_forecast(actual_data = ts_boxcox_ads_tbl)

forecast_tbl %>% rename(
    "time" = ".index",
    "value" = ".value"
  ) %>% 
  ggplot(aes(x=time, y=value, color=.key)) + geom_line(size=1) + geom_point(size=0.5) + theme_bw() + scale_color_manual(values=new_colors_pal) + 
  theme(legend.position = "bottom") + 
  labs(
    title = "Forecasting with Prophet",
    subtitle = "Number of Ads",
    y = "BoxCox Transform Value"
  ) + scale_x_datetime(expand = c(0,0)) 

8.2 Accuracy Evaluation

calibration_tbl %>% 
  modeltime_accuracy() %>% 
  pivot_longer(cols = mae:rsq) %>% 
  mutate(
    name = name %>% as_factor() %>% fct_reorder(value) %>% fct_rev()
  ) %>% 
  ggplot(aes(x=name, y=value, fill=name)) + geom_col(color="black") + 
  theme_bw() + scale_fill_tableau() + theme(legend.position = "bottom") + 
  geom_label(aes(label = round(value,2)), size=3, color="white") + 
  labs(
    title = "Metrics for evaluating Accuracy",
    subtitle = "Prophet Model with BoxCox Transformation"
  )

8.3 Forecast with inverted Box Cox

# New color palette for the visualization
# new_color_pal <- c("#456355", "#B62A3D")
# We retrain the model on the actual data
refit_tbl <- calibration_tbl %>% 
  modeltime_refit(data = ts_boxcox_ads_tbl)

# inverting transformation
forecast_ads_tbl <- refit_tbl %>% 
  modeltime_forecast(h = "5 days", actual_data = ts_boxcox_ads_tbl)

forecast_ads_tbl <- forecast_ads_tbl %>% 
  mutate(.value = box_cox_inv_vec(.value, lambda = lambda),
         .conf_lo = box_cox_inv_vec(.conf_lo, lambda = lambda),
         .conf_hi = box_cox_inv_vec(.conf_hi, lambda = lambda))


forecast_ads_tbl %>% 
    rename(
    "time" = ".index",
    "value" = ".value"
  )  %>% 
    ggplot(aes(x=time, y=value, color=.key)) + geom_line(size=1) + 
  theme_bw() + scale_color_manual(values = new_colors_pal) + 
  labs(
    title = "Prophet Forecast for the next 5 days",
    subtitle = "Number of Ads (Boxcox Inverted)",
    y = "Number of Ads",
    caption = str_glue("Lambda: {lambda}")
  ) + 
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) + 
  theme(legend.position = "bottom") + geom_smooth(method = "loess", color = "#191919", lty="dashed") + scale_x_datetime(expand = c(0,0)) 

# Fourier Series

8.4 Introduction to Fourier Series

Now we will talk about an interesting topic that could help improve our model predictability. Fourier Series are a great day to deal with multiple seasonality (which surprisingly enough this is the case). Before going into what is Fourier Series, lets revisit our data with what we know so far.

Summary of our Dataset:
  • What did the lags tell us? As we have seen in the ACF and PACF plots, there is a strong correlation every 24 hours which mean that at a specific hour the next day the tendency of the number of ads will be similar to the date of the previous day.
  • 24 hour periods: This time interval in our time series will be important to determine seasonality movements in our time series.


8.5 What are Fourier Series

Fourier Series helps us represent periodic periods through the use of sine and cosine functions. To make it more simple to the beginner users think of it as a distribution at a specific period of time. We will use a 24 hour distribution since it seems that it will be helpful for our model since we have a strong autocorrelation during the 24 hour interval.

Reference for Fourier Series: If you are interested finding out more about Fourier Series, I really enjoyed this explanation but there are tons of other sources that could better explain why Fourier Series are important. Fourier Series contributors Matt DeCross, Steve the Philosophist and Jlmln Khlm

Summary:
  • 24 hour distribution: We see that the 24 hour cosine and sin functions has a positive impact in the regression model.
  • V shape form: As you can see in the distribution there is a V shape form which makes sense since somehow we are trying to replicate those seasonality movements we have in our time series data.
  • Regression model without vs with Fourier Series: The Adjusted R-Squared improved from 0.2062 to 0.8282 which is a massive improvement in our regression model.


Fourier Series representation of a periodic Function:
\(\LARGE f(t)= a_{v} + \sum_{n=1 }^{\infty }a_{n}\cos (n\omega _{0}t) + b_{n}\sin (n\omega _{0}t)\)

Where:

\(\large N\) = integer sequence 1,2,3 …

\(\large a_v\) = Fourier Coefficient

\(\large a_n\) = Fourier Coefficient

\(\large b_n\) = Fourier Coefficient

\(\large \frac{2\pi }{T}\) = Fundamental Frequency

\(\large 2\omega _{0}, 3\omega _{0}, 4\omega _{0}\) = harmonic frequencies

More information how fourier series work in detail in Learn about Fourier Coefficients by Donald Krambeck

8.6 Fourier Distribution

fourier_hourly_tbl <- prepared_hourly_tbl %>%
  tk_augment_fourier(time, .periods = c(24), .K = 2)



# -matches("(hour)|(minute)|(second)|(am.pm)")


# color: #76A08A

fourier_col_palletes <- c("#D1362F", "#27223C", "#2E604A", "#E6A2C5")  

fourier_hourly_tbl %>%
  select(matches("(cos24)|(sin24)")) %>%
  pivot_longer(cols = 1:4) %>%
  ggplot(aes(x=value, color=name, fill=name)) +
  geom_density(alpha=0.3) +
  facet_wrap(~name, scales = "free_x") + theme_bw() + 
    theme(plot.title = element_text(face = "bold", color = "black", size=14),
        plot.subtitle = element_text(face = "bold", color = "black", size=12),
        axis.text = element_text(color = "black"), legend.text = element_text(size=10),
        legend.title = element_text(size = 12), legend.position = "bottom",
        strip.background =element_rect(fill="#76A08A"), strip.text = element_text(color="white", face="bold")) + 
  labs(
    title = "Fourier Series Distribution",
    subtitle = "24 hour Fourier Series"
  ) + scale_x_continuous(expand = c(0,0)) + 
  scale_color_manual("Fourier Legend", values = fourier_col_palletes) + 
  scale_fill_manual("Fourier Legend", values = fourier_col_palletes) + 
  guides(color = guide_legend(ncol=2, bycol=TRUE))

8.7 Adjusted R-Squared Improvement with Fourier Series

model_formula_fourier <- as.formula(
  ads ~ splines::ns(index.num) + day + wday.lbl + hour +
    time_sin24_K1 + time_cos24_K1 + time_sin24_K2 + time_cos24_K2
)

fourier_hourly_tbl %>%
  plot_time_series_regression(
    .date_var = time,
    .formula = model_formula_fourier,
    .show_summary = TRUE,
    .interactive = F,
    .line_size = 1
  ) + theme_bw() + 
  theme(plot.title = element_text(face = "bold", color = "black", size=14),
        plot.subtitle = element_text(face = "bold", color = "black", size=12),
        axis.text = element_text(color = "black"), legend.text = element_text(size=10),
        legend.title = element_text(size = 12), legend.position = "bottom") + 
  scale_color_manual(values = new_colors_pal) + scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) + 
  labs(
    title = "Regression Model using Fourier Series",
    subtitle = "Capturing seasonality with Fourier 24",
    y = "Number of Ads", 
    x = "Time of the Day"
  )
## 
## Call:
## stats::lm(formula = .formula, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23960  -7405  -1579   7269  29059 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              140725      69562   2.023 0.044381 *  
## splines::ns(index.num)     3709      61868   0.060 0.952255    
## day                       -1143       5547  -0.206 0.836997    
## wday.lbl.L                -4089       2378  -1.720 0.086997 .  
## wday.lbl.Q                13739       2161   6.357 1.33e-09 ***
## wday.lbl.C                -2701       2258  -1.196 0.232947    
## wday.lbl^4                 6592       2166   3.044 0.002645 ** 
## wday.lbl^5                -1216       2132  -0.570 0.569137    
## wday.lbl^6                 1227       1961   0.626 0.532261    
## hour                         NA         NA      NA       NA    
## time_sin24_K1            -17494       2062  -8.482 4.57e-15 ***
## time_cos24_K1            -26331       1112 -23.689  < 2e-16 ***
## time_sin24_K2              5002       1387   3.607 0.000391 ***
## time_cos24_K2            -11696       1112 -10.522  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11300 on 203 degrees of freedom
## Multiple R-squared:  0.8378, Adjusted R-squared:  0.8282 
## F-statistic: 87.35 on 12 and 203 DF,  p-value: < 2.2e-16

8.8 Adding Lagged Features

8.9 The Workflow Way

A workflow is a way of putting together different steps in the modeling process

  • Preprocessing
  • Modeling
  • Postprocessing

8.10 Feature Engineering with the Recipes package

# 22% is the testing set
# Splitting dataframe
prepared_hour_split_tbl <- time_series_split(prepared_hourly_tbl, assess = "2 days", cumulative = TRUE)

# Extracing training and test sets
train_df <- training(prepared_hour_split_tbl)
test_df <- testing(prepared_hour_split_tbl)


# Preparing our recipe 

recipe_book <- recipe(ads_transformed ~ ., data = train_df) %>%
  #Removed unecessary features
  step_rm(matches("(iso)|(xts)|(quarter)|(year)|(month)|(qday)|(diff)")) %>% 
  step_normalize(matches("(index.num)|(yday)")) %>% 
  step_dummy(all_nominal_predictors(), one_hot =TRUE) %>%
  #Combine features
  step_interact(~matches("am.pm")*matches("wday.lbl")) %>%
  step_fourier(time, period = c(24, 48, 76), K = 2)

# A glimpse of our recipe 

t %>% glimpse()
## function (x)

8.11 Creating a Pipeline with Workflow

When creating a workflow pipeline we implement 3 main steps:

  • Setting up the model: I use the parsnip model for this, in this example we will use a random forest model. We need to set the model and the type of problem (in this case) regression it could also be use for classification problems.

  • Establish the workflow: Here we establish first the model, and then the recipe we previously prepared. The recipes does all the preprocessing step on the training data.

  • Fit the model on the training data: Finally, we must fir our model to be ready for it to predict againt the testing set.

rf_model <- rand_forest() %>%
  set_engine("randomForest") %>%
  set_mode("regression")

workflow_model <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(recipe_book) %>%
  fit(train_df) 

workflow_model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_rm()
## • step_normalize()
## • step_dummy()
## • step_interact()
## • step_fourier()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 0.01066798
##                     % Var explained: 98.3

Summary: - Mean Squared of Residuals: This is error metric and as we can see it is pretty low meaning that the model is performing well on the test set. - Residuals Plot: The residuals plot shows prove of the minimum standard error. The model is able to capture well the actual observations

predict_tbl <- workflow_model %>%
  predict(test_df) %>%
  bind_cols(test_df) %>%
  select(time, .pred, ads_transformed) %>%
  pivot_longer(.pred:ads_transformed)

predict_tbl %>%
  ggplot(aes(x=time, y=value)) + geom_line(aes(color=name), size=1) +
  theme_fivethirtyeight() +
  theme(plot.background = element_rect(fill="white"),
        panel.background = element_rect(fill="white"),
        legend.background=element_rect(fill="white"),
        legend.key=element_rect(fill="white")) +
  scale_color_manual(values = c("#D1362F", "#2E604A")) + labs(
    title = "Random Forest Model",
    subtitle = "Feature Engineering with the Recipes Package",
    caption = "Note: Cross Validation not implemented, prone to overfitting"
  )

8.12 Residuals Plot

residuals_tbl <- predict_tbl %>%
  pivot_wider(names_from = name, values_from = value) %>%
  mutate(
    residuals = ads_transformed - .pred
  )

residuals_tbl %>%
  ggplot(aes(x=time, y=residuals)) +
  geom_point(color="red") +
  geom_hline(yintercept=0, color="black", size=1.5) + 
  scale_y_continuous(limits = c(-0.4, 0.4)) +
  theme_fivethirtyeight() +
  theme(plot.background = element_rect(fill="white"),
        panel.background = element_rect(fill="white"),
        legend.key = element_rect(fill="white")) + scale_color_manual(values=c("#D1362F", "#2E604A")) +
  labs(
    title = "Residuals Plot",
    subtitle = "Random Forest Model",
    caption = "Mean of squared residuals: 0.01474173"
  )

## Comparing Different Models

  • Linear Models

  • Random Forest Models

  • The random forest model looks much more accurate than linear model

  • The Random Forest model outperforms the linear mode and the explnation for those metrics we saw it above in our notebook.

horizon <- 24*3
lag_period <- 24*3
rolling_periods <- c(24, 48, 72)

final_split_df <- time_series_split(hourly_tbl, assess=horizon, cumulative=TRUE)

final_split_df %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(time, ads, .interactive=FALSE, .line_size=1) +
  theme_fivethirtyeight() +
  theme(plot.background = element_rect(fill="white"),
        panel.background = element_rect(fill="white"),
        legend.key = element_rect(fill="white")) + scale_color_manual(values=c("#D1362F", "#2E604A"))  +
  labs(
    title = "Splitting the original dataframe"
  )

# Preparing the linear model before combining before other models

linear_model <- linear_reg() %>%
  set_engine("lm")

# 3 day horizon

horizon <- 24*3
lag_period <- 24*3
rolling_periods <- c(24, 48, 72)

#Splitting dataframe
prepared_hour_split_tbl <- time_series_split(prepared_hourly_tbl, assess = horizon, cumulative = TRUE)

# Extracing training and test sets
train_df <- training(prepared_hour_split_tbl)
test_df <- testing(prepared_hour_split_tbl)


recipe_base <- recipe(ads_transformed ~ . , data = train_df) %>%
  # Remove unecessary features
  step_rm(matches("(iso)|(xts)|(quarter)|(year)|(month)|(qday)|(diff)")) %>% 
  step_normalize(matches("(index.num)|(yday)")) %>% 
  step_dummy(all_nominal(), one_hot = TRUE) %>% 
  # Combine features
  step_interact(~ matches("am.pm") * matches("wday.lbl")) %>% 
  step_fourier(time, period = c(24, 48, 76), K=2) 
  


first_recipes <- recipe_base %>% 
  step_rm(time) %>% 
  step_ns(ends_with("index.num"), deg_free=2) 

# Building the workflow
lm_workflow_fitted <- workflow() %>%
  add_model(linear_model) %>%
  add_recipe(first_recipes) %>%
  fit(train_df)

# modeltime

final_calibration_tbl <- modeltime_table(
  workflow_model,
  lm_workflow_fitted
) %>%
  modeltime_calibrate(
    new_data = test_df
  )

final_calibration_tbl %>% 
  modeltime_forecast(new_data = test_df,
                     actual_data = prepared_hourly_tbl) %>% 
  plot_modeltime_forecast(.interactive = F, .line_size=1) + 
    theme_fivethirtyeight() + 
      theme(plot.background = element_rect(fill="white"),
                                  panel.background = element_rect(fill="white"),
                                  legend.background = element_rect(fill="white"),
                                  legend.key = element_rect(fill="white"),
            strip.background =element_rect(fill="#F8FAFB")) + 
  scale_color_calc() + 
  labs(
    title = "Measuring accuracy of our Models", 
    subtitle = "Linear and Random Forest Models"
  )

accuracy_tbl <- final_calibration_tbl %>%
  modeltime_accuracy()
accuracy_tbl %>%
  pivot_longer(mae:rsq) %>%
  mutate(
    name = name %>% as_factor() %>% fct_reorder(value) %>% fct_rev(),
    value_txt = str_glue("{round(value, 2)}%") %>% as.character()
  ) %>%
  ggplot(aes(x=name, y=value)) + geom_col(aes(fill=.model_desc), color="black") + facet_wrap(~.model_desc) +
  scale_fill_calc() + 
  geom_label(aes(label=value_txt), size=3, color="black") +
  theme_fivethirtyeight() +
  theme(legend.position = "bottom",
        plot.background=element_rect(fill="white"),
        panel.background=element_rect(fill="white"),
        legend.background = element_rect(fill="white"),
        legend.key=element_rect(fill="white")) +
  labs(
    title = "Comparison of Model Accuracy"
  )

9 Transformations

# Prepared our Data

hourly_tbl <- df %>%
  set_names(names(.) %>% str_to_lower()) %>%
  mutate(
    time = ymd_hms(time)
  ) %>% as_tibble()

# Find mean and std of logs
# This is to do the inversion at the end 

log_ads <- hourly_tbl %>%
  mutate(ads_log = log(ads))

inversion_values <- log_ads %>%
  summarise(
    std_mean = mean(ads_log),
    std_mean = format(round(std_mean, 13), nsmall = 13),
    std_sd = sd(ads_log),
    std_sd = format(round(std_sd, 13), nsmall = 13)
  )

inversion_values
## # A tibble: 1 × 2
##   std_mean         std_sd         
##   <chr>            <chr>          
## 1 11.6850111412288 0.2350963087660
# Values for inverting our time series target variable (ad.)

std_mean <- inversion_values$std_mean
std_sd <- inversion_values$std_sd

# Transformed table
prepared_hourly_tbl <- hourly_tbl %>%
  mutate(
    ads_transformed = log(ads),
    ads_transformed = standardize_vec(ads_transformed)
  )

horizon <- 24*3

# SPlit the dataframe the data prepared table

splits <- time_series_split(prepared_hourly_tbl, assess = horizon, cumulative = TRUE)

9.1 Setting our Models

9.2 Setting our Models

# 1. Prophet

model_fit_prophet <- prophet_reg(seasonality_daily = TRUE) %>% 
  set_engine("prophet") %>%
  set_mode("regression")


# 2. Random Forest 
model_spec_rf <- rand_forest(trees = 500, min_n = 50) %>% 
  set_engine("randomForest")%>%
  set_mode("regression")

9.3 Preprocessing for Feature Engineering

recipe_spec_base <- recipe(ads_transformed ~ time, training(splits)) %>% 
  step_timeseries_signature(time) %>% 
  step_rm(matches("(iso)|(xts)|(second)|(year)|(quarter)|(month)")) %>% 
  step_fourier(time, period = c(12, 24, 36, 48), K=2) %>% 
  step_dummy(all_nominal())



recipe_spec_base %>% prep() %>% juice() %>% glimpse()
## Rows: 144
## Columns: 41
## $ time            <dttm> 2017-09-13 00:00:00, 2017-09-13 01:00:00, 2017-09-13 …
## $ ads_transformed <dbl> -1.67502740, -1.68725644, -1.21216050, -0.65066763, 0.…
## $ time_index.num  <dbl> 1505260800, 1505264400, 1505268000, 1505271600, 150527…
## $ time_half       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ time_day        <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ time_hour       <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ time_minute     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_hour12     <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5…
## $ time_am.pm      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
## $ time_wday       <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ time_mday       <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ time_qday       <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75…
## $ time_yday       <int> 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,…
## $ time_mweek      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ time_week       <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ time_week2      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time_week3      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time_week4      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time_mday7      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ time_sin12_K1   <dbl> -1.913562e-11, 5.000000e-01, 8.660254e-01, 1.000000e+0…
## $ time_cos12_K1   <dbl> 1.000000e+00, 8.660254e-01, 5.000000e-01, 2.169896e-11…
## $ time_sin12_K2   <dbl> -3.827123e-11, 8.660254e-01, 8.660254e-01, 4.339792e-1…
## $ time_cos12_K2   <dbl> 1.0, 0.5, -0.5, -1.0, -0.5, 0.5, 1.0, 0.5, -0.5, -1.0,…
## $ time_sin24_K1   <dbl> -9.567808e-12, 2.588190e-01, 5.000000e-01, 7.071068e-0…
## $ time_cos24_K1   <dbl> 1.000000e+00, 9.659258e-01, 8.660254e-01, 7.071068e-01…
## $ time_sin24_K2   <dbl> -1.913562e-11, 5.000000e-01, 8.660254e-01, 1.000000e+0…
## $ time_cos24_K2   <dbl> 1.000000e+00, 8.660254e-01, 5.000000e-01, 2.169896e-11…
## $ time_sin36_K1   <dbl> -8.660254e-01, -9.396926e-01, -9.848078e-01, -1.000000…
## $ time_cos36_K1   <dbl> -5.000000e-01, -3.420201e-01, -1.736482e-01, 2.468290e…
## $ time_sin36_K2   <dbl> 8.660254e-01, 6.427876e-01, 3.420201e-01, -4.936580e-1…
## $ time_cos36_K2   <dbl> -0.5000000, -0.7660444, -0.9396926, -1.0000000, -0.939…
## $ time_sin48_K1   <dbl> -4.783904e-12, 1.305262e-01, 2.588190e-01, 3.826834e-0…
## $ time_cos48_K1   <dbl> 1.000000e+00, 9.914449e-01, 9.659258e-01, 9.238795e-01…
## $ time_sin48_K2   <dbl> -9.567808e-12, 2.588190e-01, 5.000000e-01, 7.071068e-0…
## $ time_cos48_K2   <dbl> 1.000000e+00, 9.659258e-01, 8.660254e-01, 7.071068e-01…
## $ time_wday.lbl_1 <dbl> 2.855301e-17, 2.855301e-17, 2.855301e-17, 2.855301e-17…
## $ time_wday.lbl_2 <dbl> -0.4364358, -0.4364358, -0.4364358, -0.4364358, -0.436…
## $ time_wday.lbl_3 <dbl> 5.779362e-17, 5.779362e-17, 5.779362e-17, 5.779362e-17…
## $ time_wday.lbl_4 <dbl> 0.4834938, 0.4834938, 0.4834938, 0.4834938, 0.4834938,…
## $ time_wday.lbl_5 <dbl> -1.231475e-15, -1.231475e-15, -1.231475e-15, -1.231475…
## $ time_wday.lbl_6 <dbl> -0.6579517, -0.6579517, -0.6579517, -0.6579517, -0.657…

9.4 Adding Workflows

# 1. Workflow Random Forest

workflow_fit_rf <- workflow() %>%
  add_model(model_spec_rf) %>%
  add_recipe(recipe_spec_base %>% step_rm(time)) %>%
  fit(training(splits))


# 2. Prophet Workflow
workflow_fit_prophet <- workflow() %>% 
  add_model(model_fit_prophet) %>% 
  add_recipe(recipe_spec_base) %>% 
  fit(training(splits))



# Combine models together
# Combined models with modeltime

model_table <- modeltime_table(
  workflow_fit_rf,
  workflow_fit_prophet
)

# Building the calibration table
calibration_table <- model_table %>% 
  modeltime_calibrate(testing(splits))

calibration_table
## # Modeltime Table
## # A tibble: 2 × 5
##   .model_id .model     .model_desc           .type .calibration_data
##       <int> <list>     <chr>                 <chr> <list>           
## 1         1 <workflow> RANDOMFOREST          Test  <tibble [72 × 4]>
## 2         2 <workflow> PROPHET W/ REGRESSORS Test  <tibble [72 × 4]>

9.5 Evaluate against the test set

calibration_table %>%
  modeltime_forecast(actual_date = prepared_hourly_tbl) %>%
  plot_modeltime_forecast(.interactive = TRUE)
# Accuracy of testing set

calibration_table %>%
  modeltime_accuracy() %>%
  table_modeltime_accuracy(.interactive = TRUE)

9.6 Forecasting

calibration_table %>%
  modeltime_refit(prepared_hourly_tbl) %>%
  modeltime_forecast(h=horizon, actual_data = prepared_hourly_tbl) %>%
  plot_modeltime_forecast(.interactive= TRUE, .line_size = 1)