r.Sys.Date()Topics that will be discussed in this notebook: - Data Exploration through Feature Engineering - Interpreting ACF and PACF - Understanding BoxCox Transformations - Implementation of the Prophet Library
We can clearly see some seasonality patterns at a first glimpse of just looking at our time series visualizations.
## # A tibble: 216 × 2
## Time Ads
## <chr> <dbl>
## 1 2017-09-13 0:00:00 80115
## 2 2017-09-13 1:00:00 79885
## 3 2017-09-13 2:00:00 89325
## 4 2017-09-13 3:00:00 101930
## 5 2017-09-13 4:00:00 121630
## 6 2017-09-13 5:00:00 116475
## 7 2017-09-13 6:00:00 106495
## 8 2017-09-13 7:00:00 102795
## 9 2017-09-13 8:00:00 108055
## 10 2017-09-13 9:00:00 116125
## # ℹ 206 more rows
hourly_tbl <- df %>%
set_names(names(.) %>% str_to_lower()) %>%
mutate(
time = ymd_hms(time)
)
hourly_tbl## # A tibble: 216 × 2
## time ads
## <dttm> <dbl>
## 1 2017-09-13 00:00:00 80115
## 2 2017-09-13 01:00:00 79885
## 3 2017-09-13 02:00:00 89325
## 4 2017-09-13 03:00:00 101930
## 5 2017-09-13 04:00:00 121630
## 6 2017-09-13 05:00:00 116475
## 7 2017-09-13 06:00:00 106495
## 8 2017-09-13 07:00:00 102795
## 9 2017-09-13 08:00:00 108055
## 10 2017-09-13 09:00:00 116125
## # ℹ 206 more rows
hourly_tbl %>%
ggplot(aes(x=time, y=ads)) +
geom_line(size = 1, color = "#4E79A7") +
geom_point(size=0.5, color = "#2C3E50") +
theme_bw() +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
labs(
title = "Number of ads across time"
) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))The concept of finding out seasonality patterns is really important if we want to have a better grasp of what ttime series is telling us. In this section we will split out time series into hours and weekday and see how the distribution of the time series behaves
Summary -Highest distribution of ads by hour: We see that the highest distribution of ads occurs at 12, 18, and 19 - Lowest distribution of ads by hour: The lowest distribution of ads occur not surprisingly during the early morning hours -Highest distribution of ads by Weekday: We see that the highest distribution of ads occurs during Sundays followed by Saturdays - Lowest distribution of ads by Weekday: We see that the lowest distribution of ads occurs during the days of Monday and Thursday. Although, there is not that much variation during the business days
The seasonal diagnostics plot comes from the timetk package.
Raincloud Plots is a way to dig in deeper into the distribution of dataset. We use a combination of the following: - Boxplot - Histogram - Violin Plot
Interpreting a boxplot
The Histogram and the Violin Plot just shows the overall distributions of the different observations in our dataset.
weekday_pal <- c("#DBB165", "#DEB18B", "#2E604A", "#27223C", "#D1362F", "#7496D2",
"#CECD7B")
gg<- hourly_tbl %>%
mutate(
weekday = WEEKDAY(time, label = TRUE, abbr = FALSE),
weekday = weekday %>% as_factor()
) %>%
ggplot(aes(x=weekday, y=ads)) +
geom_jitter(aes(color=weekday),
alpha=0.5,
size=0.5,
show.legend=FALSE) +
geom_half_violin(aes(fill=weekday),
side ="l",
alpha =0.45,
show.legend = FALSE,
trim =FALSE) +
geom_half_boxplot(aes(fill=weekday),
side= "r",
outlier.size = 3,
outlier.color = "red",
width = 0.4,
alpha = 0.45,
show.legend = FALSE) +
stat_summary(fun=mean, geom="line") +
theme_bw() +
labs(
title = "Raincloud Plots",
subtitle = "A more descriptive way of distributions"
) +
scale_color_manual(values = weekday_pal) +
scale_fill_manual(values=weekday_pal)
ggGoing back to this topic as a refresher of the first part of time series fresher I would like to discuss a bit more in detail the definitions of both Auo Correlation (ACF) and Partital Auto Correlation (PACF). FOrst, let me say that the reason why I emphasize this is because lags at different points in time could be used as features to improve the predictive accuracy of our models. Before getting into this topic we must understand the following: It is possible that past values have an influence in the behavior of future values in a time series. When this occurs, we say that the current value has a correlation with past values in the time series.
Definitions - Autocorrelation: Autocorrelation is just the correlation between current value and past values, There is an indirect effect in autocorrelation when we find the correlation of lag. - Partital Autocorrelation (PACF): In shorter terms of partital autocorrelation removes any effect of lags that are in between the lags you are evaluating. In other words, at lag k it is the corerlation that results after removing the effect of any correlation at a different lags. For instance, if we want to find the correlation of lag k and lag k-3
Formula:
\(\LARGE r_{k} = \frac{\sum_{i=1}^{N-k}(Y_{i} - \bar{Y})(Y_{i+k} - \bar{Y})} {\sum_{i=1}^{N}(Y_{i} - \bar{Y})^{2} }\)
Where:
X = time
Y = Tagret Variable (In this
case number of ads)
r = Correlation
k = Lag (Can be lag
1,2,3…)
Summary of the feature engineering process: - index,num: Number of seconds passed since 1970 to the current timestamp - Diff: Number of seconds passed between timestamps - half: which half of the semester during the year (1 or 2) - quarter: Quarter of the year - month: Month of the year - day: Day of the month - Hour: Hour of the day - hour12: hour from 0 -12 - weekday: Day of the week - mday: Day of the month - qday: Day of the quarter - yday: Day of the year
Note: The features that will have the most significance are most probably the hour and the weekday but we will find in these upcoming updates
# Feature engineering
prepared_hourly_tbl <- hourly_tbl %>%
tk_augment_timeseries_signature() %>%
mutate(
ads_transformed = log_interval_vec(ads, limit_lower = 0, offset = 1)
) %>%
select(time, ads, ads_transformed, everything())
prepared_hourly_tbl %>%
glimpse()## Rows: 216
## Columns: 31
## $ time <dttm> 2017-09-13 00:00:00, 2017-09-13 01:00:00, 2017-09-13 …
## $ ads <dbl> 80115, 79885, 89325, 101930, 121630, 116475, 106495, 1…
## $ ads_transformed <dbl> -0.21910625, -0.22428453, -0.01340526, 0.26853004, 0.7…
## $ index.num <dbl> 1505260800, 1505264400, 1505268000, 1505271600, 150527…
## $ diff <dbl> NA, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 36…
## $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ year.iso <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ half <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ quarter <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ month <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
## $ month.xts <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
## $ month.lbl <ord> September, September, September, September, September,…
## $ day <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ minute <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ second <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ hour12 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5…
## $ am.pm <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
## $ wday <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ wday.xts <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ wday.lbl <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wednesday,…
## $ mday <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ qday <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75…
## $ yday <int> 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,…
## $ mweek <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ week <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ week.iso <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ week2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ week3 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ week4 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ mday7 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
Concept of Linear Regression: The concept of linear regression can be confusing especially for the individuals getting into the ml world. Linear regression is “the best fitted line in a set of observations”.
Metrics to know of Linear Regression: - Mean Squared Error (MSE): The average of the squared differenced between predicted values and actual values - Mean Absolute Error (MAE): The absolute average between actual values and predicted values - Root Mean Squared Error (RMSE): This is the squared root the average of the squared difference of the predicted and actual value - Adjusted R Squared: Measures the variance explained only by independent variables that has a direct effect on dependent variables - Residuals: Think about it as the “left over” when substracting the actual values - predicted values
Summary: - Linear Regression Model based on time:
# Plot 1
p1 <- prepared_hourly_tbl %>%
plot_time_series_regression(
time,
.formula = ads_transformed ~ time,
.interactive = FALSE,
.show_summary = TRUE,
.title = "Linear Model Without Implementing Feature Engineering",
.legend_show = FALSE,
.line_color = "#D1362F",
.smooth_color = "#EA3B46",
.line_size = 1
)##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24614 -0.64355 -0.06383 0.65258 1.98943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.413e+02 3.536e+02 0.682 0.496
## time -1.597e-07 2.349e-07 -0.680 0.497
##
## Residual standard error: 0.7748 on 214 degrees of freedom
## Multiple R-squared: 0.002155, Adjusted R-squared: -0.002508
## F-statistic: 0.4622 on 1 and 214 DF, p-value: 0.4973
# Plot 2
seasonality_formula <- as.formula(
ads_transformed ~
splines::ns(index.num) + day + wday.lbl + hour
)
p2 <- prepared_hourly_tbl %>%
plot_time_series_regression(
time,
.formula = seasonality_formula,
.interactive = FALSE,
.show_summary = TRUE,
.title = "Linear Model With Feature Engineering",
.legend_show = FALSE,
.line_color = "#D1362F",
.smooth_color = "#EA3B46",
.line_size = 1
)##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04837 -0.43868 0.05359 0.53015 1.30898
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.06905 2.06543 7.296 6.22e-12 ***
## splines::ns(index.num) 12.27906 1.81716 6.757 1.40e-10 ***
## day -1.12428 0.16388 -6.861 7.79e-11 ***
## wday.lbl.L -0.20223 0.14506 -1.394 0.16476
## wday.lbl.Q 0.42963 0.13186 3.258 0.00131 **
## wday.lbl.C -0.12205 0.13776 -0.886 0.37667
## wday.lbl^4 0.20997 0.13213 1.589 0.11355
## wday.lbl^5 -0.06802 0.13006 -0.523 0.60155
## wday.lbl^6 0.05198 0.11963 0.434 0.66439
## hour NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6894 on 207 degrees of freedom
## Multiple R-squared: 0.2358, Adjusted R-squared: 0.2062
## F-statistic: 7.983 on 8 and 207 DF, p-value: 2.306e-09
Why Box Cox Transformations?
Briefly, boxcox transformations tries to transform our data into a normal distribution or at least close to a normal distribution. By having a normal distribution we avoid noise in our data which allows our model to perform much better.
BoxCox Transformation Formula:
Note: lambda is important because it is the parameter that will help us transform out time series to the original values after performing the transformations. Moreover, lambda is important because the value picked in lambda is the most approximate value to a normal distribution
# We use lambda to transform back
lambda <- 0.91150846258978
ts_boxcox_ads_tbl <- hourly_tbl %>%
mutate(
ads_box_cox = box_cox_vec(ads, lambda = "auto")
)
p1<- ts_boxcox_ads_tbl %>%
ggplot(aes(x=time, y=ads_box_cox)) + geom_line(size=1, color = "#5C5C5C") + geom_point(size=0.5, color="#5C5C5C") +
theme_bw() + labs(
title = "Number of ads across time",
subtitle = "Box Cox Transformation"
) +
theme(plot.title = element_text(hjust=0.5, face="bold"))
p2<- ts_boxcox_ads_tbl %>%
ggplot(aes(x=time, y=ads)) + geom_line(size=1, color = "#5C5C5C") + geom_point(size=0.5, color="#5C5C5C") +
theme_bw() + labs(
title = "Number of ads across time",
subtitle = "Box Cox Transformation"
) +
theme(plot.title = element_text(hjust=0.5, face="bold"))ts_boxcox_ads_tbl <- ts_boxcox_ads_tbl%>%
select(time, ads_box_cox)
splitted_df <- time_series_split(ts_boxcox_ads_tbl, assess = "2 days", cumulative=TRUE)
splitted_df## <Analysis/Assess/Total>
## <168/48/216>
new_colors_pal <- c("#27223C", "#D1362F")
# Extracting training and test sets
train_df <- training(splitted_df)
test_df <- testing(splitted_df)
splitted_df %>%
tk_time_series_cv_plan() %>%
ggplot(aes(x=time, y = ads_box_cox, color=.key)) + geom_line(size=1) +
scale_color_manual(values = new_colors_pal) + theme_bw() +
labs(
title = "Time Series Split",
subtitle = "Boxcox Value"
) + geom_label(
label = "Training Set",
color = "#27223C",
vjust = 0.5,
aes(x=AS_DATETIME(c("2017-09-16 21:00:00")),
y = 65000)) +
geom_label(
label = "Testing Set",
color = "#D1362F",
vjust = 0.5,
aes(x=AS_DATETIME(c("2017-09-20 21:00:00")),
y = 62500)) +
theme(legend.position = "none") + scale_x_datetime(expand = c(0,0)) +
geom_vline(xintercept = AS_DATETIME(c("2017-09-20 00:00:00")), lty="dashed")Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers wells.
Summary:
prophet_fit <- prophet_reg() %>%
set_engine("prophet") %>%
fit(ads_box_cox ~ time, data = ts_boxcox_ads_tbl)
model_tbl <- modeltime_table(
prophet_fit
)
calibration_tbl <- model_tbl %>%
modeltime_calibrate(
new_data = test_df
)
forecast_tbl <- calibration_tbl %>%
modeltime_forecast(actual_data = ts_boxcox_ads_tbl)
forecast_tbl %>% rename(
"time" = ".index",
"value" = ".value"
) %>%
ggplot(aes(x=time, y=value, color=.key)) + geom_line(size=1) + geom_point(size=0.5) + theme_bw() + scale_color_manual(values=new_colors_pal) +
theme(legend.position = "bottom") +
labs(
title = "Forecasting with Prophet",
subtitle = "Number of Ads",
y = "BoxCox Transform Value"
) + scale_x_datetime(expand = c(0,0)) calibration_tbl %>%
modeltime_accuracy() %>%
pivot_longer(cols = mae:rsq) %>%
mutate(
name = name %>% as_factor() %>% fct_reorder(value) %>% fct_rev()
) %>%
ggplot(aes(x=name, y=value, fill=name)) + geom_col(color="black") +
theme_bw() + scale_fill_tableau() + theme(legend.position = "bottom") +
geom_label(aes(label = round(value,2)), size=3, color="white") +
labs(
title = "Metrics for evaluating Accuracy",
subtitle = "Prophet Model with BoxCox Transformation"
)# New color palette for the visualization
# new_color_pal <- c("#456355", "#B62A3D")
# We retrain the model on the actual data
refit_tbl <- calibration_tbl %>%
modeltime_refit(data = ts_boxcox_ads_tbl)
# inverting transformation
forecast_ads_tbl <- refit_tbl %>%
modeltime_forecast(h = "5 days", actual_data = ts_boxcox_ads_tbl)
forecast_ads_tbl <- forecast_ads_tbl %>%
mutate(.value = box_cox_inv_vec(.value, lambda = lambda),
.conf_lo = box_cox_inv_vec(.conf_lo, lambda = lambda),
.conf_hi = box_cox_inv_vec(.conf_hi, lambda = lambda))
forecast_ads_tbl %>%
rename(
"time" = ".index",
"value" = ".value"
) %>%
ggplot(aes(x=time, y=value, color=.key)) + geom_line(size=1) +
theme_bw() + scale_color_manual(values = new_colors_pal) +
labs(
title = "Prophet Forecast for the next 5 days",
subtitle = "Number of Ads (Boxcox Inverted)",
y = "Number of Ads",
caption = str_glue("Lambda: {lambda}")
) +
scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) +
theme(legend.position = "bottom") + geom_smooth(method = "loess", color = "#191919", lty="dashed") + scale_x_datetime(expand = c(0,0))
# Fourier Series
Now we will talk about an interesting topic that could help improve our model predictability. Fourier Series are a great day to deal with multiple seasonality (which surprisingly enough this is the case). Before going into what is Fourier Series, lets revisit our data with what we know so far.
Summary of our Dataset:Fourier Series helps us represent periodic periods
through the use of sine and cosine functions. To make it more simple to
the beginner users think of it as a distribution at a specific period of
time. We will use a 24 hour distribution since it seems that it will be
helpful for our model since we have a strong autocorrelation during the
24 hour interval.
Reference for Fourier Series: If you are interested
finding out more about Fourier Series, I really enjoyed this explanation
but there are tons of other sources that could better explain why
Fourier Series are important. Fourier Series
contributors Matt DeCross, Steve the Philosophist and Jlmln Khlm
Fourier Series representation of a periodic
Function:
\(\LARGE f(t)= a_{v} +
\sum_{n=1 }^{\infty }a_{n}\cos (n\omega _{0}t) + b_{n}\sin (n\omega
_{0}t)\)
Where:
\(\large
N\) = integer sequence 1,2,3 …
\(\large a_v\) = Fourier Coefficient
\(\large a_n\) = Fourier Coefficient
\(\large b_n\) = Fourier
Coefficient
\(\large \frac{2\pi
}{T}\) = Fundamental Frequency
\(\large 2\omega _{0}, 3\omega _{0}, 4\omega
_{0}\) = harmonic frequencies
More information how fourier series work in detail in Learn about Fourier Coefficients by Donald Krambeck
fourier_hourly_tbl <- prepared_hourly_tbl %>%
tk_augment_fourier(time, .periods = c(24), .K = 2)
# -matches("(hour)|(minute)|(second)|(am.pm)")
# color: #76A08A
fourier_col_palletes <- c("#D1362F", "#27223C", "#2E604A", "#E6A2C5")
fourier_hourly_tbl %>%
select(matches("(cos24)|(sin24)")) %>%
pivot_longer(cols = 1:4) %>%
ggplot(aes(x=value, color=name, fill=name)) +
geom_density(alpha=0.3) +
facet_wrap(~name, scales = "free_x") + theme_bw() +
theme(plot.title = element_text(face = "bold", color = "black", size=14),
plot.subtitle = element_text(face = "bold", color = "black", size=12),
axis.text = element_text(color = "black"), legend.text = element_text(size=10),
legend.title = element_text(size = 12), legend.position = "bottom",
strip.background =element_rect(fill="#76A08A"), strip.text = element_text(color="white", face="bold")) +
labs(
title = "Fourier Series Distribution",
subtitle = "24 hour Fourier Series"
) + scale_x_continuous(expand = c(0,0)) +
scale_color_manual("Fourier Legend", values = fourier_col_palletes) +
scale_fill_manual("Fourier Legend", values = fourier_col_palletes) +
guides(color = guide_legend(ncol=2, bycol=TRUE))model_formula_fourier <- as.formula(
ads ~ splines::ns(index.num) + day + wday.lbl + hour +
time_sin24_K1 + time_cos24_K1 + time_sin24_K2 + time_cos24_K2
)
fourier_hourly_tbl %>%
plot_time_series_regression(
.date_var = time,
.formula = model_formula_fourier,
.show_summary = TRUE,
.interactive = F,
.line_size = 1
) + theme_bw() +
theme(plot.title = element_text(face = "bold", color = "black", size=14),
plot.subtitle = element_text(face = "bold", color = "black", size=12),
axis.text = element_text(color = "black"), legend.text = element_text(size=10),
legend.title = element_text(size = 12), legend.position = "bottom") +
scale_color_manual(values = new_colors_pal) + scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) +
labs(
title = "Regression Model using Fourier Series",
subtitle = "Capturing seasonality with Fourier 24",
y = "Number of Ads",
x = "Time of the Day"
)##
## Call:
## stats::lm(formula = .formula, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23960 -7405 -1579 7269 29059
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140725 69562 2.023 0.044381 *
## splines::ns(index.num) 3709 61868 0.060 0.952255
## day -1143 5547 -0.206 0.836997
## wday.lbl.L -4089 2378 -1.720 0.086997 .
## wday.lbl.Q 13739 2161 6.357 1.33e-09 ***
## wday.lbl.C -2701 2258 -1.196 0.232947
## wday.lbl^4 6592 2166 3.044 0.002645 **
## wday.lbl^5 -1216 2132 -0.570 0.569137
## wday.lbl^6 1227 1961 0.626 0.532261
## hour NA NA NA NA
## time_sin24_K1 -17494 2062 -8.482 4.57e-15 ***
## time_cos24_K1 -26331 1112 -23.689 < 2e-16 ***
## time_sin24_K2 5002 1387 3.607 0.000391 ***
## time_cos24_K2 -11696 1112 -10.522 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11300 on 203 degrees of freedom
## Multiple R-squared: 0.8378, Adjusted R-squared: 0.8282
## F-statistic: 87.35 on 12 and 203 DF, p-value: < 2.2e-16
A workflow is a way of putting together different steps in the modeling process
# 22% is the testing set
# Splitting dataframe
prepared_hour_split_tbl <- time_series_split(prepared_hourly_tbl, assess = "2 days", cumulative = TRUE)
# Extracing training and test sets
train_df <- training(prepared_hour_split_tbl)
test_df <- testing(prepared_hour_split_tbl)
# Preparing our recipe
recipe_book <- recipe(ads_transformed ~ ., data = train_df) %>%
#Removed unecessary features
step_rm(matches("(iso)|(xts)|(quarter)|(year)|(month)|(qday)|(diff)")) %>%
step_normalize(matches("(index.num)|(yday)")) %>%
step_dummy(all_nominal_predictors(), one_hot =TRUE) %>%
#Combine features
step_interact(~matches("am.pm")*matches("wday.lbl")) %>%
step_fourier(time, period = c(24, 48, 76), K = 2)
# A glimpse of our recipe
t %>% glimpse()## function (x)
When creating a workflow pipeline we implement 3 main steps:
Setting up the model: I use the parsnip model for this, in this example we will use a random forest model. We need to set the model and the type of problem (in this case) regression it could also be use for classification problems.
Establish the workflow: Here we establish first the model, and then the recipe we previously prepared. The recipes does all the preprocessing step on the training data.
Fit the model on the training data: Finally, we must fir our model to be ready for it to predict againt the testing set.
rf_model <- rand_forest() %>%
set_engine("randomForest") %>%
set_mode("regression")
workflow_model <- workflow() %>%
add_model(rf_model) %>%
add_recipe(recipe_book) %>%
fit(train_df)
workflow_model## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_rm()
## • step_normalize()
## • step_dummy()
## • step_interact()
## • step_fourier()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 0.01066798
## % Var explained: 98.3
Summary: - Mean Squared of Residuals: This is error metric and as we can see it is pretty low meaning that the model is performing well on the test set. - Residuals Plot: The residuals plot shows prove of the minimum standard error. The model is able to capture well the actual observations
predict_tbl <- workflow_model %>%
predict(test_df) %>%
bind_cols(test_df) %>%
select(time, .pred, ads_transformed) %>%
pivot_longer(.pred:ads_transformed)
predict_tbl %>%
ggplot(aes(x=time, y=value)) + geom_line(aes(color=name), size=1) +
theme_fivethirtyeight() +
theme(plot.background = element_rect(fill="white"),
panel.background = element_rect(fill="white"),
legend.background=element_rect(fill="white"),
legend.key=element_rect(fill="white")) +
scale_color_manual(values = c("#D1362F", "#2E604A")) + labs(
title = "Random Forest Model",
subtitle = "Feature Engineering with the Recipes Package",
caption = "Note: Cross Validation not implemented, prone to overfitting"
)residuals_tbl <- predict_tbl %>%
pivot_wider(names_from = name, values_from = value) %>%
mutate(
residuals = ads_transformed - .pred
)
residuals_tbl %>%
ggplot(aes(x=time, y=residuals)) +
geom_point(color="red") +
geom_hline(yintercept=0, color="black", size=1.5) +
scale_y_continuous(limits = c(-0.4, 0.4)) +
theme_fivethirtyeight() +
theme(plot.background = element_rect(fill="white"),
panel.background = element_rect(fill="white"),
legend.key = element_rect(fill="white")) + scale_color_manual(values=c("#D1362F", "#2E604A")) +
labs(
title = "Residuals Plot",
subtitle = "Random Forest Model",
caption = "Mean of squared residuals: 0.01474173"
)
## Comparing Different Models
Linear Models
Random Forest Models
The random forest model looks much more accurate than linear model
The Random Forest model outperforms the linear mode and the explnation for those metrics we saw it above in our notebook.
horizon <- 24*3
lag_period <- 24*3
rolling_periods <- c(24, 48, 72)
final_split_df <- time_series_split(hourly_tbl, assess=horizon, cumulative=TRUE)
final_split_df %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(time, ads, .interactive=FALSE, .line_size=1) +
theme_fivethirtyeight() +
theme(plot.background = element_rect(fill="white"),
panel.background = element_rect(fill="white"),
legend.key = element_rect(fill="white")) + scale_color_manual(values=c("#D1362F", "#2E604A")) +
labs(
title = "Splitting the original dataframe"
)# Preparing the linear model before combining before other models
linear_model <- linear_reg() %>%
set_engine("lm")
# 3 day horizon
horizon <- 24*3
lag_period <- 24*3
rolling_periods <- c(24, 48, 72)
#Splitting dataframe
prepared_hour_split_tbl <- time_series_split(prepared_hourly_tbl, assess = horizon, cumulative = TRUE)
# Extracing training and test sets
train_df <- training(prepared_hour_split_tbl)
test_df <- testing(prepared_hour_split_tbl)
recipe_base <- recipe(ads_transformed ~ . , data = train_df) %>%
# Remove unecessary features
step_rm(matches("(iso)|(xts)|(quarter)|(year)|(month)|(qday)|(diff)")) %>%
step_normalize(matches("(index.num)|(yday)")) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
# Combine features
step_interact(~ matches("am.pm") * matches("wday.lbl")) %>%
step_fourier(time, period = c(24, 48, 76), K=2)
first_recipes <- recipe_base %>%
step_rm(time) %>%
step_ns(ends_with("index.num"), deg_free=2)
# Building the workflow
lm_workflow_fitted <- workflow() %>%
add_model(linear_model) %>%
add_recipe(first_recipes) %>%
fit(train_df)
# modeltime
final_calibration_tbl <- modeltime_table(
workflow_model,
lm_workflow_fitted
) %>%
modeltime_calibrate(
new_data = test_df
)
final_calibration_tbl %>%
modeltime_forecast(new_data = test_df,
actual_data = prepared_hourly_tbl) %>%
plot_modeltime_forecast(.interactive = F, .line_size=1) +
theme_fivethirtyeight() +
theme(plot.background = element_rect(fill="white"),
panel.background = element_rect(fill="white"),
legend.background = element_rect(fill="white"),
legend.key = element_rect(fill="white"),
strip.background =element_rect(fill="#F8FAFB")) +
scale_color_calc() +
labs(
title = "Measuring accuracy of our Models",
subtitle = "Linear and Random Forest Models"
)accuracy_tbl %>%
pivot_longer(mae:rsq) %>%
mutate(
name = name %>% as_factor() %>% fct_reorder(value) %>% fct_rev(),
value_txt = str_glue("{round(value, 2)}%") %>% as.character()
) %>%
ggplot(aes(x=name, y=value)) + geom_col(aes(fill=.model_desc), color="black") + facet_wrap(~.model_desc) +
scale_fill_calc() +
geom_label(aes(label=value_txt), size=3, color="black") +
theme_fivethirtyeight() +
theme(legend.position = "bottom",
plot.background=element_rect(fill="white"),
panel.background=element_rect(fill="white"),
legend.background = element_rect(fill="white"),
legend.key=element_rect(fill="white")) +
labs(
title = "Comparison of Model Accuracy"
)# Prepared our Data
hourly_tbl <- df %>%
set_names(names(.) %>% str_to_lower()) %>%
mutate(
time = ymd_hms(time)
) %>% as_tibble()
# Find mean and std of logs
# This is to do the inversion at the end
log_ads <- hourly_tbl %>%
mutate(ads_log = log(ads))
inversion_values <- log_ads %>%
summarise(
std_mean = mean(ads_log),
std_mean = format(round(std_mean, 13), nsmall = 13),
std_sd = sd(ads_log),
std_sd = format(round(std_sd, 13), nsmall = 13)
)
inversion_values## # A tibble: 1 × 2
## std_mean std_sd
## <chr> <chr>
## 1 11.6850111412288 0.2350963087660
# Values for inverting our time series target variable (ad.)
std_mean <- inversion_values$std_mean
std_sd <- inversion_values$std_sd
# Transformed table
prepared_hourly_tbl <- hourly_tbl %>%
mutate(
ads_transformed = log(ads),
ads_transformed = standardize_vec(ads_transformed)
)
horizon <- 24*3
# SPlit the dataframe the data prepared table
splits <- time_series_split(prepared_hourly_tbl, assess = horizon, cumulative = TRUE)recipe_spec_base <- recipe(ads_transformed ~ time, training(splits)) %>%
step_timeseries_signature(time) %>%
step_rm(matches("(iso)|(xts)|(second)|(year)|(quarter)|(month)")) %>%
step_fourier(time, period = c(12, 24, 36, 48), K=2) %>%
step_dummy(all_nominal())
recipe_spec_base %>% prep() %>% juice() %>% glimpse()## Rows: 144
## Columns: 41
## $ time <dttm> 2017-09-13 00:00:00, 2017-09-13 01:00:00, 2017-09-13 …
## $ ads_transformed <dbl> -1.67502740, -1.68725644, -1.21216050, -0.65066763, 0.…
## $ time_index.num <dbl> 1505260800, 1505264400, 1505268000, 1505271600, 150527…
## $ time_half <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ time_day <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ time_hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ time_minute <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ time_hour12 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5…
## $ time_am.pm <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
## $ time_wday <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ time_mday <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ time_qday <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75…
## $ time_yday <int> 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256,…
## $ time_mweek <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ time_week <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ time_week2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time_week3 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time_week4 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time_mday7 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ time_sin12_K1 <dbl> -1.913562e-11, 5.000000e-01, 8.660254e-01, 1.000000e+0…
## $ time_cos12_K1 <dbl> 1.000000e+00, 8.660254e-01, 5.000000e-01, 2.169896e-11…
## $ time_sin12_K2 <dbl> -3.827123e-11, 8.660254e-01, 8.660254e-01, 4.339792e-1…
## $ time_cos12_K2 <dbl> 1.0, 0.5, -0.5, -1.0, -0.5, 0.5, 1.0, 0.5, -0.5, -1.0,…
## $ time_sin24_K1 <dbl> -9.567808e-12, 2.588190e-01, 5.000000e-01, 7.071068e-0…
## $ time_cos24_K1 <dbl> 1.000000e+00, 9.659258e-01, 8.660254e-01, 7.071068e-01…
## $ time_sin24_K2 <dbl> -1.913562e-11, 5.000000e-01, 8.660254e-01, 1.000000e+0…
## $ time_cos24_K2 <dbl> 1.000000e+00, 8.660254e-01, 5.000000e-01, 2.169896e-11…
## $ time_sin36_K1 <dbl> -8.660254e-01, -9.396926e-01, -9.848078e-01, -1.000000…
## $ time_cos36_K1 <dbl> -5.000000e-01, -3.420201e-01, -1.736482e-01, 2.468290e…
## $ time_sin36_K2 <dbl> 8.660254e-01, 6.427876e-01, 3.420201e-01, -4.936580e-1…
## $ time_cos36_K2 <dbl> -0.5000000, -0.7660444, -0.9396926, -1.0000000, -0.939…
## $ time_sin48_K1 <dbl> -4.783904e-12, 1.305262e-01, 2.588190e-01, 3.826834e-0…
## $ time_cos48_K1 <dbl> 1.000000e+00, 9.914449e-01, 9.659258e-01, 9.238795e-01…
## $ time_sin48_K2 <dbl> -9.567808e-12, 2.588190e-01, 5.000000e-01, 7.071068e-0…
## $ time_cos48_K2 <dbl> 1.000000e+00, 9.659258e-01, 8.660254e-01, 7.071068e-01…
## $ time_wday.lbl_1 <dbl> 2.855301e-17, 2.855301e-17, 2.855301e-17, 2.855301e-17…
## $ time_wday.lbl_2 <dbl> -0.4364358, -0.4364358, -0.4364358, -0.4364358, -0.436…
## $ time_wday.lbl_3 <dbl> 5.779362e-17, 5.779362e-17, 5.779362e-17, 5.779362e-17…
## $ time_wday.lbl_4 <dbl> 0.4834938, 0.4834938, 0.4834938, 0.4834938, 0.4834938,…
## $ time_wday.lbl_5 <dbl> -1.231475e-15, -1.231475e-15, -1.231475e-15, -1.231475…
## $ time_wday.lbl_6 <dbl> -0.6579517, -0.6579517, -0.6579517, -0.6579517, -0.657…
# 1. Workflow Random Forest
workflow_fit_rf <- workflow() %>%
add_model(model_spec_rf) %>%
add_recipe(recipe_spec_base %>% step_rm(time)) %>%
fit(training(splits))
# 2. Prophet Workflow
workflow_fit_prophet <- workflow() %>%
add_model(model_fit_prophet) %>%
add_recipe(recipe_spec_base) %>%
fit(training(splits))
# Combine models together
# Combined models with modeltime
model_table <- modeltime_table(
workflow_fit_rf,
workflow_fit_prophet
)
# Building the calibration table
calibration_table <- model_table %>%
modeltime_calibrate(testing(splits))
calibration_table## # Modeltime Table
## # A tibble: 2 × 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 1 <workflow> RANDOMFOREST Test <tibble [72 × 4]>
## 2 2 <workflow> PROPHET W/ REGRESSORS Test <tibble [72 × 4]>