This analysis was initially published here I have redone the analysis for understanding how time series work.
library("timetk")
library("tidyquant")
library("readxl")
library("dplyr")
library("ggplot2")
The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales..
It’s always a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting. We’ll use tidyquant charting tools: mainly geom_ma to add a 12-period simple moving average to get an idea of the trend.
ggplot(data = beer_sales_tbl, aes(x = date, y = price)) +
geom_line(col = palette_light()[1]) +
geom_point(col = palette_light()[1]) +
geom_ma(ma_fun = SMA, n = 12, size = 1) +
theme_tq() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
labs(title = "Beer Sales: 2007 through 2016")
We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 12 months of data for the time series using the time series signature.
We can quickly get a feel for the time series using tk_index() to extract the index and tk_get_timeseries_summary() to retrieve summary information of the index. We use glimpse() to output in a nice format for review.
beer_sales_tbl %>%
tk_index() %>%
tk_get_timeseries_summary() %>%
glimpse()
## Observations: 1
## Variables: 12
## $ n.obs <int> 84
## $ start <date> 2010-01-01
## $ end <date> 2016-12-01
## $ units <chr> "days"
## $ scale <chr> "month"
## $ tzone <chr> "UTC"
## $ diff.minimum <dbl> 2419200
## $ diff.q1 <dbl> 2592000
## $ diff.median <dbl> 2678400
## $ diff.mean <dbl> 2629475
## $ diff.q3 <dbl> 2678400
## $ diff.maximum <dbl> 2678400
The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame.
# Augment (adds data frame columns)
beer_sales_tbl_aug <- beer_sales_tbl %>%
tk_augment_timeseries_signature()
head(beer_sales_tbl_aug, 10)
## # A tibble: 10 x 30
## date price index.num diff year year.iso half quarter month
## <date> <dbl> <int> <int> <int> <int> <int> <int> <int>
## 1 2010-01-01 6558 1262304000 NA 2010 2009 1 1 1
## 2 2010-02-01 7481 1264982400 2678400 2010 2010 1 1 2
## 3 2010-03-01 9475 1267401600 2419200 2010 2010 1 1 3
## 4 2010-04-01 9424 1270080000 2678400 2010 2010 1 2 4
## 5 2010-05-01 9351 1272672000 2592000 2010 2010 1 2 5
## 6 2010-06-01 10552 1275350400 2678400 2010 2010 1 2 6
## 7 2010-07-01 9077 1277942400 2592000 2010 2010 2 3 7
## 8 2010-08-01 9273 1280620800 2678400 2010 2010 2 3 8
## 9 2010-09-01 9420 1283299200 2678400 2010 2010 2 3 9
## 10 2010-10-01 9413 1285891200 2592000 2010 2010 2 4 10
## # ... with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## # hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## # wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## # yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## # week3 <int>, week4 <int>, mday7 <int>
Apply any regression model to the data. We’ll use lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).
fit_lm <- lm(price ~ ., data = select(beer_sales_tbl_aug, -c(date, diff)))
summary(fit_lm)
##
## Call:
## lm(formula = price ~ ., data = select(beer_sales_tbl_aug, -c(date,
## diff)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -447.3 -145.4 -18.2 169.8 421.4
##
## Coefficients: (16 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.660e+08 1.245e+08 2.940 0.004738 **
## index.num 5.900e-03 2.003e-03 2.946 0.004661 **
## year -1.974e+05 6.221e+04 -3.173 0.002434 **
## year.iso 1.159e+04 6.546e+03 1.770 0.082006 .
## half -2.132e+03 6.107e+02 -3.491 0.000935 ***
## quarter -1.239e+04 2.190e+04 -0.566 0.573919
## month -3.910e+03 7.355e+03 -0.532 0.597058
## month.xts NA NA NA NA
## month.lbl.L NA NA NA NA
## month.lbl.Q -1.643e+03 2.069e+02 -7.942 8.59e-11 ***
## month.lbl.C 8.368e+02 5.139e+02 1.628 0.108949
## month.lbl^4 6.452e+02 1.344e+02 4.801 1.18e-05 ***
## month.lbl^5 7.563e+02 4.241e+02 1.783 0.079852 .
## month.lbl^6 3.206e+02 1.609e+02 1.992 0.051135 .
## month.lbl^7 -3.537e+02 1.867e+02 -1.894 0.063263 .
## month.lbl^8 3.687e+02 3.217e+02 1.146 0.256510
## month.lbl^9 NA NA NA NA
## month.lbl^10 6.339e+02 2.240e+02 2.830 0.006414 **
## month.lbl^11 NA NA NA NA
## day NA NA NA NA
## hour NA NA NA NA
## minute NA NA NA NA
## second NA NA NA NA
## hour12 NA NA NA NA
## am.pm NA NA NA NA
## wday -8.264e+01 1.898e+01 -4.353 5.63e-05 ***
## wday.xts NA NA NA NA
## wday.lbl.L NA NA NA NA
## wday.lbl.Q -7.109e+02 1.093e+02 -6.503 2.13e-08 ***
## wday.lbl.C 2.355e+02 1.336e+02 1.763 0.083273 .
## wday.lbl^4 8.033e+01 1.133e+02 0.709 0.481281
## wday.lbl^5 6.480e+01 8.029e+01 0.807 0.422951
## wday.lbl^6 2.276e+01 8.200e+01 0.278 0.782319
## mday NA NA NA NA
## qday -1.362e+02 2.418e+02 -0.563 0.575326
## yday -2.356e+02 1.416e+02 -1.664 0.101627
## mweek -1.670e+02 1.477e+02 -1.131 0.262923
## week -1.764e+02 1.890e+02 -0.933 0.354618
## week.iso 2.315e+02 1.256e+02 1.842 0.070613 .
## week2 -1.235e+02 1.547e+02 -0.798 0.428283
## week3 NA NA NA NA
## week4 NA NA NA NA
## mday7 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260.4 on 57 degrees of freedom
## Multiple R-squared: 0.9798, Adjusted R-squared: 0.9706
## F-statistic: 106.4 on 26 and 57 DF, p-value: < 2.2e-16
Use tk_index() to extract the index of date.
Make a future index from the existing index with tk_make_future_timeseries. The function internally checks the periodicity and returns the correct sequence.
future_idx <- beer_sales_idx %>%
tk_make_future_timeseries(n_future = 12)
future_idx
## [1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
## [6] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
## [11] "2017-11-01" "2017-12-01"
From the future index, use tk_get_timeseries_signature() to turn index into time signature data frame.
new_data_tbl <- future_idx %>%
tk_get_timeseries_signature()
head(new_data_tbl,10)
## # A tibble: 10 x 29
## index index.num diff year year.iso half quarter month
## <date> <int> <int> <int> <int> <int> <int> <int>
## 1 2017-01-01 1483228800 NA 2017 2016 1 1 1
## 2 2017-02-01 1485907200 2678400 2017 2017 1 1 2
## 3 2017-03-01 1488326400 2419200 2017 2017 1 1 3
## 4 2017-04-01 1491004800 2678400 2017 2017 1 2 4
## 5 2017-05-01 1493596800 2592000 2017 2017 1 2 5
## 6 2017-06-01 1496275200 2678400 2017 2017 1 2 6
## 7 2017-07-01 1498867200 2592000 2017 2017 2 3 7
## 8 2017-08-01 1501545600 2678400 2017 2017 2 3 8
## 9 2017-09-01 1504224000 2678400 2017 2017 2 3 9
## 10 2017-10-01 1506816000 2592000 2017 2017 2 4 10
## # ... with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## # hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## # wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## # yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## # week3 <int>, week4 <int>, mday7 <int>
Use the predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the lm() function.
pred <- predict(fit_lm, newdata = select(new_data_tbl, -c(index, diff)))
predictions_tbl <- tibble(date = future_idx,value = pred)
beer_sales_tbl %>%
ggplot(aes(x = date, y = price)) +
# Training data
geom_line(color = palette_light()[[1]]) +
geom_point(color = palette_light()[[1]]) +
# Predictions
geom_line(aes(y = value), color = palette_light()[[2]], data = predictions_tbl) +
geom_point(aes(y = value), color = palette_light()[[2]], data = predictions_tbl) +
# Actuals
geom_line(color = palette_light()[[1]], data = actual_sales) +
geom_point(color = palette_light()[[1]], data = actual_sales) +
# Aesthetics
theme_tq() +
labs(title = "Beer Sales Forecast: Time Series Machine Learning",
subtitle = "Using basic multivariate linear regression can yield accurate results")
# Investigate test error
error_tbl <- left_join(actual_sales, predictions_tbl) %>%
rename(actual = price, pred = value) %>%
mutate(error = actual - pred,error_pct = error / actual)
error_tbl
## # A tibble: 9 x 5
## date actual pred error error_pct
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2017-01-01 8664 9509.122 -845.1223 -0.097544127
## 2 2017-02-01 10017 10909.189 -892.1891 -0.089067495
## 3 2017-03-01 11960 12281.835 -321.8352 -0.026909301
## 4 2017-04-01 11019 11378.678 -359.6777 -0.032641592
## 5 2017-05-01 12971 13080.710 -109.7099 -0.008458092
## 6 2017-06-01 14113 13583.471 529.5292 0.037520667
## 7 2017-07-01 10928 11529.085 -601.0854 -0.055004156
## 8 2017-08-01 12831 12984.939 -153.9386 -0.011997396
## 9 2017-09-01 11412 11859.998 -447.9976 -0.039256708