This analysis was initially published here I have redone the analysis for understanding how time series work.

Loading packages

library("timetk")
library("tidyquant")
library("readxl")
library("dplyr")
library("ggplot2")

Data loading and Data cleaning

The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales..

Visualizing Data

It’s always a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting. We’ll use tidyquant charting tools: mainly geom_ma to add a 12-period simple moving average to get an idea of the trend.

ggplot(data = beer_sales_tbl, aes(x = date, y = price)) +
    geom_line(col = palette_light()[1]) +
    geom_point(col = palette_light()[1]) +
    geom_ma(ma_fun = SMA, n = 12, size = 1) +
    theme_tq() +
    scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
    labs(title = "Beer Sales: 2007 through 2016")

We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).

Time Series Machine Learning

Step 0: Review data

We can quickly get a feel for the time series using tk_index() to extract the index and tk_get_timeseries_summary() to retrieve summary information of the index. We use glimpse() to output in a nice format for review.

beer_sales_tbl %>%
    tk_index() %>%
    tk_get_timeseries_summary() %>%
    glimpse()
## Observations: 1
## Variables: 12
## $ n.obs        <int> 84
## $ start        <date> 2010-01-01
## $ end          <date> 2016-12-01
## $ units        <chr> "days"
## $ scale        <chr> "month"
## $ tzone        <chr> "UTC"
## $ diff.minimum <dbl> 2419200
## $ diff.q1      <dbl> 2592000
## $ diff.median  <dbl> 2678400
## $ diff.mean    <dbl> 2629475
## $ diff.q3      <dbl> 2678400
## $ diff.maximum <dbl> 2678400

Step 1: Augment Time Series Signature

The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame.

# Augment (adds data frame columns)
beer_sales_tbl_aug <- beer_sales_tbl %>%
    tk_augment_timeseries_signature()

head(beer_sales_tbl_aug, 10)
## # A tibble: 10 x 30
##          date price  index.num    diff  year year.iso  half quarter month
##        <date> <dbl>      <int>   <int> <int>    <int> <int>   <int> <int>
##  1 2010-01-01  6558 1262304000      NA  2010     2009     1       1     1
##  2 2010-02-01  7481 1264982400 2678400  2010     2010     1       1     2
##  3 2010-03-01  9475 1267401600 2419200  2010     2010     1       1     3
##  4 2010-04-01  9424 1270080000 2678400  2010     2010     1       2     4
##  5 2010-05-01  9351 1272672000 2592000  2010     2010     1       2     5
##  6 2010-06-01 10552 1275350400 2678400  2010     2010     1       2     6
##  7 2010-07-01  9077 1277942400 2592000  2010     2010     2       3     7
##  8 2010-08-01  9273 1280620800 2678400  2010     2010     2       3     8
##  9 2010-09-01  9420 1283299200 2678400  2010     2010     2       3     9
## 10 2010-10-01  9413 1285891200 2592000  2010     2010     2       4    10
## # ... with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## #   hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## #   wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## #   yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## #   week3 <int>, week4 <int>, mday7 <int>

Step 2: Model

Apply any regression model to the data. We’ll use lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).

fit_lm <- lm(price ~ ., data = select(beer_sales_tbl_aug, -c(date, diff)))
summary(fit_lm)
## 
## Call:
## lm(formula = price ~ ., data = select(beer_sales_tbl_aug, -c(date, 
##     diff)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -447.3 -145.4  -18.2  169.8  421.4 
## 
## Coefficients: (16 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.660e+08  1.245e+08   2.940 0.004738 ** 
## index.num     5.900e-03  2.003e-03   2.946 0.004661 ** 
## year         -1.974e+05  6.221e+04  -3.173 0.002434 ** 
## year.iso      1.159e+04  6.546e+03   1.770 0.082006 .  
## half         -2.132e+03  6.107e+02  -3.491 0.000935 ***
## quarter      -1.239e+04  2.190e+04  -0.566 0.573919    
## month        -3.910e+03  7.355e+03  -0.532 0.597058    
## month.xts            NA         NA      NA       NA    
## month.lbl.L          NA         NA      NA       NA    
## month.lbl.Q  -1.643e+03  2.069e+02  -7.942 8.59e-11 ***
## month.lbl.C   8.368e+02  5.139e+02   1.628 0.108949    
## month.lbl^4   6.452e+02  1.344e+02   4.801 1.18e-05 ***
## month.lbl^5   7.563e+02  4.241e+02   1.783 0.079852 .  
## month.lbl^6   3.206e+02  1.609e+02   1.992 0.051135 .  
## month.lbl^7  -3.537e+02  1.867e+02  -1.894 0.063263 .  
## month.lbl^8   3.687e+02  3.217e+02   1.146 0.256510    
## month.lbl^9          NA         NA      NA       NA    
## month.lbl^10  6.339e+02  2.240e+02   2.830 0.006414 ** 
## month.lbl^11         NA         NA      NA       NA    
## day                  NA         NA      NA       NA    
## hour                 NA         NA      NA       NA    
## minute               NA         NA      NA       NA    
## second               NA         NA      NA       NA    
## hour12               NA         NA      NA       NA    
## am.pm                NA         NA      NA       NA    
## wday         -8.264e+01  1.898e+01  -4.353 5.63e-05 ***
## wday.xts             NA         NA      NA       NA    
## wday.lbl.L           NA         NA      NA       NA    
## wday.lbl.Q   -7.109e+02  1.093e+02  -6.503 2.13e-08 ***
## wday.lbl.C    2.355e+02  1.336e+02   1.763 0.083273 .  
## wday.lbl^4    8.033e+01  1.133e+02   0.709 0.481281    
## wday.lbl^5    6.480e+01  8.029e+01   0.807 0.422951    
## wday.lbl^6    2.276e+01  8.200e+01   0.278 0.782319    
## mday                 NA         NA      NA       NA    
## qday         -1.362e+02  2.418e+02  -0.563 0.575326    
## yday         -2.356e+02  1.416e+02  -1.664 0.101627    
## mweek        -1.670e+02  1.477e+02  -1.131 0.262923    
## week         -1.764e+02  1.890e+02  -0.933 0.354618    
## week.iso      2.315e+02  1.256e+02   1.842 0.070613 .  
## week2        -1.235e+02  1.547e+02  -0.798 0.428283    
## week3                NA         NA      NA       NA    
## week4                NA         NA      NA       NA    
## mday7                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260.4 on 57 degrees of freedom
## Multiple R-squared:  0.9798, Adjusted R-squared:  0.9706 
## F-statistic: 106.4 on 26 and 57 DF,  p-value: < 2.2e-16

Step 3: Build Future (New) Data

Use tk_index() to extract the index of date.

Make a future index from the existing index with tk_make_future_timeseries. The function internally checks the periodicity and returns the correct sequence.

future_idx <- beer_sales_idx %>%
    tk_make_future_timeseries(n_future = 12)
future_idx
##  [1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
##  [6] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
## [11] "2017-11-01" "2017-12-01"

From the future index, use tk_get_timeseries_signature() to turn index into time signature data frame.

new_data_tbl <- future_idx %>%
    tk_get_timeseries_signature()
head(new_data_tbl,10)
## # A tibble: 10 x 29
##         index  index.num    diff  year year.iso  half quarter month
##        <date>      <int>   <int> <int>    <int> <int>   <int> <int>
##  1 2017-01-01 1483228800      NA  2017     2016     1       1     1
##  2 2017-02-01 1485907200 2678400  2017     2017     1       1     2
##  3 2017-03-01 1488326400 2419200  2017     2017     1       1     3
##  4 2017-04-01 1491004800 2678400  2017     2017     1       2     4
##  5 2017-05-01 1493596800 2592000  2017     2017     1       2     5
##  6 2017-06-01 1496275200 2678400  2017     2017     1       2     6
##  7 2017-07-01 1498867200 2592000  2017     2017     2       3     7
##  8 2017-08-01 1501545600 2678400  2017     2017     2       3     8
##  9 2017-09-01 1504224000 2678400  2017     2017     2       3     9
## 10 2017-10-01 1506816000 2592000  2017     2017     2       4    10
## # ... with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## #   hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## #   wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## #   yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## #   week3 <int>, week4 <int>, mday7 <int>

Step 4: Predict the New Data

Use the predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the lm() function.

pred <- predict(fit_lm, newdata = select(new_data_tbl, -c(index, diff)))

predictions_tbl <- tibble(date  = future_idx,value = pred)

Step 5: Compare Actual vs Predictions

Visualize our forecast: Plot Beer Sales Forecast
beer_sales_tbl %>%
    ggplot(aes(x = date, y = price)) +
    # Training data
    geom_line(color = palette_light()[[1]]) +
    geom_point(color = palette_light()[[1]]) +
    # Predictions
    geom_line(aes(y = value), color = palette_light()[[2]], data = predictions_tbl) +
    geom_point(aes(y = value), color = palette_light()[[2]], data = predictions_tbl) +
    # Actuals
    geom_line(color = palette_light()[[1]], data = actual_sales) +
    geom_point(color = palette_light()[[1]], data = actual_sales) +
    # Aesthetics
    theme_tq() +
    labs(title = "Beer Sales Forecast: Time Series Machine Learning",
         subtitle = "Using basic multivariate linear regression can yield accurate results")

We can investigate the error on our test set (actuals vs predictions).
# Investigate test error
error_tbl <- left_join(actual_sales, predictions_tbl) %>%
    rename(actual = price, pred = value) %>%
    mutate(error = actual - pred,error_pct = error / actual) 
error_tbl
## # A tibble: 9 x 5
##         date actual      pred     error    error_pct
##       <date>  <dbl>     <dbl>     <dbl>        <dbl>
## 1 2017-01-01   8664  9509.122 -845.1223 -0.097544127
## 2 2017-02-01  10017 10909.189 -892.1891 -0.089067495
## 3 2017-03-01  11960 12281.835 -321.8352 -0.026909301
## 4 2017-04-01  11019 11378.678 -359.6777 -0.032641592
## 5 2017-05-01  12971 13080.710 -109.7099 -0.008458092
## 6 2017-06-01  14113 13583.471  529.5292  0.037520667
## 7 2017-07-01  10928 11529.085 -601.0854 -0.055004156
## 8 2017-08-01  12831 12984.939 -153.9386 -0.011997396
## 9 2017-09-01  11412 11859.998 -447.9976 -0.039256708