Undestanding Time series with timetk

This analysis was initially published here I have redone the analysis for understanding how time series work.

Time Series Machine Learning

Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 12 months of data for the time series using the time series signature.

Step 0: Review data

We can quickly get a feel for the time series using tk_index() to extract the index and tk_get_timeseries_summary() to retrieve summary information of the index. We use glimpse() to output in a nice format for review.

beer_sales_tbl %>%
    tk_index() %>%
    tk_get_timeseries_summary() %>%
    glimpse()

## Observations: 1
## Variables: 12
## $ n.obs        <int> 84
## $ start        <date> 2010-01-01
## $ end          <date> 2016-12-01
## $ units        <chr> "days"
## $ scale        <chr> "month"
## $ tzone        <chr> "UTC"
## $ diff.minimum <dbl> 2419200
## $ diff.q1      <dbl> 2592000
## $ diff.median  <dbl> 2678400
## $ diff.mean    <dbl> 2629475
## $ diff.q3      <dbl> 2678400
## $ diff.maximum <dbl> 2678400

Step 1: Augment Time Series Signature

The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame.

# Augment (adds data frame columns)
beer_sales_tbl_aug <- beer_sales_tbl %>%
    tk_augment_timeseries_signature()

head(beer_sales_tbl_aug, 10)

## # A tibble: 10 x 30
##          date price  index.num    diff  year year.iso  half quarter month
##        <date> <dbl>      <int>   <int> <int>    <int> <int>   <int> <int>
##  1 2010-01-01  6558 1262304000      NA  2010     2009     1       1     1
##  2 2010-02-01  7481 1264982400 2678400  2010     2010     1       1     2
##  3 2010-03-01  9475 1267401600 2419200  2010     2010     1       1     3
##  4 2010-04-01  9424 1270080000 2678400  2010     2010     1       2     4
##  5 2010-05-01  9351 1272672000 2592000  2010     2010     1       2     5
##  6 2010-06-01 10552 1275350400 2678400  2010     2010     1       2     6
##  7 2010-07-01  9077 1277942400 2592000  2010     2010     2       3     7
##  8 2010-08-01  9273 1280620800 2678400  2010     2010     2       3     8
##  9 2010-09-01  9420 1283299200 2678400  2010     2010     2       3     9
## 10 2010-10-01  9413 1285891200 2592000  2010     2010     2       4    10
## # ... with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## #   hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## #   wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## #   yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## #   week3 <int>, week4 <int>, mday7 <int>

Step 2: Model

Apply any regression model to the data. We’ll use lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).

fit_lm <- lm(price ~ ., data = select(beer_sales_tbl_aug, -c(date, diff)))
summary(fit_lm)

## 
## Call:
## lm(formula = price ~ ., data = select(beer_sales_tbl_aug, -c(date, 
##     diff)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -447.3 -145.4  -18.2  169.8  421.4 
## 
## Coefficients: (16 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.660e+08  1.245e+08   2.940 0.004738 ** 
## index.num     5.900e-03  2.003e-03   2.946 0.004661 ** 
## year         -1.974e+05  6.221e+04  -3.173 0.002434 ** 
## year.iso      1.159e+04  6.546e+03   1.770 0.082006 .  
## half         -2.132e+03  6.107e+02  -3.491 0.000935 ***
## quarter      -1.239e+04  2.190e+04  -0.566 0.573919    
## month        -3.910e+03  7.355e+03  -0.532 0.597058    
## month.xts            NA         NA      NA       NA    
## month.lbl.L          NA         NA      NA       NA    
## month.lbl.Q  -1.643e+03  2.069e+02  -7.942 8.59e-11 ***
## month.lbl.C   8.368e+02  5.139e+02   1.628 0.108949    
## month.lbl^4   6.452e+02  1.344e+02   4.801 1.18e-05 ***
## month.lbl^5   7.563e+02  4.241e+02   1.783 0.079852 .  
## month.lbl^6   3.206e+02  1.609e+02   1.992 0.051135 .  
## month.lbl^7  -3.537e+02  1.867e+02  -1.894 0.063263 .  
## month.lbl^8   3.687e+02  3.217e+02   1.146 0.256510    
## month.lbl^9          NA         NA      NA       NA    
## month.lbl^10  6.339e+02  2.240e+02   2.830 0.006414 ** 
## month.lbl^11         NA         NA      NA       NA    
## day                  NA         NA      NA       NA    
## hour                 NA         NA      NA       NA    
## minute               NA         NA      NA       NA    
## second               NA         NA      NA       NA    
## hour12               NA         NA      NA       NA    
## am.pm                NA         NA      NA       NA    
## wday         -8.264e+01  1.898e+01  -4.353 5.63e-05 ***
## wday.xts             NA         NA      NA       NA    
## wday.lbl.L           NA         NA      NA       NA    
## wday.lbl.Q   -7.109e+02  1.093e+02  -6.503 2.13e-08 ***
## wday.lbl.C    2.355e+02  1.336e+02   1.763 0.083273 .  
## wday.lbl^4    8.033e+01  1.133e+02   0.709 0.481281    
## wday.lbl^5    6.480e+01  8.029e+01   0.807 0.422951    
## wday.lbl^6    2.276e+01  8.200e+01   0.278 0.782319    
## mday                 NA         NA      NA       NA    
## qday         -1.362e+02  2.418e+02  -0.563 0.575326    
## yday         -2.356e+02  1.416e+02  -1.664 0.101627    
## mweek        -1.670e+02  1.477e+02  -1.131 0.262923    
## week         -1.764e+02  1.890e+02  -0.933 0.354618    
## week.iso      2.315e+02  1.256e+02   1.842 0.070613 .  
## week2        -1.235e+02  1.547e+02  -0.798 0.428283    
## week3                NA         NA      NA       NA    
## week4                NA         NA      NA       NA    
## mday7                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260.4 on 57 degrees of freedom
## Multiple R-squared:  0.9798, Adjusted R-squared:  0.9706 
## F-statistic: 106.4 on 26 and 57 DF,  p-value: < 2.2e-16

Step 3: Build Future (New) Data

Use tk_index() to extract the index of date.

Make a future index from the existing index with tk_make_future_timeseries. The function internally checks the periodicity and returns the correct sequence.

future_idx <- beer_sales_idx %>%
    tk_make_future_timeseries(n_future = 12)
future_idx

##  [1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
##  [6] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
## [11] "2017-11-01" "2017-12-01"

From the future index, use tk_get_timeseries_signature() to turn index into time signature data frame.

new_data_tbl <- future_idx %>%
    tk_get_timeseries_signature()
head(new_data_tbl,10)

## # A tibble: 10 x 29
##         index  index.num    diff  year year.iso  half quarter month
##        <date>      <int>   <int> <int>    <int> <int>   <int> <int>
##  1 2017-01-01 1483228800      NA  2017     2016     1       1     1
##  2 2017-02-01 1485907200 2678400  2017     2017     1       1     2
##  3 2017-03-01 1488326400 2419200  2017     2017     1       1     3
##  4 2017-04-01 1491004800 2678400  2017     2017     1       2     4
##  5 2017-05-01 1493596800 2592000  2017     2017     1       2     5
##  6 2017-06-01 1496275200 2678400  2017     2017     1       2     6
##  7 2017-07-01 1498867200 2592000  2017     2017     2       3     7
##  8 2017-08-01 1501545600 2678400  2017     2017     2       3     8
##  9 2017-09-01 1504224000 2678400  2017     2017     2       3     9
## 10 2017-10-01 1506816000 2592000  2017     2017     2       4    10
## # ... with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## #   hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## #   wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## #   yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## #   week3 <int>, week4 <int>, mday7 <int>

Step 4: Predict the New Data

Use the predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the lm() function.

pred <- predict(fit_lm, newdata = select(new_data_tbl, -c(index, diff)))

predictions_tbl <- tibble(date  = future_idx,value = pred)

Step 5: Compare Actual vs Predictions

Visualize our forecast: Plot Beer Sales Forecast

beer_sales_tbl %>%
    ggplot(aes(x = date, y = price)) +
    # Training data
    geom_line(color = palette_light()[[1]]) +
    geom_point(color = palette_light()[[1]]) +
    # Predictions
    geom_line(aes(y = value), color = palette_light()[[2]], data = predictions_tbl) +
    geom_point(aes(y = value), color = palette_light()[[2]], data = predictions_tbl) +
    # Actuals
    geom_line(color = palette_light()[[1]], data = actual_sales) +
    geom_point(color = palette_light()[[1]], data = actual_sales) +
    # Aesthetics
    theme_tq() +
    labs(title = "Beer Sales Forecast: Time Series Machine Learning",
         subtitle = "Using basic multivariate linear regression can yield accurate results")

We can investigate the error on our test set (actuals vs predictions).

# Investigate test error
error_tbl <- left_join(actual_sales, predictions_tbl) %>%
    rename(actual = price, pred = value) %>%
    mutate(error = actual - pred,error_pct = error / actual) 
error_tbl

## # A tibble: 9 x 5
##         date actual      pred     error    error_pct
##       <date>  <dbl>     <dbl>     <dbl>        <dbl>
## 1 2017-01-01   8664  9509.122 -845.1223 -0.097544127
## 2 2017-02-01  10017 10909.189 -892.1891 -0.089067495
## 3 2017-03-01  11960 12281.835 -321.8352 -0.026909301
## 4 2017-04-01  11019 11378.678 -359.6777 -0.032641592
## 5 2017-05-01  12971 13080.710 -109.7099 -0.008458092
## 6 2017-06-01  14113 13583.471  529.5292  0.037520667
## 7 2017-07-01  10928 11529.085 -601.0854 -0.055004156
## 8 2017-08-01  12831 12984.939 -153.9386 -0.011997396
## 9 2017-09-01  11412 11859.998 -447.9976 -0.039256708

Undestanding Time series with timetk

Amoul Singhi

25 November 2017

Loading packages

Data loading and Data cleaning

Visualizing Data