Load the packages and functions for the current session

We’ll be using some forecasting-specific packages today, as well as the usual tidyverse ones.

library(dplyr)  # for manipulating data frames 
library(ggplot2)  # for data viz
library(here)  # for simplifying folder references 
library(readr)  # reading and writing data files 
library(fpp)  # forecasting principles and practices 
library(prophet)  # facebook's forecasting package 
library(janitor)  # for cleaning up dataframes  
library(lubridate)  # for working with dates 
library(tidyr)  # for tidying data
library(magrittr)  # for easy piping 

# import custom functions: 
# I will email these files to you; save them in your "src" 
#   folder
source(here::here("src", "stl.as.df_function.R"))  
source(here::here("src", "generate-time-series_function.R"))

Part 1

Random generating processes

Before we look at specific time series, let’s take a look at some randomly generated ones. Note that there are no systematic factors underlying these series, but they can display characteristics similar to those we see in real time series.

The point is, many of the patterns we think we see in time series are probably nothing but chance. If it is easy to generate a random dataset that looks very similar to your actual data, then maybe your actual data provides no evidence of any systematic factors?

Inexperienced researchers and laypeople alike usually overestimate the role of systematic factors relative to chance factors 1

# try this to get lots of random walks: 
generate.ts_function(seed = FALSE)

# try this to get lots of random walks: 
generate.ts_function(seed = FALSE)

Features of time series data

Take a look at the following 2 time series. How are they different?

# this won't work unless you have imported the function first
generate.ts_function()

We’ll be examining three different components of a time series2:

  1. Trend: a long-term increase or decrease in the data. It does not have to be linear.
  2. Seasonality: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency.
  3. Smoothness/variability: Are neighbouring values of the time series correlated with one another, or are they completely independent of one aonther? If yes, how strongly are they related?

The series on the left has no trend, and very little variability. The one on the right has a significant positive trend, and is much “choppier”, or less smooth. It may also have some seasonality, but we can’t say for sure yet.

Let’s use some exploratory graphs to inspect the components of a time series.

     

Visualizing time series data with the fpp package

Time series objects

So far we have been working with dataframes. E.g. mtcars is a dataframe:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Most time series functions require data to be in a time series format. An example time series object is elecequip:

str(elecequip)
##  Time-Series [1:191] from 1996 to 2012: 79.4 75.9 86.4 72.7 74.9 ...
start(elecequip)
## [1] 1996    1
end(elecequip)
## [1] 2011   11
frequency(elecequip)  
## [1] 12

Just to see how it works, let’s convert one column of mtcars into a time series object. We use the ts function for that.

eg.time.seris <- mtcars %>% 
    select(mpg) %>% 
    ts(start = c(2018,1), 
       frequency = 12)

str(eg.time.seris)
##  Time-Series [1:32, 1] from 2018 to 2021: 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "mpg"

The autoplot() function

The autoplot knows how to deal with time series object, so that you don’t have to specify all the individual components in ggplot.

autoplot(elecequip)

The ggmonthplot() function

ggmonthplot(elecequip)

The ggseasonplot() function

ggseasonplot(elecequip)

The ggAcf() function

The autocorrelation function (ACF) is useful for identifying seasonality, and finding the correlation structure of a time series.

ggAcf(elecequip)

Mini Exercise

Try creating the same graphs with these series:

  • usdeaths
  • fancy
  • dj

Time series decomposition

stl(elecequip, 
    s.window = "periodic") %>% plot

stl.as.df_fn(elecequip) %>% head
##    seasonal    trend  remainder  data timeperiod
## 1 -5.563267 80.49243  4.5008362 79.43          1
## 2 -6.088288 80.40279  1.5454971 75.86          2
## 3  7.977940 80.31315 -1.8910917 86.40          3
## 4 -6.347420 80.27340 -1.2559820 72.67          4
## 5 -4.818401 80.23365 -0.4852496 74.93          5
## 6  7.751458 80.22800 -4.0994583 83.88          6

Optional: Quick and dirty de-seasonalizing

A quick and dirty method to deseasonalize a time series is to fit a robust linear model to the data, predicting the value of the series using the month as a predictor.3 The residuals from this model are the deseasonalized series.

df1.elec <- data.frame(month = c(rep(month.abb, times = 15), 
                                 month.abb[1:11]),
                       # ^first we need to create a predictor column
                       # that identifies the month
                       
                       # next we'll create a column for the year: 
                       year = c(rep(1996:2010, each = 12), 
                                rep(2011, 11)),
                       
                       # next, a column with the data: 
                       elecequip = as.numeric(elecequip)) %>% 
    
    # finally, use this to add row numbers:  
    mutate(timeperiod = 1:n())

                       
library(MASS)  # conflicts with dplyr select, so we didn't 
# load at the beginning 

m1.elec <- rlm(elecequip ~ month -1,  # "-1" removes intercept
               data = df1.elec)

unloadNamespace("MASS")  # unload the package 

# summary(m1.elec) 
df1.elec %<>% 
    mutate(deseasonalized = resid(m1.elec))


# plot deseasonalized series: 
p1.deseasonalized <- df1.elec %>% 
    ggplot(aes(x = timeperiod, 
               y = deseasonalized)) + 
    geom_line() + 
    labs(title = "Deseasonalised")

# plot original series: 
p2.orig <- df1.elec %>% 
    ggplot(aes(x = timeperiod, 
               y = elecequip)) + 
    geom_line() + 
    labs(title = "Original series")


# arrange the plots in a single figure: 
ggarrange(p1.deseasonalized, 
          p2.orig, 
          nrow = 2)

Producing and evaluating forecasts

The basic steps are:

  1. Split into training dataset and test dataset
  2. Fit models on training dataset
  3. Use test set to compare models based on quantitative criteria and appropriateness for the given problem
  4. Use the best model to produce the forecast
# create training set: 
train <- window(elecequip, 
                start = c(1996, 1), 
                end = c(2011, 12)) # take the first 170 observations 

# fit ARIMA model: 
model1 <- auto.arima(train)

# examine result: 
summary(model1)
## Series: train 
## ARIMA(4,0,1)(0,1,1)[12] 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.1661  -0.0149  0.2053  -0.3825  -0.6083  -0.7975
## s.e.  0.1322   0.1348  0.1210   0.0718   0.1376   0.0738
## 
## sigma^2 estimated as 11.09:  log likelihood=-473.73
## AIC=961.46   AICc=962.12   BIC=983.77
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.170688 3.169897 2.373325 0.09769198 2.471343 0.2894476
##                     ACF1
## Training set 0.008846644
# validate on test set: 
forecast(model1, 11) %>% 
    autoplot 

         



         

Part 2

Daily parking tickets in Vancouver

Read in and clean data for 2016-17

# if you want the entire parking data set, run this: 
# warning: this will take a long time. 

# df1.1.tickets.2017 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2017Parking_Tickets.csv"))

# df1.2.tickets.2016 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2016Parking_Tickets.csv"))

# in case it's easier to load this off disk instead of 
# over network: 
df1.1.tickets.2017 <- read_csv(here::here("data", 
                        "vancouver-parking-tickets-2017.csv")) %>%
    select(-"X1")

df1.2.tickets.2016 <- read_csv(here::here("data", 
                        "vancouver-parking-tickets-2016.csv"))

# union of the two datasets: 
df1.tickets <- rbind(df1.2.tickets.2016, 
                     df1.1.tickets.2017)


# let's do some wrangling: 
df1.tickets %<>% 
    clean_names() %>% 
    mutate(entry_date = mdy(entry_date)) %>% 
    mutate_if(is.character, 
              factor)

# view result: 
# str(df1.tickets)
# summary(df1.tickets)

Take a quick look at the data:

head(df1.tickets) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
block street entry_date bylaw section status infraction_text
800 Burrard St. 2016-01-02 2952 5(4)(a)(ii) IS PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
800 Howe St. 2016-01-02 2952 5(4)(a)(ii) VA PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
700 Smithe St. 2016-01-02 2952 5(4)(a)(ii) IS PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
700 Smithe St. 2016-01-02 2952 5(4)(a)(ii) IS PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
800 Howe St. 2016-01-02 2952 5(4)(a)(ii) IS PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
1000 Robson St. 2016-01-02 2952 5(4)(a)(ii) IS PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW

Read in 2018 data (test set)

To understand whether our forecasting methodology is any good, we must compare its forecasts against data that it has never seen. We’ll use 2018 data for this.

# df1.3.tickets.2018 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2018Parking_Tickets.csv"))

# finally, read in 2018 data (test dataset)
df1.3.tickets.2018 <- read_csv(here::here("data", 
                        "vancouver-parking-tickets-2018.csv"))%>% clean_names() %>% 
    mutate(entry_date = mdy(entry_date)) %>% 
    mutate_if(is.character, 
              factor) 

# view result: 
head(df1.3.tickets.2018) %>% 
    union_all(tail(df1.3.tickets.2018)) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
block street entry_date bylaw section status infraction_text
3700 OAK ST 2018-01-02 2849 17.1 IS STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
3100 OAK ST 2018-01-02 2849 17.1 IS STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
3600 W 2ND AVE 2018-01-02 2849 17.6(F) VA PARK ON A STREET ABUTTING RESIDENTIAL OR COMMERCIAL PREMISES FOR MORE THAN 3 HRS BETWEEN 8:00A.M. AND 6:00 P.M.
1800 BALSAM ST 2018-01-02 2849 17.5(B) IS STOP WITHIN 6 METRES OF THE NEAREST EDGE OF THE CLOSEST SIDEWALK ON AN INTERSECTING STREET
2200 LARCH ST 2018-01-02 2849 18.1(B) IS STOP OR PARK OTHER THAN HEADED IN THE DIRECTION OF TRAFFIC
2200 STEPHENS ST 2018-01-02 2849 17.5(B) IS STOP WITHIN 6 METRES OF THE NEAREST EDGE OF THE CLOSEST SIDEWALK ON AN INTERSECTING STREET
700 NELSON ST 2018-03-31 2849 17.2(J) IS STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS
600 HELMCKEN ST 2018-03-31 2849 17.1 IS STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
900 HOWE ST 2018-03-31 2849 17.2(J) IS STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS
700 NELSON ST 2018-03-31 2849 17.2(J) IS STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS
600 HELMCKEN ST 2018-03-31 2849 17.1 IS STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
800 SEYMOUR ST 2018-03-31 2849 17.3 IS STOP ON ANY PORTION OF A STREET DESIGNATED AS A BUS STOP

Group by date

Note that we have to join on list of all dates in order to make sure there are no missing dates in the data. In case you’re keeping track, joins are something that you can do in Excel, but it’s a bit of a pain. 4

# group by 
df2.daily.tickets <- df1.tickets %>% 
    group_by(entry_date) %>% 
    summarise(count = n())

# all dates: 
df3.1.all.dates <- data.frame(dates = seq(as.Date("2016-01-01"),
                                        as.Date("2017-12-31"),
                                        by="1 day"))

# full join: 
df3.daily.tickets.joined <- df2.daily.tickets %>% 
    full_join(df3.1.all.dates, 
              by = c("entry_date" = "dates")) %>% 
    arrange(entry_date) %>% 
    
    # replace NAs with 0s: 
    mutate(count = replace_na(count, 0))


# view result: 
# str(df3.daily.tickets.joined)
# summary(df3.daily.tickets.joined)
head(df3.daily.tickets.joined) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
entry_date count
2016-01-01 0
2016-01-02 1321
2016-01-03 1309
2016-01-04 1503
2016-01-05 1410
2016-01-06 1346
tail(df3.daily.tickets.joined) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
entry_date count
2017-12-26 918
2017-12-27 1102
2017-12-28 1139
2017-12-29 1322
2017-12-30 1313
2017-12-31 1016

Interesting that there’s no data for Christmas and New Year’s.
Let’s do the same grouping for the 2018 data.

df4.daily.tickets.2018 <- 
    df1.3.tickets.2018 %>% 
    group_by(entry_date) %>% 
    summarise(count = n())

df4.1.all.dates <- data.frame(dates=seq(as.Date("2018-01-01"),
                                        as.Date("2018-12-31"),
                                        by="1 day"))

df4.daily.tickets.2018 %<>% 
    full_join(df4.1.all.dates, 
              by = c("entry_date" = "dates")) %>% 
    arrange(entry_date) 
    
    # replace NAs with 0s: 
    # mutate(count = replace_na(count, 0))

# str(df4.daily.tickets.2018)
# summary(df4.daily.tickets.2018)
# head(df4.daily.tickets.2018)
# tail(df4.daily.tickets.2018)

     

Visualize time series

p1.parking.2017 <- df2.daily.tickets %>% 
    ggplot(aes(x = entry_date, 
               y = count)) + 
    geom_line() + 
    labs(title = "Daily parking tickets in Vancouver, BC", 
         subtitle = "2016-01-01 to 2017-12-31\n\n", 
         caption = "\n\nSource:https://data.vancouver.ca/datacatalogue/parking-tickets.htm") + 
    
    geom_smooth() +
    
    theme_classic(base_size = 16) + 
    theme(plot.caption = element_text(size = 11))

p1.parking.2017

Break down into time series components

We are often asked to interpret trends in data. Common questions include:

  • Is there seasonality in this data? If yes, can we adjust for the seasonality?
  • Is there a trend over the last X years?
  • Is this data unusual, or is it within line with previous trends?

To do this, it’s very useful to be decompose the time series into trend, seasonality, and remainder components.

First we have to convert the counts into a time series object:

# reference: https://robjhyndman.com/hyndsight/dailydata/

t1.tickets.time.series <- df3.daily.tickets.joined %>% 
    pull(count) %>%   # pull out the count column as a vector
    ts(frequency = 365, 
       start = c(2016)) 

str(t1.tickets.time.series) 
##  Time-Series [1:731] from 2016 to 2018: 0 1321 1309 1503 1410 ...

Now we can run an STL decomposition.

stl(t1.tickets.time.series,
    s.window = "periodic") %>% plot

# get decomposition as a df: 
df5.stl.decomp <- stl.as.df_fn(t1.tickets.time.series)

# join back on orig data: 
df5.stl.decomp <- cbind(df3.daily.tickets.joined, 
                        df5.stl.decomp) %>% 
    select(-"data")

# str(df4.stl.decomp)
# summary(df4.stl.decomp)
head(df5.stl.decomp, 20) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
entry_date count seasonal trend remainder timeperiod
2016-01-01 0 -469.69744 1171.656 -701.958216 1
2016-01-02 1321 -508.03829 1171.556 657.482145 2
2016-01-03 1309 -79.09461 1171.457 216.637964 3
2016-01-04 1503 326.34908 1171.357 5.293784 4
2016-01-05 1410 206.79277 1171.258 31.949604 5
2016-01-06 1346 183.23646 1171.158 -8.394576 6
2016-01-07 1360 217.68014 1171.059 -28.738757 7
2016-01-08 1390 -32.87617 1170.959 251.917063 8
2016-01-09 1359 42.06752 1170.860 146.072883 9
2016-01-10 1126 11.01120 1170.760 -55.771297 10
2016-01-11 1416 139.45489 1170.661 105.884522 11
2016-01-12 1411 163.89858 1170.561 76.540342 12
2016-01-13 1493 191.34226 1170.462 131.196162 13
2016-01-14 1230 130.28595 1170.362 -70.648018 14
2016-01-15 1293 73.22964 1170.263 49.507801 15
2016-01-16 1510 206.67333 1170.163 133.163621 16
2016-01-17 1468 82.11701 1170.064 215.819441 17
2016-01-18 1660 323.06070 1169.964 166.975261 18
2016-01-19 1704 403.50439 1169.865 130.631081 19
2016-01-20 1550 268.44807 1169.765 111.786900 20

Notice the distribution of the remainder. This is what’s “left over” after your decomposition model has been fitted to the past data. You should expect that the errors your forecast makes in future will have a similar distribution, at best.

It’s a good thing that the remainders have a nice bell shape with mean very close to zero.

df5.stl.decomp$remainder %>% 
    hist(50, main = "Distribution of remainders from STL decomposition")

abline(v = mean(df5.stl.decomp$remainder), 
       col = "red")

Optional: use prophet package

# fit model: 
m1 <- prophet(df3.daily.tickets.joined %>% 
                  rename(ds = entry_date, 
                         y = count))
## Initial log joint probability = -75.8183
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
# forecast next up to 30th March 2018: 
df6.future <- make_future_dataframe(m1, periods = 365)

# make prediction: 
df7.prophet.fcast <- predict(m1, df6.future)  

df7.prophet.fcast %>%
    tail(31) %>%
    select(ds,
           yhat_lower,
           yhat,
           yhat_upper) %>%
    mutate(ds = as.character(ds) %>% as.Date) %>%

    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")
ds yhat_lower yhat yhat_upper
2018-12-01 868.8223 1105.4002 1327.429
2018-12-02 870.7166 1096.5452 1325.477
2018-12-03 1163.7032 1379.1682 1600.859
2018-12-04 1266.7934 1486.6810 1710.486
2018-12-05 1309.3204 1541.8534 1781.005
2018-12-06 1272.8076 1485.6709 1719.847
2018-12-07 1200.9808 1422.4885 1635.526
2018-12-08 902.0280 1128.5438 1353.356
2018-12-09 872.3750 1099.1804 1328.927
2018-12-10 1116.9861 1360.6006 1591.597
2018-12-11 1226.1962 1446.7122 1661.772
2018-12-12 1264.4228 1480.7966 1708.361
2018-12-13 1173.5500 1404.3530 1628.152
2018-12-14 1107.4456 1322.2394 1547.494
2018-12-15 790.1111 1011.1727 1234.720
2018-12-16 745.8085 966.9395 1196.864
2018-12-17 989.7505 1216.1366 1426.266
2018-12-18 1059.7491 1293.0081 1523.057
2018-12-19 1089.1449 1321.1033 1547.783
2018-12-20 1014.3432 1242.1139 1471.806
2018-12-21 934.1466 1161.0092 1397.109
2018-12-22 629.5014 854.5325 1089.895
2018-12-23 601.7879 818.4104 1040.774
2018-12-24 865.1183 1079.0941 1301.108
2018-12-25 948.3636 1170.5993 1407.238
2018-12-26 970.3576 1216.1694 1447.033
2018-12-27 925.8607 1157.1195 1368.546
2018-12-28 865.8403 1097.9808 1315.144
2018-12-29 593.2192 815.0071 1031.839
2018-12-30 563.1342 803.3967 1025.142
2018-12-31 864.1994 1089.0462 1306.806
# plot fcast:
p2.prophet.fcast <-
    plot(m1, df7.prophet.fcast) +
    labs(title = "Forecast of number of daily parking tickets in Vancouver for 2018",
         subtitle = "Validation set: 2018-01-01 to 2018-03-30 \nRMSE on validation set = 167 \n",
         caption = "\n\nData source: https://data.vancouver.ca/datacatalogue/parking-tickets.htm ") +
    theme_classic(base_size = 16); p2.prophet.fcast

# decompositions: 
prophet_plot_components(m1, df7.prophet.fcast)

Let’s plot prophet’s fcast with actual data as well.

First we’ll prep the data for plotting.

df8.1.all.dates <- 
    data.frame(dates = seq(as.Date("2016-01-01"),
                           as.Date("2018-12-31"),
                           by="1 day"))
    
    
df8.hist.and.fcast <- 
    df8.1.all.dates %>% 
    full_join(df3.daily.tickets.joined, 
              by = c("dates" = "entry_date")) %>% 
    rename(count2017 = count) %>% 
    
    # now add 2018 data: 
    full_join(df4.daily.tickets.2018, 
              by = c("dates" = "entry_date")) %>% 
    rename(count2018 = count) %>% 
    
    # apparently have to change to POSIXCT: 
    mutate(dates = as.POSIXct(dates)) %>% 
    
    # now join on prophet fcast: 
    full_join(df7.prophet.fcast %>% 
                  filter(ds >= "2018-01-01") %>% 
                  select(ds, 
                         yhat_lower, 
                         yhat, 
                         yhat_upper), 
              by = c("dates" = "ds"))  


str(df8.hist.and.fcast)
## 'data.frame':    1096 obs. of  6 variables:
##  $ dates     : POSIXct, format: "2016-01-01" "2016-01-02" ...
##  $ count2017 : num  0 1321 1309 1503 1410 ...
##  $ count2018 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ yhat_lower: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ yhat      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ yhat_upper: num  NA NA NA NA NA NA NA NA NA NA ...
df8.hist.and.fcast[729:740, ]
##          dates count2017 count2018 yhat_lower      yhat yhat_upper
## 729 2017-12-29      1322        NA         NA        NA         NA
## 730 2017-12-30      1313        NA         NA        NA         NA
## 731 2017-12-31      1016        NA         NA        NA         NA
## 732 2018-01-01        NA        NA         NA        NA         NA
## 733 2018-01-02        NA      1503  1003.1035 1218.3383   1445.706
## 734 2018-01-03        NA      1595  1067.3164 1291.0601   1517.842
## 735 2018-01-04        NA      1620  1018.8459 1257.3603   1502.796
## 736 2018-01-05        NA      1501  1003.1986 1221.1971   1464.912
## 737 2018-01-06        NA      1180   718.6805  958.3087   1178.859
## 738 2018-01-07        NA      1111   715.8292  963.4477   1196.713
## 739 2018-01-08        NA      1333  1029.9171 1262.1460   1479.912
## 740 2018-01-09        NA      1206  1164.3242 1387.5773   1626.816

Let’s evaluate the forecast error:

df9.validate.fcast <- 
    df8.hist.and.fcast %>% 
    filter(dates >= "2018-01-01", 
           dates <= "2018-03-30") %>% 
    select(dates, 
           count2018, 
           yhat) %>% 
    mutate(error = (count2018 - yhat),
           sq.error = (count2018 - yhat)^2)

rmse <- sqrt(mean(df9.validate.fcast$sq.error))
print(paste0("RMSE = ", round(rmse,1)))
## [1] "RMSE = 166.7"
mae <- mean(abs(df9.validate.fcast$error))
print(paste0("MAE = ", round(mae, 1)))
## [1] "MAE = 134.7"

Now the plot:

p3.fcast.validation <- 
    df8.hist.and.fcast %>% 
    filter(dates >= "2017-12-01",
           dates <= "2018-03-30") %>% 
    ggplot(aes(x = dates, 
               y = count2017)) + 
    geom_line() + 
    
    # add 2018 actuals: 
    geom_line(data = df8.hist.and.fcast %>%
                  filter(dates >= "2017-12-01",
                         dates <= "2018-03-30"), 
              aes(x = dates, 
                  y = count2018), 
              colour = "blue", 
              size = 1) + 
    
    # add 2018 fcast: 
    geom_line(data = df8.hist.and.fcast %>%
                  filter(dates >= "2017-12-01",
                         dates <= "2018-03-30"), 
              aes(x = dates, 
                  y = yhat), 
              colour = "steelblue") + 
    
    # add confidence intervals: 
    geom_ribbon(data = df8.hist.and.fcast %>%
                  filter(dates >= "2017-12-01",
                         dates <= "2018-03-30"),
                aes(ymin = yhat_lower, 
                    ymax = yhat_upper), 
                fill = "grey60", 
                alpha = .2) + 
    
    labs(title = "Evaluating forecast of number of parking tickets in 2018", 
         subtitle = paste0("Training data: 2016-01-01 to 2017-12-31 \nValidation data: 2018-01-01 to 2018-03-30 \nRMSE = ", round(rmse), "\n")) + 
    theme_classic(base_size = 16)

p3.fcast.validation

Write outputs

ggsave(here::here("results", 
                  "output from src", 
                  "2018-11-03_forecast-parking-tickets-full-2018.pdf"), 
       p2.prophet.fcast, 
       width = 10)

ggsave(here::here("results", 
                  "output from src", 
                  "2018-11-03_validating-2018-fcast.pdf"), 
       p3.fcast.validation, 
       width = 10)

  1. Abelson, 1995, Statistics as Principled Argument, p7

  2. https://otexts.org/fpp2/tspatterns.html

  3. see here for example: https://www.jstatsoft.org/article/view/v040i01

  4. https://superuser.com/questions/420635/how-do-i-join-two-worksheets-in-excel-as-i-would-in-sql