Load the packages and functions for the current session

We’ll be using some forecasting-specific packages today, as well as the usual tidyverse ones.

library(dplyr)  # for manipulating data frames 
library(ggplot2)  # for data viz
library(here)  # for simplifying folder references 
library(readr)  # reading and writing data files 
library(fpp)  # forecasting principles and practices 
library(prophet)  # facebook's forecasting package 
library(janitor)  # for cleaning up dataframes  
library(lubridate)  # for working with dates 
library(tidyr)  # for tidying data
library(magrittr)  # for easy piping 

# import custom functions: 
# I will email these files to you; save them in your "src" 
#   folder
source(here::here("src", "stl.as.df_function.R"))  
source(here::here("src", "generate-time-series_function.R"))

Part 1

Random generating processes

Before we look at specific time series, let’s take a look at some randomly generated ones. Note that there are no systematic factors underlying these series, but they can display characteristics similar to those we see in real time series.

The point is, many of the patterns we think we see in time series are probably nothing but chance. If it is easy to generate a random dataset that looks very similar to your actual data, then maybe your actual data provides no evidence of any systematic factors?

Inexperienced researchers and laypeople alike usually overestimate the role of systematic factors relative to chance factors ¹

# try this to get lots of random walks: 
generate.ts_function(seed = FALSE)

# try this to get lots of random walks: 
generate.ts_function(seed = FALSE)

Features of time series data

Take a look at the following 2 time series. How are they different?

# this won't work unless you have imported the function first
generate.ts_function()

We’ll be examining three different components of a time series²:

Trend: a long-term increase or decrease in the data. It does not have to be linear.
Seasonality: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency.
Smoothness/variability: Are neighbouring values of the time series correlated with one another, or are they completely independent of one aonther? If yes, how strongly are they related?

The series on the left has no trend, and very little variability. The one on the right has a significant positive trend, and is much “choppier”, or less smooth. It may also have some seasonality, but we can’t say for sure yet.

Let’s use some exploratory graphs to inspect the components of a time series.

Visualizing time series data with the fpp package

Time series objects

So far we have been working with dataframes. E.g. mtcars is a dataframe:

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Most time series functions require data to be in a time series format. An example time series object is elecequip:

str(elecequip)

##  Time-Series [1:191] from 1996 to 2012: 79.4 75.9 86.4 72.7 74.9 ...

start(elecequip)

## [1] 1996    1

end(elecequip)

## [1] 2011   11

frequency(elecequip)

## [1] 12

Just to see how it works, let’s convert one column of mtcars into a time series object. We use the ts function for that.

eg.time.seris <- mtcars %>% 
    select(mpg) %>% 
    ts(start = c(2018,1), 
       frequency = 12)

str(eg.time.seris)

##  Time-Series [1:32, 1] from 2018 to 2021: 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "mpg"

The autoplot() function

The autoplot knows how to deal with time series object, so that you don’t have to specify all the individual components in ggplot.

autoplot(elecequip)

The ggmonthplot() function

ggmonthplot(elecequip)

The ggseasonplot() function

ggseasonplot(elecequip)

The ggAcf() function

The autocorrelation function (ACF) is useful for identifying seasonality, and finding the correlation structure of a time series.

ggAcf(elecequip)

Mini Exercise

Try creating the same graphs with these series:

usdeaths
fancy
dj

Time series decomposition

stl(elecequip, 
    s.window = "periodic") %>% plot

stl.as.df_fn(elecequip) %>% head

##    seasonal    trend  remainder  data timeperiod
## 1 -5.563267 80.49243  4.5008362 79.43          1
## 2 -6.088288 80.40279  1.5454971 75.86          2
## 3  7.977940 80.31315 -1.8910917 86.40          3
## 4 -6.347420 80.27340 -1.2559820 72.67          4
## 5 -4.818401 80.23365 -0.4852496 74.93          5
## 6  7.751458 80.22800 -4.0994583 83.88          6

Optional: Quick and dirty de-seasonalizing

A quick and dirty method to deseasonalize a time series is to fit a robust linear model to the data, predicting the value of the series using the month as a predictor.³ The residuals from this model are the deseasonalized series.

df1.elec <- data.frame(month = c(rep(month.abb, times = 15), 
                                 month.abb[1:11]),
                       # ^first we need to create a predictor column
                       # that identifies the month
                       
                       # next we'll create a column for the year: 
                       year = c(rep(1996:2010, each = 12), 
                                rep(2011, 11)),
                       
                       # next, a column with the data: 
                       elecequip = as.numeric(elecequip)) %>% 
    
    # finally, use this to add row numbers:  
    mutate(timeperiod = 1:n())

                       
library(MASS)  # conflicts with dplyr select, so we didn't 
# load at the beginning 

m1.elec <- rlm(elecequip ~ month -1,  # "-1" removes intercept
               data = df1.elec)

unloadNamespace("MASS")  # unload the package 

# summary(m1.elec) 
df1.elec %<>% 
    mutate(deseasonalized = resid(m1.elec))


# plot deseasonalized series: 
p1.deseasonalized <- df1.elec %>% 
    ggplot(aes(x = timeperiod, 
               y = deseasonalized)) + 
    geom_line() + 
    labs(title = "Deseasonalised")

# plot original series: 
p2.orig <- df1.elec %>% 
    ggplot(aes(x = timeperiod, 
               y = elecequip)) + 
    geom_line() + 
    labs(title = "Original series")


# arrange the plots in a single figure: 
ggarrange(p1.deseasonalized, 
          p2.orig, 
          nrow = 2)

Producing and evaluating forecasts

The basic steps are:

Split into training dataset and test dataset
Fit models on training dataset
Use test set to compare models based on quantitative criteria and appropriateness for the given problem
Use the best model to produce the forecast

# create training set: 
train <- window(elecequip, 
                start = c(1996, 1), 
                end = c(2011, 12)) # take the first 170 observations 

# fit ARIMA model: 
model1 <- auto.arima(train)

# examine result: 
summary(model1)

## Series: train 
## ARIMA(4,0,1)(0,1,1)[12] 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.1661  -0.0149  0.2053  -0.3825  -0.6083  -0.7975
## s.e.  0.1322   0.1348  0.1210   0.0718   0.1376   0.0738
## 
## sigma^2 estimated as 11.09:  log likelihood=-473.73
## AIC=961.46   AICc=962.12   BIC=983.77
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.170688 3.169897 2.373325 0.09769198 2.471343 0.2894476
##                     ACF1
## Training set 0.008846644

# validate on test set: 
forecast(model1, 11) %>% 
    autoplot

Part 2

Daily parking tickets in Vancouver

Read in and clean data for 2016-17

# if you want the entire parking data set, run this: 
# warning: this will take a long time. 

# df1.1.tickets.2017 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2017Parking_Tickets.csv"))

# df1.2.tickets.2016 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2016Parking_Tickets.csv"))

# in case it's easier to load this off disk instead of 
# over network: 
df1.1.tickets.2017 <- read_csv(here::here("data", 
                        "vancouver-parking-tickets-2017.csv")) %>%
    select(-"X1")

df1.2.tickets.2016 <- read_csv(here::here("data", 
                        "vancouver-parking-tickets-2016.csv"))

# union of the two datasets: 
df1.tickets <- rbind(df1.2.tickets.2016, 
                     df1.1.tickets.2017)


# let's do some wrangling: 
df1.tickets %<>% 
    clean_names() %>% 
    mutate(entry_date = mdy(entry_date)) %>% 
    mutate_if(is.character, 
              factor)

# view result: 
# str(df1.tickets)
# summary(df1.tickets)

Take a quick look at the data:

head(df1.tickets) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

block	street	entry_date	bylaw	section	status	infraction_text
800	Burrard St.	2016-01-02	2952	5(4)(a)(ii)	IS	PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
800	Howe St.	2016-01-02	2952	5(4)(a)(ii)	VA	PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
700	Smithe St.	2016-01-02	2952	5(4)(a)(ii)	IS	PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
700	Smithe St.	2016-01-02	2952	5(4)(a)(ii)	IS	PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
800	Howe St.	2016-01-02	2952	5(4)(a)(ii)	IS	PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW
1000	Robson St.	2016-01-02	2952	5(4)(a)(ii)	IS	PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW

Read in 2018 data (test set)

To understand whether our forecasting methodology is any good, we must compare its forecasts against data that it has never seen. We’ll use 2018 data for this.

# df1.3.tickets.2018 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2018Parking_Tickets.csv"))

# finally, read in 2018 data (test dataset)
df1.3.tickets.2018 <- read_csv(here::here("data", 
                        "vancouver-parking-tickets-2018.csv"))%>% clean_names() %>% 
    mutate(entry_date = mdy(entry_date)) %>% 
    mutate_if(is.character, 
              factor) 

# view result: 
head(df1.3.tickets.2018) %>% 
    union_all(tail(df1.3.tickets.2018)) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

block	street	entry_date	bylaw	section	status	infraction_text
3700	OAK ST	2018-01-02	2849	17.1	IS	STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
3100	OAK ST	2018-01-02	2849	17.1	IS	STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
3600	W 2ND AVE	2018-01-02	2849	17.6(F)	VA	PARK ON A STREET ABUTTING RESIDENTIAL OR COMMERCIAL PREMISES FOR MORE THAN 3 HRS BETWEEN 8:00A.M. AND 6:00 P.M.
1800	BALSAM ST	2018-01-02	2849	17.5(B)	IS	STOP WITHIN 6 METRES OF THE NEAREST EDGE OF THE CLOSEST SIDEWALK ON AN INTERSECTING STREET
2200	LARCH ST	2018-01-02	2849	18.1(B)	IS	STOP OR PARK OTHER THAN HEADED IN THE DIRECTION OF TRAFFIC
2200	STEPHENS ST	2018-01-02	2849	17.5(B)	IS	STOP WITHIN 6 METRES OF THE NEAREST EDGE OF THE CLOSEST SIDEWALK ON AN INTERSECTING STREET
700	NELSON ST	2018-03-31	2849	17.2(J)	IS	STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS
600	HELMCKEN ST	2018-03-31	2849	17.1	IS	STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
900	HOWE ST	2018-03-31	2849	17.2(J)	IS	STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS
700	NELSON ST	2018-03-31	2849	17.2(J)	IS	STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS
600	HELMCKEN ST	2018-03-31	2849	17.1	IS	STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING
800	SEYMOUR ST	2018-03-31	2849	17.3	IS	STOP ON ANY PORTION OF A STREET DESIGNATED AS A BUS STOP

Group by date

Note that we have to join on list of all dates in order to make sure there are no missing dates in the data. In case you’re keeping track, joins are something that you can do in Excel, but it’s a bit of a pain. ⁴

# group by 
df2.daily.tickets <- df1.tickets %>% 
    group_by(entry_date) %>% 
    summarise(count = n())

# all dates: 
df3.1.all.dates <- data.frame(dates = seq(as.Date("2016-01-01"),
                                        as.Date("2017-12-31"),
                                        by="1 day"))

# full join: 
df3.daily.tickets.joined <- df2.daily.tickets %>% 
    full_join(df3.1.all.dates, 
              by = c("entry_date" = "dates")) %>% 
    arrange(entry_date) %>% 
    
    # replace NAs with 0s: 
    mutate(count = replace_na(count, 0))


# view result: 
# str(df3.daily.tickets.joined)
# summary(df3.daily.tickets.joined)
head(df3.daily.tickets.joined) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

entry_date	count
2016-01-01	0
2016-01-02	1321
2016-01-03	1309
2016-01-04	1503
2016-01-05	1410
2016-01-06	1346

tail(df3.daily.tickets.joined) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

entry_date	count
2017-12-26	918
2017-12-27	1102
2017-12-28	1139
2017-12-29	1322
2017-12-30	1313
2017-12-31	1016

Interesting that there’s no data for Christmas and New Year’s.
Let’s do the same grouping for the 2018 data.

df4.daily.tickets.2018 <- 
    df1.3.tickets.2018 %>% 
    group_by(entry_date) %>% 
    summarise(count = n())

df4.1.all.dates <- data.frame(dates=seq(as.Date("2018-01-01"),
                                        as.Date("2018-12-31"),
                                        by="1 day"))

df4.daily.tickets.2018 %<>% 
    full_join(df4.1.all.dates, 
              by = c("entry_date" = "dates")) %>% 
    arrange(entry_date) 
    
    # replace NAs with 0s: 
    # mutate(count = replace_na(count, 0))

# str(df4.daily.tickets.2018)
# summary(df4.daily.tickets.2018)
# head(df4.daily.tickets.2018)
# tail(df4.daily.tickets.2018)

Visualize time series

p1.parking.2017 <- df2.daily.tickets %>% 
    ggplot(aes(x = entry_date, 
               y = count)) + 
    geom_line() + 
    labs(title = "Daily parking tickets in Vancouver, BC", 
         subtitle = "2016-01-01 to 2017-12-31\n\n", 
         caption = "\n\nSource:https://data.vancouver.ca/datacatalogue/parking-tickets.htm") + 
    
    geom_smooth() +
    
    theme_classic(base_size = 16) + 
    theme(plot.caption = element_text(size = 11))

p1.parking.2017

Break down into time series components

We are often asked to interpret trends in data. Common questions include:

Is there seasonality in this data? If yes, can we adjust for the seasonality?
Is there a trend over the last X years?
Is this data unusual, or is it within line with previous trends?

To do this, it’s very useful to be decompose the time series into trend, seasonality, and remainder components.

First we have to convert the counts into a time series object:

# reference: https://robjhyndman.com/hyndsight/dailydata/

t1.tickets.time.series <- df3.daily.tickets.joined %>% 
    pull(count) %>%   # pull out the count column as a vector
    ts(frequency = 365, 
       start = c(2016)) 

str(t1.tickets.time.series)

##  Time-Series [1:731] from 2016 to 2018: 0 1321 1309 1503 1410 ...

Now we can run an STL decomposition.

stl(t1.tickets.time.series,
    s.window = "periodic") %>% plot

# get decomposition as a df: 
df5.stl.decomp <- stl.as.df_fn(t1.tickets.time.series)

# join back on orig data: 
df5.stl.decomp <- cbind(df3.daily.tickets.joined, 
                        df5.stl.decomp) %>% 
    select(-"data")

# str(df4.stl.decomp)
# summary(df4.stl.decomp)
head(df5.stl.decomp, 20) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

entry_date	count	seasonal	trend	remainder	timeperiod
2016-01-01	0	-469.69744	1171.656	-701.958216	1
2016-01-02	1321	-508.03829	1171.556	657.482145	2
2016-01-03	1309	-79.09461	1171.457	216.637964	3
2016-01-04	1503	326.34908	1171.357	5.293784	4
2016-01-05	1410	206.79277	1171.258	31.949604	5
2016-01-06	1346	183.23646	1171.158	-8.394576	6
2016-01-07	1360	217.68014	1171.059	-28.738757	7
2016-01-08	1390	-32.87617	1170.959	251.917063	8
2016-01-09	1359	42.06752	1170.860	146.072883	9
2016-01-10	1126	11.01120	1170.760	-55.771297	10
2016-01-11	1416	139.45489	1170.661	105.884522	11
2016-01-12	1411	163.89858	1170.561	76.540342	12
2016-01-13	1493	191.34226	1170.462	131.196162	13
2016-01-14	1230	130.28595	1170.362	-70.648018	14
2016-01-15	1293	73.22964	1170.263	49.507801	15
2016-01-16	1510	206.67333	1170.163	133.163621	16
2016-01-17	1468	82.11701	1170.064	215.819441	17
2016-01-18	1660	323.06070	1169.964	166.975261	18
2016-01-19	1704	403.50439	1169.865	130.631081	19
2016-01-20	1550	268.44807	1169.765	111.786900	20

Notice the distribution of the remainder. This is what’s “left over” after your decomposition model has been fitted to the past data. You should expect that the errors your forecast makes in future will have a similar distribution, at best.

It’s a good thing that the remainders have a nice bell shape with mean very close to zero.

df5.stl.decomp$remainder %>% 
    hist(50, main = "Distribution of remainders from STL decomposition")

abline(v = mean(df5.stl.decomp$remainder), 
       col = "red")

Optional: use prophet package

# fit model: 
m1 <- prophet(df3.daily.tickets.joined %>% 
                  rename(ds = entry_date, 
                         y = count))

## Initial log joint probability = -75.8183
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

# forecast next up to 30th March 2018: 
df6.future <- make_future_dataframe(m1, periods = 365)

# make prediction: 
df7.prophet.fcast <- predict(m1, df6.future)  

df7.prophet.fcast %>%
    tail(31) %>%
    select(ds,
           yhat_lower,
           yhat,
           yhat_upper) %>%
    mutate(ds = as.character(ds) %>% as.Date) %>%

    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

ds	yhat_lower	yhat	yhat_upper
2018-12-01	868.8223	1105.4002	1327.429
2018-12-02	870.7166	1096.5452	1325.477
2018-12-03	1163.7032	1379.1682	1600.859
2018-12-04	1266.7934	1486.6810	1710.486
2018-12-05	1309.3204	1541.8534	1781.005
2018-12-06	1272.8076	1485.6709	1719.847
2018-12-07	1200.9808	1422.4885	1635.526
2018-12-08	902.0280	1128.5438	1353.356
2018-12-09	872.3750	1099.1804	1328.927
2018-12-10	1116.9861	1360.6006	1591.597
2018-12-11	1226.1962	1446.7122	1661.772
2018-12-12	1264.4228	1480.7966	1708.361
2018-12-13	1173.5500	1404.3530	1628.152
2018-12-14	1107.4456	1322.2394	1547.494
2018-12-15	790.1111	1011.1727	1234.720
2018-12-16	745.8085	966.9395	1196.864
2018-12-17	989.7505	1216.1366	1426.266
2018-12-18	1059.7491	1293.0081	1523.057
2018-12-19	1089.1449	1321.1033	1547.783
2018-12-20	1014.3432	1242.1139	1471.806
2018-12-21	934.1466	1161.0092	1397.109
2018-12-22	629.5014	854.5325	1089.895
2018-12-23	601.7879	818.4104	1040.774
2018-12-24	865.1183	1079.0941	1301.108
2018-12-25	948.3636	1170.5993	1407.238
2018-12-26	970.3576	1216.1694	1447.033
2018-12-27	925.8607	1157.1195	1368.546
2018-12-28	865.8403	1097.9808	1315.144
2018-12-29	593.2192	815.0071	1031.839
2018-12-30	563.1342	803.3967	1025.142
2018-12-31	864.1994	1089.0462	1306.806

# plot fcast:
p2.prophet.fcast <-
    plot(m1, df7.prophet.fcast) +
    labs(title = "Forecast of number of daily parking tickets in Vancouver for 2018",
         subtitle = "Validation set: 2018-01-01 to 2018-03-30 \nRMSE on validation set = 167 \n",
         caption = "\n\nData source: https://data.vancouver.ca/datacatalogue/parking-tickets.htm ") +
    theme_classic(base_size = 16); p2.prophet.fcast

# decompositions: 
prophet_plot_components(m1, df7.prophet.fcast)

Let’s plot prophet’s fcast with actual data as well.

First we’ll prep the data for plotting.

df8.1.all.dates <- 
    data.frame(dates = seq(as.Date("2016-01-01"),
                           as.Date("2018-12-31"),
                           by="1 day"))
    
    
df8.hist.and.fcast <- 
    df8.1.all.dates %>% 
    full_join(df3.daily.tickets.joined, 
              by = c("dates" = "entry_date")) %>% 
    rename(count2017 = count) %>% 
    
    # now add 2018 data: 
    full_join(df4.daily.tickets.2018, 
              by = c("dates" = "entry_date")) %>% 
    rename(count2018 = count) %>% 
    
    # apparently have to change to POSIXCT: 
    mutate(dates = as.POSIXct(dates)) %>% 
    
    # now join on prophet fcast: 
    full_join(df7.prophet.fcast %>% 
                  filter(ds >= "2018-01-01") %>% 
                  select(ds, 
                         yhat_lower, 
                         yhat, 
                         yhat_upper), 
              by = c("dates" = "ds"))  


str(df8.hist.and.fcast)

## 'data.frame':    1096 obs. of  6 variables:
##  $ dates     : POSIXct, format: "2016-01-01" "2016-01-02" ...
##  $ count2017 : num  0 1321 1309 1503 1410 ...
##  $ count2018 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ yhat_lower: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ yhat      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ yhat_upper: num  NA NA NA NA NA NA NA NA NA NA ...

df8.hist.and.fcast[729:740, ]

##          dates count2017 count2018 yhat_lower      yhat yhat_upper
## 729 2017-12-29      1322        NA         NA        NA         NA
## 730 2017-12-30      1313        NA         NA        NA         NA
## 731 2017-12-31      1016        NA         NA        NA         NA
## 732 2018-01-01        NA        NA         NA        NA         NA
## 733 2018-01-02        NA      1503  1003.1035 1218.3383   1445.706
## 734 2018-01-03        NA      1595  1067.3164 1291.0601   1517.842
## 735 2018-01-04        NA      1620  1018.8459 1257.3603   1502.796
## 736 2018-01-05        NA      1501  1003.1986 1221.1971   1464.912
## 737 2018-01-06        NA      1180   718.6805  958.3087   1178.859
## 738 2018-01-07        NA      1111   715.8292  963.4477   1196.713
## 739 2018-01-08        NA      1333  1029.9171 1262.1460   1479.912
## 740 2018-01-09        NA      1206  1164.3242 1387.5773   1626.816

Let’s evaluate the forecast error:

df9.validate.fcast <- 
    df8.hist.and.fcast %>% 
    filter(dates >= "2018-01-01", 
           dates <= "2018-03-30") %>% 
    select(dates, 
           count2018, 
           yhat) %>% 
    mutate(error = (count2018 - yhat),
           sq.error = (count2018 - yhat)^2)

rmse <- sqrt(mean(df9.validate.fcast$sq.error))
print(paste0("RMSE = ", round(rmse,1)))

## [1] "RMSE = 166.7"

mae <- mean(abs(df9.validate.fcast$error))
print(paste0("MAE = ", round(mae, 1)))

## [1] "MAE = 134.7"

Now the plot:

p3.fcast.validation <- 
    df8.hist.and.fcast %>% 
    filter(dates >= "2017-12-01",
           dates <= "2018-03-30") %>% 
    ggplot(aes(x = dates, 
               y = count2017)) + 
    geom_line() + 
    
    # add 2018 actuals: 
    geom_line(data = df8.hist.and.fcast %>%
                  filter(dates >= "2017-12-01",
                         dates <= "2018-03-30"), 
              aes(x = dates, 
                  y = count2018), 
              colour = "blue", 
              size = 1) + 
    
    # add 2018 fcast: 
    geom_line(data = df8.hist.and.fcast %>%
                  filter(dates >= "2017-12-01",
                         dates <= "2018-03-30"), 
              aes(x = dates, 
                  y = yhat), 
              colour = "steelblue") + 
    
    # add confidence intervals: 
    geom_ribbon(data = df8.hist.and.fcast %>%
                  filter(dates >= "2017-12-01",
                         dates <= "2018-03-30"),
                aes(ymin = yhat_lower, 
                    ymax = yhat_upper), 
                fill = "grey60", 
                alpha = .2) + 
    
    labs(title = "Evaluating forecast of number of parking tickets in 2018", 
         subtitle = paste0("Training data: 2016-01-01 to 2017-12-31 \nValidation data: 2018-01-01 to 2018-03-30 \nRMSE = ", round(rmse), "\n")) + 
    theme_classic(base_size = 16)

p3.fcast.validation

Write outputs

ggsave(here::here("results", 
                  "output from src", 
                  "2018-11-03_forecast-parking-tickets-full-2018.pdf"), 
       p2.prophet.fcast, 
       width = 10)

ggsave(here::here("results", 
                  "output from src", 
                  "2018-11-03_validating-2018-fcast.pdf"), 
       p3.fcast.validation, 
       width = 10)

Abelson, 1995, Statistics as Principled Argument, p7↩
https://otexts.org/fpp2/tspatterns.html ↩
see here for example: https://www.jstatsoft.org/article/view/v040i01 ↩
https://superuser.com/questions/420635/how-do-i-join-two-worksheets-in-excel-as-i-would-in-sql ↩

Day 2: Time series analysis and forecasting

Nayef Ahmad - VCH Decision Support

November 3, 2018