We’ll be using some forecasting-specific packages today, as well as the usual tidyverse ones.
library(dplyr) # for manipulating data frames
library(ggplot2) # for data viz
library(here) # for simplifying folder references
library(readr) # reading and writing data files
library(fpp) # forecasting principles and practices
library(prophet) # facebook's forecasting package
library(janitor) # for cleaning up dataframes
library(lubridate) # for working with dates
library(tidyr) # for tidying data
library(magrittr) # for easy piping
# import custom functions:
# I will email these files to you; save them in your "src"
# folder
source(here::here("src", "stl.as.df_function.R"))
source(here::here("src", "generate-time-series_function.R"))
Before we look at specific time series, let’s take a look at some randomly generated ones. Note that there are no systematic factors underlying these series, but they can display characteristics similar to those we see in real time series.
The point is, many of the patterns we think we see in time series are probably nothing but chance. If it is easy to generate a random dataset that looks very similar to your actual data, then maybe your actual data provides no evidence of any systematic factors?
Inexperienced researchers and laypeople alike usually overestimate the role of systematic factors relative to chance factors 1
# try this to get lots of random walks:
generate.ts_function(seed = FALSE)
# try this to get lots of random walks:
generate.ts_function(seed = FALSE)
Take a look at the following 2 time series. How are they different?
# this won't work unless you have imported the function first
generate.ts_function()
We’ll be examining three different components of a time series2:
The series on the left has no trend, and very little variability. The one on the right has a significant positive trend, and is much “choppier”, or less smooth. It may also have some seasonality, but we can’t say for sure yet.
Let’s use some exploratory graphs to inspect the components of a time series.
So far we have been working with dataframes. E.g. mtcars is a dataframe:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Most time series functions require data to be in a time series format. An example time series object is elecequip:
str(elecequip)
## Time-Series [1:191] from 1996 to 2012: 79.4 75.9 86.4 72.7 74.9 ...
start(elecequip)
## [1] 1996 1
end(elecequip)
## [1] 2011 11
frequency(elecequip)
## [1] 12
Just to see how it works, let’s convert one column of mtcars into a time series object. We use the ts
function for that.
eg.time.seris <- mtcars %>%
select(mpg) %>%
ts(start = c(2018,1),
frequency = 12)
str(eg.time.seris)
## Time-Series [1:32, 1] from 2018 to 2021: 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "mpg"
The autoplot
knows how to deal with time series object, so that you don’t have to specify all the individual components in ggplot.
autoplot(elecequip)
ggmonthplot(elecequip)
ggseasonplot(elecequip)
The autocorrelation function (ACF) is useful for identifying seasonality, and finding the correlation structure of a time series.
ggAcf(elecequip)
Try creating the same graphs with these series:
usdeaths
fancy
dj
stl(elecequip,
s.window = "periodic") %>% plot
stl.as.df_fn(elecequip) %>% head
## seasonal trend remainder data timeperiod
## 1 -5.563267 80.49243 4.5008362 79.43 1
## 2 -6.088288 80.40279 1.5454971 75.86 2
## 3 7.977940 80.31315 -1.8910917 86.40 3
## 4 -6.347420 80.27340 -1.2559820 72.67 4
## 5 -4.818401 80.23365 -0.4852496 74.93 5
## 6 7.751458 80.22800 -4.0994583 83.88 6
A quick and dirty method to deseasonalize a time series is to fit a robust linear model to the data, predicting the value of the series using the month as a predictor.3 The residuals from this model are the deseasonalized series.
df1.elec <- data.frame(month = c(rep(month.abb, times = 15),
month.abb[1:11]),
# ^first we need to create a predictor column
# that identifies the month
# next we'll create a column for the year:
year = c(rep(1996:2010, each = 12),
rep(2011, 11)),
# next, a column with the data:
elecequip = as.numeric(elecequip)) %>%
# finally, use this to add row numbers:
mutate(timeperiod = 1:n())
library(MASS) # conflicts with dplyr select, so we didn't
# load at the beginning
m1.elec <- rlm(elecequip ~ month -1, # "-1" removes intercept
data = df1.elec)
unloadNamespace("MASS") # unload the package
# summary(m1.elec)
df1.elec %<>%
mutate(deseasonalized = resid(m1.elec))
# plot deseasonalized series:
p1.deseasonalized <- df1.elec %>%
ggplot(aes(x = timeperiod,
y = deseasonalized)) +
geom_line() +
labs(title = "Deseasonalised")
# plot original series:
p2.orig <- df1.elec %>%
ggplot(aes(x = timeperiod,
y = elecequip)) +
geom_line() +
labs(title = "Original series")
# arrange the plots in a single figure:
ggarrange(p1.deseasonalized,
p2.orig,
nrow = 2)
The basic steps are:
# create training set:
train <- window(elecequip,
start = c(1996, 1),
end = c(2011, 12)) # take the first 170 observations
# fit ARIMA model:
model1 <- auto.arima(train)
# examine result:
summary(model1)
## Series: train
## ARIMA(4,0,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 sma1
## 1.1661 -0.0149 0.2053 -0.3825 -0.6083 -0.7975
## s.e. 0.1322 0.1348 0.1210 0.0718 0.1376 0.0738
##
## sigma^2 estimated as 11.09: log likelihood=-473.73
## AIC=961.46 AICc=962.12 BIC=983.77
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.170688 3.169897 2.373325 0.09769198 2.471343 0.2894476
## ACF1
## Training set 0.008846644
# validate on test set:
forecast(model1, 11) %>%
autoplot
# if you want the entire parking data set, run this:
# warning: this will take a long time.
# df1.1.tickets.2017 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2017Parking_Tickets.csv"))
# df1.2.tickets.2016 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2016Parking_Tickets.csv"))
# in case it's easier to load this off disk instead of
# over network:
df1.1.tickets.2017 <- read_csv(here::here("data",
"vancouver-parking-tickets-2017.csv")) %>%
select(-"X1")
df1.2.tickets.2016 <- read_csv(here::here("data",
"vancouver-parking-tickets-2016.csv"))
# union of the two datasets:
df1.tickets <- rbind(df1.2.tickets.2016,
df1.1.tickets.2017)
# let's do some wrangling:
df1.tickets %<>%
clean_names() %>%
mutate(entry_date = mdy(entry_date)) %>%
mutate_if(is.character,
factor)
# view result:
# str(df1.tickets)
# summary(df1.tickets)
Take a quick look at the data:
head(df1.tickets) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
block | street | entry_date | bylaw | section | status | infraction_text |
---|---|---|---|---|---|---|
800 | Burrard St. | 2016-01-02 | 2952 | 5(4)(a)(ii) | IS | PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW |
800 | Howe St. | 2016-01-02 | 2952 | 5(4)(a)(ii) | VA | PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW |
700 | Smithe St. | 2016-01-02 | 2952 | 5(4)(a)(ii) | IS | PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW |
700 | Smithe St. | 2016-01-02 | 2952 | 5(4)(a)(ii) | IS | PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW |
800 | Howe St. | 2016-01-02 | 2952 | 5(4)(a)(ii) | IS | PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW |
1000 | Robson St. | 2016-01-02 | 2952 | 5(4)(a)(ii) | IS | PARK IN A METERED SPACE IF THE PARKING METER HEAD DISPLAYS FOUR FLASHING ZEROS IN A WINDOW |
To understand whether our forecasting methodology is any good, we must compare its forecasts against data that it has never seen. We’ll use 2018 data for this.
# df1.3.tickets.2018 <- read.csv(url("ftp://webftp.vancouver.ca/opendata/csv/2018Parking_Tickets.csv"))
# finally, read in 2018 data (test dataset)
df1.3.tickets.2018 <- read_csv(here::here("data",
"vancouver-parking-tickets-2018.csv"))%>% clean_names() %>%
mutate(entry_date = mdy(entry_date)) %>%
mutate_if(is.character,
factor)
# view result:
head(df1.3.tickets.2018) %>%
union_all(tail(df1.3.tickets.2018)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
block | street | entry_date | bylaw | section | status | infraction_text |
---|---|---|---|---|---|---|
3700 | OAK ST | 2018-01-02 | 2849 | 17.1 | IS | STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING |
3100 | OAK ST | 2018-01-02 | 2849 | 17.1 | IS | STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING |
3600 | W 2ND AVE | 2018-01-02 | 2849 | 17.6(F) | VA | PARK ON A STREET ABUTTING RESIDENTIAL OR COMMERCIAL PREMISES FOR MORE THAN 3 HRS BETWEEN 8:00A.M. AND 6:00 P.M. |
1800 | BALSAM ST | 2018-01-02 | 2849 | 17.5(B) | IS | STOP WITHIN 6 METRES OF THE NEAREST EDGE OF THE CLOSEST SIDEWALK ON AN INTERSECTING STREET |
2200 | LARCH ST | 2018-01-02 | 2849 | 18.1(B) | IS | STOP OR PARK OTHER THAN HEADED IN THE DIRECTION OF TRAFFIC |
2200 | STEPHENS ST | 2018-01-02 | 2849 | 17.5(B) | IS | STOP WITHIN 6 METRES OF THE NEAREST EDGE OF THE CLOSEST SIDEWALK ON AN INTERSECTING STREET |
700 | NELSON ST | 2018-03-31 | 2849 | 17.2(J) | IS | STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS |
600 | HELMCKEN ST | 2018-03-31 | 2849 | 17.1 | IS | STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING |
900 | HOWE ST | 2018-03-31 | 2849 | 17.2(J) | IS | STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS |
700 | NELSON ST | 2018-03-31 | 2849 | 17.2(J) | IS | STOP ON ANY PORTION OF A STREET INDICATED BY A SIGN OR OTHER MARKER AS RESERVED FOR ONE OR MORE CLASS OF VEHICLE, EXCEPT FOR RECOGNIZED VEHICLES OF THAT CLASS |
600 | HELMCKEN ST | 2018-03-31 | 2849 | 17.1 | IS | STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS STOPPING |
800 | SEYMOUR ST | 2018-03-31 | 2849 | 17.3 | IS | STOP ON ANY PORTION OF A STREET DESIGNATED AS A BUS STOP |
Note that we have to join on list of all dates in order to make sure there are no missing dates in the data. In case you’re keeping track, joins are something that you can do in Excel, but it’s a bit of a pain. 4
# group by
df2.daily.tickets <- df1.tickets %>%
group_by(entry_date) %>%
summarise(count = n())
# all dates:
df3.1.all.dates <- data.frame(dates = seq(as.Date("2016-01-01"),
as.Date("2017-12-31"),
by="1 day"))
# full join:
df3.daily.tickets.joined <- df2.daily.tickets %>%
full_join(df3.1.all.dates,
by = c("entry_date" = "dates")) %>%
arrange(entry_date) %>%
# replace NAs with 0s:
mutate(count = replace_na(count, 0))
# view result:
# str(df3.daily.tickets.joined)
# summary(df3.daily.tickets.joined)
head(df3.daily.tickets.joined) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
entry_date | count |
---|---|
2016-01-01 | 0 |
2016-01-02 | 1321 |
2016-01-03 | 1309 |
2016-01-04 | 1503 |
2016-01-05 | 1410 |
2016-01-06 | 1346 |
tail(df3.daily.tickets.joined) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
entry_date | count |
---|---|
2017-12-26 | 918 |
2017-12-27 | 1102 |
2017-12-28 | 1139 |
2017-12-29 | 1322 |
2017-12-30 | 1313 |
2017-12-31 | 1016 |
Interesting that there’s no data for Christmas and New Year’s.
Let’s do the same grouping for the 2018 data.
df4.daily.tickets.2018 <-
df1.3.tickets.2018 %>%
group_by(entry_date) %>%
summarise(count = n())
df4.1.all.dates <- data.frame(dates=seq(as.Date("2018-01-01"),
as.Date("2018-12-31"),
by="1 day"))
df4.daily.tickets.2018 %<>%
full_join(df4.1.all.dates,
by = c("entry_date" = "dates")) %>%
arrange(entry_date)
# replace NAs with 0s:
# mutate(count = replace_na(count, 0))
# str(df4.daily.tickets.2018)
# summary(df4.daily.tickets.2018)
# head(df4.daily.tickets.2018)
# tail(df4.daily.tickets.2018)
p1.parking.2017 <- df2.daily.tickets %>%
ggplot(aes(x = entry_date,
y = count)) +
geom_line() +
labs(title = "Daily parking tickets in Vancouver, BC",
subtitle = "2016-01-01 to 2017-12-31\n\n",
caption = "\n\nSource:https://data.vancouver.ca/datacatalogue/parking-tickets.htm") +
geom_smooth() +
theme_classic(base_size = 16) +
theme(plot.caption = element_text(size = 11))
p1.parking.2017
We are often asked to interpret trends in data. Common questions include:
To do this, it’s very useful to be decompose the time series into trend, seasonality, and remainder components.
First we have to convert the counts into a time series object:
# reference: https://robjhyndman.com/hyndsight/dailydata/
t1.tickets.time.series <- df3.daily.tickets.joined %>%
pull(count) %>% # pull out the count column as a vector
ts(frequency = 365,
start = c(2016))
str(t1.tickets.time.series)
## Time-Series [1:731] from 2016 to 2018: 0 1321 1309 1503 1410 ...
Now we can run an STL decomposition.
stl(t1.tickets.time.series,
s.window = "periodic") %>% plot
# get decomposition as a df:
df5.stl.decomp <- stl.as.df_fn(t1.tickets.time.series)
# join back on orig data:
df5.stl.decomp <- cbind(df3.daily.tickets.joined,
df5.stl.decomp) %>%
select(-"data")
# str(df4.stl.decomp)
# summary(df4.stl.decomp)
head(df5.stl.decomp, 20) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
entry_date | count | seasonal | trend | remainder | timeperiod |
---|---|---|---|---|---|
2016-01-01 | 0 | -469.69744 | 1171.656 | -701.958216 | 1 |
2016-01-02 | 1321 | -508.03829 | 1171.556 | 657.482145 | 2 |
2016-01-03 | 1309 | -79.09461 | 1171.457 | 216.637964 | 3 |
2016-01-04 | 1503 | 326.34908 | 1171.357 | 5.293784 | 4 |
2016-01-05 | 1410 | 206.79277 | 1171.258 | 31.949604 | 5 |
2016-01-06 | 1346 | 183.23646 | 1171.158 | -8.394576 | 6 |
2016-01-07 | 1360 | 217.68014 | 1171.059 | -28.738757 | 7 |
2016-01-08 | 1390 | -32.87617 | 1170.959 | 251.917063 | 8 |
2016-01-09 | 1359 | 42.06752 | 1170.860 | 146.072883 | 9 |
2016-01-10 | 1126 | 11.01120 | 1170.760 | -55.771297 | 10 |
2016-01-11 | 1416 | 139.45489 | 1170.661 | 105.884522 | 11 |
2016-01-12 | 1411 | 163.89858 | 1170.561 | 76.540342 | 12 |
2016-01-13 | 1493 | 191.34226 | 1170.462 | 131.196162 | 13 |
2016-01-14 | 1230 | 130.28595 | 1170.362 | -70.648018 | 14 |
2016-01-15 | 1293 | 73.22964 | 1170.263 | 49.507801 | 15 |
2016-01-16 | 1510 | 206.67333 | 1170.163 | 133.163621 | 16 |
2016-01-17 | 1468 | 82.11701 | 1170.064 | 215.819441 | 17 |
2016-01-18 | 1660 | 323.06070 | 1169.964 | 166.975261 | 18 |
2016-01-19 | 1704 | 403.50439 | 1169.865 | 130.631081 | 19 |
2016-01-20 | 1550 | 268.44807 | 1169.765 | 111.786900 | 20 |
Notice the distribution of the remainder. This is what’s “left over” after your decomposition model has been fitted to the past data. You should expect that the errors your forecast makes in future will have a similar distribution, at best.
It’s a good thing that the remainders have a nice bell shape with mean very close to zero.
df5.stl.decomp$remainder %>%
hist(50, main = "Distribution of remainders from STL decomposition")
abline(v = mean(df5.stl.decomp$remainder),
col = "red")
# fit model:
m1 <- prophet(df3.daily.tickets.joined %>%
rename(ds = entry_date,
y = count))
## Initial log joint probability = -75.8183
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
# forecast next up to 30th March 2018:
df6.future <- make_future_dataframe(m1, periods = 365)
# make prediction:
df7.prophet.fcast <- predict(m1, df6.future)
df7.prophet.fcast %>%
tail(31) %>%
select(ds,
yhat_lower,
yhat,
yhat_upper) %>%
mutate(ds = as.character(ds) %>% as.Date) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
ds | yhat_lower | yhat | yhat_upper |
---|---|---|---|
2018-12-01 | 868.8223 | 1105.4002 | 1327.429 |
2018-12-02 | 870.7166 | 1096.5452 | 1325.477 |
2018-12-03 | 1163.7032 | 1379.1682 | 1600.859 |
2018-12-04 | 1266.7934 | 1486.6810 | 1710.486 |
2018-12-05 | 1309.3204 | 1541.8534 | 1781.005 |
2018-12-06 | 1272.8076 | 1485.6709 | 1719.847 |
2018-12-07 | 1200.9808 | 1422.4885 | 1635.526 |
2018-12-08 | 902.0280 | 1128.5438 | 1353.356 |
2018-12-09 | 872.3750 | 1099.1804 | 1328.927 |
2018-12-10 | 1116.9861 | 1360.6006 | 1591.597 |
2018-12-11 | 1226.1962 | 1446.7122 | 1661.772 |
2018-12-12 | 1264.4228 | 1480.7966 | 1708.361 |
2018-12-13 | 1173.5500 | 1404.3530 | 1628.152 |
2018-12-14 | 1107.4456 | 1322.2394 | 1547.494 |
2018-12-15 | 790.1111 | 1011.1727 | 1234.720 |
2018-12-16 | 745.8085 | 966.9395 | 1196.864 |
2018-12-17 | 989.7505 | 1216.1366 | 1426.266 |
2018-12-18 | 1059.7491 | 1293.0081 | 1523.057 |
2018-12-19 | 1089.1449 | 1321.1033 | 1547.783 |
2018-12-20 | 1014.3432 | 1242.1139 | 1471.806 |
2018-12-21 | 934.1466 | 1161.0092 | 1397.109 |
2018-12-22 | 629.5014 | 854.5325 | 1089.895 |
2018-12-23 | 601.7879 | 818.4104 | 1040.774 |
2018-12-24 | 865.1183 | 1079.0941 | 1301.108 |
2018-12-25 | 948.3636 | 1170.5993 | 1407.238 |
2018-12-26 | 970.3576 | 1216.1694 | 1447.033 |
2018-12-27 | 925.8607 | 1157.1195 | 1368.546 |
2018-12-28 | 865.8403 | 1097.9808 | 1315.144 |
2018-12-29 | 593.2192 | 815.0071 | 1031.839 |
2018-12-30 | 563.1342 | 803.3967 | 1025.142 |
2018-12-31 | 864.1994 | 1089.0462 | 1306.806 |
# plot fcast:
p2.prophet.fcast <-
plot(m1, df7.prophet.fcast) +
labs(title = "Forecast of number of daily parking tickets in Vancouver for 2018",
subtitle = "Validation set: 2018-01-01 to 2018-03-30 \nRMSE on validation set = 167 \n",
caption = "\n\nData source: https://data.vancouver.ca/datacatalogue/parking-tickets.htm ") +
theme_classic(base_size = 16); p2.prophet.fcast
# decompositions:
prophet_plot_components(m1, df7.prophet.fcast)
Let’s plot prophet’s fcast with actual data as well.
First we’ll prep the data for plotting.
df8.1.all.dates <-
data.frame(dates = seq(as.Date("2016-01-01"),
as.Date("2018-12-31"),
by="1 day"))
df8.hist.and.fcast <-
df8.1.all.dates %>%
full_join(df3.daily.tickets.joined,
by = c("dates" = "entry_date")) %>%
rename(count2017 = count) %>%
# now add 2018 data:
full_join(df4.daily.tickets.2018,
by = c("dates" = "entry_date")) %>%
rename(count2018 = count) %>%
# apparently have to change to POSIXCT:
mutate(dates = as.POSIXct(dates)) %>%
# now join on prophet fcast:
full_join(df7.prophet.fcast %>%
filter(ds >= "2018-01-01") %>%
select(ds,
yhat_lower,
yhat,
yhat_upper),
by = c("dates" = "ds"))
str(df8.hist.and.fcast)
## 'data.frame': 1096 obs. of 6 variables:
## $ dates : POSIXct, format: "2016-01-01" "2016-01-02" ...
## $ count2017 : num 0 1321 1309 1503 1410 ...
## $ count2018 : int NA NA NA NA NA NA NA NA NA NA ...
## $ yhat_lower: num NA NA NA NA NA NA NA NA NA NA ...
## $ yhat : num NA NA NA NA NA NA NA NA NA NA ...
## $ yhat_upper: num NA NA NA NA NA NA NA NA NA NA ...
df8.hist.and.fcast[729:740, ]
## dates count2017 count2018 yhat_lower yhat yhat_upper
## 729 2017-12-29 1322 NA NA NA NA
## 730 2017-12-30 1313 NA NA NA NA
## 731 2017-12-31 1016 NA NA NA NA
## 732 2018-01-01 NA NA NA NA NA
## 733 2018-01-02 NA 1503 1003.1035 1218.3383 1445.706
## 734 2018-01-03 NA 1595 1067.3164 1291.0601 1517.842
## 735 2018-01-04 NA 1620 1018.8459 1257.3603 1502.796
## 736 2018-01-05 NA 1501 1003.1986 1221.1971 1464.912
## 737 2018-01-06 NA 1180 718.6805 958.3087 1178.859
## 738 2018-01-07 NA 1111 715.8292 963.4477 1196.713
## 739 2018-01-08 NA 1333 1029.9171 1262.1460 1479.912
## 740 2018-01-09 NA 1206 1164.3242 1387.5773 1626.816
Let’s evaluate the forecast error:
df9.validate.fcast <-
df8.hist.and.fcast %>%
filter(dates >= "2018-01-01",
dates <= "2018-03-30") %>%
select(dates,
count2018,
yhat) %>%
mutate(error = (count2018 - yhat),
sq.error = (count2018 - yhat)^2)
rmse <- sqrt(mean(df9.validate.fcast$sq.error))
print(paste0("RMSE = ", round(rmse,1)))
## [1] "RMSE = 166.7"
mae <- mean(abs(df9.validate.fcast$error))
print(paste0("MAE = ", round(mae, 1)))
## [1] "MAE = 134.7"
Now the plot:
p3.fcast.validation <-
df8.hist.and.fcast %>%
filter(dates >= "2017-12-01",
dates <= "2018-03-30") %>%
ggplot(aes(x = dates,
y = count2017)) +
geom_line() +
# add 2018 actuals:
geom_line(data = df8.hist.and.fcast %>%
filter(dates >= "2017-12-01",
dates <= "2018-03-30"),
aes(x = dates,
y = count2018),
colour = "blue",
size = 1) +
# add 2018 fcast:
geom_line(data = df8.hist.and.fcast %>%
filter(dates >= "2017-12-01",
dates <= "2018-03-30"),
aes(x = dates,
y = yhat),
colour = "steelblue") +
# add confidence intervals:
geom_ribbon(data = df8.hist.and.fcast %>%
filter(dates >= "2017-12-01",
dates <= "2018-03-30"),
aes(ymin = yhat_lower,
ymax = yhat_upper),
fill = "grey60",
alpha = .2) +
labs(title = "Evaluating forecast of number of parking tickets in 2018",
subtitle = paste0("Training data: 2016-01-01 to 2017-12-31 \nValidation data: 2018-01-01 to 2018-03-30 \nRMSE = ", round(rmse), "\n")) +
theme_classic(base_size = 16)
p3.fcast.validation
ggsave(here::here("results",
"output from src",
"2018-11-03_forecast-parking-tickets-full-2018.pdf"),
p2.prophet.fcast,
width = 10)
ggsave(here::here("results",
"output from src",
"2018-11-03_validating-2018-fcast.pdf"),
p3.fcast.validation,
width = 10)
Abelson, 1995, Statistics as Principled Argument, p7↩
see here for example: https://www.jstatsoft.org/article/view/v040i01↩
https://superuser.com/questions/420635/how-do-i-join-two-worksheets-in-excel-as-i-would-in-sql↩