library(fpp3)
library(readxl)
library(writexl)
library(tidyverse)
library(zoo)
Project 01
Instructions
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 30th. I will accept late submissions with a penalty until the meetup after that when we review some projects.
Part A - ATM Forecast
Data: ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Setup
Load the Data
This is our second time around. We found that DATE
is not recognized as a date. And it’s a weird number. But read_xlsx
knows how to convert it, so we are explicitly asking the reader
to convert it. That throws a ton of warnings, but we can ignore them. We would like to convert ATM
to a factor, but that’s not supported by read_xlsx
so we’ll do that afterwards.
<- read_xlsx(
data "data/ATM624Data.xlsx",
col_types = c("date", "guess", "numeric"),
)
Inspect the Data
head(data)
# A tibble: 6 × 3
DATE ATM Cash
<dttm> <chr> <dbl>
1 2009-05-01 00:00:00 ATM1 96
2 2009-05-01 00:00:00 ATM2 107
3 2009-05-02 00:00:00 ATM1 82
4 2009-05-02 00:00:00 ATM2 89
5 2009-05-03 00:00:00 ATM1 85
6 2009-05-03 00:00:00 ATM2 90
summary(data)
DATE ATM Cash
Min. :2009-05-01 00:00:00.00 Length:1474 Min. : 0.0
1st Qu.:2009-08-01 00:00:00.00 Class :character 1st Qu.: 0.5
Median :2009-11-01 00:00:00.00 Mode :character Median : 73.0
Mean :2009-10-31 19:11:48.27 Mean : 155.6
3rd Qu.:2010-02-01 00:00:00.00 3rd Qu.: 114.0
Max. :2010-05-14 00:00:00.00 Max. :10919.8
NA's :19
Fix the Data
Let’s make ATM
a factor. And we are doubling back to make DATE
a Date
, otherwise as_tsibble
throws an error: “Can’t obtain the interval due to the mismatched index class”.
<- data |>
data mutate(
ATM = as_factor(ATM),
DATE = as.Date(DATE)
)
head(data)
# A tibble: 6 × 3
DATE ATM Cash
<date> <fct> <dbl>
1 2009-05-01 ATM1 96
2 2009-05-01 ATM2 107
3 2009-05-02 ATM1 82
4 2009-05-02 ATM2 89
5 2009-05-03 ATM1 85
6 2009-05-03 ATM2 90
summary(data)
DATE ATM Cash
Min. :2009-05-01 ATM1:365 Min. : 0.0
1st Qu.:2009-08-01 ATM2:365 1st Qu.: 0.5
Median :2009-11-01 ATM3:365 Median : 73.0
Mean :2009-10-31 ATM4:365 Mean : 155.6
3rd Qu.:2010-02-01 NA's: 14 3rd Qu.: 114.0
Max. :2010-05-14 Max. :10919.8
NA's :19
Hmmm, there are 14 NA
s in ATM
and 19 in Cash
. Let’s look at them
|>
data filter(is.na(ATM) | is.na(Cash)) |>
print()
# A tibble: 19 × 3
DATE ATM Cash
<date> <fct> <dbl>
1 2009-06-13 ATM1 NA
2 2009-06-16 ATM1 NA
3 2009-06-18 ATM2 NA
4 2009-06-22 ATM1 NA
5 2009-06-24 ATM2 NA
6 2010-05-01 <NA> NA
7 2010-05-02 <NA> NA
8 2010-05-03 <NA> NA
9 2010-05-04 <NA> NA
10 2010-05-05 <NA> NA
11 2010-05-06 <NA> NA
12 2010-05-07 <NA> NA
13 2010-05-08 <NA> NA
14 2010-05-09 <NA> NA
15 2010-05-10 <NA> NA
16 2010-05-11 <NA> NA
17 2010-05-12 <NA> NA
18 2010-05-13 <NA> NA
19 2010-05-14 <NA> NA
Is this a timeseries? Are there duplicate dates per ATM?
|>
data group_by(DATE, ATM) |>
filter(n() > 1) |>
ungroup()
# A tibble: 0 × 3
# ℹ 3 variables: DATE <date>, ATM <fct>, Cash <dbl>
There are not any duplicate dates. It looks like a timeseries. At the moment this data is in a long format. We are going to convert it to a wide format so that we can see and treat the data for each ATM. But first let’s get rid of the NA
s otherwise we’ll create empty columns. We shall address our NA
problem shortly. We shall also convert the data’s dates while we have a pipeline in our hands.
<- data |>
data filter(!(is.na(ATM) | is.na(Cash))) |>
mutate(DATE = as.Date(DATE))
Now we will make it wide vs. long.
<- data |>
wider pivot_wider(
names_from = ATM,
values_from = Cash
)
We are going to make the series continuous by imputing the missing dates. We shall use the zoo
package’s seq
method.
<- seq(min(wider$DATE), max(wider$DATE), by = "day")
dseq <- merge(wider, data.frame(DATE = dseq), all = TRUE)
wider rm(dseq)
As it turns out, our timeseries didn’t get any longer. No harm done.
Now let’s fill values from the top down (LOCF: carrying their balance forward).
<- wider |>
wider mutate(
ATM1 = na.locf(ATM1, na.rm = FALSE),
ATM2 = na.locf(ATM2, na.rm = FALSE),
ATM3 = na.locf(ATM3, na.rm = FALSE),
ATM4 = na.locf(ATM4, na.rm = FALSE)
)
And, finally (for data manipulation), let’s make it long again and convert it to a tsibble
.
<- wider |>
ts pivot_longer(
cols = ATM1:ATM4,
names_to = "ATM",
values_to = "Cash"
|>
) as_tsibble(index = DATE, key = ATM)
# clean up
rm(wider)
Explore the Data
|>
ts autoplot(Cash) +
labs(
title = "ATM Cash Withdrawals",
x = "Date",
y = "Cash Withdrawn<!----> x $100"
)
Oh dear, that’s not good. Let’s do something about the $10,000+ withdrawal down so that we can see what the general lay of the land is.
|>
ts mutate(Cash = pmin(Cash, 1500)) |>
autoplot(Cash) +
labs(
title = "ATM Cash Withdrawals",
x = "Date",
y = "Cash Withdrawn x $100"
+
) facet_wrap(~ATM, ncol = 1, scale = "free")
Not only ATM4 a little strange, ATM3 looks as if he’s been in existence for only a day or two. Let’s have a look at his non-zero values.
|>
ts filter(ATM == "ATM3") |>
filter(Cash > 0) |>
print()
# A tsibble: 3 x 3 [1D]
# Key: ATM [1]
DATE ATM Cash
<date> <chr> <dbl>
1 2010-04-28 ATM3 96
2 2010-04-29 ATM3 82
3 2010-04-30 ATM3 85
Hmm, we shall certainly treat him separately.
Regarding the other three, it looks like there is a strong seasonal component in ATM1 and ATM2, but ATM4 looks noise. There doesn’t appear to be a trend for any of them. Let’s have a look at the ACF and PACF for all three of them.
|>
ts filter(ATM != "ATM3") |>
ACF(Cash) |>
autoplot() +
labs(
title = "ACF of ATM Cash Withdrawals",
x = "Lag",
y = "ACF"
+
) facet_wrap(~ATM, ncol = 1, scale = "free")
ATM1 and ATM2 both have a strong weekly cycle, but ATM4 does appear to be noise. I think we should try to model ATM1 and ATM2 using ETS and see whether what the residuals look like. Let’s make the data wide again, because it’s become inconvenient to use it as long data.
<- ts |>
wider pivot_wider(
names_from = ATM,
values_from = Cash
)
|>
wider model(
ATM1 = ETS(ATM1 ~ error("A") + trend("N") + season("A")),
|>
) report() |>
gg_tsresiduals()
Series: ATM1
Model: ETS(A,N,A)
Smoothing parameters:
alpha = 0.0215764
gamma = 0.3261183
Initial states:
l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
82.82498 -58.64238 -1.973851 23.05811 5.631834 9.150982 13.77427 9.00104
sigma^2: 582.7142
AIC AICc BIC
4488.559 4489.181 4527.558
|>
wider model(
ATM2 = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
|>
) report() |>
gg_tsresiduals()
Series: ATM2
Model: ETS(A,N,A)
Smoothing parameters:
alpha = 0.0001000239
gamma = 0.3583549
Initial states:
l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
77.10971 -42.86391 -40.91243 17.28339 9.408066 14.66191 18.32838 24.0946
sigma^2: 643.1448
AIC AICc BIC
4524.575 4525.196 4563.574
The residuals both look pretty good, though each have lags that slightly exceed their critical values. At the moment, I am thinking that we should use the ETS model for ATM1 and ATM2, the naive model for ATM3, and the mean method for ATM4.
But, I am still curious whether we can do better for ATM1 and ATM2. I would like to try a seasonal naive model and a SARIMA model. We shall start with the seasonal naive model.
|>
wider model(ATM1 = SNAIVE(ATM1 ~ lag(7))) |>
gg_tsresiduals()
|>
wider model(ATM2 = SNAIVE(ATM2 ~ lag(7))) |>
gg_tsresiduals()
SNAIVE
did not do a very good job. Even though we specified the lag, it remained strong in the residuals. Let’s try SARIMA. We are going to use ARIMA()
to find the best model.
|>
wider model(
ATM1 = ARIMA(ATM1 ~ pdq(, , ) + PDQ(, , ))
|>
) report() |>
gg_tsresiduals()
Series: ATM1
Model: ARIMA(0,0,1)(0,1,2)[7]
Coefficients:
ma1 sma1 sma2
0.1993 -0.5812 -0.1037
s.e. 0.0545 0.0505 0.0493
sigma^2 estimated as 558.6: log likelihood=-1640.72
AIC=3289.45 AICc=3289.56 BIC=3304.97
|>
wider model(
ATM2 = ARIMA(ATM2 ~ pdq(, , ) + PDQ(, , ))
|>
) report() |>
gg_tsresiduals()
Series: ATM2
Model: ARIMA(2,0,2)(0,1,1)[7]
Coefficients:
ar1 ar2 ma1 ma2 sma1
-0.4300 -0.9202 0.4676 0.8187 -0.7522
s.e. 0.0541 0.0390 0.0849 0.0539 0.0391
sigma^2 estimated as 605.3: log likelihood=-1654.47
AIC=3320.94 AICc=3321.18 BIC=3344.23
Wow, on a couple of levels. First, it found two different ARIMA models for ATM1 and ATM2:
- ATM1: ARIMA(0,0,1)(0,1,2)[7]
- ATM2: ARIMA(2,0,2)(0,1,1)[7]
Secondly, the AIC is 1000 points lower than the ETS model. And the residuals look much better. We are going to change our plan:
- ATM1: ARIMA(0,0,1)(0,1,2)[7]
- ATM2: ARIMA(2,0,2)(0,1,1)[7]
- ATM3: Naive
- ATM4: Mean
Forecasting
<- wider |>
fc1 model(ATM1 = ARIMA(ATM1 ~ pdq(0, 0, 1) + PDQ(0, 1, 2))) |>
forecast(h = 31)
<- wider |>
fc2 model(ATM2 = ARIMA(ATM2 ~ pdq(2, 0, 2) + PDQ(0, 1, 1))) |>
forecast(h = 31)
<- wider |>
fc3 model(ATM3 = NAIVE(ATM3)) |>
forecast(h = 31)
<- wider |>
fc4 model(ATM4 = MEAN(ATM4)) |>
forecast(h = 31)
|>
fc1 autoplot(wider) +
labs(
title = "ATM1 Forecast",
x = "Date",
y = "Cash Withdrawn x $100"
)
|>
fc2 autoplot(wider) +
labs(
title = "ATM2 Forecast",
x = "Date",
y = "Cash Withdrawn x $100"
)
|>
fc3 autoplot(wider) +
labs(
title = "ATM3 Forecast",
x = "Date",
y = "Cash Withdrawn x $100"
)
|>
fc4 autoplot(wider) +
labs(
title = "ATM4 Forecast",
x = "Date",
y = "Cash Withdrawn x $100"
)
Now let’s tabulate how much cash is taken out of each of the 4 different ATM machines for May 2010. The following are in the same units as the data in the spreadsheet.
cat("ATM1: ", sum(fc1$.mean), " $100s\n")
ATM1: 2426.317 $100s
cat("ATM2: ", sum(fc2$.mean), " $100s\n")
ATM2: 1790.056 $100s
cat("ATM3: ", sum(fc3$.mean), " $100s\n")
ATM3: 2635 $100s
cat("ATM4: ", sum(fc4$.mean), " $100s")
ATM4: 14695.34 $100s
Finally, we shall write our forecast to an .xlxs file.
= data.frame(
fc DATE = c("2010 May", "2010 May", "2010 May", "2010 May"),
ATM = c("ATM1", "ATM2", "ATM3", "ATM4"),
Cash = c(sum(fc1$.mean), sum(fc2$.mean), sum(fc3$.mean), sum(fc4$.mean))
)write_xlsx(fc, "./data/partA_forecast.xlsx")
Part B - Forecasting Power
Data: ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Setup
library(fpp3)
library(readxl)
library(writexl)
library(tidyverse)
library(zoo)
Load the Data
# `read_xlsx` not recognizing the date format. We shall fix it after it's loaded.
<- read_xlsx(
data "./data/ResidentialCustomerForecastLoad-624.xlsx"
)
Inspect the Data
head(data)
# A tibble: 6 × 3
CaseSequence `YYYY-MMM` KWH
<dbl> <chr> <dbl>
1 733 1998-Jan 6862583
2 734 1998-Feb 5838198
3 735 1998-Mar 5420658
4 736 1998-Apr 5010364
5 737 1998-May 4665377
6 738 1998-Jun 6467147
summary(data)
CaseSequence YYYY-MMM KWH
Min. :733.0 Length:192 Min. : 770523
1st Qu.:780.8 Class :character 1st Qu.: 5429912
Median :828.5 Mode :character Median : 6283324
Mean :828.5 Mean : 6502475
3rd Qu.:876.2 3rd Qu.: 7620524
Max. :924.0 Max. :10655730
NA's :1
Fix the Data
The date did not get converted and it’s inconveniently named.
<- data |>
data rename(
Date = `YYYY-MMM`,
# Let's shorten the name of this guy.
ID = CaseSequence
|>
) mutate(
# if we don't convert it to year-month, then it will look like daily data
# to `as_tsibble()`
Date = yearmonth(
# my goodness, as.Date and strptime are fussy!
as.Date(paste(Date, "01", sep = "-"), format = "%Y-%b-%d")
) )
One more glimpse
head(data)
# A tibble: 6 × 3
ID Date KWH
<dbl> <mth> <dbl>
1 733 1998 Jan 6862583
2 734 1998 Feb 5838198
3 735 1998 Mar 5420658
4 736 1998 Apr 5010364
5 737 1998 May 4665377
6 738 1998 Jun 6467147
# is ID unique
duplicates(data, ID)
Using `Date` as index variable.
# A tibble: 0 × 3
# ℹ 3 variables: ID <dbl>, Date <mth>, KWH <dbl>
# is the date unique
duplicates(data, Date)
Using `Date` as index variable.
# A tibble: 0 × 3
# ℹ 3 variables: ID <dbl>, Date <mth>, KWH <dbl>
Note: we were having a problem with “implicit gaps” down below. Subsequently, we thought we had missing dates. But, it was a problem with the way we were converting the date. The following is not all for naught. It seems like a good practice, to make sure one’s time sequence is complete. So, I’m leaving the code in the project.
<- seq(
dseq from = min(data$Date),
to = max(data$Date),
by = 1
)<- merge(data, data.frame(Date = dseq), all = TRUE)
data rm(dseq)
Let’s look at our missing values
|>
data filter(is.na(KWH))
Date ID KWH
1 2008 Sep 861 NA
Just the one: 2008-09. Let’s look at the rest of the values for September
<- data |>
september filter(month(Date) == 9) |>
mutate(
mean = mean(KWH, na.rm = TRUE),
sd = sd(KWH, na.rm = TRUE)
|>
) select(ID, mean, sd)
summary(september)
ID mean sd
Min. :741 Min. :7702333 Min. :650683
1st Qu.:786 1st Qu.:7702333 1st Qu.:650683
Median :831 Median :7702333 Median :650683
Mean :831 Mean :7702333 Mean :650683
3rd Qu.:876 3rd Qu.:7702333 3rd Qu.:650683
Max. :921 Max. :7702333 Max. :650683
There is a fair amount of variation in the month of September. We are going to impute the missing value with the mean of the month. At least for now.
<- data |>
data left_join(september, by = "ID") |>
mutate(
KWH = ifelse(is.na(KWH), mean, KWH)
|>
) select(-mean, -sd)
And lastly, let’s convert it to a time series.
<- data |>
ts # arrange the columns as we'd prefer seeing them
select(Date, ID, KWH) |>
as_tsibble(index = Date)
# cleanup
rm(data)
Explore the Data
We shall start with a line plot.
|>
ts ggplot(aes(x = Date, y = KWH)) +
geom_line() +
labs(
title = "Residential Power Usage"
)
It looks like we may have a bi-yearly seasonal pattern. And it is very clear that we have a rogue outlier somewhere around 2011. There appears to be a slight upward trend in the data as of 2007.
Let’s see if we can find the outlier.
<- ts |>
outlier filter(KWH < 3000000) |>
print()
# A tsibble: 1 x 3 [1M]
Date ID KWH
<mth> <dbl> <dbl>
1 2010 Jul 883 770523
I’m not sure what to do about the outlier. It is clear to see it’s trajectory under normal conditions. Let’s proceed and come back to it if it causes problems.
I’d like to see what components()
will do. Let’s look through the lens of STL
and inspect the components. I think it will help us identify what the season and trend look like.
|>
ts model(stl = STL(KWH)) |>
components() |>
autoplot()
That worked out nicely. We can very clearly see a trend that is going to be tricky because he’s full of bumps and bruises. I think it can be described by a slight exponential curve. And the seasonal looks pretty constant, though its shape does change. The are 10 cycles per 5 years so we will treat it as a bi-annual seasonal cycle. Due to the seasonal and trend components, I think we’ll rule out all of the simple methods except for SNAIVE. I think it will be a battle between ETS, SARIMA and perhaps SNAIVE.
Forecasting
ETS
Let’s start with ETS. Let’s let him suggest a model.
|>
ts model(ets = ETS(KWH)) |>
report()
Series: KWH
Model: ETS(M,N,M)
Smoothing parameters:
alpha = 0.1143788
gamma = 0.0001053547
Initial states:
l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
6137630 0.955165 0.7487879 0.8652007 1.178319 1.265538 1.1881 0.9999747
s[-7] s[-8] s[-9] s[-10] s[-11]
0.7757071 0.8083132 0.9146982 1.074833 1.225362
sigma^2: 0.0142
AIC AICc BIC
6222.135 6224.862 6270.997
Hmmm, it’s suggesting a multiplicative error, no trend and a multiplicative seasonal component. That surprises me. I wonder if our outlier is affecting the suggestion. Let’s try correcting the the outlier and see whether the results are any different.
# We shall add a column `KWHa` to `ts` and set the outlier to the mean of its
# adjacent observations
<- which(ts$KWH == outlier$KWH)
index $KWHa <- ts$KWH
ts$KWHa[index] = (ts$KWH[index-1] + ts$KWH[index+1]) / 2 ts
|>
ts model(ets = ETS(KWHa)) |>
report()
Series: KWHa
Model: ETS(M,N,M)
Smoothing parameters:
alpha = 0.1160681
gamma = 0.1972472
Initial states:
l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
6190943 0.9005789 0.7540081 0.9353756 1.224268 1.267743 1.231822 1.012795
s[-7] s[-8] s[-9] s[-10] s[-11]
0.7650256 0.8045656 0.8885978 1.021186 1.194035
sigma^2: 0.0094
AIC AICc BIC
6144.091 6146.818 6192.953
It didn’t alter the ETS(M,N,M)
suggestion, but the parameters and states have changed and the information criteria all improved. I’m still curious about ETS(A,M,A)
, though having plotted ETS(M,N,M)
(below) I can see that it captures the amplitude of the seasonal component very nicely. So, let’s model and plot ETS(M,N,M)
and ETS(A,M,M)
for both the KWH
with the outlier and the altered KWHa
.
|>
ts model(
suggested = ETS(KWH ~ error("M") + trend("N") + season("M")),
experiment = ETS(KWH ~ error("A") + trend("M") + season("M"))
|>
) forecast(h=12*4) |>
autoplot(ts, level = NULL) +
labs(
title = "ETS(M,N,M) vs ETS(A,M,M) for KWH"
+
) facet_wrap(~ .model, ncol = 1)
|>
ts model(
suggested = ETS(KWHa ~ error("M") + trend("N") + season("M")),
experiment = ETS(KWHa ~ error("A") + trend("M") + season("M"))
|>
) forecast(h=12*4) |>
autoplot(ts, level = NULL) +
labs(
title = "ETS(M,N,M) vs ETS(A,M,M) for KWHa (altered)"
+
) facet_wrap(~ .model, ncol = 1)
First, the difference between the forecast based on data with 1 imputed value (KWH
) versus the data with 1 imputed value and a corrected outlier (KWHa
) isn’t significant enough to justify changing an observation. But the difference between ETS(M,N,M)
and ETS(A,M,M)
is significant. ETS(A,M,M)
picks up on the apparent (to me) trend line, where ETS(M,N,M)
does not. Considering this difference, I am overriding ETS’s suggestion; I believe that ETS(A,M,M)
over KWH
best forecasts the next 4 years. Incidentally, I know that we are forecasting 1 year, but I wanted to see a forecast for a longer term so that I could better see the forecasted trend line and fluctuations in seasonal amplitude.
SARIMA
We shall look to ARIMA()
for our model parameters.
|>
ts model(arima = ARIMA(KWH ~ pdq(,,) + PDQ(,,))) |>
report()
Series: KWH
Model: ARIMA(0,0,1)(1,1,1)[12] w/ drift
Coefficients:
ma1 sar1 sma1 constant
0.2346 -0.147 -0.7368 105135.63
s.e. 0.0723 0.096 0.0816 26306.58
sigma^2 estimated as 7.406e+11: log likelihood=-2719.24
AIC=5448.47 AICc=5448.82 BIC=5464.44
We can see that the information criteria are all lower than those based on ETS
’s recommended parameters. Let’s see how it changes when using KWHa
.
|>
ts model(arima = ARIMA(KWHa ~ pdq(,,) + PDQ(,,))) |>
report()
Series: KWHa
Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift
Coefficients:
ma1 sar1 sar2 constant
0.3136 -0.7260 -0.4103 225683.19
s.e. 0.0855 0.0774 0.0783 64115.78
sigma^2 estimated as 3.969e+11: log likelihood=-2661.13
AIC=5332.26 AICc=5332.61 BIC=5348.23
Ah, interesting. For the seasonal component, the order argument for the AR
component has changed from 1 to 2. And the information criteria have also improved. As we did with ETS
, we shall look at both of them with a 4 year forecast.
|>
ts model(arima = ARIMA(KWH ~ pdq(,,) + PDQ(,,))) |>
forecast(h = 12 * 4) |>
autoplot(ts, level = NULL) +
labs(
title = "SARIMA(0,0,1)(1,1,0)[12] w/ drift for KWH"
)
|>
ts model(arima = ARIMA(KWHa ~ pdq(,,) + PDQ(,,))) |>
forecast(h = 24) |>
autoplot(ts, level = NULL) +
labs(
title = "SARIMA(0,0,1)(2,1,0)[12] w/ drift for KWHa (altered)"
)
I am selecting SARIMA(0,0,1)(2,1,0)
with KWHa
(adjusted) over SARIMA(0,0,1)(1,1,0)
with KWH
for two reasons:
- The information criteria are all better for
SARIMA(0,0,1)(2,1,0)
withKWHa
than forSARIMA(0,0,1)(1,1,0)
withKWH
. - I believe the greater amplitude of the seasonal model for
SARIMA(0,0,1)(2,1,0)
withKWHa
better models the fairly erratic period leading up to 2014.
And I am selecting SARIMA(0,0,1)(2,1,0)
with KWHa
over ETS(A,M,M)
over KWH
. Not only is the information criteria better for SARIMA, it (SARIMA) also performs better than ETS. The ETS(A,M,M)
model takes a relatively long time to compute compared to SARIMA(0,0,1)(2,1,0)
.
At this point, I’m discounting SNAIVE()
as an alternative. I think the trend is essential to an accurate forecast for this data. Perhaps it will not make much of a difference for 1 year (forecast period). And, with some work, we may even be able to achieve including an underlying trend with SNAIVE()
by adding a drift component. But the seasonal variability captured by ETS and SARIMA are essential to the forecast. When looking at the line plot from 2009 - 2014, one see’s a good amount of variability in the seasonal cycle’s amplitude. ETS and SARIMA are both able to model this variability where SNAIVE
is not.
I know that I am basing my selection on a twice modified timeseries:
- We imputed a missing value with the mean of observations made for the same month (different years).
- We modified an outlier. We set it to the mean of the previous and following observation.
I believe this is where the art meets data science. The challenge is to predict the future, which, as of yet, is not a perfect science. I have based my decision partly on the information criteria score. But also because I have a strong belief that SARIMA(0,0,1)(2,1,0)
’s plot best projects the trend and seasonal component into the future (from a 2014 perspective).
Forecast
And now we shall create our forecast using SARIMA(0,0,1)(2,1,0)
.
<-ts |>
fc model(arima = ARIMA(KWHa ~ pdq(0,0,1) + PDQ(2,1,0))) |>
forecast(h = 12)
The monthly forecast for 2014 is as follows:
|>
fc autoplot(ts, level = NULL) +
labs(
title = "SARIMA(0,0,1)(2,1,0)[12] w/ drift for KWHa (altered)"
)
<- fc |>
fc rename(Forecast = .mean) |>
select(Date, Forecast)
fc
# A tsibble: 12 x 2 [1M]
Date Forecast
<mth> <dbl>
1 2014 Jan 10182990.
2 2014 Feb 8491929.
3 2014 Mar 6626605.
4 2014 Apr 5989578.
5 2014 May 5938476.
6 2014 Jun 8196380.
7 2014 Jul 9478425.
8 2014 Aug 9977759.
9 2014 Sep 8464995.
10 2014 Oct 5863644.
11 2014 Nov 6141030.
12 2014 Dec 8338904.
Lastly, we shall write our results to an Excel workbook.
write_xlsx(fc, "./data/partB_forecast.xlsx")
Part C – BONUS
Data: Waterflow_Pipe1.xlsx, Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Setup
library(fpp3)
library(readxl)
library(tidyverse)
library(writexl)
library(zoo)
Load the Data
<- read_xlsx(
wfp1 "data/Waterflow_Pipe1.xlsx",
col_types = c("date", "numeric")
)<- read_xlsx(
wfp2 "data/Waterflow_Pipe2.xlsx",
col_types = c("date", "numeric")
)
Inspect the Data
head(wfp1)
# A tibble: 6 × 2
`Date Time` WaterFlow
<dttm> <dbl>
1 2015-10-23 00:24:06 23.4
2 2015-10-23 00:40:02 28.0
3 2015-10-23 00:53:51 23.1
4 2015-10-23 00:55:40 30.0
5 2015-10-23 01:19:17 6.00
6 2015-10-23 01:23:58 15.9
head(wfp2)
# A tibble: 6 × 2
`Date Time` WaterFlow
<dttm> <dbl>
1 2015-10-23 01:00:00 18.8
2 2015-10-23 02:00:00 43.1
3 2015-10-23 03:00:00 38.0
4 2015-10-23 04:00:00 36.1
5 2015-10-23 05:00:00 31.9
6 2015-10-23 06:00:00 28.2
summary(wfp1)
Date Time WaterFlow
Min. :2015-10-23 00:24:06.49 Min. : 1.067
1st Qu.:2015-10-25 11:21:35.49 1st Qu.:13.683
Median :2015-10-27 20:07:30.88 Median :19.880
Mean :2015-10-27 20:49:15.93 Mean :19.897
3rd Qu.:2015-10-30 08:24:51.07 3rd Qu.:26.159
Max. :2015-11-01 23:35:43.11 Max. :38.913
summary(wfp2)
Date Time WaterFlow
Min. :2015-10-23 01:00:00 Min. : 1.885
1st Qu.:2015-11-02 10:45:00 1st Qu.:28.140
Median :2015-11-12 20:30:00 Median :39.682
Mean :2015-11-12 20:30:00 Mean :39.556
3rd Qu.:2015-11-23 06:15:00 3rd Qu.:50.782
Max. :2015-12-03 16:00:00 Max. :78.303
Fix the Data
Things look pretty good except for the date columns name. I was thinking about converting them to timeseries, but the intervals between their timestamps is erratic. Ah, that is part of the assignment.
<- wfp1 |>
wfp1 rename(
Date = `Date Time`
)<- wfp2 |>
wfp2 rename(
Date = `Date Time`
)
Now we shall aggregate the data by hour.
<- wfp1 |>
wfp1_a mutate(Date = floor_date(Date, "hours")) |>
group_by(Date) |>
summarize(
WaterFlow = mean(WaterFlow)
)
<- wfp2 |>
wfp2_a mutate(Date = floor_date(Date, "hours")) |>
group_by(Date) |>
summarize(
WaterFlow = mean(WaterFlow)
)
Let’s make sure that our timeseries are complete/continuous:
<- function(data) {
impute_dates <- seq(
dseq from = min(data$Date),
to = max(data$Date),
by = 60 * 60
)return(merge(data, data.frame(Date = dseq), all = TRUE))
}
<- impute_dates(wfp1_a)
wfp1_a <- impute_dates(wfp2_a) wfp2_a
There were hourly observations missing for both datasets. I think it makes more sense to take the mean of adjacent observations than it does to carry the last value forward. There is a method na.approx
that I think will interpolate nicely.
$WaterFlow = na.approx(wfp1_a$WaterFlow)
wfp1_a$WaterFlow = na.approx(wfp2_a$WaterFlow) wfp2_a
Let’s convert them to timeseries
<- as_tsibble(wfp1_a, index = Date)
ts1 <- as_tsibble(wfp2_a, index = Date) ts2
Let’s have one last look at it before we move on.
summary(ts1)
Date WaterFlow
Min. :2015-10-23 00:00:00 Min. : 8.923
1st Qu.:2015-10-25 11:45:00 1st Qu.:16.971
Median :2015-10-27 23:30:00 Median :19.766
Mean :2015-10-27 23:30:00 Mean :19.867
3rd Qu.:2015-10-30 11:15:00 3rd Qu.:22.737
Max. :2015-11-01 23:00:00 Max. :31.730
summary(ts2)
Date WaterFlow
Min. :2015-10-23 01:00:00 Min. : 1.885
1st Qu.:2015-11-02 10:45:00 1st Qu.:28.140
Median :2015-11-12 20:30:00 Median :39.682
Mean :2015-11-12 20:30:00 Mean :39.556
3rd Qu.:2015-11-23 06:15:00 3rd Qu.:50.782
Max. :2015-12-03 16:00:00 Max. :78.303
Everything looks good. Let’s cleanup and move on. We shall keep wfp1
and wfp2
around for reference.
rm(wfp1_a)
rm(wfp2_a)
Explore the Data
Let’s have a look at our data.
|>
ts1 ggplot(aes(x = Date, y = WaterFlow)) +
geom_line() +
labs(
title = "WaterFlow Pipe 1"
)
|>
ts2 ggplot(aes(x = Date, y = WaterFlow)) +
geom_line() +
labs(
title = "WaterFlow Pipe 2"
)
Oh boy, it’s hard to see what is going on in there. I’m going to tackle these guys one at a time because with a little probing they are appearing statistically different.
Pipe 1
Let’s try to use STL
to pick it apart.
|>
ts1 model(stl = STL(WaterFlow)) |>
components() |>
autoplot() +
labs(
title = "STL decomposition: Pipe 1"
)
The trend line is lumpy but, relative to the range of the WaterFlow
, I don’t think it’s very significant. It looks like it is picking up dips and spikes more than it is a trend. It does detect complex daily pattern but it’s range is about half that of the amplitde of the remainder. It will be interesting to look at the ACF.
ACF(ts1, WaterFlow) |>
autoplot()
The ACF doesn’t detect any lags that exceed the critical value. I think we can consider it stationary. Let’s see what SARIMA makes of it.
|>
ts1 model(sarima = ARIMA(WaterFlow ~ pdq(,,) + PDQ(,,))) |>
report()
Series: WaterFlow
Model: ARIMA(0,0,0) w/ mean
Coefficients:
constant
19.8674
s.e. 0.2720
sigma^2 estimated as 17.83: log likelihood=-685.78
AIC=1375.56 AICc=1375.61 BIC=1382.52
It look like ARIMA’s resorted to the mean of the timeseries. I’m going to conclude that is the best we can do for water pipe #1.
Pipe 2
As we did with pipe #1, let’s STL to pick our timeseries apart and see what we have.
|>
ts2 model(stl = STL(WaterFlow)) |>
components() |>
autoplot() +
labs(
title = "STL decomposition: Pipe 2"
)
The trend line is lumpy but, relative to the range of the WaterFlow
, I don’t think it’s very significant. It looks like it is picking up dips and spikes more than it is a trend. It’s detecting two seasonal components!?
Let’s see what it’s ACF looks like.
<- ts2 |>
acf ACF(WaterFlow)
autoplot(acf)
|>
acf mutate(
max = max(abs(acf))
|>
) filter(max == abs(acf))
# A tsibble: 1 x 3 [1h]
lag acf max
<cf_lag> <dbl> <dbl>
1 15h 0.0929 0.0929
Yikes, this is getting worse, not better. ACF is suggesting the lag with the most correlation is 15h. Let’s open ACF up to greater lags with lag_max
.
<- ts2 |>
acf ACF(WaterFlow, lag_max = 7 * 24)
autoplot(acf)
|>
acf mutate(
max = max(abs(acf))
|>
) filter(max == abs(acf))
# A tsibble: 1 x 3 [1h]
lag acf max
<cf_lag> <dbl> <dbl>
1 15h 0.0929 0.0929
Our greatest lag is still 15h.
Let’s see what SARIMA makes of it.
|>
ts2 model(sarima = ARIMA(WaterFlow ~ pdq(,,) + PDQ(,,, period = 15), stepwise = TRUE)) |>
report()
Series: WaterFlow
Model: ARIMA(1,0,2)(1,0,0)[15] w/ mean
Coefficients:
ar1 ma1 ma2 sar1 constant
0.7180 -0.7416 0.0624 0.096 10.0901
s.e. 0.1646 0.1662 0.0331 0.032 0.1613
sigma^2 estimated as 255.1: log likelihood=-4187.41
AIC=8386.81 AICc=8386.9 BIC=8416.26
Well, that seems to have had an effect. The information criteria are very high, though they are not standardized. Let’s see what happens when we use its suggestions.
<- ts2 |>
fc2 model(sarima = ARIMA(WaterFlow ~ pdq(1,0,2) + PDQ(1,0,0, period = 15))) |>
forecast(h = 24 * 7)
|>
fc2 autoplot(ts2)
Haha, that’s not very encouraging. I don’t feel wonderful about either of my forecasts. Actually, I take that back. I don’t think the mean is an unreasonable forecast for the entire week. But I feel certain that a better hourly forecast can be made for both datasets. At the moment, I’m at a loss as to how to make them. SNAIVE just occured to me. If we have time, we will model it and see how it look.
Let’s give it a shot.
|>
ts2 model(snaive = SNAIVE(WaterFlow ~ lag(15))) |>
gg_tsresiduals()
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).
What the heck is going on!? That inverted 15th lag and made it the worse. I’ve got to see this.
|>
ts2 model(snaive = SNAIVE(WaterFlow ~ lag(15))) |>
forecast(h = 24 * 7) |>
autoplot(ts2)
Going off of the STL decomposition, I am going to try a seasonal naive model with a 24h lag and then one at 1 week. It’s a long shot but I’m curious.
|>
ts2 model(snaive = SNAIVE(WaterFlow ~ lag(24))) |>
gg_tsresiduals()
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_bin()`).
I see a pattern emerging. I think 1 week is hopeless, nonetheless I’ve got to see what it looks like.
|>
ts2 model(snaive = SNAIVE(WaterFlow ~ lag(24*7))) |>
gg_tsresiduals()
Warning: Removed 168 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 168 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 168 rows containing non-finite outside the scale range
(`stat_bin()`).
Hmmm, interesting. Let’s see what its forecast looks like.
|>
ts2 model(snaive = SNAIVE(WaterFlow ~ lag(7 * 24))) |>
forecast(h = 24 * 7) |>
autoplot(ts2)
I think that looks remarkably interesting. Let’s zoom in on the forecast.
|>
ts2 model(snaive = SNAIVE(WaterFlow ~ lag(7 * 24))) |>
forecast(h = 24 * 7) |>
autoplot(ts2) +
xlim(ts2$Date[24*30], max(ts2$Date) + days(7))
Warning: Removed 719 rows containing missing values or values outside the scale range
(`geom_line()`).
I think this forecast looks more interesting and better than any other I’ve seen yet. It includes the random element and the slight sinusoidal weekly pattern. The ACF looks a little worse than the ACF of the raw data, but the mean is a pretty awful hourly forecast. Ha, I feel like I’m gambling. I’ve already put forth a mean forecast for water pipe #1. And I’ve already put forth a mean forecast for water pipe #2, but I’m going to change it to the SNAIVE
. Nothing ventured, nothing gained.
Forecasting
As concluded above, we shall use the simple mean model to forecast water pipe #1.
<- ts1 |>
fc1 model(mean = MEAN(WaterFlow)) |>
forecast(h = 7 * 24)
|>
fc1 autoplot(ts1) +
labs(
title = "Forecast: Water Pipe 1"
)
Haha, that plot looks pretty ridiculous. But, it’s the best I’ve got. I won’t pour 158 means(ts1$WaterFlow)
into this document. The prediction, calculated below, is one week of hourly data at ~19.87.
mean(ts1$WaterFlow)
[1] 19.86744
And, as we saw above, our forecast for Water Pipe #2 was:
|>
fc2 autoplot(ts2) +
labs(
title = "Forecast: Water Pipe 2"
)
But I’ve since decided to change it to the SNAIVE
model with a 7 day lag. See my argument above for the reasoning behind my choice.
<- ts2 |>
fc2 model(snaive = SNAIVE(WaterFlow ~ lag(7 * 24))) |>
forecast(h = 24 * 7)
|>
fc2 autoplot(ts2) +
labs(
title = "Revised Forecast: Water Pipe 2"
)
And we shall write them to separate sheets of the spreadsheet for part C.
<- fc1 |>
fc1 rename(Forecast = .mean) |>
select(Date, Forecast)
<- fc2 |>
fc2 rename(Forecast = .mean) |>
select(Date, Forecast)
<- list(
sheets `Pipe 1` = fc1,
`Pipe 2` = fc2
)
write_xlsx(sheets, "./data/partC_forecast.xlsx")