Project 1

Part A – ATM Forecast

In part A, the task is to forecast how much cash is taken out of 4 different ATM machines for May 2010. The variable ‘Cash’ is provided in hundreds of dollars. The instructions are somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast, which goes into an Excel readable file.

#atm = read_excel("data/ATM624Data.xlsx")
#write.csv(atm, "atm.csv")
atm = read.csv("https://raw.githubusercontent.com/ebhtra/msds-624/main/proj1/data/atm.csv") %>% select(-X)
atm$DATE = as.Date("1900-01-01") + (atm$DATE - 2)  # undo an excel conversion of dates
tail(atm)

##            DATE  ATM      Cash
## 1469 2010-04-25 ATM4 542.28060
## 1470 2010-04-26 ATM4 403.83934
## 1471 2010-04-27 ATM4  13.69733
## 1472 2010-04-28 ATM4 348.20106
## 1473 2010-04-29 ATM4  44.24535
## 1474 2010-04-30 ATM4 482.28711

How many missing values, and from where?

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

unique(atm$ATM)

## [1] "ATM1" "ATM2" NA     "ATM3" "ATM4"

paste("There are", sum(is.na(atm$ATM)), "missing values for ATM")

## [1] "There are 14 missing values for ATM"

Hopefully those overlap with the 19 NA’s for Cash.

# convert to tsibble
atm = atm %>%
  as_tsibble(ATM, DATE)
atm[(is.na(atm$Cash)) | (is.na(atm$ATM)),]

## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-22 ATM1     NA
##  4 2009-06-18 ATM2     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

Remove 14 of the 19 Cash NA’s by removing the dates we’re forecasting.

atm = atm %>%
  filter(DATE < "2010-05-01")

Now there are just those 5 NA’s in June at ATM1 and ATM2. Let’s see how to deal.

atm %>%
  filter_index("2009-06") %>%
  filter(ATM %in% c("ATM1", "ATM2")) %>%
  autoplot(Cash) +
  labs(title = "Sparsely Missing Values for ATM's 1 and 2 in June")

Separate the tsibble by ATM for easier analysis

atm1 = atm %>%
  filter(ATM=="ATM1")
atm2 = atm %>%
  filter(ATM=="ATM2")
atm3 = atm %>%
  filter(ATM=="ATM3")
atm4 = atm %>%
  filter(ATM=="ATM4")

atm1 %>% 
  autoplot(Cash) +
  labs(title = "A couple of missing June values in ATM 1")

na_kalman from imputeTS can model the seasonality of this series, for imputing missing values.

library(imputeTS)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

atm1 = atm1 %>% 
  select(Cash) %>% 
  na_kalman()
atm2 = atm2 %>% 
  select(Cash) %>% 
  na_kalman()

ATM1 and ATM2 are similarly scaled. ATM3 only exists at the very end of the series, and ATM4 is busier.

atm %>%
  filter(ATM %in% c("ATM1", "ATM2")) %>%
  autoplot(Cash) +
  labs(title = "Two ATM's are similar")

Correlation of atm1 and atm2:

cor(atm1$Cash, atm2$Cash)

## [1] 0.7120436

Correlation of atm1 and atm4:

cor(atm1$Cash, atm4$Cash)

## [1] 0.3786748

Not shown here: ATM’s 2 and 4 are much less correlated (around .2) by comparison.

Auto-correlation patterns

The overall trend of the year’s worth of data is flattish, and the predictability of each day’s value seems to mainly arise from its seasonality (day of week mostly, to eyeball the plot). Therefore the strongest auto-correlation should be in the 7-day, 14-day, etc., areas. Maybe we’ll se some 30 day AC as well, for paycheck or rent-type withdrawals.

atm1 %>%
  ACF(Cash, lag_max = 35) %>%
  autoplot() +
  labs(title = "ATM 1 auto-correlation")

ATM2 is, predictably, similar in nature (not shown). How about ATM4?

atm4 %>%
  ACF(Cash, lag_max = 50) %>%
  autoplot() +
  labs(title = "ATM 4 auto-correlation")

ATM 4 looks stationary, and may require a different modelling approach. How does is plot?

atm4 %>%
  autoplot(Cash) +
  labs(title = "ATM 4")

Maybe ATM 4 isn’t stationary — That one spiky outlier may be hiding patterns. Drill down on it a bit.

paste("Day", which(atm4$Cash==max(atm4$Cash)), "is the outlier.")

## [1] "Day 285 is the outlier."

Inspect the day and its surrounding days.

date(atm4$DATE[285])

## [1] "2010-02-09"

weekdays(atm4$DATE[285])

## [1] "Tuesday"

atm4$Cash[283:287]

## [1]   112.03063   417.91228 10919.76164    42.43808   280.04343

One value, off by so much, on a Tuesday in February, hints at a data error. Even if somehow one million dollars were withdrawn from this ATM on this random day, it would be unwise to use that knowledge in forecasts. For example, if there were some sort of ATM fraud, we would expect the bank to shore up its defenses, such that another such day became much less likely. Off-by-a-decimal makes more sense, such that 10919 should be 1091.9. But so does off-by-two-decimals (109.19).
Since there appears to be no seasonality here, you could also just take the mean of the series, which is …

# What is the actual mean of the series, including this inflated datapoint?
mean(atm4$Cash)

## [1] 474.0433

Before settling for the mean, it could help to look for more clues.
Here are where the days with Cash > 1000 ($100,000) fall, at ATM4:

sort(table(weekdays(atm4$DATE[atm4$Cash>1000])))

## 
##   Monday Thursday Saturday   Friday   Sunday  Tuesday 
##        1        1        4        5        7        7

Because Tuesdays, for whatever reason, are one of the most likely days to have over 100K taken from ATM4, I’m going to go with the off-by-a-decimal hypothesis, and change the outlier day to be 1092 (from 10920).

atm4$Cash[285] = atm4$Cash[285] / 10
atm4 %>%
  autoplot(Cash) +
  labs(title = "ATM 4 after 'fixing' outlier")

Let’s see if that exposed any autoregressive patterns.

atm4 %>%
  ACF(Cash, lag_max = 50) %>%
  autoplot() +
  labs(title = "ATM 4 auto-correlation after shrinking the outlier")

Indeed, we can now see the familiar weekly seasonality, though less pronounced than for ATM1 and ATM2.
The correlation between ATM1 and ATM4 was about .38 before mutating the outlier, so it probably should have gone up now.

cor(atm1$Cash, atm4$Cash)

## [1] 0.6172861

Recall that correlation between ATM1 and ATM2 was about .71. By changing the one outlier in ATM4, similar modelling approaches may work for all ATM’s.

Is there a trend to use?

atm1 %>% 
  mutate(
    Weekly = slider::slide_dbl(Cash, mean,
                .before = 7, .after = 7, .complete = F),
    Monthly = slider::slide_dbl(Cash, mean,
                .before = 28, .after = 28, .complete = F)) %>%
  autoplot(Cash) +
  geom_line(aes(y = Weekly), colour = "yellow") +
  geom_line(aes(y = Monthly), colour = "blue") +
  ggtitle("Weekly (yellow) and Monthly (blue) MA for ATM 1")

atm2 %>% 
  mutate(
    Weekly = slider::slide_dbl(Cash, mean,
                .before = 7, .after = 7, .complete = F),
    Monthly = slider::slide_dbl(Cash, mean,
                .before = 28, .after = 28, .complete = F)) %>%
  autoplot(Cash) +
  geom_line(aes(y = Weekly), colour = "yellow") +
  geom_line(aes(y = Monthly), colour = "blue") +
  ggtitle("Weekly (yellow) and Monthly (blue) MA for ATM 2")

ATM 1 has a steady overall level, such that the forecasts for May 2010 should end up looking like the values for May 2009. ATM 2, on the other hand, has an overall downward trend. May 2009 starts around 75 and April 2010 ends around 62. So the May 2010 forecast might have to be lower than the May 2009 numbers.

atm4 %>% 
  mutate(
    Weekly = slider::slide_dbl(Cash, mean,
                .before = 7, .after = 7, .complete = F),
    Monthly = slider::slide_dbl(Cash, mean,
                .before = 28, .after = 28, .complete = F)) %>%
  autoplot(Cash) +
  geom_line(aes(y = Weekly), colour = "yellow") +
  geom_line(aes(y = Monthly), colour = "blue") +
  ggtitle("Weekly (yellow) and Monthly (blue) MA for ATM 4")

Because of the spikiness of the data, especially for ATM 4, even the moving averages are missing a lot of summary details. For example, the mean daily value by month looks like this for ATM 4:

atm4 %>%
  index_by(month = yearmonth(DATE)) %>%
  summarise(daymean = mean(Cash)) %>%
  autoplot(daymean) +
  labs(title = "Mean daily Cash, by Month, for ATM 4")

atm2 %>%
  index_by(month = yearmonth(DATE)) %>%
  summarise(daymean = mean(Cash)) %>%
  autoplot(daymean) +
  labs(title = "Mean daily Cash, by Month, for ATM 2")

atm1 %>%
  index_by(month = yearmonth(DATE)) %>%
  summarise(daymean = mean(Cash)) %>%
  autoplot(daymean) +
  labs(title = "Mean daily Cash, by Month, for ATM 1")

These line charts partly show why ATM’s 1 and 4 have relatively high correlation numbers: They have similarly shaped monthly patterns. Upon inspecting the yearlong trends, it’s doubtful that modelling them will help forecasts much.

How much seasonality is there?

Let’s start with the autocorrelation for ATM 1, using 7-day differences.

atm1 %>%
  mutate(diffCash = difference(Cash, 7)) %>%
  gg_tsdisplay(diffCash, plot_type = "partial") +
  labs(title = "Week-on-week changes for ATM 1")

There are strong weekday patterns, and the 7-day differences still have significant spikes at 7-day lags. Hopefully a classical decomposition will show a fixed weekday pattern accounting for a lot of the movement.

atm1 %>%
  model(classical_decomposition(Cash, type = "additive")) %>%
  components() -> comps1
#comps1

autoplot(comps1) +
  labs(title = "ATM 1 (additive) Decomposition")

atm2 %>%
  model(classical_decomposition(Cash, type = "additive")) %>%
  components() %>%
  autoplot() +
  labs(title = "ATM 2 Decomposition")

atm4 %>%
  model(classical_decomposition(Cash, type = "additive")) %>%
  components() %>%
  autoplot() +
  labs(title = "ATM 4 Decomposition")

For ATM 4, it looks like the random element is truly random and unpredictable throughout the year.

For ATM’s 1 and 2, however, a new pattern seems to set in around mid-Feb, such that the seasonal element of the decomposition for the earlier part of the series doesn’t account for the new pattern. It’s possible that just running a new fit starting mid-February would produce better forecasts for May 2010. Especially since the downward trend of ATM 2 changes at that same time point, and stabilizes.

endATM1 = atm1 %>% filter_index("2010-03-01" ~.)
endATM2 = atm2 %>% filter_index("2010-03-01" ~.)
endATM4 = atm4 %>% filter_index("2010-03-01" ~.)

atm1comps = endATM1 %>%
  model(classical_decomposition(Cash, type = "additive")) %>%
  components()
atm1comps %>%
  autoplot() +
  labs(title = "ATM 1 Decomposition, last 2 months only")

This decomposition is nice in the sense that its seasonal component explains most of the levels, and the random component looks pretty random. There is still a mountain at the end of April’s trend that looks like it could be fit by another model, so as to produce higher forecasts for the beginning and end of May 2010. But having chopped the data down to 2 months for fitting, that possible trend will be hard to isolate.

atm2comps = endATM2 %>%
  model(classical_decomposition(Cash, type = "additive")) %>%
  components()
atm2comps %>%
  autoplot() +
  labs(title = "ATM 2 Decomposition, last 2 months only")

This same approach to ATM 2 isn’t quite as satisfying. The seasonal component explains a bit less of the levels and the random component looks perhaps less random. But it still looks like isolating the seasonal component to produce forecasts will be effective. If instead a time-weighted approach like ETS or ARIMA is used, it will produce forecasts closer to the mean, when in fact the day of the week is consistently pushing certain days very far from the mean.

atm4comps = endATM4 %>%
  model(classical_decomposition(Cash, type = "additive")) %>%
  components()
atm4comps%>%
  autoplot() +
  labs(title = "ATM 4 Decomposition, last 2 months only")

Even for ATM4, you get the seasonal component accounting for a higher proportion of the variability, when you just model the data from mid-Feb onward. The fixed seasonal (weekday) component accounts for about half of the daily swings, compared to only about 1/4 of the daily swings when modeling the whole year’s data. Of course this makes sense, because the model only has to fit a pattern to one part of the series, meaning the fit will be better and in fact it will be overfit, when compared to the whole-year fit. But the fact of this project is that we don’t have any real indication of whether there’s even a yearly pattern or not, since we only have one year of data. So in the absence of any yearly pattern, a more naive fit may well turn out to be better anyways. The one ATM (number 2) that had a trend for most of the year ended up reversing that trend. For these reasons, I’ll only fit the data for the ends of the series, in order to produce forecasts.

Assuming the seasonal component is used as the basis for forecasts, what is left to model un-naively here?

atm4comps %>%
  gg_tsdisplay(season_adjust, plot_type = "partial") +
  labs(title = "Breakdown of ATM 4 After Removing Seasonal Component")

Can ARIMA pick up on anything left over here?

endATM4 = endATM4 %>%
  mutate(seasonless = atm4comps$season_adjust)

endATM4 %>%
  model(ARIMA(seasonless, stepwise = F, approximation = F)) -> fit4
report(fit4)

## Series: seasonless 
## Model: ARIMA(0,0,0)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##          sar1  constant
##       -0.3126  543.8977
## s.e.   0.1336   28.0468
## 
## sigma^2 estimated as 46830:  log likelihood=-413.9
## AIC=833.81   AICc=834.23   BIC=840.14

fc4 = fit4 %>% forecast(h=31)
fc4 %>%
  autoplot(endATM4) +
  labs(title = "ARIMA(0,0,0)(1,0,0)[7] Forecasts on De-seasoned ATM 4")

Because there isn’t much seasonality left in the data, the ACF and PACF plots are mostly insignificant, and the auto-fitted model finds a SAR $\phi$ that is small enough to produce forecasts that flatten out to the mean fairly quickly. Still, the hope is that by adding the decomposed seasonal pattern back into these ARIMA forecasts, the results will be slightly better than just using the seasonal pattern added to the mean.

Adding the seasonal pattern back in to those forecasts produces the final forecasts, shown here:

# make sure to add the correct (weekday) seasonal pattern
from = (dim(endATM4)[1] + 1) %% 7
to = from + 30  # to get 31 days of May 2010 forecasts
fc4vals = tsibble(DATE = fc4$DATE, vals = fc4$.mean + atm4comps$seasonal[from:to], index = DATE)
combo = tsibble(date = c(endATM4$DATE, fc4vals$DATE), cash = c(endATM4$Cash, fc4vals$vals), index = date)
fcs = combo %>% filter_index("2010-05" ~.)
combo %>% autoplot(cash) +
  autolayer(fcs, cash, colour = "red") +
  labs(title = "ATM 4, Decomposed Seasonal Pattern plus Leftover SAR(1) Forecasts (in red)")

Although the high parts of that forecast are clearly not high enough, the low and middle parts look right, and at least some of the higher trend at the start of months is forecasted. And even that potential trend is unpredictable, as witnessed by the first weekend of April dropping low (Easter Sunday was April 4, so depending on where this ATM is, that may have affected things).

ATM 4 forecasts

submitATM4 = fc4vals
#write.csv(submitATM4, "atm4Fcast.csv", row.names = F)

For ATM’s 1 and 2,

I’ll just plot the seasonal pattern added to the mean, on top of the actual levels:

atm1comps %>% 
  select(c(Cash, seasonal)) %>%
  as_tsibble(index = DATE) %>% 
  autoplot(Cash, colour="orange") + 
  geom_line(aes(y=seasonal+mean(Cash), colour = "Seasonal Part")) +
  scale_color_manual(name='', values = c("Seasonal Component + Mean" = "black")) +
  labs(title = "ATM 1 Cash Withdrawn") +
  theme(legend.position = c(0.1, 1.0))

ATM 1 looks to have stable enough seasonal (weekday) elements to just use the mean + seasonal element as forecasts.

n = dim(atm1comps)[1]
from = (n+1) %% 7
to = from + 30
submitATM1 = tsibble(DATE = atm1comps$DATE[1:31] + n,
                     Cash = mean(endATM1$Cash) + atm1comps$seasonal[from:to],
                     index = DATE)
#write.csv(submitATM1, "atm1Fcast.csv", row.names = F)

Here’s ATM 2:

atm2comps %>% 
  select(c(Cash, seasonal)) %>%
  as_tsibble(index = DATE) %>% 
  autoplot(Cash, colour="orange") + 
  geom_line(aes(y=seasonal+mean(Cash), colour = "Seasonal Part")) +
  scale_color_manual(name='', values = c("Seasonal Component + Mean" = "black")) +
  labs(title = "ATM 2 Cash Withdrawn") +
  theme(legend.position = c(0.1, 1.0))

ATM 2 looks like its swings have left enough autocorrelation behind for a similar approach to ATM 4.

endATM2 = endATM2 %>%
  mutate(seasonless = atm2comps$season_adjust)

endATM2 %>%
  model(ARIMA(seasonless, stepwise = F, approximation = F)) -> fit2222
report(fit2222)

## Series: seasonless 
## Model: ARIMA(0,0,0)(0,0,2)[7] w/ mean 
## 
## Coefficients:
##         sma1    sma2  constant
##       0.0684  0.4477   57.4007
## s.e.  0.1431  0.2251    2.7562
## 
## sigma^2 estimated as 242.4:  log likelihood=-254.05
## AIC=516.1   AICc=516.81   BIC=524.54

fit2222 %>%
  forecast(h=31) %>%
  autoplot(endATM2) +
  labs(title = "ARIMA(0,0,0)(0,0,2)[7] Forecasts on De-seasoned ATM 2")

fc2 = fit2222 %>% forecast(h=31)
fc2vals = tsibble(DATE = fc2$DATE, vals = fc2$.mean + atm2comps$seasonal[from:to], index = DATE)
combo2222 = tsibble(date = c(endATM2$DATE, fc2vals$DATE), cash = c(endATM2$Cash, fc2vals$vals), index = date)
fcs2222 = combo2222 %>% filter_index("2010-05" ~.)
combo2222 %>% autoplot(cash) +
  autolayer(fcs2222, cash, colour = "red") +
  labs(title = "ATM 2, Decomposed Seasonal Pattern plus Leftover SMA(2) Forecasts (in red)")

I’ll use those red forecasts as final ones.

submitATM2 = fc2vals
#write.csv(submitATM2, "atm2Fcast.csv", row.names = F)

ATM 3 still needs attention

There may not be much point in looking for help from other ATM’s, when only 3 days of data exist, but let’s see if the (partial) weekly pattern matches anyone.

last3atm1 = atm1 %>% filter_index("2010-04-28" ~ "2010-04-30")
atm %>%
  filter_index("2010-04-28" ~ "2010-04-30") %>%
  autoplot(Cash) +
  geom_point(data=last3atm1, aes(y=Cash), colour="black") +
  labs(title = "ATM 3 series is 3 days long, and duplicates ATM 1 for those days",
       x = "ATM 1 shown as black dots")

I had to add ATM 1 data as black dots here, since they were hidden behind ATM 3 data.

There are many possible reasons why ATM 3 has just 3 days of data and all its values are the same as ATM 1 for those 3 days. In the absence of more helpful hints, I’ll use the ATM 1 forecasts for ATM 3. Other viable approaches include naive guesses, but those are equally blind shots in the dark. I’m inclined to follow the few clues provided, and to use the most non-random pattern I can see.

submitATM3 = submitATM1
#write.csv(submitATM3, "atm3Fcast.csv", row.names = F)

Part B - Forecasting Power

#power = read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")
#write.csv(power, "power.csv")
power = read.csv("https://raw.githubusercontent.com/ebhtra/msds-624/main/proj1/data/power.csv") %>% select(-X)
tail(power)

##     CaseSequence YYYY.MMM     KWH
## 187          919 2013-Jul 8415321
## 188          920 2013-Aug 9080226
## 189          921 2013-Sep 7968220
## 190          922 2013-Oct 5759367
## 191          923 2013-Nov 5769083
## 192          924 2013-Dec 9606304

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. The assignment is to model these data and produce a monthly forecast for 2014, in Excel format. The variable ‘KWH’ is power consumption in Kilowatt hours.

# convert to tsibble
power = power %>%
  mutate(date = yearmonth(power$YYYY.MMM)) %>%
  as_tsibble(index = date)
power %>% autoplot(KWH) + labs(title = "Original Power Data, with Missing and Outlying Points")

min(power$KWH[!is.na(power$KWH)])

## [1] 770523

Another candidate for off-by-a-decimal.

off_index = which(power$KWH==min(power$KWH[!is.na(power$KWH)]))
power$KWH[off_index]

## [1] 770523

power$KWH[off_index] = power$KWH[off_index] * 10
power %>% autoplot(KWH) + labs(title = "After adjusting outlier")

That looks pretty convincing. Now it appears there is a sparse collection of missing values to handle.

sum(is.na(power$KWH))

## [1] 1

power = power %>% 
  select(KWH) %>% 
  na_kalman()
power %>% autoplot(KWH) +
  labs(title = "After seasonally imputing the missing value")

That looks good so far. Now it seems as if the high months are increasing more quickly than the low months, which suggests that a power transform may help forecasts.

lambda = power %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)
power = power %>%
  mutate(boxcox = box_cox(KWH, lambda))
power %>%
  autoplot(boxcox) +
  labs(y = "Transformed KWH",
       title = latex2exp::TeX(paste0(
         "Box-Cox Transformed KWH with $\\lambda$ = ",
         round(lambda,2))))

This looks ready for a seasonally fit model.

power %>% gg_tsdisplay(boxcox, plot_type = "partial")

The ACF plot shows two main seasonal patterns—The yearly and quarterly.

power %>% gg_tsdisplay(difference(boxcox, 12), plot_type = "partial", lag= 60) +
  labs(title = "After differencing by 12 months")

The seasonally differenced data looks reasonably stationary, and the remaining ACF and PACF spikes show there will be plenty for ARIMA to sort through. There will be at least one seasonal (12-month) and one 1-month component, but I don’t know where the AR will go and where the MA will go, so I’ll compare both pdq(0,0,1) + PDQ(1,1,0) and pdq(1,0,0) + PDQ(0,1,1) to an automatically fit model:

fit = power %>%
  model(
    arima001110 = ARIMA(boxcox ~ pdq(0,0,1) + PDQ(1,1,0)),
    arima100011 = ARIMA(boxcox ~ pdq(1,0,0) + PDQ(0,1,1)),
    auto = ARIMA(boxcox, stepwise = FALSE, approx = FALSE)
  )
report(fit %>% select(auto))

## Series: boxcox 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2746  0.0776  0.2181  -0.7360  -0.3901    0.0013
## s.e.  0.0759  0.0844  0.0703   0.0738   0.0766    0.0004
## 
## sigma^2 estimated as 1.288e-05:  log likelihood=767.74
## AIC=-1521.49   AICc=-1520.84   BIC=-1499.14

glance(fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 3 × 6
##   .model         sigma2 log_lik    AIC   AICc    BIC
##   <chr>           <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1 auto        0.0000129    768. -1521. -1521. -1499.
## 2 arima100011 0.0000132    763. -1518. -1518. -1505.
## 3 arima001110 0.0000149    754. -1499. -1499. -1486.

The model chose ARIMA(0,0,3)(2,1,0)[12] w/ drift, although the drift component is tiny. The largest components were the seasonal AR(1) and AR(2), followed by the MA(1) and MA(3). The arima100011 actually had the lowest BIC, but the auto-chosen model had higher likelihood, and lower AIC/c, and I’ll focus on it for now.

fit %>% select(auto) %>% gg_tsresiduals(lag=48) + labs(title = "ARIMA(0,0,3)(2,1,0)[12] Residual Analysis")

There are still slight patterns in the residual ACF, especially in lag 6, but the distribution is pretty normal and the significance levels aren’t troubling.

forecast(fit, h=12) %>%
  filter(.model=='auto') %>%
  autoplot(power, level=NULL) +
  labs(title = "ARIMA(0,0,3)(2,1,0)[12] Forecasts")

The automatically chosen model had a MA(3) coefficient, because of the nested seasonality mentioned earlier. We could also directly specify the double seasonality to a dshw model, like so:

library(forecast)
dshw_model = dshw(power$boxcox, period1=3, period2=12, h=12)
dshw_model %>% autoplot()

The double-seasonal Holt Winters forecasts are similar to the ARIMA(0,0,3)(2,1,0) with drift, although they are slightly higher, suggesting they rely more on the recent, upward trend in power usage. In order to reduce the arbitrariness of a choice between the two models, we could see how each model does on the 2012-2013 data, after fitting to the earlier years.

train = power %>% filter_index( ~ "2011 Dec")
test = power %>% filter_index("2012 Jan" ~.)
arimaTrain = train %>% model(ARIMA(boxcox ~ 1 + pdq(0,0,3) + PDQ(2,1,0)))
dshwTrain = dshw(train$boxcox, period1=3, period2 = 12, h=dim(test)[1])

fcArima = arimaTrain %>% forecast(h=dim(test)[1])
fcArima = fcArima$.mean
fcDshw = dshwTrain$mean
true = test$boxcox

paste("Test RMSE for ARIMA model:", sqrt(mean((true - fcArima)^2)))

## [1] "Test RMSE for ARIMA model: 0.00348374898311607"

paste("Test RMSE for DSHW model:", sqrt(mean((true - fcDshw)^2)))

## [1] "Test RMSE for DSHW model: 0.00677183319418132"

So the fairly clear winner, testing them on 2012-2013, is the ARIMA model. This is not surprising, when you look at those two years on the chart; They are 2 relatively flat years that come on the heels of a few years of upward trend. Since we just saw that the DSHW model carries that trend forward more into its forecasts than does the ARIMA, it most likely forecasted levels that were too high.

paste("DSHW forecasts - actual levels, on average:", mean(fcDshw - true))

## [1] "DSHW forecasts - actual levels, on average: 0.0059180728913845"

paste("ARIMA forecasts - actual levels, on average:", mean(fcArima - true))

## [1] "ARIMA forecasts - actual levels, on average: -0.000273139097479983"

Now having the added information of seeing how 2012-2013 actually looked, we might want to just go with the ARIMA model, which tested best on that data, or we might eyeball the fuller chart and speculate that the upward trend will resume/continue, and go with the DSHW model. I’m going to use the DSHW forecasts, because every time a year’s pattern has had its 2 low months look like the ones in the last year of the series, the following year has seen increased levels, and so I’d rather use the more upward-trending forecasts of DSHW.

To use the DSHW forecasts, I still need to invert the BoxCox transformation applied to the original series before fitting it.
The power transform applied $\lambda=-0.21$, so I need to invert that function and apply it to the forecast.

lambda

## [1] -0.2093916

fc = forecast(fit, h=12) %>%
  filter(.model=='auto')  # this is really just to get the months
unboxedDSHW = inv_box_cox(dshw_model$mean, lambda)
patchedDSHW = tsibble(date = c(power$date, fc$date),
                       KWH = c(power$KWH, unboxedDSHW),
                       index = date)  # this is just to plot things
fcastDSHW = patchedDSHW %>% filter_index(as.character(fc$date[1]) ~.)
patchedDSHW %>% autoplot(KWH) +
  autolayer(fcastDSHW, KWH, colour = "red") +
  labs(title = "DSHW Untransformed Forecasts (in red)")

submitPower = tsibble(date = fc$date,
                      KWH = unboxedDSHW,
                      index=date)
#write.csv(submitPower, "powerFcast.csv", row.names = F)

Part C - BONUS, optional (part or all)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour. Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

#water1 = read_excel("data/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
#write.csv(water1, "data/water1.csv")
#water2 = read_excel("data/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
#write.csv(water2, "data/water2.csv")

water1 = read.csv("https://raw.githubusercontent.com/ebhtra/msds-624/main/proj1/data/water1.csv") %>% select(-X)
tail(water1)

##                Date.Time WaterFlow
## 995  2015-11-01 22:09:14  15.28143
## 996  2015-11-01 22:09:18  26.33668
## 997  2015-11-01 22:25:31  29.06427
## 998  2015-11-01 23:08:20  22.84410
## 999  2015-11-01 23:34:10  16.21870
## 1000 2015-11-01 23:35:43  21.21171

water2 = read.csv("https://raw.githubusercontent.com/ebhtra/msds-624/main/proj1/data/water2.csv") %>% select(-X)
tail(water2)

##                Date.Time WaterFlow
## 995  2015-12-03 11:00:00  72.96677
## 996  2015-12-03 12:00:00  31.48311
## 997  2015-12-03 13:00:00  66.81673
## 998  2015-12-03 14:00:00  42.93666
## 999  2015-12-03 15:00:00  33.40133
## 1000 2015-12-03 16:00:00  66.68147

Convert datetimes

water1$Date.Time =  floor_date(ymd_hms(water1$Date.Time), 'hour')
water2$Date.Time =  floor_date(ymd_hms(water2$Date.Time), 'hour')

Create tsibbles

water1 = tsibble(water1 %>% group_by(Date.Time) %>% summarise(water=mean(WaterFlow)), index = Date.Time)
water2 = tsibble(water2 %>% group_by(Date.Time) %>% summarise(water=mean(WaterFlow)), index = Date.Time)

water1 %>% autoplot(water) + labs(title = "water1")

To get a better comparison to water1, here are the water2 data for just the same dates:

water2 %>% filter_index(~ "2015-11-01") %>% autoplot(water) + labs(title = "Part of water2")

Check for missingness

count_gaps(water1)

## # A tibble: 4 × 3
##   .from               .to                    .n
##   <dttm>              <dttm>              <int>
## 1 2015-10-27 17:00:00 2015-10-27 17:00:00     1
## 2 2015-10-28 00:00:00 2015-10-28 00:00:00     1
## 3 2015-11-01 05:00:00 2015-11-01 05:00:00     1
## 4 2015-11-01 09:00:00 2015-11-01 09:00:00     1

count_gaps(water2)

## # A tibble: 0 × 3
## # … with 3 variables: .from <dttm>, .to <dttm>, .n <int>

Just the 4 gaps in water1, so as in previous problems, I’ll just use the kalman function from imputeTS, after creating NA’s by filling datetime gaps in the index.

water1 = fill_gaps(water1)
sum(is.na(water1$water))

## [1] 4

#library(imputeTS)
water1 = water1 %>% 
  select(water) %>% 
  na_kalman()
count_gaps(water1)

## # A tibble: 0 × 3
## # … with 3 variables: .from <dttm>, .to <dttm>, .n <int>

Check autocorrelation patterns

water1 %>% gg_tsdisplay(water, plot_type = "partial", lag=168) + labs(title = "water1 AC analysis")

water1 does look stationary and white noisy. Just one of the 336 lags (barely) exceeds the blue significance lines.

water1 %>%
  features(water, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.242         0.1

The KPSS null hypothesis is that the data are stationary, and any p-value over .1 is listed as .1, meaning that the null hypothesis cannot be rejected, as is the case here. So this lends statistical support to the claim that the data are stationary.

The following histogram shows a roughly normal distribution of values around the mean, as further evidence.

hist(water1$water - mean(water1$water), main = "Differences from the Mean of Water1", xlab = "")

water2 %>% gg_tsdisplay(water, plot_type = "partial", lag=168) + labs(title = "water2 AC analysis")

water2 %>%
  features(water, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.105         0.1

water2 %>%
  features(water, ljung_box) %>%
  select(lb_pvalue)

## # A tibble: 1 × 1
##   lb_pvalue
##       <dbl>
## 1     0.422

The null hypothesis of this Ljung-Box test is that the water2 data are independently distributed, and thus not significantly auto-correlated, and the reported p-value of the test, 0.42, isn’t nearly low enough to reject that.

So even though the ACF and PACF plots show multiple lags that are statistically significant for water2, the data are technically stationary, or at least statistically enough so to fit models to them.

water2 %>%
  model(classical_decomposition(water, type = "additive")) %>%
  components() %>%
  autoplot() + labs(title = "Additive Decomposition of water2")

The random component of the decomposition dominates in magnitude, which means that there will be a lot of error in the forecasts, and in fact a very naive approach may produce the lowest squared errors. Let’s see if water1 is equally difficult to model.

water1 %>%
  model(classical_decomposition(water, type = "additive")) %>%
  components() %>%
  autoplot() + labs(title = "Additive Decomposition of water1")

Not surprisingly, considering the autocorrelation plots for water1 were even less significant than those for water2, the bulk of the decomposition is attributed to random variance.

ETS and/or ARIMA?

arim1 = water1 %>% model(ARIMA(water, stepwise = F, approximation = F))
arim2 = water2 %>% model(ARIMA(water, stepwise = F, approximation = F))

report(arim1)

## Series: water 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##        19.8887
## s.e.    0.2708
## 
## sigma^2 estimated as 17.68:  log likelihood=-684.73
## AIC=1373.46   AICc=1373.51   BIC=1380.42

water1 proved every bit as hard to crack as it appeared—The auto-ARIMA function called it white noise and quit. We could just forecast the mean, or the last value, which is approximately the same as the mean, and thus expect the forecast MSE to be about the same as the $\sigma^2$ reported by the model output, or about 17.68, assuming the variance continued to be fairly steady. Or we could see how the seasonal pattern extracted from the classical decomposition did, maybe on the last 2 days of data, after fitting it to the previous few days, like this:

train = water1 %>% filter_index(~ "2015-10-29")
test = water1 %>% filter_index("2015-10-31" ~.)
train %>%
  model(classical_decomposition(water, type = "additive")) %>%
  components() -> comps1
fcast = rep(comps1$seasonal[1:24], length.out=dim(test)[1]) + mean(train$water)
actual = test$water
paste("MSE for test 2 days, using seasonal average (hour of day) predictions:",
      mean((actual - fcast) ^ 2))

## [1] "MSE for test 2 days, using seasonal average (hour of day) predictions: 18.7723067746856"

paste("MSE for just predicting the mean of the 'training' series:",
      mean((actual - mean(train$water)) ^ 2))

## [1] "MSE for just predicting the mean of the 'training' series: 15.2661259316151"

The variance of the last two days (our “test” series here) seems to have been somewhat low compared to the series overall, such that the mean prediction worked best for the forecasts. I’ll run with that safely naive concept and just predict the mean of the series for all forecasts. Now let’s see if water2, with its slightly more auto-correlated patterns, produced anything of interest from the ARIMA model.

report(arim2)

## Series: water 
## Model: ARIMA(1,0,0)(1,0,0)[24] w/ mean 
## 
## Coefficients:
##           ar1    sar1  constant
##       -0.0180  0.0820   36.9636
## s.e.   0.0035  0.0035    0.5048
## 
## sigma^2 estimated as 255.6:  log likelihood=-4188.95
## AIC=8385.9   AICc=8385.95   BIC=8405.54

It did pick up a small pattern, one tiny $\phi$ for the 1-hour lag and one slightly more significant $\phi$ for the 24-hour lag.
Let’s do a cross-validation comparison of this ARIMA model and an ETS(“A”,“N”,“A”), model, although the latter will only be able to pick up on one seasonality. Note that I’m starting the 1-step forecast validations on an initial training series of 144 points, and adding 5 points to each subsequent validation fit and test. The 144 represents six days of data, and should give the model enough data to fit patterns to, and the 5 hour steps are to sample all hours of the day equally, without bogging down the machine all day to do 850 different model fittings and validations.

mods = water2 %>%
  stretch_tsibble(.init = 144, .step = 5) %>%
  model( 'ets' = ETS(water ~ error("A") + trend("N") + season("A")),
         'arima' = ARIMA(water ~ pdq(1,0,0) + PDQ(1,0,0)))  %>%
  forecast(h=1)

mods %>% accuracy(water2) %>%
  select(.model, RMSE)

## # A tibble: 2 × 2
##   .model  RMSE
##   <chr>  <dbl>
## 1 arima   16.0
## 2 ets     16.4

Cross-validate a Double Seasonal Holt Winters model on the same chunks of the data. For periods I tried 4 and 24, because the 24-hour lag was important, and no other factors or multiples of 24 were significant, up until 72, which was too far out to make useful forecasts, but 4 was the factor of 24 which was most significant in the AC plot.

#library(forecast)
errors = c()
n = dim(water2)[1] - 1
for (hr in seq(144,n,5)) {
  fcDSHW = dshw(water2$water[1:hr], period1=4, period2=24, h=1)
  errors = c(errors, fcDSHW$mean - water2$water[(hr+1)])
}
paste("RMSE for DSHW:", sqrt(mean(errors * errors)))

## [1] "RMSE for DSHW: 16.8840902876066"

The ARIMA(1,0,0)(1,0,0)[24] looks to be slightly better than the ETS(A,N,A), and much better than the DSHW(4,24).
With a Cross-Validated error of just over 16, a fair comparison might be the standard deviation from the mean of the series for those points, each one validated based on the mean of the series up until it:

errors = c()
n = dim(water2)[1] - 1
for (hr in seq(144,n,5)) {
  errors = c(errors, mean(water2$water[1:hr]) - water2$water[(hr+1)])
}
paste("RMSE for mean method:", sqrt(mean(errors * errors)))

## [1] "RMSE for mean method: 15.778394288517"

Again, unfortunately, no manner of fancy decomposition or smoothing or nested seasonal patterns can beat the standard mean method. At least not with me driving the bus. I will check one other thing before resorting to the mean forecast, however. If one of the methods performs much better on the second half of its predictions than the others, that could hint at better future forecasts.

armcast = mods %>% filter(.model == 'arima')
armcast = armcast$.mean
nas = is.na(armcast) # for whatever reason, the arima model produces NA's sporadically.
armcast = armcast[!nas]
etscast = mods %>% filter(.model == 'ets')
etscast = etscast$.mean
etscast = etscast[!nas]
actual = water2$water[seq(145, length(water2$water), 5)]
actual = actual[!nas]

meanErrs = errors[!nas]
etsErrs = etscast - actual
armErrs = armcast - actual

n = length(armErrs)
m = round(n/2)

halves = data.frame(RMSE_1 = c(sqrt(mean(errors[1:m] ^2)),
                               sqrt(mean(etsErrs[1:m] ^2)),
                               sqrt(mean(armErrs[1:m] ^2))),
                    RMSE_2 = c(sqrt(mean(errors[(m+1):n] ^2)),
                               sqrt(mean(etsErrs[(m+1):n] ^2)),
                               sqrt(mean(armErrs[(m+1):n] ^2))),
                    row.names = c("mean forecast", "ETS forecast", "ARIMA forecast"))
halves

##                  RMSE_1   RMSE_2
## mean forecast  15.26968 14.66844
## ETS forecast   15.57899 17.43346
## ARIMA forecast 14.55948 17.35836

In this case, the mean forecast accuracy was the only one trending towards more accurate, in the second half of the cross-validation series. The ARIMA actually did best on the first half, although I’m not sure why it forecasted NA’s about 15% of the time, and mostly in the first half, perhaps not coincidentally. At any rate, the mean forecast is the simplest and also most accurate way to predict water2 and water1, as far as I can tell, and assuming RMSE is the metric of choice.

That does bring up the question of why I’m using squared errors as my measuring stick here, instead of, for example, absolute errors. My reasoning is that the variance from the mean is the best way to baseline a fit for these time series. In other words, I think a great target to beat is the variance of the series, possibly with drift (although drift chiefly didn’t apply here). And if you compare your squared errors to the variance, you’re less likely to create overfitted models than if you compare your absolute errors to something like an average absolute deviation from the mean. Ideally, you’d like to think that you’re modelling a partly random process, whereby the parts of the observations that your model failed to explain fall into a normal distribution around the mean of the series. Without squaring the errors, the mean doesn’t matter any more, and there are many fits that all have approximately the same overall errors, no matter how bizarre the fit and how far from the mean it lies.

tsibble(date = ymd("2022-03-04") + 1:4,
        value = c(101, 105, 105, 101),
        index = date) %>%
  autoplot(value) +
  geom_line(aes(y=103), color = 'orange') +
  geom_line(aes(y=101.5), color = 'blue')

Let’s say you’re trying to fit a model to a series that includes these four days somewhere in the middle of it (black line). You have one candidate model that fits the orange line and another one that fits the blue line. From a mean absolute error standpoint, both lines produce a mean error of 2 for these 4 points. But you’d like to think that the orange line is the more likely of the two, because then you can more easily attribute the errors to random variance. If the blue line is in fact the most likely fit, then the absolute errors are (.5, 3.5, 3.5 .5), which look more likely to be drawn from a random normal distribution with higher variance than the one that explains the orange absolute errors of (2, 2, 2, 2). The mean squared error for the blue line is 6.25, compared to the orange’s 4. Our point is to find a distribution that has the lowest random variance, but the mean absolute error doesn’t help us differentiate between candidates like these.

Having said that, here are the mean predictions for water1 and water2.

# 7 days of forecasts, one per hour (60 * 60 secs), 24 hours a day.
idx1 = water1$Date.Time[length(water1$Date.Time)] + (1:(24*7)) * 60 * 60
idx2 = water2$Date.Time[length(water2$Date.Time)] + (1:(24*7)) * 60 * 60

submitWater1 = tsibble(Date.Time = idx1,
                       water = mean(water1$water),
                       index = Date.Time)
submitWater2 = tsibble(Date.Time = idx2,
                       water = mean(water2$water),
                       index = Date.Time)
#write.csv(submitWater1, "water1Fcast.csv", row.names = F)
#write.csv(submitWater2, "water2Fcast.csv", row.names = F)