Project 01

Author

Curtis Elsasser

Published

March 30, 2025

Instructions

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 30th. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A - ATM Forecast

Data: ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Setup

library(fpp3)
library(readxl)
library(writexl)
library(tidyverse)
library(zoo)

Load the Data

This is our second time around. We found that DATE is not recognized as a date. And it’s a weird number. But read_xlsx knows how to convert it, so we are explicitly asking the reader to convert it. That throws a ton of warnings, but we can ignore them. We would like to convert ATM to a factor, but that’s not supported by read_xlsx so we’ll do that afterwards.

data <- read_xlsx(
  "data/ATM624Data.xlsx",
  col_types = c("date", "guess", "numeric"),
)

Inspect the Data

head(data)
# A tibble: 6 × 3
  DATE                ATM    Cash
  <dttm>              <chr> <dbl>
1 2009-05-01 00:00:00 ATM1     96
2 2009-05-01 00:00:00 ATM2    107
3 2009-05-02 00:00:00 ATM1     82
4 2009-05-02 00:00:00 ATM2     89
5 2009-05-03 00:00:00 ATM1     85
6 2009-05-03 00:00:00 ATM2     90
summary(data)
      DATE                            ATM                 Cash        
 Min.   :2009-05-01 00:00:00.00   Length:1474        Min.   :    0.0  
 1st Qu.:2009-08-01 00:00:00.00   Class :character   1st Qu.:    0.5  
 Median :2009-11-01 00:00:00.00   Mode  :character   Median :   73.0  
 Mean   :2009-10-31 19:11:48.27                      Mean   :  155.6  
 3rd Qu.:2010-02-01 00:00:00.00                      3rd Qu.:  114.0  
 Max.   :2010-05-14 00:00:00.00                      Max.   :10919.8  
                                                     NA's   :19       

Fix the Data

Let’s make ATM a factor. And we are doubling back to make DATE a Date, otherwise as_tsibble throws an error: “Can’t obtain the interval due to the mismatched index class”.

data <- data |>
  mutate(
    ATM = as_factor(ATM),
    DATE = as.Date(DATE)
  )

head(data)
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <fct> <dbl>
1 2009-05-01 ATM1     96
2 2009-05-01 ATM2    107
3 2009-05-02 ATM1     82
4 2009-05-02 ATM2     89
5 2009-05-03 ATM1     85
6 2009-05-03 ATM2     90
summary(data)
      DATE              ATM           Cash        
 Min.   :2009-05-01   ATM1:365   Min.   :    0.0  
 1st Qu.:2009-08-01   ATM2:365   1st Qu.:    0.5  
 Median :2009-11-01   ATM3:365   Median :   73.0  
 Mean   :2009-10-31   ATM4:365   Mean   :  155.6  
 3rd Qu.:2010-02-01   NA's: 14   3rd Qu.:  114.0  
 Max.   :2010-05-14              Max.   :10919.8  
                                 NA's   :19       

Hmmm, there are 14 NAs in ATM and 19 in Cash. Let’s look at them

data |>
  filter(is.na(ATM) | is.na(Cash)) |>
  print()
# A tibble: 19 × 3
   DATE       ATM    Cash
   <date>     <fct> <dbl>
 1 2009-06-13 ATM1     NA
 2 2009-06-16 ATM1     NA
 3 2009-06-18 ATM2     NA
 4 2009-06-22 ATM1     NA
 5 2009-06-24 ATM2     NA
 6 2010-05-01 <NA>     NA
 7 2010-05-02 <NA>     NA
 8 2010-05-03 <NA>     NA
 9 2010-05-04 <NA>     NA
10 2010-05-05 <NA>     NA
11 2010-05-06 <NA>     NA
12 2010-05-07 <NA>     NA
13 2010-05-08 <NA>     NA
14 2010-05-09 <NA>     NA
15 2010-05-10 <NA>     NA
16 2010-05-11 <NA>     NA
17 2010-05-12 <NA>     NA
18 2010-05-13 <NA>     NA
19 2010-05-14 <NA>     NA

Is this a timeseries? Are there duplicate dates per ATM?

data |>
  group_by(DATE, ATM) |>
  filter(n() > 1) |>
  ungroup()
# A tibble: 0 × 3
# ℹ 3 variables: DATE <date>, ATM <fct>, Cash <dbl>

There are not any duplicate dates. It looks like a timeseries. At the moment this data is in a long format. We are going to convert it to a wide format so that we can see and treat the data for each ATM. But first let’s get rid of the NAs otherwise we’ll create empty columns. We shall address our NA problem shortly. We shall also convert the data’s dates while we have a pipeline in our hands.

data <- data |>
  filter(!(is.na(ATM) | is.na(Cash))) |>
  mutate(DATE = as.Date(DATE))

Now we will make it wide vs. long.

wider <- data |>
  pivot_wider(
    names_from = ATM,
    values_from = Cash
  )

We are going to make the series continuous by imputing the missing dates. We shall use the zoo package’s seq method.

dseq <- seq(min(wider$DATE), max(wider$DATE), by = "day")
wider <- merge(wider, data.frame(DATE = dseq), all = TRUE)
rm(dseq)

As it turns out, our timeseries didn’t get any longer. No harm done.

Now let’s fill values from the top down (LOCF: carrying their balance forward).

wider <- wider |>
  mutate(
    ATM1 = na.locf(ATM1, na.rm = FALSE),
    ATM2 = na.locf(ATM2, na.rm = FALSE),
    ATM3 = na.locf(ATM3, na.rm = FALSE),
    ATM4 = na.locf(ATM4, na.rm = FALSE)
  )

And, finally (for data manipulation), let’s make it long again and convert it to a tsibble.

ts <- wider |>
  pivot_longer(
    cols = ATM1:ATM4,
    names_to = "ATM",
    values_to = "Cash"
  ) |>
  as_tsibble(index = DATE, key = ATM)
# clean up
rm(wider)

Explore the Data

ts |>
  autoplot(Cash) +
  labs(
    title = "ATM Cash Withdrawals",
    x = "Date",
    y = "Cash Withdrawn<!----> x $100"
  )

Oh dear, that’s not good. Let’s do something about the $10,000+ withdrawal down so that we can see what the general lay of the land is.

ts |>
  mutate(Cash = pmin(Cash, 1500)) |>
  autoplot(Cash) +
  labs(
    title = "ATM Cash Withdrawals",
    x = "Date",
    y = "Cash Withdrawn x $100"
  ) +
  facet_wrap(~ATM, ncol = 1, scale = "free")

Not only ATM4 a little strange, ATM3 looks as if he’s been in existence for only a day or two. Let’s have a look at his non-zero values.

ts |>
  filter(ATM == "ATM3") |>
  filter(Cash > 0) |>
  print()
# A tsibble: 3 x 3 [1D]
# Key:       ATM [1]
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2010-04-28 ATM3     96
2 2010-04-29 ATM3     82
3 2010-04-30 ATM3     85

Hmm, we shall certainly treat him separately.

Regarding the other three, it looks like there is a strong seasonal component in ATM1 and ATM2, but ATM4 looks noise. There doesn’t appear to be a trend for any of them. Let’s have a look at the ACF and PACF for all three of them.

ts |>
  filter(ATM != "ATM3") |>
  ACF(Cash) |>
  autoplot() +
  labs(
    title = "ACF of ATM Cash Withdrawals",
    x = "Lag",
    y = "ACF"
  ) +
  facet_wrap(~ATM, ncol = 1, scale = "free")

ATM1 and ATM2 both have a strong weekly cycle, but ATM4 does appear to be noise. I think we should try to model ATM1 and ATM2 using ETS and see whether what the residuals look like. Let’s make the data wide again, because it’s become inconvenient to use it as long data.

wider <- ts |>
  pivot_wider(
    names_from = ATM,
    values_from = Cash
  )
wider |>
  model(
    ATM1 = ETS(ATM1 ~ error("A") + trend("N") + season("A")),
  ) |>
  report() |>
  gg_tsresiduals()
Series: ATM1 
Model: ETS(A,N,A) 
  Smoothing parameters:
    alpha = 0.0215764 
    gamma = 0.3261183 

  Initial states:
     l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
 82.82498 -58.64238 -1.973851 23.05811 5.631834 9.150982 13.77427 9.00104

  sigma^2:  582.7142

     AIC     AICc      BIC 
4488.559 4489.181 4527.558 

wider |>
  model(
    ATM2 = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
  ) |>
  report() |>
  gg_tsresiduals()
Series: ATM2 
Model: ETS(A,N,A) 
  Smoothing parameters:
    alpha = 0.0001000239 
    gamma = 0.3583549 

  Initial states:
     l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
 77.10971 -42.86391 -40.91243 17.28339 9.408066 14.66191 18.32838 24.0946

  sigma^2:  643.1448

     AIC     AICc      BIC 
4524.575 4525.196 4563.574 

The residuals both look pretty good, though each have lags that slightly exceed their critical values. At the moment, I am thinking that we should use the ETS model for ATM1 and ATM2, the naive model for ATM3, and the mean method for ATM4.

But, I am still curious whether we can do better for ATM1 and ATM2. I would like to try a seasonal naive model and a SARIMA model. We shall start with the seasonal naive model.

wider |>
  model(ATM1 = SNAIVE(ATM1 ~ lag(7))) |>
  gg_tsresiduals()

wider |>
  model(ATM2 = SNAIVE(ATM2 ~ lag(7))) |>
  gg_tsresiduals()

SNAIVE did not do a very good job. Even though we specified the lag, it remained strong in the residuals. Let’s try SARIMA. We are going to use ARIMA() to find the best model.

wider |>
  model(
    ATM1 = ARIMA(ATM1 ~ pdq(, , ) + PDQ(, , ))
  ) |>
  report() |>
  gg_tsresiduals()
Series: ATM1 
Model: ARIMA(0,0,1)(0,1,2)[7] 

Coefficients:
         ma1     sma1     sma2
      0.1993  -0.5812  -0.1037
s.e.  0.0545   0.0505   0.0493

sigma^2 estimated as 558.6:  log likelihood=-1640.72
AIC=3289.45   AICc=3289.56   BIC=3304.97

wider |>
  model(
    ATM2 = ARIMA(ATM2 ~ pdq(, , ) + PDQ(, , ))
  ) |>
  report() |>
  gg_tsresiduals()
Series: ATM2 
Model: ARIMA(2,0,2)(0,1,1)[7] 

Coefficients:
          ar1      ar2     ma1     ma2     sma1
      -0.4300  -0.9202  0.4676  0.8187  -0.7522
s.e.   0.0541   0.0390  0.0849  0.0539   0.0391

sigma^2 estimated as 605.3:  log likelihood=-1654.47
AIC=3320.94   AICc=3321.18   BIC=3344.23

Wow, on a couple of levels. First, it found two different ARIMA models for ATM1 and ATM2:

  • ATM1: ARIMA(0,0,1)(0,1,2)[7]
  • ATM2: ARIMA(2,0,2)(0,1,1)[7]

Secondly, the AIC is 1000 points lower than the ETS model. And the residuals look much better. We are going to change our plan:

  • ATM1: ARIMA(0,0,1)(0,1,2)[7]
  • ATM2: ARIMA(2,0,2)(0,1,1)[7]
  • ATM3: Naive
  • ATM4: Mean

Forecasting

fc1 <- wider |>
  model(ATM1 = ARIMA(ATM1 ~ pdq(0, 0, 1) + PDQ(0, 1, 2))) |>
  forecast(h = 31)

fc2 <- wider |>
  model(ATM2 = ARIMA(ATM2 ~ pdq(2, 0, 2) + PDQ(0, 1, 1))) |>
  forecast(h = 31)

fc3 <- wider |>
  model(ATM3 = NAIVE(ATM3)) |>
  forecast(h = 31)

fc4 <- wider |>
  model(ATM4 = MEAN(ATM4)) |>
  forecast(h = 31)

fc1 |>
  autoplot(wider) +
  labs(
    title = "ATM1 Forecast",
    x = "Date",
    y = "Cash Withdrawn x $100"
  )

fc2 |>
  autoplot(wider) +
  labs(
    title = "ATM2 Forecast",
    x = "Date",
    y = "Cash Withdrawn x $100"
  )

fc3 |>
  autoplot(wider) +
  labs(
    title = "ATM3 Forecast",
    x = "Date",
    y = "Cash Withdrawn x $100"
  )

fc4 |>
  autoplot(wider) +
  labs(
    title = "ATM4 Forecast",
    x = "Date",
    y = "Cash Withdrawn x $100"
  )

Now let’s tabulate how much cash is taken out of each of the 4 different ATM machines for May 2010. The following are in the same units as the data in the spreadsheet.

cat("ATM1: ", sum(fc1$.mean), " $100s\n")
ATM1:  2426.317  $100s
cat("ATM2: ", sum(fc2$.mean), " $100s\n")
ATM2:  1790.056  $100s
cat("ATM3: ", sum(fc3$.mean), " $100s\n")
ATM3:  2635  $100s
cat("ATM4: ", sum(fc4$.mean), " $100s")
ATM4:  14695.34  $100s

Finally, we shall write our forecast to an .xlxs file.

fc = data.frame(
  DATE = c("2010 May", "2010 May", "2010 May", "2010 May"),
  ATM = c("ATM1", "ATM2", "ATM3", "ATM4"),
  Cash = c(sum(fc1$.mean), sum(fc2$.mean), sum(fc3$.mean), sum(fc4$.mean))
)
write_xlsx(fc, "./data/partA_forecast.xlsx")

Part B - Forecasting Power

Data: ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Setup

library(fpp3)
library(readxl)
library(writexl)
library(tidyverse)
library(zoo)

Load the Data

# `read_xlsx` not recognizing the date format. We shall fix it after it's loaded.
data <- read_xlsx(
  "./data/ResidentialCustomerForecastLoad-624.xlsx"
)

Inspect the Data

head(data)
# A tibble: 6 × 3
  CaseSequence `YYYY-MMM`     KWH
         <dbl> <chr>        <dbl>
1          733 1998-Jan   6862583
2          734 1998-Feb   5838198
3          735 1998-Mar   5420658
4          736 1998-Apr   5010364
5          737 1998-May   4665377
6          738 1998-Jun   6467147
summary(data)
  CaseSequence     YYYY-MMM              KWH          
 Min.   :733.0   Length:192         Min.   :  770523  
 1st Qu.:780.8   Class :character   1st Qu.: 5429912  
 Median :828.5   Mode  :character   Median : 6283324  
 Mean   :828.5                      Mean   : 6502475  
 3rd Qu.:876.2                      3rd Qu.: 7620524  
 Max.   :924.0                      Max.   :10655730  
                                    NA's   :1         

Fix the Data

The date did not get converted and it’s inconveniently named.

data <- data |>
  rename(
    Date = `YYYY-MMM`,
    # Let's shorten the name of this guy.
    ID = CaseSequence
  ) |>
  mutate(
    # if we don't convert it to year-month, then it will look like daily data
    # to `as_tsibble()`
    Date = yearmonth(
      # my goodness, as.Date and strptime are fussy!
      as.Date(paste(Date, "01", sep = "-"), format = "%Y-%b-%d")
    )
  )

One more glimpse

head(data)
# A tibble: 6 × 3
     ID     Date     KWH
  <dbl>    <mth>   <dbl>
1   733 1998 Jan 6862583
2   734 1998 Feb 5838198
3   735 1998 Mar 5420658
4   736 1998 Apr 5010364
5   737 1998 May 4665377
6   738 1998 Jun 6467147
# is ID unique
duplicates(data, ID)
Using `Date` as index variable.
# A tibble: 0 × 3
# ℹ 3 variables: ID <dbl>, Date <mth>, KWH <dbl>
# is the date unique
duplicates(data, Date)
Using `Date` as index variable.
# A tibble: 0 × 3
# ℹ 3 variables: ID <dbl>, Date <mth>, KWH <dbl>

Note: we were having a problem with “implicit gaps” down below. Subsequently, we thought we had missing dates. But, it was a problem with the way we were converting the date. The following is not all for naught. It seems like a good practice, to make sure one’s time sequence is complete. So, I’m leaving the code in the project.

dseq <- seq(
  from = min(data$Date),
  to = max(data$Date),
  by = 1
)
data <- merge(data, data.frame(Date = dseq), all = TRUE)
rm(dseq)

Let’s look at our missing values

data |>
  filter(is.na(KWH))
      Date  ID KWH
1 2008 Sep 861  NA

Just the one: 2008-09. Let’s look at the rest of the values for September

september <- data |>
  filter(month(Date) == 9) |>
  mutate(
    mean = mean(KWH, na.rm = TRUE),
    sd = sd(KWH, na.rm = TRUE)
  ) |>
  select(ID, mean, sd)

summary(september)
       ID           mean               sd        
 Min.   :741   Min.   :7702333   Min.   :650683  
 1st Qu.:786   1st Qu.:7702333   1st Qu.:650683  
 Median :831   Median :7702333   Median :650683  
 Mean   :831   Mean   :7702333   Mean   :650683  
 3rd Qu.:876   3rd Qu.:7702333   3rd Qu.:650683  
 Max.   :921   Max.   :7702333   Max.   :650683  

There is a fair amount of variation in the month of September. We are going to impute the missing value with the mean of the month. At least for now.

data <- data |>
  left_join(september, by = "ID") |>
  mutate(
    KWH = ifelse(is.na(KWH), mean, KWH)
  ) |>
  select(-mean, -sd)

And lastly, let’s convert it to a time series.

ts <- data |>
  # arrange the columns as we'd prefer seeing them
  select(Date, ID, KWH) |>
  as_tsibble(index = Date)

# cleanup
rm(data)

Explore the Data

We shall start with a line plot.

ts |>
  ggplot(aes(x = Date, y = KWH)) +
  geom_line() +
  labs(
    title = "Residential Power Usage"
  )

It looks like we may have a bi-yearly seasonal pattern. And it is very clear that we have a rogue outlier somewhere around 2011. There appears to be a slight upward trend in the data as of 2007.

Let’s see if we can find the outlier.

outlier <- ts |>
  filter(KWH < 3000000) |>
  print()
# A tsibble: 1 x 3 [1M]
      Date    ID    KWH
     <mth> <dbl>  <dbl>
1 2010 Jul   883 770523

I’m not sure what to do about the outlier. It is clear to see it’s trajectory under normal conditions. Let’s proceed and come back to it if it causes problems.

I’d like to see what components() will do. Let’s look through the lens of STL and inspect the components. I think it will help us identify what the season and trend look like.

ts |>
  model(stl = STL(KWH)) |>
  components() |>
  autoplot()

That worked out nicely. We can very clearly see a trend that is going to be tricky because he’s full of bumps and bruises. I think it can be described by a slight exponential curve. And the seasonal looks pretty constant, though its shape does change. The are 10 cycles per 5 years so we will treat it as a bi-annual seasonal cycle. Due to the seasonal and trend components, I think we’ll rule out all of the simple methods except for SNAIVE. I think it will be a battle between ETS, SARIMA and perhaps SNAIVE.

Forecasting

ETS

Let’s start with ETS. Let’s let him suggest a model.

ts |>
  model(ets = ETS(KWH)) |>
  report()
Series: KWH 
Model: ETS(M,N,M) 
  Smoothing parameters:
    alpha = 0.1143788 
    gamma = 0.0001053547 

  Initial states:
    l[0]     s[0]     s[-1]     s[-2]    s[-3]    s[-4]  s[-5]     s[-6]
 6137630 0.955165 0.7487879 0.8652007 1.178319 1.265538 1.1881 0.9999747
     s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
 0.7757071 0.8083132 0.9146982 1.074833 1.225362

  sigma^2:  0.0142

     AIC     AICc      BIC 
6222.135 6224.862 6270.997 

Hmmm, it’s suggesting a multiplicative error, no trend and a multiplicative seasonal component. That surprises me. I wonder if our outlier is affecting the suggestion. Let’s try correcting the the outlier and see whether the results are any different.

# We shall add a column `KWHa` to `ts` and set the outlier to the mean of its
# adjacent observations
index <- which(ts$KWH == outlier$KWH)
ts$KWHa <- ts$KWH
ts$KWHa[index] = (ts$KWH[index-1] + ts$KWH[index+1]) / 2
ts |>
  model(ets = ETS(KWHa)) |>
  report()
Series: KWHa 
Model: ETS(M,N,M) 
  Smoothing parameters:
    alpha = 0.1160681 
    gamma = 0.1972472 

  Initial states:
    l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
 6190943 0.9005789 0.7540081 0.9353756 1.224268 1.267743 1.231822 1.012795
     s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
 0.7650256 0.8045656 0.8885978 1.021186 1.194035

  sigma^2:  0.0094

     AIC     AICc      BIC 
6144.091 6146.818 6192.953 

It didn’t alter the ETS(M,N,M) suggestion, but the parameters and states have changed and the information criteria all improved. I’m still curious about ETS(A,M,A), though having plotted ETS(M,N,M) (below) I can see that it captures the amplitude of the seasonal component very nicely. So, let’s model and plot ETS(M,N,M) and ETS(A,M,M) for both the KWH with the outlier and the altered KWHa.

ts |>
  model(
    suggested = ETS(KWH ~ error("M") + trend("N") + season("M")),
    experiment = ETS(KWH ~ error("A") + trend("M") + season("M"))
  ) |>
  forecast(h=12*4) |>
  autoplot(ts, level = NULL) +
  labs(
    title = "ETS(M,N,M) vs ETS(A,M,M) for KWH"
  ) +
  facet_wrap(~ .model, ncol = 1)

ts |>
  model(
    suggested = ETS(KWHa ~ error("M") + trend("N") + season("M")),
    experiment = ETS(KWHa ~ error("A") + trend("M") + season("M"))
  ) |>
  forecast(h=12*4) |>
  autoplot(ts, level = NULL) +
  labs(
    title = "ETS(M,N,M) vs ETS(A,M,M) for KWHa (altered)"
  ) +
  facet_wrap(~ .model, ncol = 1)

First, the difference between the forecast based on data with 1 imputed value (KWH) versus the data with 1 imputed value and a corrected outlier (KWHa) isn’t significant enough to justify changing an observation. But the difference between ETS(M,N,M) and ETS(A,M,M) is significant. ETS(A,M,M) picks up on the apparent (to me) trend line, where ETS(M,N,M) does not. Considering this difference, I am overriding ETS’s suggestion; I believe that ETS(A,M,M) over KWH best forecasts the next 4 years. Incidentally, I know that we are forecasting 1 year, but I wanted to see a forecast for a longer term so that I could better see the forecasted trend line and fluctuations in seasonal amplitude.

SARIMA

We shall look to ARIMA() for our model parameters.

ts |>
  model(arima = ARIMA(KWH ~ pdq(,,) + PDQ(,,))) |>
  report()
Series: KWH 
Model: ARIMA(0,0,1)(1,1,1)[12] w/ drift 

Coefficients:
         ma1    sar1     sma1   constant
      0.2346  -0.147  -0.7368  105135.63
s.e.  0.0723   0.096   0.0816   26306.58

sigma^2 estimated as 7.406e+11:  log likelihood=-2719.24
AIC=5448.47   AICc=5448.82   BIC=5464.44

We can see that the information criteria are all lower than those based on ETS’s recommended parameters. Let’s see how it changes when using KWHa.

ts |>
  model(arima = ARIMA(KWHa ~ pdq(,,) + PDQ(,,))) |>
  report()
Series: KWHa 
Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 

Coefficients:
         ma1     sar1     sar2   constant
      0.3136  -0.7260  -0.4103  225683.19
s.e.  0.0855   0.0774   0.0783   64115.78

sigma^2 estimated as 3.969e+11:  log likelihood=-2661.13
AIC=5332.26   AICc=5332.61   BIC=5348.23

Ah, interesting. For the seasonal component, the order argument for the AR component has changed from 1 to 2. And the information criteria have also improved. As we did with ETS, we shall look at both of them with a 4 year forecast.

ts |>
  model(arima = ARIMA(KWH ~ pdq(,,) + PDQ(,,))) |>
  forecast(h = 12 * 4) |>
  autoplot(ts, level = NULL) +
  labs(
    title = "SARIMA(0,0,1)(1,1,0)[12] w/ drift for KWH"
  )

ts |>
  model(arima = ARIMA(KWHa ~ pdq(,,) + PDQ(,,))) |>
  forecast(h = 24) |>
  autoplot(ts, level = NULL) +
  labs(
    title = "SARIMA(0,0,1)(2,1,0)[12] w/ drift for KWHa (altered)"
  )

I am selecting SARIMA(0,0,1)(2,1,0) with KWHa (adjusted) over SARIMA(0,0,1)(1,1,0) with KWH for two reasons:

  1. The information criteria are all better for SARIMA(0,0,1)(2,1,0) with KWHa than for SARIMA(0,0,1)(1,1,0) with KWH.
  2. I believe the greater amplitude of the seasonal model for SARIMA(0,0,1)(2,1,0) with KWHa better models the fairly erratic period leading up to 2014.

And I am selecting SARIMA(0,0,1)(2,1,0) with KWHa over ETS(A,M,M) over KWH. Not only is the information criteria better for SARIMA, it (SARIMA) also performs better than ETS. The ETS(A,M,M) model takes a relatively long time to compute compared to SARIMA(0,0,1)(2,1,0).

At this point, I’m discounting SNAIVE() as an alternative. I think the trend is essential to an accurate forecast for this data. Perhaps it will not make much of a difference for 1 year (forecast period). And, with some work, we may even be able to achieve including an underlying trend with SNAIVE() by adding a drift component. But the seasonal variability captured by ETS and SARIMA are essential to the forecast. When looking at the line plot from 2009 - 2014, one see’s a good amount of variability in the seasonal cycle’s amplitude. ETS and SARIMA are both able to model this variability where SNAIVE is not.

Art vs. Science

I know that I am basing my selection on a twice modified timeseries:

  1. We imputed a missing value with the mean of observations made for the same month (different years).
  2. We modified an outlier. We set it to the mean of the previous and following observation.

I believe this is where the art meets data science. The challenge is to predict the future, which, as of yet, is not a perfect science. I have based my decision partly on the information criteria score. But also because I have a strong belief that SARIMA(0,0,1)(2,1,0)’s plot best projects the trend and seasonal component into the future (from a 2014 perspective).

Forecast

And now we shall create our forecast using SARIMA(0,0,1)(2,1,0).

fc <-ts |>
  model(arima = ARIMA(KWHa ~ pdq(0,0,1) + PDQ(2,1,0))) |>
  forecast(h = 12)

The monthly forecast for 2014 is as follows:

fc |>
  autoplot(ts, level = NULL) +
  labs(
    title = "SARIMA(0,0,1)(2,1,0)[12] w/ drift for KWHa (altered)"
  )

fc <- fc |>
  rename(Forecast = .mean) |>
  select(Date, Forecast)
fc
# A tsibble: 12 x 2 [1M]
       Date  Forecast
      <mth>     <dbl>
 1 2014 Jan 10182990.
 2 2014 Feb  8491929.
 3 2014 Mar  6626605.
 4 2014 Apr  5989578.
 5 2014 May  5938476.
 6 2014 Jun  8196380.
 7 2014 Jul  9478425.
 8 2014 Aug  9977759.
 9 2014 Sep  8464995.
10 2014 Oct  5863644.
11 2014 Nov  6141030.
12 2014 Dec  8338904.

Lastly, we shall write our results to an Excel workbook.

write_xlsx(fc, "./data/partB_forecast.xlsx")

Part C – BONUS

Data: Waterflow_Pipe1.xlsx, Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Setup

library(fpp3)
library(readxl)
library(tidyverse)
library(writexl)
library(zoo)

Load the Data

wfp1 <- read_xlsx(
  "data/Waterflow_Pipe1.xlsx",
  col_types = c("date", "numeric")
)
wfp2 <- read_xlsx(
  "data/Waterflow_Pipe2.xlsx",
  col_types = c("date", "numeric")
)

Inspect the Data

head(wfp1)
# A tibble: 6 × 2
  `Date Time`         WaterFlow
  <dttm>                  <dbl>
1 2015-10-23 00:24:06     23.4 
2 2015-10-23 00:40:02     28.0 
3 2015-10-23 00:53:51     23.1 
4 2015-10-23 00:55:40     30.0 
5 2015-10-23 01:19:17      6.00
6 2015-10-23 01:23:58     15.9 
head(wfp2)
# A tibble: 6 × 2
  `Date Time`         WaterFlow
  <dttm>                  <dbl>
1 2015-10-23 01:00:00      18.8
2 2015-10-23 02:00:00      43.1
3 2015-10-23 03:00:00      38.0
4 2015-10-23 04:00:00      36.1
5 2015-10-23 05:00:00      31.9
6 2015-10-23 06:00:00      28.2
summary(wfp1)
   Date Time                        WaterFlow     
 Min.   :2015-10-23 00:24:06.49   Min.   : 1.067  
 1st Qu.:2015-10-25 11:21:35.49   1st Qu.:13.683  
 Median :2015-10-27 20:07:30.88   Median :19.880  
 Mean   :2015-10-27 20:49:15.93   Mean   :19.897  
 3rd Qu.:2015-10-30 08:24:51.07   3rd Qu.:26.159  
 Max.   :2015-11-01 23:35:43.11   Max.   :38.913  
summary(wfp2)
   Date Time                     WaterFlow     
 Min.   :2015-10-23 01:00:00   Min.   : 1.885  
 1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
 Median :2015-11-12 20:30:00   Median :39.682  
 Mean   :2015-11-12 20:30:00   Mean   :39.556  
 3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
 Max.   :2015-12-03 16:00:00   Max.   :78.303  

Fix the Data

Things look pretty good except for the date columns name. I was thinking about converting them to timeseries, but the intervals between their timestamps is erratic. Ah, that is part of the assignment.

wfp1 <- wfp1 |>
  rename(
    Date = `Date Time`
  )
wfp2 <- wfp2 |>
  rename(
    Date = `Date Time`
  )

Now we shall aggregate the data by hour.

wfp1_a <- wfp1 |>
  mutate(Date = floor_date(Date, "hours")) |>
  group_by(Date) |>
  summarize(
    WaterFlow = mean(WaterFlow)
  )

wfp2_a <- wfp2 |>
  mutate(Date = floor_date(Date, "hours")) |>
  group_by(Date) |>
  summarize(
    WaterFlow = mean(WaterFlow)
  )

Let’s make sure that our timeseries are complete/continuous:

impute_dates <- function(data) {
  dseq <- seq(
    from = min(data$Date),
    to = max(data$Date),
    by = 60 * 60
  )
  return(merge(data, data.frame(Date = dseq), all = TRUE))
}

wfp1_a <- impute_dates(wfp1_a)
wfp2_a <- impute_dates(wfp2_a)

There were hourly observations missing for both datasets. I think it makes more sense to take the mean of adjacent observations than it does to carry the last value forward. There is a method na.approx that I think will interpolate nicely.

wfp1_a$WaterFlow = na.approx(wfp1_a$WaterFlow)
wfp2_a$WaterFlow = na.approx(wfp2_a$WaterFlow)

Let’s convert them to timeseries

ts1 <- as_tsibble(wfp1_a, index = Date)
ts2 <- as_tsibble(wfp2_a, index = Date)

Let’s have one last look at it before we move on.

summary(ts1)
      Date                       WaterFlow     
 Min.   :2015-10-23 00:00:00   Min.   : 8.923  
 1st Qu.:2015-10-25 11:45:00   1st Qu.:16.971  
 Median :2015-10-27 23:30:00   Median :19.766  
 Mean   :2015-10-27 23:30:00   Mean   :19.867  
 3rd Qu.:2015-10-30 11:15:00   3rd Qu.:22.737  
 Max.   :2015-11-01 23:00:00   Max.   :31.730  
summary(ts2)
      Date                       WaterFlow     
 Min.   :2015-10-23 01:00:00   Min.   : 1.885  
 1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
 Median :2015-11-12 20:30:00   Median :39.682  
 Mean   :2015-11-12 20:30:00   Mean   :39.556  
 3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
 Max.   :2015-12-03 16:00:00   Max.   :78.303  

Everything looks good. Let’s cleanup and move on. We shall keep wfp1 and wfp2 around for reference.

rm(wfp1_a)
rm(wfp2_a)

Explore the Data

Let’s have a look at our data.

ts1 |>
  ggplot(aes(x = Date, y = WaterFlow)) +
  geom_line() +
  labs(
    title = "WaterFlow Pipe 1"
  )

ts2 |>
  ggplot(aes(x = Date, y = WaterFlow)) +
  geom_line() +
  labs(
    title = "WaterFlow Pipe 2"
  )

Oh boy, it’s hard to see what is going on in there. I’m going to tackle these guys one at a time because with a little probing they are appearing statistically different.

Pipe 1

Let’s try to use STL to pick it apart.

ts1 |>
  model(stl = STL(WaterFlow)) |>
  components() |>
  autoplot() +
  labs(
    title = "STL decomposition: Pipe 1"
  )

The trend line is lumpy but, relative to the range of the WaterFlow, I don’t think it’s very significant. It looks like it is picking up dips and spikes more than it is a trend. It does detect complex daily pattern but it’s range is about half that of the amplitde of the remainder. It will be interesting to look at the ACF.

ACF(ts1, WaterFlow) |>
  autoplot()

The ACF doesn’t detect any lags that exceed the critical value. I think we can consider it stationary. Let’s see what SARIMA makes of it.

ts1 |>
  model(sarima = ARIMA(WaterFlow ~ pdq(,,) + PDQ(,,))) |>
  report()
Series: WaterFlow 
Model: ARIMA(0,0,0) w/ mean 

Coefficients:
      constant
       19.8674
s.e.    0.2720

sigma^2 estimated as 17.83:  log likelihood=-685.78
AIC=1375.56   AICc=1375.61   BIC=1382.52

It look like ARIMA’s resorted to the mean of the timeseries. I’m going to conclude that is the best we can do for water pipe #1.

Pipe 2

As we did with pipe #1, let’s STL to pick our timeseries apart and see what we have.

ts2 |>
  model(stl = STL(WaterFlow)) |>
  components() |>
  autoplot() +
  labs(
    title = "STL decomposition: Pipe 2"
  )

The trend line is lumpy but, relative to the range of the WaterFlow, I don’t think it’s very significant. It looks like it is picking up dips and spikes more than it is a trend. It’s detecting two seasonal components!?

Let’s see what it’s ACF looks like.

acf <- ts2 |>
  ACF(WaterFlow)
autoplot(acf)

acf |>
  mutate(
    max = max(abs(acf))
  ) |>
  filter(max == abs(acf))
# A tsibble: 1 x 3 [1h]
       lag    acf    max
  <cf_lag>  <dbl>  <dbl>
1      15h 0.0929 0.0929

Yikes, this is getting worse, not better. ACF is suggesting the lag with the most correlation is 15h. Let’s open ACF up to greater lags with lag_max.

acf <- ts2 |>
  ACF(WaterFlow, lag_max = 7 * 24)
autoplot(acf)

acf |>
  mutate(
    max = max(abs(acf))
  ) |>
  filter(max == abs(acf))
# A tsibble: 1 x 3 [1h]
       lag    acf    max
  <cf_lag>  <dbl>  <dbl>
1      15h 0.0929 0.0929

Our greatest lag is still 15h.

Let’s see what SARIMA makes of it.

ts2 |>
  model(sarima = ARIMA(WaterFlow ~ pdq(,,) + PDQ(,,, period = 15), stepwise = TRUE)) |>
  report()
Series: WaterFlow 
Model: ARIMA(1,0,2)(1,0,0)[15] w/ mean 

Coefficients:
         ar1      ma1     ma2   sar1  constant
      0.7180  -0.7416  0.0624  0.096   10.0901
s.e.  0.1646   0.1662  0.0331  0.032    0.1613

sigma^2 estimated as 255.1:  log likelihood=-4187.41
AIC=8386.81   AICc=8386.9   BIC=8416.26

Well, that seems to have had an effect. The information criteria are very high, though they are not standardized. Let’s see what happens when we use its suggestions.

fc2 <- ts2 |>
  model(sarima = ARIMA(WaterFlow ~ pdq(1,0,2) + PDQ(1,0,0, period = 15))) |>
  forecast(h = 24 * 7)
  
fc2 |>
  autoplot(ts2)

Haha, that’s not very encouraging. I don’t feel wonderful about either of my forecasts. Actually, I take that back. I don’t think the mean is an unreasonable forecast for the entire week. But I feel certain that a better hourly forecast can be made for both datasets. At the moment, I’m at a loss as to how to make them. SNAIVE just occured to me. If we have time, we will model it and see how it look.

Let’s give it a shot.

ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(15))) |>
  gg_tsresiduals()
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).

What the heck is going on!? That inverted 15th lag and made it the worse. I’ve got to see this.

ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(15))) |>
  forecast(h = 24 * 7) |>
  autoplot(ts2)

Going off of the STL decomposition, I am going to try a seasonal naive model with a 24h lag and then one at 1 week. It’s a long shot but I’m curious.

ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(24))) |>
  gg_tsresiduals()
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_bin()`).

I see a pattern emerging. I think 1 week is hopeless, nonetheless I’ve got to see what it looks like.

ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(24*7))) |>
  gg_tsresiduals()
Warning: Removed 168 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 168 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 168 rows containing non-finite outside the scale range
(`stat_bin()`).

Hmmm, interesting. Let’s see what its forecast looks like.

ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(7 * 24))) |>
  forecast(h = 24 * 7) |>
  autoplot(ts2)

I think that looks remarkably interesting. Let’s zoom in on the forecast.

ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(7 * 24))) |>
  forecast(h = 24 * 7) |>
  autoplot(ts2) +
  xlim(ts2$Date[24*30], max(ts2$Date) + days(7))
Warning: Removed 719 rows containing missing values or values outside the scale range
(`geom_line()`).

I think this forecast looks more interesting and better than any other I’ve seen yet. It includes the random element and the slight sinusoidal weekly pattern. The ACF looks a little worse than the ACF of the raw data, but the mean is a pretty awful hourly forecast. Ha, I feel like I’m gambling. I’ve already put forth a mean forecast for water pipe #1. And I’ve already put forth a mean forecast for water pipe #2, but I’m going to change it to the SNAIVE. Nothing ventured, nothing gained.

Forecasting

As concluded above, we shall use the simple mean model to forecast water pipe #1.

fc1 <- ts1 |>
  model(mean = MEAN(WaterFlow)) |>
  forecast(h = 7 * 24)

fc1 |>
  autoplot(ts1) +
  labs(
    title = "Forecast: Water Pipe 1"
  )

Haha, that plot looks pretty ridiculous. But, it’s the best I’ve got. I won’t pour 158 means(ts1$WaterFlow) into this document. The prediction, calculated below, is one week of hourly data at ~19.87.

mean(ts1$WaterFlow)
[1] 19.86744

And, as we saw above, our forecast for Water Pipe #2 was:

fc2 |>
  autoplot(ts2) +
  labs(
    title = "Forecast: Water Pipe 2"
  )

But I’ve since decided to change it to the SNAIVE model with a 7 day lag. See my argument above for the reasoning behind my choice.

fc2 <- ts2 |>
  model(snaive = SNAIVE(WaterFlow ~ lag(7 * 24))) |>
  forecast(h = 24 * 7)

fc2 |>
  autoplot(ts2) +
  labs(
    title = "Revised Forecast: Water Pipe 2"
  )

And we shall write them to separate sheets of the spreadsheet for part C.

fc1 <- fc1 |>
  rename(Forecast = .mean) |>
  select(Date, Forecast)

fc2 <- fc2 |>
  rename(Forecast = .mean) |>
  select(Date, Forecast)

sheets <- list(
  `Pipe 1` = fc1,
  `Pipe 2` = fc2
)

write_xlsx(sheets, "./data/partC_forecast.xlsx")