DATA624 Project 1 Time Series Forecasting

library(fpp3)
library(tidyverse)

Context

This project consists of 3 parts:

Part A – ATM Forecast, ATM.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.

Part B – Forecasting Power, Power.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

Part A

Introduction

We have received data in an Excel file format (ATM.xlsx) that contains information of cash withdrawals from 4 different ATM machines between August 2009 and May 2010. We would like to utilize this data to create forecasts for each ATM in the month of May 2010. The way we will tackle this problem is through the tidy forecast workflow:

Tidy

The first part of the tidy forecast workflow is also the first word in the term. Tidying involves loading the data and then preparing it to be in the correct format to generate forecasting models for.

We begin our tidying journey by loading in the data from the Excel file. We utilize readxl’s read_xlsx() function to read in this type of Excel file. By specifying the column types in the function itself we in theory should be able to reduce column typing transformations that are needed afterwards. However, read_xlsx has limitations with column types in that it can not create factors, and uses a different class of date than what tsibbles use when converting a column to date. Thus, we need to mutate the columns Date and ATM once more into their proper typings. The loaded and correctly classed tsibble is then defined as the variable tsb.

Taking a look at the tsibble with glimpse, we immediately notice that there are 5 factors within ATM, when there should only be 4 ATMs. This needs to be investigated

file <- "ATM.xlsx"

suppressWarnings(
  readxl::read_xlsx(path = file, col_types = c('date','text','numeric'))
  ) |>
  mutate(Date = as_date(DATE) ,ATM = as_factor(ATM)) |>
  select(Date, ATM, Cash) |>
  as_tsibble(key = ATM, index = Date) -> tsb

glimpse(tsb)

## Rows: 1,474
## Columns: 3
## Key: ATM [5]
## $ Date <date> 2009-05-01, 2009-05-02, 2009-05-03, 2009-05-04, 2009-05-05, 2009…
## $ ATM  <fct> ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1,…
## $ Cash <dbl> 96, 82, 85, 90, 99, 88, 8, 104, 87, 93, 86, 111, 75, 6, 102, 73, …

Looking at the different levels of the ATM column we see that there are indeed only the four ATMs which we are expecting. Thus, we need to figure out where this 5th type of value is coming from.

levels(tsb$ATM)

## [1] "ATM1" "ATM2" "ATM3" "ATM4"

Filtering out data that does not belong to any of the ATMs in the tsibble reveals that the fifth value was blank rows of data in May without an ATM or Cash value assigned. Since, these missing values are not really data we want to remove it from our tsibble.

tsb |>
  filter(!ATM %in% levels(tsb$ATM))

## # A tsibble: 14 x 3 [1D]
## # Key:       ATM [1]
##    Date       ATM    Cash
##    <date>     <fct> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA

We’ll take out two birds with one stone by removing this missing data through subsetting out any data that goes into our forecast period of May 2010. As we would not want to be forecasting on data that we already have. Checking glimpse again we can see we’ve removed that 5th key value, but have not removed more than the 14 rows which belonged to it.

tsb |>
  filter(yearmonth(Date) < yearmonth("2010-05") ) -> tsb

glimpse(tsb)

## Rows: 1,460
## Columns: 3
## Key: ATM [4]
## $ Date <date> 2009-05-01, 2009-05-02, 2009-05-03, 2009-05-04, 2009-05-05, 2009…
## $ ATM  <fct> ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1, ATM1,…
## $ Cash <dbl> 96, 82, 85, 90, 99, 88, 8, 104, 87, 93, 86, 111, 75, 6, 102, 73, …

Since we have dealt with missing ATMs, it would make sense to also deal with missing cash values at this point. We utilize the skim() function from the skimr package to see if we have any more missing data. We’ll group the data by ATM as it will also help us extract more summary information for each individual series. We can see that there are 5 missing values within the Cash column, 3 from ATM 1 and 2 from ATM 2, we will need to explore these further and impute next.

Our skim also shows us that we have data from the beginning of May 2009 to the end of April 2010, which is a good sign that we have continuous data for our time frame in each case besides the missing values. There are also two anomalies that we can see within our cash variable for ATM3 and ATM4: ATM3 has a mode of $0 withdrawn which extends up until the 75th percentile. This could mean there is missing data hiding as 0s or something else which we would need to look into. From ATM4, there is a max cash value of 10919.76 which is $1,091,976 dollars withdrawn. This is a huge outlier from even the 75th percentile of $70,450.70 withdrawn on a day, we’ll need to look at this down the line.

tsb |>
  as_tibble() |>
  group_by(ATM) |>
  skimr::skim()

Data summary
Name	group_by(as_tibble(tsb), …
Number of rows	1460
Number of columns	3
_______________________
Column type frequency:
Date	1
numeric	1
________________________
Group variables	ATM

Variable type: Date

skim_variable	ATM	complete_rate	min	max	median	n_unique
Date	ATM1	1	2009-05-01	2010-04-30	2009-10-30	365
Date	ATM2	1	2009-05-01	2010-04-30	2009-10-30	365
Date	ATM3	1	2009-05-01	2010-04-30	2009-10-30	365
Date	ATM4	1	2009-05-01	2010-04-30	2009-10-30	365

Variable type: numeric

skim_variable	ATM	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Cash	ATM1	3	0.99	83.89	36.66	1.00	73.00	91.00	108.00	180.00	▂▁▇▃▁
Cash	ATM2	2	0.99	62.58	38.90	0.00	25.50	67.00	93.00	147.00	▇▅▇▇▂
Cash	ATM3	0	1.00	0.72	7.94	0.00	0.00	0.00	0.00	96.00	▇▁▁▁▁
Cash	ATM4	0	1.00	474.04	650.94	1.56	124.33	403.84	704.51	10919.76	▇▁▁▁▁

We want to attempt to deal with missing values using ARIMA interpolation, so we’ll want to deal with outliers before dealing with missing values as a whole. This is because leaving outliers in while using ARIMA interpolation could change the resulting interpolated value skewing towards outliers. Additionally, we’ll be able to interpolate the outliers which we turn into missing data as well.

Let’s check how egregious the outlier in ATM4 is by plotting the tsibble with and without that value.

p1 <- tsb |>
  autoplot(Cash) +
  theme(legend.position = "none")
p2 <- tsb |>
  filter(Cash < 10919.76) |>
  autoplot(Cash) +
  theme(legend.position = "none")

cowplot::plot_grid(p1,p2, ncol = 1, labels = "AUTO")

While ATM4 does seem very volatile and occasionally peaks up to high values, the outlier is a magnitude higher than anything else without a trend or seasonal explanation through visual inspection. Thus, we will remove it and replace it with a missing value by subsetting all data with a value less than an outlier then filling in missing date values with NAs through tsibble’s fill_gap function. We can confirm that the previous outlier is now transformed into an NA

outlie <- tsb |>
  filter(ATM == "ATM4" & Cash > 10919.76) |>
  select(Date) |>
  pull()
  
tsb |>
  filter(Cash < 10919.76) |>
  fill_gaps() -> tsb

tsb |>
  filter(ATM == "ATM4" & Date == outlie)

## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM    Cash
##   <date>     <fct> <dbl>
## 1 2010-02-09 ATM4     NA

Now we want to see what’s going on with ATM3 and its multitude of 0 values by plotting it with and without 0 values. We end up with only 3 values that do not have a 0 Cash value. There are two possible explanations for this: One is that ATM 3 has been in service a whole year and only coincidentally been used three times consecutively at the end of the data set. The second is that our zero values here are technically true, but they are only true because ATM3 started service on April 28, 2010. The second option is the simplest and most reasonable solution, so we will assume it given no other alternative to better understand our data.

p1 <- tsb |>
  filter(ATM == "ATM3") |>
  autoplot(Cash)
p2 <- tsb |>
  filter(ATM == "ATM3" & Cash > 0) |>
  autoplot(Cash)
cowplot::plot_grid(p1,p2, ncol = 2, labels = "AUTO")

If we were to leave the 0 values in for ATM 3 then any non-naive forecasting method we use will be thrown off and tend to forecast values very close to 0. As 0 is the majority of our data. However, any advanced forecasts based off of just 3 data points would be extremely foolish to attempt. Ideally we would simply drop this series due to forecasting off the data we have not making sense.

However, the assignment here calls for forecasting all 4 ATM series. Thus we will convert the 0s to missing values and impute the previous values purely for the forecast. We know the true values for this ATM 3 are still 0 because it wasn’t in use.

tsb |>
  mutate(Cash = if_else(Cash == 0, NA, Cash)) -> tsb

tsb |>
  filter(ATM == "ATM3") |>
  tail(10)

## # A tsibble: 10 x 3 [1D]
## # Key:       ATM [1]
##    Date       ATM    Cash
##    <date>     <fct> <dbl>
##  1 2010-04-21 ATM3     NA
##  2 2010-04-22 ATM3     NA
##  3 2010-04-23 ATM3     NA
##  4 2010-04-24 ATM3     NA
##  5 2010-04-25 ATM3     NA
##  6 2010-04-26 ATM3     NA
##  7 2010-04-27 ATM3     NA
##  8 2010-04-28 ATM3     96
##  9 2010-04-29 ATM3     82
## 10 2010-04-30 ATM3     85

Now, let’s begun our exploration into the missing cash for ATM 1 and 2. We are looking for any pattern with the dates where we are missing data that we could derive meaning from. So we filter for the missing values and the values before and after them. There don’t seem to be consecutive days missing in a row which would make imputing lead to unreliable values. However, we are able to infer that these may be days where the ATM is down for service in the case of ATM 2 because the day before or after the missing value sometimes has an extremely low cash value compared to what is normal. Indicating it was being serviced up until the end of the day or started being serviced earlier in the day.

tsb |>
  filter((ATM %in% c("ATM1","ATM2") & is.na(Cash)) |
           (lag(ATM) %in% c("ATM1","ATM2") & is.na(lag(Cash))) |
           (lead(ATM) %in% c("ATM1","ATM2") & is.na(lead(Cash))) )

## # A tsibble: 21 x 3 [1D]
## # Key:       ATM [2]
##    Date       ATM    Cash
##    <date>     <fct> <dbl>
##  1 2009-06-12 ATM1    142
##  2 2009-06-13 ATM1     NA
##  3 2009-06-14 ATM1    120
##  4 2009-06-15 ATM1    106
##  5 2009-06-16 ATM1     NA
##  6 2009-06-17 ATM1    108
##  7 2009-06-21 ATM1    115
##  8 2009-06-22 ATM1     NA
##  9 2009-06-23 ATM1    108
## 10 2009-06-17 ATM2     24
## # ℹ 11 more rows

We will utilize this information that we have extracted to convert the days before or after a service with low values for ATM 2 into missing values. This will help us get a more accurate forecast as the interpolated values will be more typical for a working ATM. We’ll define low values as being below 25 which is the 25th percentile value for ATM 2.

tsb |>
  filter( !((lag(ATM) %in% c("ATM2") & is.na(lag(Cash)) & Cash < 25) |
         (lead(ATM) %in% c("ATM2") & is.na(lead(Cash))) & Cash < 25) ) |>
  fill_gaps() -> tsb

tsb |>
  filter((ATM %in% c("ATM2") & is.na(Cash)) |
           (lag(ATM) %in% c("ATM2") & is.na(lag(Cash))) |
           (lead(ATM) %in% c("ATM2") & is.na(lead(Cash))) )

## # A tsibble: 15 x 3 [1D]
## # Key:       ATM [1]
##    Date       ATM    Cash
##    <date>     <fct> <dbl>
##  1 2009-06-16 ATM2     82
##  2 2009-06-17 ATM2     NA
##  3 2009-06-18 ATM2     NA
##  4 2009-06-19 ATM2    134
##  5 2009-06-23 ATM2     99
##  6 2009-06-24 ATM2     NA
##  7 2009-06-25 ATM2     NA
##  8 2009-06-26 ATM2    117
##  9 2009-10-24 ATM2     61
## 10 2009-10-25 ATM2     NA
## 11 2009-10-26 ATM2     95
## 12 2010-03-28 ATM2     64
## 13 2010-03-29 ATM2     NA
## 14 2010-03-30 ATM2     NA
## 15 2010-03-31 ATM2    102

Finally, with a bunch more missing values within our dataset we’ll impute new values utilizing ARIMA imputation. ARIMA imputation involves us creating ARIMA models for each series and then utilizing the interpolate function to insert values from the models into the tsibble where it is missing values. Below, we automatically model each series and for the most part the data looks better now, except the model that the ARIMA selection algorithm has chosen for ATM 3 is not one that realistically makes sense.

tsb |>
  model(ARIMA(Cash)) |>
  interpolate(tsb) |>
  autoplot(Cash)

Thus, we will utilize mean imputation for ATM 3 as we know that attempting to model it for interpolation will not turn out well. Likely a foreshadowing to how well our forecasting will go for that series.

tsb |>
  mutate(Cash = if_else( (ATM == "ATM3" & is.na(Cash)),
                        tsb |> filter(ATM == "ATM3") |> pull(Cash) |> mean(na.rm = TRUE),
                        Cash)) -> tsb

tsb |>
  filter(ATM == "ATM3") |>
  tail(5)

## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM    Cash
##   <date>     <fct> <dbl>
## 1 2010-04-26 ATM3   87.7
## 2 2010-04-27 ATM3   87.7
## 3 2010-04-28 ATM3   96  
## 4 2010-04-29 ATM3   82  
## 5 2010-04-30 ATM3   85

We go back to the ARIMA imputation method for our other ATMs’ missing values. We take a skim for comparing what we started with before tidying this data and the data after removing the outlier and processing missing data. First of all, there is no missing data anymore which will allow us to use ETS models for forecasting. ATM 1 and ATM 2 barely have had their summary statistics moved. Next, we can see the reduction in mean and standard deviation in ATM 4 due to removing the outlier. ATM 3 still has a heavily unimodal distribution, but it is now around the mean point of the data we actually did have.

All in all, our data is now tidied and ready to move on to the next step of the tidy forecasting workflow.

tsb |>
  model(ARIMA(Cash)) |>
  interpolate(tsb) -> tsb

tsb |>
  as_tibble() |>
  group_by(ATM) |>
  skimr::skim()

Data summary
Name	group_by(as_tibble(tsb), …
Number of rows	1460
Number of columns	3
_______________________
Column type frequency:
Date	1
numeric	1
________________________
Group variables	ATM

Variable type: Date

skim_variable	ATM	complete_rate	min	max	median	n_unique
Date	ATM1	1	2009-05-01	2010-04-30	2009-10-30	365
Date	ATM2	1	2009-05-01	2010-04-30	2009-10-30	365
Date	ATM3	1	2009-05-01	2010-04-30	2009-10-30	365
Date	ATM4	1	2009-05-01	2010-04-30	2009-10-30	365

Variable type: numeric

skim_variable	ATM	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Cash	ATM1	1	84.12	36.60	1.00	73.00	91.00	108.00	180.00	▂▁▇▃▁
Cash	ATM2	1	62.58	38.68	1.00	26.00	66.00	93.00	147.00	▇▅▇▆▂
Cash	ATM3	1	87.67	0.55	82.00	87.67	87.67	87.67	96.00	▁▁▇▁▁
Cash	ATM4	1	444.71	351.11	1.56	124.33	402.77	704.19	1712.07	▇▅▃▁▁

Visualise

The next step in the workflow is to visualize our data in order to better understand it. We briefly visualized our data previously, but that was simply to be able to clean it. Here we want to attempt what patterns exist in the data for model selection down the line.

We plot each of our series faceted on the same x-axis to begin the initial pattern recognition. ATM 3 will not provide us useful information while analyzing the visualizations because of how many values have been imputed. However, for the other ATMs we can see that there does not seem to be a clear trend within each series. The average withdrawal seems to remain constant over longer periods of time. Seasonality is quite heavy and likely weekly, but we’d like to take a clear look to also determine if monthly seasonality exists.

tsb |>
  autoplot(Cash) + 
    facet_wrap(~ATM, scales = "free_y" , ncol = 1) +
    theme(legend.position = "none")

Taking a look at the seasonal plots adjusted to a monthly period we do not see any sort of clear pattern per month, thus dismissing the notion of monthly seasonality.

tsb |>
  filter(ATM != "ATM3") |>
  gg_season(Cash, period = "month") + 
    facet_wrap(~ATM, scales = "free_y" , ncol = 1) +
    theme(legend.position = "none")

We’ll try to further confirm the existence of the weekly seasonality with another seasonal plot that has a weekly period. Here we can see that there is strong overlap in pattern between most weeks for at least ATM 1 and ATM 2, but the earlier weeks seem to have a different seasonal pattern for them. The earlier weeks seem to have the lowest withdrawals on Tuesday while overtime Tuesday changes to one of the peak days and Thursday becomes the low in the week. This seems to be an ideal use case to split a model into two for each section, but we lack the ability to do so. ATM 4 seems to follow this general pattern as well, but due to the variation in the graphs we’d like to investigate the correlation plots further.

tsb |>
  filter(ATM != "ATM3") |>
  gg_season(Cash, period = "week") + 
    facet_wrap(~ATM, scales = "free_y" , ncol = 1)

Our ACF and PACF plots bring interesting results. We do see significant weekly correlation indicating that weekly seasonality is a pattern in play for each series. However, it is much weaker in ATM 4 compared to the other ATMs, which could lead to non-seasonal models also fitting ATM 4. On the other hand, ATMs 2 and 4 have seasonality that appears on multiples of 2 and 5 days as negative correlation. Which doesn’t quite make sense intuitively as a set seasonal pattern. This is probably an effect of the changing seasonality that we saw in the seasonal plots, as a slight negative autocorrelation at certain days would eventually shift how the seasonal pattern looks as a whole. We will acknowledge the effects that this autocorrelation has, but we should leave it in as there is no good way to utilize it in the models we can use.

tsb |>
  filter(ATM == "ATM1") |>
  gg_tsdisplay(y= Cash, plot_type='partial', lag=31) +
  labs(title = "ATM 1")

tsb |>
  filter(ATM == "ATM2") |>
  gg_tsdisplay(y= Cash, plot_type='partial', lag=31) +
  labs(title = "ATM 2")

tsb |>
  filter(ATM == "ATM4") |>
  gg_tsdisplay(y= Cash, plot_type='partial', lag=31) +
  labs(title = "ATM 4")

Next we’ll take a look at what would be a good amount of differences to use when we build our ARIMA model. To determine how many seasonal differences we should use in ARIMA we will utilize the unitroot_nsdiffs function within our tsibble’s features. We see that a seasonal difference is optimal, but after seasonal differencing there is no need to take a non-seasonal difference.

tsb |>
  features(Cash, unitroot_nsdiffs)

## # A tibble: 4 × 2
##   ATM   nsdiffs
##   <fct>   <int>
## 1 ATM1        1
## 2 ATM2        1
## 3 ATM3        0
## 4 ATM4        0

tsb |>
  features(difference(Cash,7), unitroot_ndiffs)

## # A tibble: 4 × 2
##   ATM   ndiffs
##   <fct>  <int>
## 1 ATM1       0
## 2 ATM2       0
## 3 ATM3       0
## 4 ATM4       0

Then we will seasonally difference ATM 1 and 2 for later determining ARIMA orders. From looking at these graphs it appears that utilizing a moving average order of 1 in seasonality could be optimal because the ACFs have a main spike at the seasonal lag of 7 while the PACFs have almost decaying spikes at multiples of the seasonal lag. Realistically, the gradual decay we see is not going to be fully explained by the moving average. However, it’s the closest estimate we can make.

tsb |>
  filter(ATM == "ATM1") |>
  gg_tsdisplay(difference(Cash,7), plot_type='partial', lag=31) +
  labs(title = "ATM 1")

tsb |>
  filter(ATM == "ATM2") |>
  gg_tsdisplay(difference(Cash,7), plot_type='partial', lag=31) +
  labs(title = "ATM 2")

Finally, we seasonally decompose each of our series in order to determine if there is a significant trend that we might be missing. Below we see that trend is flat for all but ATM 4, in which case the trend looks more like noise the STL decomposition failed to extract into the remainder. So, we can say that trend is not a big contributor to our multiple ATM series.

tsb |>
  filter(ATM != "ATM3") |>
  model(STL(
    Cash
    )) |>
  components() |>
  autoplot()

Specify and Estimate

Now that we have a good understanding of the patterns within our data we can move on to the next step in the tidy forecasting workflow: specifying and estimating. We want to specify which models we will use to forecast our data and then fit our data to the models. These are originally two separate steps but r makes it easy to specify our models and then fit them within the same stroke.

We’ll start by creating a cross-validation dataset to train our data on and evaluate it on later down the line. This allows us to best select the model to use for forecasts by determining its accuracy at multiple different combinations and points within our dataset. We’ll start with an entire month and train the data a month at a time to balance the number of training sets with the amount of data trained and also including multiple cycles of weekly seasonality in each set.

tsb_tr <- tsb |>
  stretch_tsibble(.init = 30, .step = 30) |>
  relocate(Date, ATM, .id)

We want to train our data on both exponential smoothing models and ARIMA models, as we can’t be certain which model will be best for forecasting without training and evaluating the models. Both ETS and ARIMA have algorithms that will attempt to find the best model by minimizing selection criterion. We will utilize the auto generated models as they do have a good chance of fitting the data and being able to forecast well.

Note that we do not take the log of cash or transform it in any of our models because only ATM 4 has very noisy variation, and it would make the models for the other ATMs worse to transform the variable.

As humans, we also have the power of pattern recognition to attempt to intuit a model that will fit our data and forecast well. Utilizing the trends we have discovered and visualized we will create manually crafted versions of each model to attempt to beat the algorithms. Our exponential triple smoothing models will be MNM models. This is because our trend is mainly stationary, in each series leaving us with no need to calculate a trend component. Our season component is multiplicative despite the high variations of seasonal magnitude only in ATM 4, this is because a multiplicative modifier on seasonality would be able to emulate our gradual shift in seasonality that we see in ATM 1 and ATM 2. The error will naturally follow the seasonality.

From our visual exploration we will utilize a ARIMA(0,0,1)(0,1,1) model. Our unit root test suggested to not difference any series in the dataset without seasonality and we decided to go with an order of 1 for our moving average through the ACF and PACF plots which is why we get the 1st part. The second part is derived from the fact that a seasonal difference was required for two of the series and again the ACF and PACF plots showed a moving average order of 1.

Additionally, we utilize the technique of combination modeling by testing different combinations of the models within our mable. In theory, the combination models should be more accurate than the individual models.

tsb_tr |>
  model(
    ets_auto = ETS(Cash),
    arima_auto = ARIMA(Cash),
    ets_man = ETS(Cash ~ error("M") + trend("N") + season("M")),
    arima_man = ARIMA(Cash ~ 0 + pdq(1,0,0) + PDQ(0,1,1))
  ) |>
  mutate(combination_all = (ets_auto + arima_auto + ets_man + arima_man) / 4,
         combination_auto = (ets_auto + arima_auto) / 2,
         combination_man = (ets_man + arima_man) / 2) -> fit

head(fit,10)

## # A mable: 10 x 9
## # Key:     ATM, .id [10]
##    ATM     .id     ets_auto               arima_auto      ets_man
##    <fct> <int>      <model>                  <model>      <model>
##  1 ATM1      1 <ETS(A,N,A)> <ARIMA(0,0,0)(0,1,1)[7]> <ETS(M,N,M)>
##  2 ATM1      2 <ETS(M,N,M)> <ARIMA(0,0,0)(2,1,1)[7]> <ETS(M,N,M)>
##  3 ATM1      3 <ETS(M,N,M)> <ARIMA(0,0,0)(2,1,0)[7]> <ETS(M,N,M)>
##  4 ATM1      4 <ETS(M,N,M)> <ARIMA(2,0,0)(0,1,1)[7]> <ETS(M,N,M)>
##  5 ATM1      5 <ETS(M,N,M)> <ARIMA(1,0,0)(0,1,1)[7]> <ETS(M,N,M)>
##  6 ATM1      6 <ETS(M,N,M)> <ARIMA(1,0,1)(1,1,0)[7]> <ETS(M,N,M)>
##  7 ATM1      7 <ETS(M,N,M)> <ARIMA(0,0,1)(0,1,1)[7]> <ETS(M,N,M)>
##  8 ATM1      8 <ETS(M,N,M)> <ARIMA(0,0,2)(0,1,1)[7]> <ETS(M,N,M)>
##  9 ATM1      9 <ETS(M,N,M)> <ARIMA(0,0,3)(0,1,1)[7]> <ETS(M,N,M)>
## 10 ATM1     10 <ETS(A,N,A)> <ARIMA(0,0,1)(0,1,1)[7]> <ETS(M,N,M)>
## # ℹ 4 more variables: arima_man <model>, combination_all <model>,
## #   combination_auto <model>, combination_man <model>

We also create a mable based on fitting to the whole dataset, as we can use the AICc as well to evaluate the performance of the different models.

tsb |>
  model(
    ets_auto = ETS(Cash),
    arima_auto = ARIMA(Cash),
    ets_man = ETS(Cash ~ error("M") + trend("N") + season("M")),
    arima_man = ARIMA(Cash ~ 0 + pdq(1,0,0) + PDQ(0,1,1))
  ) |>
  mutate(combination_all = (ets_auto + arima_auto + ets_man + arima_man) / 4,
         combination_auto = (ets_auto + arima_auto) / 2,
         combination_man = (ets_man + arima_man) / 2) -> fit2

fit2

## # A mable: 4 x 8
## # Key:     ATM [4]
##   ATM       ets_auto                       arima_auto      ets_man
##   <fct>      <model>                          <model>      <model>
## 1 ATM1  <ETS(A,N,A)>         <ARIMA(0,0,1)(0,1,2)[7]> <ETS(M,N,M)>
## 2 ATM2  <ETS(A,N,A)>         <ARIMA(2,0,2)(0,1,1)[7]> <ETS(M,N,M)>
## 3 ATM3  <ETS(M,N,N)>           <ARIMA(2,0,0) w/ mean> <ETS(M,N,M)>
## 4 ATM4  <ETS(A,N,A)> <ARIMA(3,0,2)(1,0,0)[7] w/ mean> <ETS(M,N,M)>
## # ℹ 4 more variables: arima_man <model>, combination_all <model>,
## #   combination_auto <model>, combination_man <model>

Evaluate

The evaluate step of our tidy forecasting workflow involves determining which model that we have fit is the most accurate at estimating values. We do this by comparing different goodness of fit statistics. We also determine if the best model is valid by analyzing the residuals.

We now have 7 different models created for our 4 different ATMs with 12 cross-validations each. So, how do we determine which of each of these is the best model to forecast? We will analyze the accuracy that each cross-validated model has with one prediction compared to the RMSE. Within our cross-validation accuracy report we see that our manually created ETS model has the lowest RMSE for ATM 1 and ATM 3, which is surprising as we have relatively high MASE errors which means that these models are only marginally better than a naive model. It is while considering this point that we realize cross-validation does not work well with the auto selection algorithm of ETS and ARIMA.

If we revisit our cross-validation fit mable, we see that as the .id changes so does the type of model for ets_auto and arma_auto. Thus, we are not really cross-validating the same model but instead averaging different models together for a single test statistic. It’s because of this we decide not to consider the cross-validation results further.

fit |>
  forecast(h = 1) |>
  accuracy(tsb) -> acc

acc |>
  group_by(ATM) |>
  slice_min(RMSE)

## # A tibble: 5 × 11
## # Groups:   ATM [4]
##   .model   ATM   .type     ME  RMSE   MAE     MPE  MAPE    MASE   RMSSE     ACF1
##   <chr>    <fct> <chr>  <dbl> <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>    <dbl>
## 1 ets_man  ATM1  Test    8.95  22.4  12.7    6.61  13.4   0.715   0.806  -0.0651
## 2 combina… ATM2  Test   -3.50  28.6  20.6  -35.4   58.3   1.02    0.974  -0.457 
## 3 ets_auto ATM3  Test    0      0     0      0      0   NaN     NaN     NaN     
## 4 ets_man  ATM3  Test    0      0     0      0      0   NaN     NaN     NaN     
## 5 arima_a… ATM4  Test  -68.0  308.  255.  -443.   465.    0.733   0.676   0.248

Instead we visit the AICc statistic to determine which model fits our data the best. The AICc has been proven to be as potent as each-step cross-validation in determining accuracy of a forecasting model. Which is why we want to pick the model per ATM that minimizes this. It looks like for each ATM series except the 4th ATM, the automatically generated arima model minimizes AICc. For ATM 4, the manually generated arima model minimizes AICc. However, these AICc values are particularly high except in the case of ATM 3, but ATM 3 is our mean imputed data. Thus, we doubt the validity of these models and take a look at the residuals.

fit2 |>
  select(ATM, ets_auto, arima_auto, ets_man, arima_man) |>
  glance() |>
  group_by(ATM) |>
  slice_min(AICc)

## # A tibble: 4 × 12
## # Groups:   ATM [4]
##   ATM   .model       sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE   MAE ar_roots
##   <fct> <chr>         <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>  
## 1 ATM1  arima_auto  5.54e+2  -1639. 3286. 3286. 3302.    NA    NA    NA <cpl>   
## 2 ATM2  arima_auto  5.72e+2  -1644. 3300. 3301. 3324.    NA    NA    NA <cpl>   
## 3 ATM3  arima_auto  1.91e-1   -216.  439.  439.  455.    NA    NA    NA <cpl>   
## 4 ATM4  arima_man   1.11e+5  -2594. 5193. 5193. 5205.    NA    NA    NA <cpl>   
## # ℹ 1 more variable: ma_roots <list>

In the case of our model selected for ATM 1, the residuals fulfill all of their conditions. The residuals are fairly normally distributed as can be seen in the histogram. There is no significant auto correlation at any lag as seen in the ACF graph. Finally, the individual residual graph shows a mean of 0 and is fairly heteroscedastic despite occasional spikes in residuals.

fit2 |>
  filter(ATM == "ATM1") |>
  select(arima_auto) |>
  gg_tsresiduals()

In the case of our model selected for ATM 2, the residuals fulfill all of their conditions. The residuals are fairly normally distributed as can be seen in the histogram despite some outliers at the far ends. There is only a singular example of significant auto correlation at lag 23 as seen in the ACF graph, which can be dismissed due to how close it is to the edge of significance. Finally, the individual residual graph shows a mean of 0 and is fairly heteroscedastic although variability might be starting to change after March.

fit2 |>
  filter(ATM == "ATM2") |>
  select(arima_auto) |>
  gg_tsresiduals()

In the case of our model selected for ATM 3, the residuals fulfill all of their conditions despite looking very much so like they do not. The residuals are not normally distributed as can be seen in the histogram, but that is because they are perfect residuals at 0 which is better than being normally distributed. There is almost no auto correlation at all as seen in the ACF graph. Finally, the individual residual graph shows a mean of 0 with almost every point being at 0 and if we ignore the residual spike of 8 (very small in the scope of everything) our residuals are perfectly homoscedastic.

Note that our model for ATM 3 is fundamentally flawed because the ATM 3 data we trained it on is just extrapolating 3 data points.

fit2 |>
  filter(ATM == "ATM3") |>
  select(arima_auto) |>
  gg_tsresiduals()

In the case of our model selected for ATM 4, the residuals fulfill all of their conditions despite being our worst residuals. The residuals are mostly normally distributed with an acceptable amount of rightward-skewing as can be seen in the histogram. There is only a singular example of significant auto correlation at lag 3 as seen in the ACF graph, which can be dismissed due to how close it is to the edge of significance. Finally, the individual residual graph shows a mean of 0 and is fairly heteroscedastic although there is more variability overall than any of our other residuals.

fit2 |>
  filter(ATM == "ATM4") |>
  select(arima_man) |>
  gg_tsresiduals()

Since all of our model residuals seem to be acceptable, we can move on the final part of our tidy workflow.

Forecast

Finally with our model chosen and residuals examined we can forecast the next month.

We select the optimal models from our mable and then we join them together after forecasting for 30 days. We then plot our forecasts to get an idea of how useful they might be.

For ATM 1 and 2, the forecasts seem to follow the existing patterns fairly well and the prediction intervals are reasonably small they in fact look almost like seasonal naive models, despite having an ARIMA model that is not equivalent to SNAIVE. However, we do see them dip into negative cash withdrawn, which is not possible as our data is most likely separate from deposits. We see this behavior in all the series except for ATM 3 which should not be held in much confidence because of our data for the series being flawed and visually we see it loses seasonal behavior over time. Although, the point forecasts of the ATM 4 model seem like they could be a reasonable approximation, the confidence intervals stretching between -$50,000 withdrawn and $100,000 withdrawn instill the opposite of confidence. Our original data had only a single point which reached near $100,000 in daily withdrawals.

fit2 |>
  filter(ATM != "ATM4") |>
  select(arima_auto) |>
  forecast(h=30) |>
  select(Date, ATM, Cash, .mean) |>
  full_join(
    fit2 |>
      filter(ATM == "ATM4") |>
      select(arima_man) |>
      forecast(h=30) |>
      mutate(ATM = as_factor("ATM4")) |>
      select(Date, ATM, Cash, .mean),
    by = join_by(Date, ATM, Cash, .mean)
  ) -> fb_fc

fb_fc|>
  autoplot(tsb |> subset(yearmonth(Date) > yearmonth("2010-03")))

Even with two of our series’ forecasts not inspiring confidence, this is the result after minimizing the AICc of the models we fit. Thus, the forecasts could be used as a reasonable estimator of waht to expect in the next month of withdrawals, except in the case of ATM 3.

To wrap up, we use the writexl package’s write_xlsx function to generate an Excel sheet of our forecasts to share with colleagues.

fb_fc |>
  as_tibble() |>
  mutate(Cash = round(.mean,2)) |>
  select(-.mean) |>
  writexl::write_xlsx("ATMForecasts.xlsx")

Part B

Introduction

We have received data in an Excel file format (Power.xlsx) that contains information of monthly residential power usage for January 1998 until December 2013. We would like to utilize this data to create monthly forecasts for power usage in the year of 2014. The way we will tackle this problem is through the tidy forecast workflow:

Tidy

We begin our tidying journey by loading in the data from the Excel file. We utilize readxl’s read_xlsx() function to read in this type of Excel file. By specifying the column names and types in the function itself we in theory should be able to reduce column typing transformations that are needed afterwards. However, read_xlsx has limitations with column types and is not properly able to convert many commonly used date formats. Thus, we need to mutate the column Date into its proper typing. The loaded and correctly classed tsibble is then defined as the variable tsb.

file <- "Power.xlsx"

suppressWarnings(
  readxl::read_xlsx(path = file, col_names = c("ID", "Date", "KWH"), skip = 1, col_types = c('numeric','text','numeric'))
  ) |>
  mutate(Date = yearmonth(Date)) |>
  select(Date, KWH) |>
  as_tsibble(index = Date) -> tsb

glimpse(tsb)

## Rows: 192
## Columns: 2
## $ Date <mth> 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun, 1998 …
## $ KWH  <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 8914755, 86…

Since our glimpse doesn’t show anything out of the ordinary, we skim our data to attempt to identify missing values and outliers. There don’t seem to be at a glance outliers, but there is a missing data point that we will have to deal with.

tsb |>
  as_tibble() |>
  skimr::skim()

## Warning: Couldn't find skimmers for class: yearmonth, vctrs_vctr; No
## user-defined `sfl` provided. Falling back to `character`.

Data summary
Name	as_tibble(tsb)
Number of rows	192
Number of columns	2
_______________________
Column type frequency:
character	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	5	5	0	192	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
KWH	1	0.99	6502475	1447571	770523	5429912	6283324	7620524	10655730	▁▁▇▅▁

Plotting shows us that our missing value appears in 2008, but is surrounded by available data points that are following the same general seasonal pattern. Thus, there is no information that this missing data point gives us and we deem it missing completely at random. We also discover by plotting that there is a very heavy downward outlier near 2011 that also is surrounded by data points following the common pattern. Power consumption does not just suddenly heavily drop without explanation for a whole month, so this data point is likely wrong. Based on the rest of the data, there is no logical reason for the missing data or the outlier.

tsb |>
  autoplot(KWH)

Thus, we convert the outlier to an NA value to ARIMA impute with the missing value afterwards.

tsb |>
  mutate(KWH = if_else(KWH < 1000000, NA, KWH)) -> tsb

tsb |>
  filter(is.na(KWH))

## # A tsibble: 2 x 2 [1M]
##       Date   KWH
##      <mth> <dbl>
## 1 2008 Sep    NA
## 2 2010 Jul    NA

Now with the missing values within our dataset we’ll impute new values utilizing ARIMA imputation. ARIMA imputation involves us creating ARIMA models for each series and then utilizing the interpolate function to insert values from the models into the tsibble where it is missing values.

Below we find the optimal box-cox transformation for minimizing heteroscedasticity to best create an ARIMA model for imputation. Then we impute the values with the transformed KWH variable. Plotting shows a more complete and outlier-free time series. That should be ready for visualization.

tsb |>
  features(KWH, guerrero) |>
  pull() -> lambda

tsb |>
  model(ARIMA(box_cox(KWH,lambda))) |>
  interpolate(tsb) -> tsb

tsb |>
  autoplot(KWH)

All in all, our data is now tidied and ready to move on to the next step of the tidy forecasting workflow.

Visualise

We plot our cleaned series and are immediately able to detect patterns. There is a slight increasing trend as the years go by. Masking this trend is seasonality per year that seems to follow the months. There are no apparent cycles in the data that we have. Additionally, our data seems to be very slightly heteroscedastic as variance of seasonal fluctuations increase by a bit as the trend does.

tsb |>
  autoplot(KWH)

Taking a look at the seasonal plots adjusted to a yearly period there is a very clear pattern that increases in average value as the year does. We have peaks in energy costs in the middle of July and start of January, while there are valleys mid-April and early November. These correspond to the seasons where people would be using more and less air conditioning respectively. July and January are the peak of winter and summer, while April and November signal spring and fall.

tsb |>
  gg_season(KWH, period = "year")

Our ACF plot heavily mirrors and confirms our seasonal pattern.

tsb |>
  ACF(KWH) |>
  autoplot()

Next we’ll take a look at what would be a good amount of differences to use when we build our ARIMA model. First, we determine the optimal box-cox transformation from our processed data as that can affect how many differences are required. To determine how many seasonal differences we should use in ARIMA we will utilize the unitroot_nsdiffs function within our tsibble’s features. We see that a seasonal difference is optimal, but after seasonal differencing there is no need to take a non-seasonal difference.

tsb |>
  features(KWH, guerrero) |>
  pull() -> lambda

tsb |>
  features(box_cox(KWH, lambda), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

tsb |>
  features(difference(box_cox(KWH, lambda),12), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

Then we will seasonally difference our series for later determining ARIMA orders. From looking at these graphs it appears that utilizing an auto regressive order of 2 in seasonality could be optimal because the ACFs have a decaying spike at the seasonal lag of 12 while the PACFs has a sudden spike at the second seasonal lag of 24.

tsb |>
  gg_tsdisplay(difference(box_cox(KWH, lambda),lag = 12), plot_type='partial', lag=36)

Finally, we seasonally decompose each of our series in order to determine if there is a significant trend that we might be missing. The trend seems to really pick towards the end of our series which was not something that I had recognized from the visualization.

tsb |>
  model(STL(
    KWH
    )) |>
  components() |>
  autoplot()

Specify and Estimate

Note that we will box-cox transform KWH to help make it more stationary, a requirement for forecasting.

Our exponential triple smoothing model will be a MAdM model. Our trend is additive damped because within our STL decomposition we see the trend seems to be exponentially increasing. The damping might be able to emulate the exponential increase. Our season component is multiplicative as we have heteroscedasticity which is best modeled with multiplicative seasonality. The error will naturally follow the seasonality.

From our visual exploration we will utilize a ARIMA(2,0,0)(2,1,0) model. Our unit root test suggested to not difference the series in the dataset without seasonality and we decided to go with an order of 2 for our autocorrelation through the ACF and PACF plots which is why we get the 1st part. The second part is derived from the fact that a seasonal difference was required for two of the series and again the ACF and PACF plots showed an autocorrelation order of 2.

tsb |>
  model(
    ets_auto = ETS(box_cox(KWH, lambda)),
    arima_auto = ARIMA(box_cox(KWH, lambda)),
    ets_man = ETS(box_cox(KWH, lambda) ~ error("M") + trend("Ad") + season("M")),
    arima_man = ARIMA(box_cox(KWH, lambda) ~ pdq(2,0,0) + PDQ(2,1,0))
  ) -> fit

fit

## # A mable: 1 x 4
##       ets_auto                         arima_auto       ets_man
##        <model>                            <model>       <model>
## 1 <ETS(M,A,A)> <ARIMA(0,0,3)(2,1,0)[12] w/ drift> <ETS(M,Ad,M)>
## # ℹ 1 more variable: arima_man <model>

Evaluate

We now have 4 different models which we will utilize the AICc statistic to determine which model fits our data the best. Surprisingly, the model which minimizes AICc at -1115 is our manually built ETS model. The automatically determined ets model is close behind though at -1120. Our manually determined ARIMA model also beats the automatically determined ARIMA model.

fit |>
  glance() |>
  arrange(desc(AICc))

## # A tibble: 4 × 11
##   .model          sigma2 log_lik    AIC   AICc    BIC      MSE     AMSE      MAE
##   <chr>            <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
## 1 ets_man    0.000000639    577. -1119. -1115. -1060.  1.27e-5  1.32e-5  5.86e-4
## 2 ets_auto   0.000000627    579. -1123. -1119. -1068.  1.26e-5  1.30e-5  5.85e-4
## 3 arima_man  0.0000150      753. -1494. -1493. -1474. NA       NA       NA      
## 4 arima_auto 0.0000143      757. -1501. -1500. -1478. NA       NA       NA      
## # ℹ 2 more variables: ar_roots <list>, ma_roots <list>

Now we’ll evaluate our best model of ets_man and surprisingly, the residuals do not fulfill all of their conditions. In particular, the residuals are significantly autocorrelated at lags of 1 and 4. This indicates that there is still information left which can be extracted from our residuals.The residuals are fairly normally distributed as can be seen in the histogram. Finally, the individual residual graph shows a mean of 0 and is fairly heteroscedastic despite occasional spikes in residuals.

fit |>
  select(ets_man) |>
  gg_tsresiduals()

As our selected model doesn’t fulfill all of the residual diagnostic properties we take a look at our other models to find if they do. The automatically generated ETS model has very similar residual diagnostic plots and thus does not fulfill the residual conditions either. With significant autocorrelation at lags 1 and 4.

fit |>
  select(ets_auto) |>
  gg_tsresiduals()

The manually generated ARIMA model still finds itself with autocorrelation at lags of 3 and 6, indicating that there is still seasonality left over in the residuals.

fit |>
  select(arima_man) |>
  gg_tsresiduals()

It is only when we get to the auto generated ARIMA model where we have a model that just barely passes all 4 residual conditions. There is finally no autocorrelation. Yet, it has a much higher AICc than our manually selected ETS model or any of the other models that do not pass the autocorrelation condition. We will still use this model to forecast our future data because the AICc is only valid if the residuals pass the tests of being normal with a mean around 0 along with being uncorrelated.

fit |>
  select(arima_auto) |>
  gg_tsresiduals()

We trade theoretical AICc for a model that passes residual conditions. So, we can move on the final part of our tidy workflow.

Forecast

Finally with our model chosen and residuals examined we can forecast the next month.

Looking at our forecasts for the year of 2014, they all seem to follow a similar pattern that the series has been following in general. Each model is quite close to each other, yet notice how the prediction intervals seem to follow AICc. With the lowest AICc of ets_auto also having the lowest prediction interval. Unfortunately, these prediction intervals are not valid if the model diagnostics are not passed.

fit |>
  forecast(h=12) -> fb_fc

fb_fc |>
  autoplot(tsb |> subset(yearmonth(Date) > yearmonth("2010-01")))

Thuswe make do utilizing the model with the highest prediction intervals to forecast.

fb_fc |>
  filter(.model == "arima_auto") |>
  autoplot(tsb |> subset(yearmonth(Date) > yearmonth("2010-01")))

To wrap up, we use the writexl package’s write_xlsx function to generate an Excel sheet of our forecasts to share with colleagues.

fb_fc |>
  as_tibble() |>
  mutate(KWH = round(.mean,2)) |>
  filter(.model == "arima_auto") |>
  select(-.mean, -.model) |>
  mutate(Date = as_date(Date)) |>
  writexl::write_xlsx("PowerForecasts.xlsx")

DATA624 Project 1 Time Series Forecasting

Taha Ahmad

2023-10-17

Context

Part A

Introduction

Tidy

Visualise

Specify and Estimate

Evaluate

Forecast

Part B

Introduction

Tidy

Visualise

Specify and Estimate

Evaluate

Forecast