Project 1

Author

Amanda Rose Knudsen

Published

March 30, 2025

This is Project 1 for DATA-624: Predictive Analytics.

Before we begin, let’s load our libraries.

library(tidyverse)
library(fpp3)
library(fable)
library(readxl)
library(openxlsx)
library(tsibble)
library(tidyr)
library(ggplot2)
library(slider)
library(zoo)

Part A – ATM Forecast, ATM624Data.xlsx

Prompt: I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Loading and Inspecting the data

atm_data <- read_excel("ATM624Data.xlsx", sheet = "ATM Data")

glimpse(atm_data)

Rows: 1,474
Columns: 3
$ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
$ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
$ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

We can see immediately that the DATE column is in a non-date-type format. Let’s convert it so we can properly read the file and analyze the data. We can see that the dates are stored as Excel serial numbers, so unless we convert with the ’origin = “1899-12-30” we’ll end up with incorrect dates.

atm_data <- atm_data |> 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

glimpse(atm_data)

Rows: 1,474
Columns: 3
$ DATE <date> 2009-05-01, 2009-05-01, 2009-05-02, 2009-05-02, 2009-05-03, 2009…
$ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
$ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

That looks much better.

Before we move on, let’s also modify the ‘Cash’ variable so it more accurately reflects the cash flow. We know that the original spreadsheet was provided in hundreds of dollars, so let’s convert the values to reflect their actual dollar amount. Later we can determine if we’d like to convert back to ‘hundreds of dollars’ as the Cash unit before we produce our final output excel file.

atm_data$Cash <- atm_data$Cash * 100

Summary Statistics and Missing Values

Let’s look ATM by ATM what the total, median, mean, maximum in a given day, minimum in a given day, and number of missing values over the entirety of the data we have available. Since dollar amounts are in hundreds of dollars let’s make this appear more ‘accurate’ to better understand each ATM’s cash withdrawal flow.

atm_summary <- atm_data |> 
  group_by(ATM) |> 
  summarise(
    total_cash = sum(Cash, na.rm = TRUE),
    avg_cash = mean(Cash, na.rm = TRUE),
    median_cash = median(Cash, na.rm = TRUE),
    stddev_cash = sd(Cash, na.rm = TRUE),
    max_cash = max(Cash, na.rm = TRUE),
    min_cash = min(Cash, na.rm = TRUE),
    missing_values = sum(is.na(Cash)),
    .groups = "drop"
  )
atm_summary

# A tibble: 5 × 8
  ATM   total_cash avg_cash median_cash stddev_cash max_cash min_cash
  <chr>      <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>
1 ATM1    3036700    8389.        9100        3666.   18000      100 
2 ATM2    2271600    6258.        6700        3890.   14700        0 
3 ATM3      26300      72.1          0         794.    9600        0 
4 ATM4   17302582.  47404.       40384.      65094. 1091976.     156.
5 <NA>          0     NaN           NA          NA     -Inf      Inf 
# ℹ 1 more variable: missing_values <int>

This gives us a lot of information, including that ATM1 has 3 missing values, ATM2 has 2 missing values, ATM3 and ATM4 have 0 missing values, and then there are 14 observations that are missing an ATM designation and have no associated data. Let’s return to that in a moment.

Before determining how we want to handle missing values, let’s discuss summary statistics. We can see that ATM4 has the greatest total cash withdrawal volume overall ($17,302,582) which is a lot higher than the total in ATM1 ($3,036,700), ATM2 ($2,271,600), or ATM3 (which is much lower than the others, at just $26,300 total).

ATM4 also has the highest average (mean) cash withdrawal volume at $47,404.34, compared to ATM1’s average near $8,400, ATM2’s average near $6,260, and ATM3’s average of less than $75.00, by far the lowest average overall. In fact, ATM3 has a median value of 0.00, which tells us something different is likely going on with this ATM compared to the others.

We can also observe an apparent outlier for ATM4: it has a maximum cash withdrawal value of $1,091,976 – this seems extremely high, given factors including ATM4’s median cash withdrawal value of $40,383.93. (This also makes us wonder if it’s even possible for a single ATM to experience such a high withdrawal volume in a single day.) We’ll want to take a closer look at the overall shape of each ATM’s cash values per day so we can observe anything out of the ordinary.

Handling Missing Values

We already know that ATM3 and ATM4 are not missing any values; that ATM1 and ATM2 are missing 3 and 2 values, respectively, and that there are 14 missing values for records without any data (including without an ATM specified). Before we decide how to handle missing values, let’s identify the minimum date and maximum date for each category of ATM along with looking at the number of days where the value was 0.00, zero_cash_days, as well as the total records for each ATM to ensure we’re looking at a total year.

atm_date_summary <- atm_data |> 
  group_by(ATM) |> 
  summarise(
    min_date = min(DATE),
    max_date = max(DATE),
    zero_cash_days = sum(Cash == 0, na.rm = TRUE),
    total_records = n(),
    .groups = "drop"
  )
atm_date_summary

# A tibble: 5 × 5
  ATM   min_date   max_date   zero_cash_days total_records
  <chr> <date>     <date>              <int>         <int>
1 ATM1  2009-05-01 2010-04-30              0           365
2 ATM2  2009-05-01 2010-04-30              2           365
3 ATM3  2009-05-01 2010-04-30            362           365
4 ATM4  2009-05-01 2010-04-30              0           365
5 <NA>  2010-05-01 2010-05-14              0            14

First, we can see that there are correctly as expected a total year of values for ATM1, ATM2, ATM3, and ATM4 (365 days total).

Second, we can see that the observations where ATM is null are all in May, and ATMs 1-4 have values from 2009-05-01 to 2010-04-30, as expected. We will drop the null ATM records that represent dates from May 1 to May 14 because we’ll want to incorporate forecasts for each ATM rather than one per day value as we continue our inquiry. Also, we will not determine if we should interpolate the 2 zero-cash-days for ATM2 until we validate that those are truly anomalies or outliers in the data: it may be completely legitimate that there are 2 zero-cash-days. To determine how to handle these (or if we need to handle them at all) let’s determine if those fall within the IQR of data for that ATM, and only then would we consider those two zero-cash-days as issues that would need to be dealt with (like for the extreme outlier for ATM4 and the null values for ATM1 and ATM2.)

Before we move on, it’s worth noting that ATM3 looks like it is missing a significant amount of data, but not in error – enough to warrant not including ATM3 in our forecasting, or utilizing a very different approach for forecasting ATM3. The reason is that from May 1, 2009 to April 30, 2010, 362 out of 365 days were “zero cash days” for ATM3, compared to just 2 for ATM2 and 0 for ATM1 and ATM4. This does not appear like a situation where we need to impute values, but rather, a situation where we simply do not have enough data to impute nor to forecast using a model like ARIMA. While ATM3 has records that ‘start’ and ‘end’ on the same start and end dates of the other ATMs, we believe that ATM3 simply does not have as much data to provide contextual or significant information to factor into a total forecast or imputation process. We will confirm this with our outlier detection below, but it’s highly likely that ATM3 is a more recent machine and simply does not have much data available in our time period of interest, likely leading to utilization of a NAIVE model instead. Let’s take a look below when we look at outliers.

Let’s remove the records that are all but blank for May 2010, which will be our period of interest for forecasting.

atm_data <- atm_data |> 
  filter(!is.na(ATM))

Now that we’ve removed the records where ATM is null and there is no other data provided, we need to determine what to do about the missing values for ATM1 and ATM2, and the potential outliers we suspect in ATM4.

missing_dates <- atm_data |>  
  filter(is.na(Cash)) |> 
  select(DATE, ATM)

missing_dates

# A tibble: 5 × 2
  DATE       ATM  
  <date>     <chr>
1 2009-06-13 ATM1 
2 2009-06-16 ATM1 
3 2009-06-18 ATM2 
4 2009-06-22 ATM1 
5 2009-06-24 ATM2

We can see that all missing values for ATM1 and ATM2 occured in the same month: June 2009. This may be significant as we look toward imputation - it is a validation that these records should not be missing and that it’s likely due to mechanical or data capture error. It is a valid approach to impute these select missing values using interpolation. Before we do that, let’s take a look at outliers in the data.

Detect outliers

atm_data |> 
  ggplot(aes(x = ATM, y = Cash, fill = ATM)) +
  geom_boxplot(outlier.color = 'red') +
  labs(title = 'Boxplots of Cash Withdrawals by ATM', x = 'ATM', y = 'Cash') +
  theme_minimal()

Clearly there is one significant outlier for ATM4, and a small number of outliers closer to the IQR for ATM4. ATM1 and ATM3 reveal outliers that are also close to the IQR. The outlier values that are close to IQR do not need to be modified, as these appear legitimate values and not errors. The significant outlier for ATM4 needs to be managed if we are to utilize this data for forecasting.

We can also see that there is no low outlier for ATM2, which indicates that the zero-dollar-days are not an outlier and should not be imputed or removed for alternative data points. In fact, it appears that the 2 zero-cash-days for ATM2 are accurate values that should not be modified.

Visualize Cash Over Time

Now we will visualize the cash per ATM over time.

atm_data  |> 
  ggplot(aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free_y") +
  labs(title = "Cash Withdrawals Over Time by ATM", x = "Date", y = "Cash") +
  theme_minimal()

This view of the data is also telling, as it shows us some more information. The significant outlier for ATM4 is clearly (as we already knew) out of the ordinary and likely represents inaccurate data. In addition, we can confirm that for ATM3, for the majority of the year until April 2010 it was non-functional and represented a $0.00 cash withdrawal balance. Another observation from this view of the data is that while ATM4 appears to have (as we knew from our summary statistics previously) a greater overall volume on average if we look at the y-axis for each of these ATMs compared.

Returning to the abnormal outlier for ATM4, it’s clearly a major anomaly that we should consider replacing the value for with a median value of ATM4 excluding outliers. The extreme outlier appears to be some sort of equipment malfunction or data entry error, as it is extremely abnormal rather than an actual daily withdrawal transaction total that occurred at ATM4.

Major Outlier for ATM4

We’ll utilize the 99th percentile threshold to identify extreme outlier values for ATM4. We don’t need to modify all outlier values as outliers can be legitimate data points. We only need to modify the extreme outlier which would otherwise greatly impact our model and forecasting.

atm4_outlier_threshold <- 
  quantile(atm_data$Cash[atm_data$ATM == "ATM4"], 0.99, na.rm = TRUE)

We’ll now replace the outliers above the 99th percentile threshold with a median value of ATM4 calculated by excluding data past that outlier threshold.

atm_data <- atm_data |> 
  mutate(Cash = ifelse(ATM == "ATM4" & Cash > atm4_outlier_threshold,
                       median(Cash[ATM == "ATM4" & 
                                     Cash <= atm4_outlier_threshold],
                              na.rm = TRUE),
                       Cash))

atm_data  |> 
  ggplot(aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free_y") +
  labs(title = "Cash Withdrawals Over Time by ATM", x = "Date", y = "Cash") +
  theme_minimal()

The data for ATM4 now appears much more realistic and will be reasonable for our purposes. We can see that, now that we’ve removed the outlier that was above the 99th percentile of data for ATM4, the values across our ATMs are close to one another. When we look at the data overlaid on a single y-axis for all the ATMs, we can see that ATM4 is clearly still the higher-cash-value ATM and has a broader range of variance compared to ATM1 and ATM2. For this reason and more, we will perform forecasting for each ATM separately rather than as a total unit of cash value for all ATMs together.

atm_data  |> 
  ggplot(aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "Cash Withdrawals Over Time by ATM", x = "Date", y = "Cash") +
  theme_minimal()

Before we can move on to forecasting we must continue with pre-processing the data in order to handle those missing values in ATM1 and ATM2.

Impute Missing Values Using Linear Interpolation

Below we use linear interpolation for short term gaps of missing data, meaning if there are gaps of missing data (which we know exist for ATM1 and ATM2, for periods of 1 day at a time for fewer than 4 days per ATM in June 2009) where data is missing, we can fill them by drawing a straight line between the previous (non-missing) value and the next available (non-missing) value. This is a validated and accepted method for replacing null values because the data can be assumed to follow a smooth transition over a brief period of time days. The below method interpolates missing values based on the neighboring data points and replaces each null with that interpolated value.

atm_data <- atm_data |> 
  group_by(ATM) |> 
  arrange(DATE) |> 
  mutate(Cash = zoo::na.approx(Cash, na.rm = TRUE)) |> 
  ungroup()

To ensure this worked let’s run the summary and date summary view again.

atm_summary_after_imputation <- atm_data |> 
  group_by(ATM) |> 
  summarise(Missing_Values = sum(is.na(Cash)), .groups = 'drop')

atm_summary_after_imputation

# A tibble: 4 × 2
  ATM   Missing_Values
  <chr>          <int>
1 ATM1               0
2 ATM2               0
3 ATM3               0
4 ATM4               0

We can confirm that imputation worked for ATM1 and ATM2, and we now have zero missing data.

atm_date_summary2 <- atm_data |> 
  group_by(ATM) |> 
  summarise(
    min_date = min(DATE),
    max_date = max(DATE),
    zero_cash_days = sum(Cash == 0, na.rm = TRUE),
    total_records = n(),
    .groups = "drop"
  )
atm_date_summary2

# A tibble: 4 × 5
  ATM   min_date   max_date   zero_cash_days total_records
  <chr> <date>     <date>              <int>         <int>
1 ATM1  2009-05-01 2010-04-30              0           365
2 ATM2  2009-05-01 2010-04-30              2           365
3 ATM3  2009-05-01 2010-04-30            362           365
4 ATM4  2009-05-01 2010-04-30              0           365

We also confirmed that our interpolation process did not merely remove the missing (null) values, but rather populated them with the method we described above. We have a full 365 days in the year from May 1, 2009 to April 30, 2010, for each one of the ATMs.

Forecasting Preparation

Now that we have zero missing values and have handled our extreme outliers, we are able to move on toward forecasting.

We will convert our dataset from a tibble to a tsibble to enable time series forecasting. While forecasting an entire full month of data (May 2010) using fewer than 5 days of actual data is highly unreliable, we will choose to forecast for ATM3 utilizing a NAIVE approach. The NAIVE model is ideal for very limited historical data because it assumes the most recent observation is the best predictor for future values. Given that ATM3 has too few non-zero entries for traditional ARIMA modeling, using a NAIVE model ensures we have a forecast without over-fitting to a highly limited set of data values. While this is likely to lead to a generalization of results, our other option would be to simply remove ATM3 from our forecasting entirely. Instead of removing the option of utilizing the NAIVE forecast for ATM3, we will include it in our forecasts, and encourage the business to weight the benefits with the value sought in the forecasts for each ATM.

Forecasting ARIMA models need a reasonably long history to identify trends, seasonality, and patterns in the data: for this reason we will use a different approach for ATM1, ATM2, and ATM4, which have a full year of data for predicting the month of May 2010.

Now, we will convert the data to a tsibble.

atm_tsibble <- atm_data |> 
  as_tsibble(index = DATE, key = ATM)

Our data is daily, meaning we are likely to have seasonality on a weekly basis. We also previously identified that each element of the data has variability and potentially has non-seasonal trend data. Before we move on to modeling, which we will likely use an ARIMA or SARIMA model for, let’s decompose our data using time series decomposition so that we can see if there’s a strong seasonal pattern which would justify using SARIMA modeling rather than ARIMA.

An important note that we will utilize ARIMA (or SARIMA) modeling rather than exponential smoothing (ETS) is that we need to model and forecast both seasonal and non-seasonal components with more flexibility than is offered with ETS. While ARIMA isn’t always the better choice compared to ETS, in this case it is: we will see below that there are elements of trend, seasonality, and remainder (error) which do not follow a regular pattern with clear structure. ARIMA can handle these seasonal and non-seasonal patterns effectively, and will be able to handle stationarity and differencing requirements.

atm1_decomp <- atm_tsibble|> 
  filter(ATM == "ATM1") |> 
  model(STL(Cash ~ season(window = "periodic"))) |> components()
atm2_decomp <- atm_tsibble |> 
  filter(ATM == "ATM2") |> 
  model(STL(Cash ~ season(window = "periodic"))) |> components()
atm3_decomp <- atm_tsibble |> 
  filter(ATM == "ATM3") |> 
  model(STL(Cash ~ season(window = "periodic"))) |> components()
atm4_decomp <- atm_tsibble |> 
  filter(ATM == "ATM4") |> 
  model(STL(Cash ~ season(window = "periodic"))) |> components()

Below is the decomposition for ATM1.

atm1_decomp |> 
  autoplot() +
  labs(title = "Time Series Decomposition for ATM1", x = "Date", y = "Cash") +
  theme_minimal()

Below is the decomposition for ATM2.

atm2_decomp |> 
  autoplot() +
  labs(title = "Time Series Decomposition for ATM2", x = "Date", y = "Cash") +
  theme_minimal()

Below is the decomposition for ATM3.

atm3_decomp |> 
  autoplot() +
  labs(title = "Time Series Decomposition for ATM3", x = "Date", y = "Cash") +
  theme_minimal()

As stated earlier, there is no significant trend or seasonal data available, confirming again our choice to use a NAIVE model rather than ARIMA or ETS.

Below is the decomposition for ATM4.

atm4_decomp |> 
  autoplot() +
  labs(title = "Time Series Decomposition for ATM4", x = "Date", y = "Cash") +
  theme_minimal()

We can observe that across the decomposition plots for ATM1, ATM2, and ATM4, there are trend components that appear to fluctuate over time. The seasonal patterns for each are strong, with clear weekly seasonal patterns that are regular. The remainder component for each shows that there are notable random fluctuations that are not explained by the trend or seasonality. Since seasonality is strong in each, we can determine that a SARIMA model is relevant and appropriate here; a non-seasonal ARIMA model is unlikely to adequately capture the repeatin seasonal patterns we see in the data.

We know that in order to apply ARIMA or SARIMA, we’ll need to ensure stationarity. We’ll use a KPSS Test to check for stationarity for each ATM that we’ll be forecasting. Before we move on to this next phase, we’re going to split the atm_data into 4 separate tsibbles (one for each ATM).

atm1_tsibble <- atm_data |> filter(ATM == "ATM1") |> as_tsibble(index = DATE)
atm2_tsibble <- atm_data |> filter(ATM == "ATM2") |> as_tsibble(index = DATE)
atm3_tsibble <- atm_data |> filter(ATM == "ATM3") |> as_tsibble(index = DATE)
atm4_tsibble <- atm_data |> filter(ATM == "ATM4") |> as_tsibble(index = DATE)

Stationary and differencing for ATM1, ATM2, ATM4

ATM1

atm1_kpss <- unitroot_kpss(atm1_tsibble$Cash)
atm1_kpss

  kpss_stat kpss_pvalue 
 0.50941761  0.03954558

It looks like ATM1, with a p-value <0.05, is not stationary.

atm2_kpss <- unitroot_kpss(atm2_tsibble$Cash)
atm2_kpss

  kpss_stat kpss_pvalue 
   2.145167    0.010000

It appears that ATM2, with a p-value <0.05, is also not stationary.

As stated previously, since ATM3 has limited data, we use a NAIVE model for forecasting. We do not need to check for stationarity for ATM3.

atm4_kpss <- unitroot_kpss(atm4_tsibble$Cash)
print(atm4_kpss)

  kpss_stat kpss_pvalue 
  0.1144756   0.1000000

It appears that ATM4, with a p-value exactly at 0.10, may already be considered stationary.

We’ll move on to view model fit for each of these using a SARIMA model. The fable package will tell us the optimal model parameters to use for each. By specifying in the manner we are below we are allowing the model to fit the non-seasonal (pdq) and seasonal (PDQ) elements. This will also perform the proper seasonal differencing required for each model, so we won’t want to go ahead and perform differencing prior to the model fit step.

Model Fit

The models for ATM1, ATM2, and ATM4 are all different and described below.

atm1_model <- atm1_tsibble |> 
  model(ATM_Model = ARIMA(Cash ~ pdq() + PDQ()))

report(atm1_model)

Series: Cash 
Model: ARIMA(0,0,1)(0,1,2)[7] 

Coefficients:
         ma1     sma1     sma2
      0.1950  -0.5806  -0.1037
s.e.  0.0546   0.0506   0.0494

sigma^2 estimated as 5563821:  log likelihood=-3288.66
AIC=6585.33   AICc=6585.44   BIC=6600.85

ATM1 is a ARIMA(0,0,1)(0,1,2)[7] model. This model shows data that is seasonally differenced once (D=1, at the weekly lag) with a moving average of order 1 (ma1) and a seasonal moving average component of order 2 (sma1, sma2). This model doesn’t have any autoregressive components. The simpler model suggests that the primary driver is seasonality (weekly) which makes sense since cash withdrawals appear to repeat on a weekly basis without much significant growth or decay (as we observed in the decomposition of the series previously.)

atm2_model <- atm2_tsibble |> 
  model(ATM_Model = ARIMA(Cash ~ pdq() + PDQ()))

report(atm2_model)

Series: Cash 
Model: ARIMA(2,0,2)(0,1,1)[7] 

Coefficients:
          ar1      ar2     ma1     ma2     sma1
      -0.4320  -0.9130  0.4773  0.8048  -0.7547
s.e.   0.0553   0.0407  0.0861  0.0556   0.0381

sigma^2 estimated as 6025157:  log likelihood=-3302.32
AIC=6616.63   AICc=6616.87   BIC=6639.92

ATM2 is a ARIMA(2,0,2)(0,1,1)[7] model. There are autoregressive components (ar1, ar2) and moving average components (ma1, ma2) in addition to seasonal differencing (D=1) and a seasonal moving average component (sma1). This model is more complex than for ATM1. Both autoregressive and moving average components are needed to model the cash withdrawals from ATM2. This is indicative of a pattern with more fluctuation compared to ATM1.

naive_model_atm3 <- atm3_tsibble |> model(NAIVE(Cash))
report(naive_model_atm3)

Series: Cash 
Model: NAIVE 

sigma^2: 258984.8788

atm4_model <- atm4_tsibble |> 
  model(ATM_Model = ARIMA(Cash ~ pdq() + PDQ()))

report(atm4_model)

Series: Cash 
Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 

Coefficients:
          ar1      ar2     ar3     ma1     ma2    sar1   constant
      -0.9237  -0.6897  0.1511  0.9875  0.7759  0.2430  80975.385
s.e.   0.1138   0.1151  0.0578  0.1060  0.1003  0.0553   4657.807

sigma^2 estimated as 1.071e+09:  log likelihood=-4309.13
AIC=8634.27   AICc=8634.67   BIC=8665.47

This is the most complex model for all the ATMs. ATM4 is a ARIMA(3,0,2)(1,0,0)[7] w/ mean model. There are 3 autoregressive components (ar1, ar2, ar3), two moving average components (ma1, ma2), one seasonal moving average component (sar1) which confirms a strong weekly pattern and a constant term (mean). The strong autoregressive (p = 3) structure of this model indicates that the prior days’ cash withdrawals influence the current day’s withdrawals. All this is indicative of the higher volume of cash withdrawals and higher variability we observed for the ATM4 cash withdrawals. The inclusion of a constant (mean) term in this model suggests there’s a long-term trend affecting this ATM differently than ATM1 or ATM2. The seasonal component for the ATM4 model is more simple than the non-seasonal part of the model, which indicates that much of the pattern comes from within-week trends.

Overall, the report() function shows that the models for ATM1, ATM2, and ATM4 differ significantly in their structure. This is due to the distinct patterns and seasonalities in the cash flow for each ATM. ATM1 and ATM2 have similar structures with non-zero ma1 and sma1 terms indicating short-term effects and seasonality at a weekly level. ATM4, however, has a more complex model with additional autoregressive (ar) components and a constant term suggesting a strong underlying trend or mean value. This complexity is likely driven by the high variability and extreme values detected earlier during the EDA phase. We can also see that the lowest AICc value is for the ATM1 model, followed by the ATM2 model, followed by the ATM4 model, which is an indicator of the goodness of fit of the model. These cannot be truly compared from one to the next to determine which is ‘best’ because they are different models on different data.

atm1_model |> 
  gg_tsresiduals() +
  labs(title = "Residuals for ATM1 ARIMA(0,0,1)(0,1,2)[7] model")

atm2_model |> 
  gg_tsresiduals() +
  labs(title = "Residuals for ATM2 ARIMA(2,0,2)(0,1,1)[7] model")

atm4_model |> 
  gg_tsresiduals() +
  labs(title = "Residuals for ATM4 ARIMA(3,0,2)(1,0,0)[7] w/ mean model")

While each of these looks quite different, each residuals look like white noise.

However, for ATM4, we may want to see what result we get from an ETS model. The selected ARIMA model for ATM4 involves no differencing – we previously confirmed that the data was already stationary. Let’s create an ETS model for ATM4 and compare it to the model fit report for the ARIMA generated above.

atm4_ets_model <- atm4_tsibble |> model(ETS(Cash))
report(atm4_ets_model)

Series: Cash 
Model: ETS(A,N,A) 
  Smoothing parameters:
    alpha = 0.003302771 
    gamma = 0.0001000113 

  Initial states:
     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
 46188.15 -26254.57 -2067.936 -113.7639 4669.727 9982.185 5744.055 8040.303

  sigma^2:  1011450461

     AIC     AICc      BIC 
9732.497 9733.119 9771.496

The results of this ETS model have a higher AICc and higher BIC compared with the complex ARIMA model for ATM4. This indicates that the ARIMA model for ATM4, while more complex, is also a better fit. The ETS(A,N,A) model selected stands for Additive trend, No seasonality, Additive error. This means that the model assumes there’s a trend component but no seasonal pattern, which is unexpected (regarding the seasonal component) considering the results of our time series decomposition which clearly showed seasonality.

Let’s see what the forecasts look like in comparison before we discount this all together. Next we’ll generate forecasts for the month of May for each.

Forecasts

atm1_forecast <- forecast(atm1_model, h = "31 days")
atm2_forecast <- forecast(atm2_model, h = "31 days")
atm3_forecast <- forecast(naive_model_atm3, h = "31 days")
atm4_forecast <- forecast(atm4_model, h = "31 days")
atm4_ets_forecast <- forecast(atm4_ets_model, h = "31 days")

Below, the forecasts are plotted with the historical data for each ATM in order to provide context for the forecast data.

atm1_forecast_plot <- atm1_tsibble |> 
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +
  autolayer(atm1_forecast, series = "Forecast", PI = TRUE) +
  labs(title = "ATM1 ARIMA(0,0,1)(0,1,2)[7] model Forecast with Historical Data",
       x = "Date", y = "Cash") +
  theme_minimal()
atm1_forecast_plot

atm2_forecast_plot <- atm2_tsibble |> 
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +
  autolayer(atm2_forecast, series = "Forecast", PI = TRUE) +
  labs(title = "ATM2 ARIMA(2,0,2)(0,1,1)[7] model Forecast with Historical Data", 
       x = "Date", y = "Cash") +
  theme_minimal()
atm2_forecast_plot

atm3_forecast_plot <- atm3_tsibble |> 
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +
  autolayer(atm3_forecast, series = "Forecast", PI = TRUE) +
  labs(title = "ATM3 NAIVE model Forecast with Historical Data",
       x = "Date", y = "Cash") +
  theme_minimal()
atm3_forecast_plot

We can see that the NAIVE model forecast is based on the prior available value, as expected, which is an ideal approach for very limited data like in our situation for ATM3.

atm4_forecast_plot <- atm4_tsibble |> 
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +
  autolayer(atm4_forecast, series = "Forecast", PI = TRUE) +
  labs(title = "ATM4 ARIMA(3,0,2)(1,0,0)[7] w/ mean model Forecast 
       with Historical Data", x = "Date", y = "Cash") +
  theme_minimal()
atm4_forecast_plot

atm4_ets_forecast_plot <- atm4_tsibble |> 
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +
  autolayer(atm4_ets_forecast, series = "Forecast", PI = TRUE) +
  labs(title = "ATM4 ETS(A,N,A) Forecast with Historical Data", x = "Date", y = "Cash") +
  theme_minimal()
atm4_ets_forecast_plot

Interestingly we see what appears to be ‘seasonality’ in the ETS(A,N,A) forecast, yet we know based on the model itself that the model does not explicitly account for seasonal components. Because of how the model estimates the trend and error terms it can still indirectly reflect repeating patterns in the data like we see in the forecast, such as periodic fluctuations over the short term. What we see here is an artifact of the trend estimation process rather than a seasonal component in the model itself.

Summary

The forecasts for ATM1 and ATM2 display consistent seasonality with reasonable prediction intervals. These models effectively capture the weekly cyclic patterns, suggesting that the ARIMA models applied are well-suited for modeling the short-term seasonality observed in the historical data. The confidence intervals widen moderately over the forecast horizon, indicating that the models are performing robustly and providing reliable predictions for these ATMs.

The forecast for ATM4 is noticeably more volatile compared to ATM1 and ATM2, with much wider prediction intervals. This is somewhat expected given the higher volume of cash deposits, the inherent variability in the data, and the more complex model structure required to fit the observed patterns. The inclusion of the constant (mean) term in the ARIMA(3,0,2)(1,0,0)[7] model suggests that the model is attempting to fit a non-zero average level of cash withdrawals, which may contribute to the apparent flattening of the forecast. With all this in mind, clearly the model is successfully capturing the seasonal behavior present in the data, even if it requires a more complex structure to do so.

In contrast, the ETS(A,N,A) model fitted for ATM4 resulted in higher AICc and BIC values compared to the ARIMA model, indicating a less optimal fit. This ETS model assumes an Additive Trend, No Seasonality, and Additive Error, which contradicts the observed seasonality from the time series decomposition. Despite the fact that the ETS forecast visually appears to exhibit seasonality, it is actually a result of trend smoothing rather than true seasonal modeling. This highlights the importance of properly specifying the model structure when seasonality is present.

Given the higher performance of the ARIMA model for ATM4, we will proceed by providing forecasts for both ATM4’s ARIMA model and ETS model in our final forecast report. The reason for this is that the period of forecast is in the past based on the present date of March 2025, and we encourage a comparison of the real data from May 2010 to identify which model actually performed better. The ARIMA model is retained for its superior fit to the data, while the ETS model is included for comparison and completeness.

For ATM3, due to the very limited historical data, a NAIVE model was applied. The NAIVE model is appropriate for such a limited dataset because it uses the most recent observation as the best predictor for future values. This ensures that we are able to still provide a forecast for the month of May 2010 without overfitting to insufficient data.

Overall, the forecasts for ATM1 and ATM2 are relatively stable and the prediction intervals are well-contained, confirming that the models effectively capture the weekly seasonality. ATM3’s NAIVE model is our best-guess based on the limited data available. The forecasts for ATM4 are more uncertain in their performance. This is due to the higher variability and complexity of the ATM4 dataset, which presents a greater challenge for accurate forecasting. However, the ARIMA model’s ability to incorporate seasonality makes it the preferred approach.

Combine and report forecasts

Now we will combine our forecasts into a file to provide along with this report. We’ll first convert the tsibbles back to tibbles. We will also convert the cash value back to hundreds of dollars so it aligns with our initial historical data file, then bind the rows and to write an xlsx file. Please note, this file will include the two different model forecasts for ATM4. Since ATM4 is a high-volume ATM, and the forecast time period is in the past compared to the present date, it is likely that we will want to observe both ARIMA and ETS models to see which performed with more accuracy and therefore which the business prefers for its purposes. We look forward to the determination of which was more accurate for ATM4 based on the actual data for May 2010.

atm1_forecast_tbl <- as_tibble(atm1_forecast) |> 
  select(DATE, Cash) |> 
  mutate(ATM = "ATM1", Cash = Cash / 100, 
         Model = "ARIMA")

atm2_forecast_tbl <- as_tibble(atm2_forecast) |> 
  select(DATE, Cash) |> 
  mutate(ATM = "ATM2", Cash = Cash / 100, 
         Model = "ARIMA")

atm3_forecast_tbl <- as_tibble(atm3_forecast) |> 
  select(DATE, Cash) |> 
  mutate(ATM = "ATM3", Cash = Cash / 100, 
         Model = "NAIVE")

atm4_forecast_tbl <- as_tibble(atm4_forecast) |> 
  select(DATE, Cash) |> 
  mutate(ATM = "ATM4", Cash = Cash / 100, 
         Model = "ARIMA")

atm4_forecast_ets_tbl <- as_tibble(atm4_ets_forecast) |> 
  select(DATE, Cash) |> 
  mutate(ATM = "ATM4", Cash = Cash / 100, 
         Model = "ETS")


combined_forecasts <-
  bind_rows(atm1_forecast_tbl, atm2_forecast_tbl, atm3_forecast_tbl, atm4_forecast_tbl)

write.xlsx(combined_forecasts, file = "ATMForecastsMay2010.xlsx")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Prompt: Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Loading and Inspecting the data

power_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx", 
                       sheet = "ResidentialCustomerForecastLoad")

glimpse(power_data)

Rows: 192
Columns: 3
$ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
$ `YYYY-MMM`   <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
$ KWH          <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…

First, we’ll convert the Date column to yearmonth date type, as it’s currently a character that clearly represents the year and the month of the observation, which will enable us to utilize it in a tsibble below. We’ll also rename it DATE instead of YYYY-MMM.

power_data <- power_data |> 
  mutate(DATE = yearmonth(`YYYY-MMM`))

It seems that the CaseSequence column is not relevant to our inquiry at hand, so we won’t select it when we convert power_data to a tsibble for time series analysis.

Checking for Null Values

sum(is.na(power_data))

[1] 1

We can see that we have one null value in the data, which occurs in September 2008. We’ll need to address this before we can utilize this in any modeling.

power_data |> filter(is.na(KWH))

# A tibble: 1 × 4
  CaseSequence `YYYY-MMM`   KWH     DATE
         <dbl> <chr>      <dbl>    <mth>
1          861 2008-Sep      NA 2008 Sep

Visualize power usage over time

power_data |> 
  ggplot(aes(x = DATE, y = KWH)) +
  geom_line() +
  labs(title = "Residential Power Usage Over Time", x = "Date", y = "KWH") +
  theme_minimal()

We can clearly see both the missing (null) value where there is a clear line break, and an outlier that is extremely outside the bounds of the otherwise relatively consistent variance.

Based on what we see as a ‘break’ in the monthly pattern, we can see the null value that clearly would represent a connection between the peak of summer KWH usage and the drop in KWH usage in fall. Rather than linear interpolation, we’ll replace the missing value (Sep 2008) with the average of all other September values across years.

Calculate the average KWH for all other September months

sept_avg <- power_data |> 
  filter(month(DATE) == 9 & !is.na(KWH)) |> 
  summarise(Avg_Sept = mean(KWH)) |> 
  pull(Avg_Sept)

sept_avg

[1] 7702333

Impute missing September 2008 with the calculated September average

power_data <- power_data |> 
  mutate(KWH = ifelse(is.na(KWH), sept_avg, KWH))

power_data |> 
  ggplot(aes(x = DATE, y = KWH)) +
  geom_line() +
  labs(title = "Residential Power Usage Over Time (Null Adjusted)", 
       x = "Date", y = "KWH") +
  theme_minimal()

Now that we’ve handled the null value, we need to address the glaring outlier.

Visualizing outliers

power_data |> 
  ggplot(aes(y = KWH)) +
  geom_boxplot() +
  labs(title = "Boxplot of KWH to Identify Outliers") +
  theme_minimal()

This indicates to us that there is just the one previously observed non-null outlier that we’ll need to address. This single outlier is clearly way below the range of other values, as observed in our previous line plots and again in this boxplot.

Seeing as it’s a verified outlier that represents the lowest value in the entire dataset, we can utilize an approach to find the minimum value of KWH (since the outlier is observed as a significant drop in the data) and replace it with the average KWH value for the same month across all other years. This ensures the value is imputed similar to how we imputed the null value and is based on trends in the data rather than just linearly interpolating, which we might be more inclined to do if the data had smaller grain (like daily or hourly data rather than the monthly data it represents.)

We ran View(power_data) and sorted by KWH to confirm that the outlier occurs in July 2010. The below code identifies the index at which it occurs, which we’ll use to replace the outlier with the average of the same month across all years we have in the dataset.

outlier_index <- which.min(power_data$KWH)
outlier_index

[1] 151

outlier_date <- power_data$DATE[outlier_index]
outlier_month <- month(outlier_date)
same_month_avg <- power_data |> 
  filter(month(DATE) == outlier_month & !is.na(KWH) & DATE != outlier_date) |> 
  summarise(Avg_Month = mean(KWH)) |> 
  pull(Avg_Month)

same_month_avg

[1] 7852521

This is the value that we’ll impute the outlier with.

power_data$KWH[outlier_index] <- same_month_avg

power_data |> 
  ggplot(aes(x = DATE, y = KWH)) +
  geom_line() +
  labs(title = "Residential Power Usage Over Time (Outlier Adjusted)", 
       x = "Date", y = "KWH") +
  theme_minimal()

power_data |> 
  ggplot(aes(y = KWH)) +
  geom_boxplot() +
  labs(title = "Boxplot of KWH After Outlier Adjustment") +
  theme_minimal()

The single extreme outlier which we observed previously is no longer part of the dataset. It’s been corrected to represent a much more accurate representation of the month of July 2010, a month when KWH is certainly not the low outlier it represented in the dataset we initially obtained. While there are other approaches to handle the outlier value, such as simply multiplying by an order of magnitude, we believe that the approach we selected is the most accurate for representing the data over time, which is of course what we aim to achieve and utilize for modeling and forecast accuracy.

The missing value for September 2008 was filled with the average KWH from all other September months across years. This approach maintains seasonal consistency without introducing bias. The outlier, identified as a single extreme low value, was replaced with the average KWH of the same month from other years. This approach ensures the correction is based on realistic values. Now that we’ve handled missing values and outliers effectively and in a situationally relevant manner, we are ready to move on with our model fit analysis toward forecasting.

Converting to tsibble

Converting our dataset to a tsibble will enable us to perform time series decomposition, model selection, model fit and forecasting. We can use the DATE which is in yearmonth format as the index.

power_tsibble <- power_data |> 
  select(DATE, KWH) |> 
  as_tsibble(index = DATE)

power_tsibble

# A tsibble: 192 x 2 [1M]
       DATE     KWH
      <mth>   <dbl>
 1 1998 Jan 6862583
 2 1998 Feb 5838198
 3 1998 Mar 5420658
 4 1998 Apr 5010364
 5 1998 May 4665377
 6 1998 Jun 6467147
 7 1998 Jul 8914755
 8 1998 Aug 8607428
 9 1998 Sep 6989888
10 1998 Oct 6345620
# ℹ 182 more rows

Time Series Decomposition

dec_power <- power_tsibble |> 
  model(STL(KWH ~ season(window = "periodic")))
dec_components <- components(dec_power)

autoplot(dec_components) +
  labs(title = "Time Series Decomposition of Residential Power Usage") +
  theme_minimal()

We can observe several important points in this data. First, there is a steadily increasing trnd overall, however it has some dips and peaks as time goes on. Second, there is a clear seasonal pattern where we can observe peaks in both the summer and the winter months, when KWH increase can be correlated with (if in the northern hemisphere) heat needed in the winter and cooling needed in the summer. There is also notable variability in the remainder component of our time series decomposition, spanning a similar range as the seasonal component.

Check stationarity using KPSS test

kpss_test <- unitroot_kpss(power_tsibble$KWH)
print(kpss_test)

  kpss_stat kpss_pvalue 
   1.678341    0.010000

The KPSS test shows a p-value of 0.01, confirming non-stationarity. This implies differencing is required for ARIMA modeling.

Model Fit

We’ll fit an ARIMA model first and then ETS to compare it to.

arima_model <- power_tsibble |> 
  model(ARIMA(KWH ~ pdq() + PDQ()))

arima_report <- report(arima_model)

Series: KWH 
Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift 

Coefficients:
         ma1     ma2     ma3      ma4     sar1     sar2   constant
      0.3299  0.0327  0.2205  -0.0690  -0.6942  -0.4122  226018.24
s.e.  0.0781  0.0823  0.0748   0.0785   0.0779   0.0789   71599.96

sigma^2 estimated as 3.798e+11:  log likelihood=-2655.62
AIC=5327.25   AICc=5328.09   BIC=5352.79

arima_report

# A mable: 1 x 1
        `ARIMA(KWH ~ pdq() + PDQ())`
                             <model>
1 <ARIMA(0,0,4)(2,1,0)[12] w/ drift>

The ARIMA(0,0,4)(2,1,0)[12] w/ drift model was automatically selected using the fable package, which determined appropriate orders for the AR, MA, and differencing components. We can see that differencing was applied at the seasonal (D=1) level with seasonality of 12 (monthly). The AICc = 5328.09 and BIC=5352.79.

arima_model |> 
  gg_tsresiduals() +
  labs(title = "Residuals for ARIMA(0,0,4)(2,1,0)[12] w/ drift model")

We can confirm that the residuals appear as white noise,

ets_model <- power_tsibble |> 
  model(ETS(KWH))

ets_report <- report(ets_model)

Series: KWH 
Model: ETS(M,A,M) 
  Smoothing parameters:
    alpha = 0.0732315 
    beta  = 0.001550427 
    gamma = 0.0001014573 

  Initial states:
    l[0]     b[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]   s[-5]
 6211908 6824.427 0.9597096 0.7471235 0.8643818 1.179736 1.261062 1.19889
   s[-6]     s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
 1.00242 0.7689967 0.8067894 0.9173301 1.068384 1.225178

  sigma^2:  0.009

     AIC     AICc      BIC 
6140.081 6143.599 6195.459

ets_report

# A mable: 1 x 1
    `ETS(KWH)`
       <model>
1 <ETS(M,A,M)>

ets_model |> 
  gg_tsresiduals() +
  labs(title = "Residuals for ETS(M,A,M) model")

The residuals plot for the ETS model suggests that the residuals are not purely white noise - the ACF plot in the bottom left panel shows certain significant spikes above the blue dashed lines. We did not see any spikes go above the blue dashed lines in the ARIMA model. This could be due to elements of the data not being fully addressed by the ETS model.

The ETS model is suitable for capturing trnd and seasonality but does not assume stationarity. The AICc (6143.599) for the ETS model is higher than for the ARIMA model (AICc = 5328.09); additionally, the ARIMA model incorporates seasonal differencing. The BIC score is also lower for the ARIMA model (BIC = 5352.79) compared to the ETS model (BIC = 6195.459), again emphasizing that the ARIMA model is a better fit in this circumstance.

ARIMA appears better suited for this situation, as the data shows trend and seasonality with non-stationarity - which the ARIMA model is able to handle. It captures autoregressive and moving average elements of the data. However, these models and their forecasts are calculated differently and there may be a business use case to review and compare the forecast results for both options.

Forecasts

To ensure we understand the differences in these approaches, we’ll take a look at the different forecasts resulting from the ARIMA vs ETS models.

power_forecast_arima <- forecast(arima_model, h = '12 months')
power_forecast_ets <- forecast(ets_model, h = '12 months')

autoplot(power_tsibble) +
  autolayer(power_forecast_arima, 
            series = 'ARIMA Forecast', PI = TRUE, color = "purple")

Plot variable not specified, automatically selected `.vars = KWH`

  labs(title = 'ARIMA(0,0,4)(2,1,0)[12] w/ drift model Forecasts for 2014', 
       x = 'Date', y = 'KWH') +
  theme_minimal()

NULL

autoplot(power_tsibble) +
  autolayer(power_forecast_ets, 
            series = 'ETS Forecast', PI = TRUE, color = "green") +
  labs(title = 'ETS(M,A,M) Forecasts for 2014', x = 'Date', y = 'KWH') +
  theme_minimal()

Plot variable not specified, automatically selected `.vars = KWH`

Summary

While the ARIMA model is believed to be the superior choice for forecasting residential power usage, it is evident that the forecasts produced by the ARIMA and ETS models appear similar in their general trend and seasonal pattern. Both models successfully capture the overall increasing trend and the clear seasonal fluctuations present in the data, which are likely associated with increased power usage during the summer and winter months.

The ARIMA model’s prediction intervals, as indicated by the shaded areas, are notably wider than those produced by the ETS model. This suggests greater uncertainty in the ARIMA model’s forecasts, which is expected when differencing is applied to achieve stationarity. The wider intervals highlight the constant balance of precision and accuracy when making forecasts based on historical data.

The ETS(M,A,M) model selected stands for Multiplicative error, Additive trend, and Multiplicative seasonality. This means the forecast errors are proportional to the level of the series, there’s an additive trend, and multiplicative Seasonality. The ETS(M,A,M) model, while providing a smooth forecast with well-defined seasonal patterns, may not effectively capture more complex autoregressive patterns that the ARIMA model is better equipped to identify. Nonetheless, the ETS model appears to be reasonably effective in generating reliable forecasts based on seasonal trends and smooth adjustments. However, residual diagnostics suggest that this model may not fully satisfy the white noise assumption, indicating potential model inadequacies.

Although we prefer and recommend the ARIMA model due to its superior performance in terms of AICc and BIC values, we are including forecasts from both models in our final Excel file. This dual approach allows the business to review and determine which forecast better suits their needs. Additionally, providing both models facilitates a comparison of their performance once the actual data for 2014 becomes available. This will allow us to assess the relative accuracy of the models and confirm whether the ARIMA model’s complexity provides a substantial advantage over the simpler, seasonal-driven ETS model.

We look forward to the business reviewing the forecasts and providing feedback on which model proves more accurate based on their practical application and comparison with the actual data for 2014.

Combine and report for business

forecast_tbl_arima <- as_tibble(power_forecast_arima)
forecast_tbl_ets <- as_tibble(power_forecast_ets)

combined_forecasts <- bind_rows(
  forecast_tbl_arima |> select(DATE, KWH) |> 
    mutate(Model = 'ARIMA'),
  forecast_tbl_ets |> select(DATE, KWH) |> mutate(Model = 'ETS')
)

write.xlsx(combined_forecasts, file = 'ResidentialPowerForecast2014.xlsx')