Data 624: Project 1

Part A - ATM Forecast, ATM624Data.xlsx**

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010.The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

For part A, I read in the atm dataset and then formatted the date variable. I then filtered the data to create 4 different datasets, each containing the 4 atms. This process automatically removed some of the NAs/missing data. The total NAs/missing data before I filtered the data was 19 rows: 14 of which had missing ATMs and missing Cash. The other 5 were from ATM1 and ATM2 which I later decided to impute. Each dataset or each ATM has 365 observations.

atm <- read_excel("C:/Users/Kesha/Desktop/Spring 2025/DATA 624/ATM624Data.xlsx")
head(atm)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

# Change to sate format
atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")

glimpse(atm)

## Rows: 1,474
## Columns: 3
## $ DATE <date> 2009-05-01, 2009-05-01, 2009-05-02, 2009-05-02, 2009-05-03, 2009…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

# Look at missing data
which(is.na(atm), arr.ind = TRUE)

##       row col
##  [1,] 731   2
##  [2,] 732   2
##  [3,] 733   2
##  [4,] 734   2
##  [5,] 735   2
##  [6,] 736   2
##  [7,] 737   2
##  [8,] 738   2
##  [9,] 739   2
## [10,] 740   2
## [11,] 741   2
## [12,] 742   2
## [13,] 743   2
## [14,] 744   2
## [15,]  87   3
## [16,]  93   3
## [17,]  98   3
## [18,] 105   3
## [19,] 110   3
## [20,] 731   3
## [21,] 732   3
## [22,] 733   3
## [23,] 734   3
## [24,] 735   3
## [25,] 736   3
## [26,] 737   3
## [27,] 738   3
## [28,] 739   3
## [29,] 740   3
## [30,] 741   3
## [31,] 742   3
## [32,] 743   3
## [33,] 744   3

atm1 <- atm |>
  filter(ATM == "ATM1")

atm2 <- atm |>
  filter(ATM == "ATM2")

atm3 <- atm |>
  filter(ATM == "ATM3")

atm4 <- atm |>
  filter(ATM == "ATM4")

Imputations: For dataset ATM1 and ATM2, I decided to do KNN (K-Nearest Neighbors) imputation which takes the average of the nearest neighbor data points to fill in the missing data.

atm1_imput <- kNN(atm1, k = 3)

atm2_imput <- kNN(atm2, k = 3)

Changing the datasets to a tsibble time series to prepare for forecasting.

atm1_ts <- atm1_imput |>
  as_tsibble(key = ATM, index = DATE)

atm2_ts <- atm2_imput |>
  as_tsibble(key = ATM, index = DATE)

atm3_ts <- atm3 |>
  as_tsibble(key = ATM, index = DATE)

atm4_ts <- atm4 |>
  as_tsibble(key = ATM, index = DATE)

Plotting each ATM time series:

ATM1: There does not seem to be an obvious trend in this time series, but there does appear to be some seasonality. ATM2: There does not seem to be an obvious trend in this time series, but there does appear to be some seasonality. ATM3: There is not enough data in this time series to determine trend or seasonality. ATM4: There does not seem to be an obvious trend in this time series, but there does appear to be some seasonality.

# ATM 1
autoplot(atm1_ts, Cash)

gg_season(atm1_ts, Cash)

ACF(atm1_ts, Cash) |> autoplot()

# ATM 2
autoplot(atm2_ts, Cash)

gg_season(atm2_ts, Cash)

ACF(atm2_ts, Cash) |> autoplot()

# ATM 3
autoplot(atm3_ts, Cash)

gg_season(atm3_ts, Cash)

ACF(atm3_ts, Cash) |> autoplot()

# ATM 4
autoplot(atm4_ts, Cash)

gg_season(atm4_ts, Cash)

ACF(atm4_ts, Cash) |> autoplot()

ATM1

I did an STL model to look at the trends and seasonality of the time series to figure out how to proceed. The results of the STL decomposition graph tell us that there is no obvious or apparent trend for ATM1. The trend component shows, upward and downward movement throughout. In terms of seasonality, the figure showing the seasonal component from the decomposition clearly tells us that throughout these two years, there are certain times of the month and year where the amount of cash increases and decreases - indicating seasonality. In terms of outliers, we see an outlier occurring a few times throughout, sometime in the early summer of 2009, late July of 2009, early Autumn of 2009, and later 2009 - where there are drastic decreases.

#STL
atm1_ts |>
  model(stl = STL(Cash)) |>
  components() |>
  autoplot()

I proceeded to do a box-cox transformation to make the data more normal and stabilize the variance at lambda = 0.2378545. The transformation helped us see where the drastic decreases in cash withdrawl occurred and the overall plot more clearly.

#Box cox tranformation

lambda_atm1 <- atm1_ts |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda_atm1

## [1] 0.2378545

atm1_ts |>
  autoplot(box_cox(Cash, lambda_atm1)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM 1 Cash with $\\lambda$ = ",
         round(lambda_atm1,4))))

For ATM1, we see there is seasonality but no significant trend. I decided to use SNAIVE() since it accounts for seasonality, but not trend. The forecast plot using SNAVIE() shows the last season’s pattern in the time series repeated for the next month (May 2010).

The histogram of residuals looks somewhat unimodal and normally distributed. The ACF of the residuals shows four lags out of the the bounds, one of them being far out the lower limit, but most of the lags are somewhat close to zero. The residuals do not appear to be white noise since there does appear to be autocorrelation in the residual series.

From the Box-Pierce test and the Ljung-Box test, we see that the results are significant since the p-values are 2.048584e-12 and 9.132695e-13, respectively. We can conclude that the residuals are distinguishable from white noise. This means that there is autocorrelation in the residuals and the residuals are not white noise. This also suggest that the SNAIVE() model is probably not the best fit for the data.

The accuracy of the training data shows that the RMSE value is 27.85452, which will be used to compare with the proceeding models.

#SNAIVE model

atm1_ts %>% 
  model(SNAIVE(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm1_ts)

# Fit the model
fit <- atm1_ts |> model(SNAIVE(Cash))

# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

fit |> accuracy()

## # A tibble: 1 × 11
##   ATM   .model       .type         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr> <chr>        <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1  SNAIVE(Cash) Training -0.0363  27.9  17.8 -96.6  116.     1     1 0.152

# Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(fit) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 4
##   ATM   .model       bp_stat bp_pvalue
##   <chr> <chr>          <dbl>     <dbl>
## 1 ATM1  SNAIVE(Cash)    76.9  2.05e-12

augment(fit) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 4
##   ATM   .model       lb_stat lb_pvalue
##   <chr> <chr>          <dbl>     <dbl>
## 1 ATM1  SNAIVE(Cash)    78.7  9.13e-13

I then decided to use the ETS() function to estimate the equivalent model for simple exponential smoothing. The ETS model recommended is the ETS(A,N,A) which is error(“A”) + trend(“N”) + season(“A”) - additive error, no trend, and additive seasonality.

Comparing the SNAIVE() model with the ETS(A,N,A) model, I prefer the ETS(A,N,A) model since it has a smaller RMSE value, which means it may have best captured the seasonality better and predicted with better accuracy. I also opted to look at the AIC value for model comparison - the box-cox transformed ETS(A,N,A) model had an AIC of 2322.326, while the ETS(A,N,A) model had an AIC of 4487.370. However, since I wanted the better prediction, I went with the RMSE value and it looks like the ETS(A,N,A) model also preformed better than the box-cox transformed ETS(A,N,A) model with an RMSE value of 23.80119 compared to 25.07023.

#ETS model
atm_ets <- atm1_ts |>
  model(fit = ETS(Cash))
report(atm_ets)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.01401495 
##     gamma = 0.3279311 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]      s[-3]    s[-4]    s[-5]    s[-6]
##  81.58814 -64.55687 7.545899 15.56121 -0.8582085 16.66588 10.95651 14.68558
## 
##   sigma^2:  580.8183
## 
##      AIC     AICc      BIC 
## 4487.370 4487.991 4526.369

# ETS of ANA and box-cox ANA
atm_ets2 <- atm1_ts |>
  model(ets = ETS(Cash ~ error("A") + trend("N") + season("A")),
        ets_box = ETS(box_cox(Cash, lambda_atm1) ~ error("A") + trend("N") + season("A")),)

# Check RMSE
accuracy(atm_ets2)

## # A tibble: 2 × 11
##   ATM   .model  .type        ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr> <chr>   <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1  ets     Training -0.249  23.8  15.1 -106.   121. 0.847 0.854 0.127
## 2 ATM1  ets_box Training  2.35   25.1  16.1  -91.7  111. 0.905 0.900 0.117

# Check AIC
glance(atm_ets2) |> arrange(AICc) |> select(.model:BIC)

## # A tibble: 2 × 6
##   .model  sigma2 log_lik   AIC  AICc   BIC
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ets_box   1.54  -1151. 2322. 2323. 2361.
## 2 ets     581.    -2234. 4487. 4488. 4526.

# Forecast ets model
atm_fc <- atm_ets2 |> forecast(h = '30 days')

# Plot ets model of atm1
atm_fc |>
  autoplot(atm1_ts, level = NULL) +
  labs(y="Cash", title="ATM1 Cash: ETS(A,N,A) Model")

I then decided to try ARIMA, to see if it is a better model for forecasting ATM1. The ARIMA model recommended is ARIMA(0,0,1) which indicated no autoregressive (AR) terms, no differencing, and 1 MA term and the seasonal part (0,1,2) indicates no seasonal AR terms, 1 seasonal differencing, and 2 seasonal MA terms.

The ARIMA(0,0,1)(0,1,2) without a Box-Cox transformation, had an AIC of 3292.217 and an RMSE of 23.39736, being the smallest RMSE value of all of the models.

The histogram of the residuals looks unimodal and somewhat normally distributed. The ACF of the residuals shows that none of the lags are out of bound and all the lags are somewhat close to zero and there is no obvious patterns in the model. The residuals do look like its white noise since there appears to be some autocorrelation in the residual series.

From the Box-Pierce test and the Ljung-Box test, we see that the results are not significant since the p-values are 0.1708509 and 0.1573743, respectively. We can conclude that the residuals are not distinguishable from white noise. This means that there is no significant autocorrelation in the residuals and the residuals are white noise. This means the ARIMA model was fairly good at fitting and predicting ATM1.

#ARIMA
atm1_ts|>
  model(ARIMA(Cash))

## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM              `ARIMA(Cash)`
##   <chr>                  <model>
## 1 ATM1  <ARIMA(0,0,1)(0,1,2)[7]>

# ACF anf PCF of atm1
atm1_ts |> gg_tsdisplay(Cash)

#ARIMA of ATM1
arm_arm2  <- atm1_ts|>
  model(arima = ARIMA(Cash ~ pdq(0,0,1) + PDQ(0,1,2)),
        arima_box = ARIMA(box_cox(Cash, lambda_atm1)~ pdq(0,0,1) + PDQ(0,1,2))) 

#RMSE and AIC of ATM 1
report(arm_arm2)

## Warning in report.mdl_df(arm_arm2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 9
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots  
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>    
## 1 ATM1  arima     563.    -1642. 3292. 3292. 3308. <cpl [0]> <cpl [15]>
## 2 ATM1  arima_box   1.53   -584. 1176. 1176. 1192. <cpl [0]> <cpl [15]>

accuracy(arm_arm2)

## # A tibble: 2 × 11
##   ATM   .model    .type         ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>     <chr>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM1  arima     Training -0.0736  23.4  14.7 -103.   118. 0.825 0.840 -0.00798
## 2 ATM1  arima_box Training  2.46    24.9  15.7  -90.3  109. 0.880 0.895 -0.00266

#Residuals plot
arm_arm2 |> select(arima)|> gg_tsresiduals(lag=36)

# Residuals: Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(arm_arm2) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 2 × 4
##   ATM   .model    bp_stat bp_pvalue
##   <chr> <chr>       <dbl>     <dbl>
## 1 ATM1  arima        11.7     0.302
## 2 ATM1  arima_box    12.5     0.252

augment(arm_arm2) |> features(.innov, ljung_box, lag=10)

## # A tibble: 2 × 4
##   ATM   .model    lb_stat lb_pvalue
##   <chr> <chr>       <dbl>     <dbl>
## 1 ATM1  arima        12.0     0.285
## 2 ATM1  arima_box    12.7     0.239

# Plot ARIMA
arm_arm2 %>% 
  forecast(h="30 days") %>%
  autoplot(atm1_ts) +
  labs(title = "ATM1: ARIMA(0,0,1)(0,1,2)")

Conclusion for ATM1:

The best model to predict the next month (May 2010) of the time series ATM1 is ARIMA(0,0,1)(0,1,2), with an RMSE value of 23.39736 - the smallest RMSE value of all of the models. I filtered for April 2010 and onward to get a better look at the prediction. The forecast looks reasonable to me, especially since it followed the same pattern in the time series.

#Filter ATM1
atm1_ts_plot <- atm1_ts |>
  filter(DATE >= "2010-04-01")

# Plot ARIMA for ATM1
atm1_plot <- atm1_ts|>
  model(ARIMA(Cash))

atm1_plot %>% 
  forecast(h="30 days") %>%
  autoplot(atm1_ts_plot) +
  labs(title = "ATM1: ARIMA(0,0,1)(0,1,2)")

Excel spreadsheet for forecast

atm1_excel <- atm1_plot %>% 
  forecast(h="30 days")

ATM2

I once again did an STL model to look at the trends and seasonality of the time series to figure out how to proceed. The results of the STL decomposition graph tell us that there is no obvious or apparent trend for ATM2. This trend component also shows upward and downward movement throughout. In terms of seasonality, the figure showing the seasonal component from the decomposition clearly tells us that throughout these two years, there are certain times of the month and year where the amount of cash increases and decreases - indicating seasonality. In terms of outliers, we see an outlier occurring a few times throughout, sometime in the early summer of 2009, late July of 2009, early Autumn of 2009, and a few others in late 2009 to early 2010 - where there are some drastic decreases.

#STL
atm2_ts |>
  model(stl = STL(Cash)) |>
  components() |>
  autoplot()

I proceeded to do a box-cox transformation to make the data more normal and stabilize the variance at lambda = 0.7408956. The transformation did not seem to help very much in this time series, but I still incorporated in the ETS and ARIMA, just for comparison.

#Box cox tranformation

lambda_atm2 <- atm2_ts |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda_atm2

## [1] 0.7408956

atm2_ts |>
  autoplot(box_cox(Cash, lambda_atm2)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM 1 Cash with $\\lambda$ = ",
         round(lambda_atm2,2))))

For ATM2, we see there is seasonality but no significant trend. I decided to use SNAIVE() since it accounts for seasonality, but not trend. The forecast plot using SNAVIE() shows the last season’s pattern in the time series repeated for the next month (May 2010).

The histogram of residuals looks somewhat unimodal and normally distributed. The ACF of the residuals shows a few lags out of the the bounds, one of them being far out the lower limit, but most of the lags are somewhat close to zero. The residuals do not appear to be white noise since there does appear to be autocorrelation in the residual series.

From the Box-Pierce test and the Ljung-Box test, we see that the results are significant since the p-values are 5.551115e-16 and 2.220446e-16 3, respectively. We can conclude that the residuals are distinguishable from white noise. This means that there is autocorrelation in the residuals and the residuals are not white noise. This also suggest that the SNAIVE() model is probably not the best fit for the data.

The accuracy of the training data shows that the RMSE value is 30.64785 , which will be used to compare with the proceeding models.

#SNAIVE model

atm2_ts %>% 
  model(SNAIVE(Cash)) %>%
  forecast(h = '30 days') %>%
  autoplot(atm2_ts)

# Fit the model
fit2 <- atm2_ts |> model(SNAIVE(Cash))

# Look at the residuals
fit2 |> gg_tsresiduals()

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

fit2 |> accuracy()

## # A tibble: 1 × 11
##   ATM   .model       .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>        <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM2  SNAIVE(Cash) Training 0.0223  30.6  21.1  -Inf   Inf     1     1 0.0367

# Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(fit2) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 4
##   ATM   .model       bp_stat bp_pvalue
##   <chr> <chr>          <dbl>     <dbl>
## 1 ATM2  SNAIVE(Cash)    95.0  5.55e-16

augment(fit2) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 4
##   ATM   .model       lb_stat lb_pvalue
##   <chr> <chr>          <dbl>     <dbl>
## 1 ATM2  SNAIVE(Cash)    97.4  2.22e-16

Comparing the SNAIVE() model with the ETS(A,N,A) model, I prefer the ETS(A,N,A) model since it has the smaller RMSE value, which means it may have best captured the seasonality better and predicted with better accuracy. I also opted to look at the AIC value for model comparison - the box-cox transformed ETS(A,N,A) model had an AIC of 2519.843, while the ETS(A,N,A) model had an AIC of 4534.620. However, since I wanted the better prediction, I went with the RMSE value and it looks like the ETS(A,N,A) model also preformed better than the box-cox transformed ETS(A,N,A) model with an RMSE value of 25.39273 compared to 27.06020.

#ETS model
atm2_ets <- atm2_ts |>
  model(fit = ETS(Cash))
report(atm2_ets)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000428 
##     gamma = 0.353788 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]     s[-3]    s[-4]    s[-5]    s[-6]
##  71.58026 -40.90334 -29.14207 19.32742 -20.26629 24.12002 20.43491 26.42935
## 
##   sigma^2:  661.0916
## 
##      AIC     AICc      BIC 
## 4534.620 4535.242 4573.619

# ETS of ANA and box-cox ANA
atm2_ets2 <- atm2_ts |>
  model(ets = ETS(Cash ~ error("A") + trend("N") + season("A")),
        ets_box = ETS(box_cox(Cash, lambda_atm1) ~ error("A") + trend("N") + season("A")),)

# Check RMSE
accuracy(atm2_ets2)

## # A tibble: 2 × 11
##   ATM   .model  .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM2  ets     Training -0.641  25.4  18.0  -Inf   Inf 0.856 0.829 0.0167
## 2 ATM2  ets_box Training  3.27   27.1  18.8  -Inf   Inf 0.891 0.883 0.0119

# Check AIC
glance(atm2_ets2) |> arrange(AICc) |> select(.model:BIC)

## # A tibble: 2 × 6
##   .model  sigma2 log_lik   AIC  AICc   BIC
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ets_box   2.65  -1250. 2520. 2520. 2559.
## 2 ets     661.    -2257. 4535. 4535. 4574.

# Forecast ets model
atm2_fc <- atm2_ets |> forecast(h = '30 days')

# Plot ETS model of ATM2
atm2_fc |>
  autoplot(atm2_ts, level = NULL) +
  labs(y="Cash", title="ATM2 Cash: ETS(A,N,A) Model")

I then decided to try ARIMA, to see if it is a better model for forecasting ATM2. The ARIMA model recommended is ARIMA(2,0,2) which indicated 2 autoregressive (AR) terms, no differencing, and 2 MA term and the seasonal part (0,1,1) indicates no seasonal AR terms, 1 seasonal differencing, and 1 seasonal MA terms.

The ARIMA(2,0,2)(0,1,1) model without a Box-Cox transformation, had an AIC of 3330.258 and an RMSE of 24.50838, being the smallest value of all of the models.

The histogram of the residuals looks unimodal and somewhat normally distributed. The ACF of the residuals shows that most of the lags are in bound and all the lags are somewhat close to zero and there is no obvious patterns in the model. The residuals do look like its white noise since there appears to be some autocorrelation in the residual series.

From the Box-Pierce test and the Ljung-Box test, we see that the results are not significant since the p-values are 0.5791602 and 0.5620855, respectively. We can conclude that the residuals are not distinguishable from white noise. This means that there is no significant autocorrelation in the residuals and the residuals are white noise. This means the ARIMA model was good at fitting and predicting ATM2.

#ARIMA
atm2_ts|>
  model(ARIMA(Cash))

## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM              `ARIMA(Cash)`
##   <chr>                  <model>
## 1 ATM2  <ARIMA(2,0,2)(0,1,1)[7]>

# ACF anf PCF of ATM2
atm2_ts |> gg_tsdisplay(Cash)

#ARIMA of ATM2
atm2_arm  <- atm2_ts|>
  model(arima = ARIMA(Cash ~ pdq(2,0,2) + PDQ(0,1,1)),
        arima_box = ARIMA(box_cox(Cash, lambda_atm1)~ pdq(2,0,2) + PDQ(0,1,1))) 

#RMSE and AIC of ATM 2
report(atm2_arm)

## Warning in report.mdl_df(atm2_arm): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 9
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 ATM2  arima     621.    -1659. 3330. 3330. 3354. <cpl [2]> <cpl [9]>
## 2 ATM2  arima_box   2.56   -675. 1363. 1363. 1386. <cpl [2]> <cpl [9]>

accuracy(atm2_arm)

## # A tibble: 2 × 11
##   ATM   .model    .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>     <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM2  arima     Training -0.910  24.5  17.3  -Inf   Inf 0.819 0.800 -0.00452
## 2 ATM2  arima_box Training  2.56   26.4  18.3  -Inf   Inf 0.870 0.863 -0.0107

#Residuals plot
atm2_arm  |> select(arima)|> gg_tsresiduals(lag=36)

# Residuals: Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(atm2_arm) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 2 × 4
##   ATM   .model    bp_stat bp_pvalue
##   <chr> <chr>       <dbl>     <dbl>
## 1 ATM2  arima        8.51     0.579
## 2 ATM2  arima_box   10.5      0.395

augment(atm2_arm) |> features(.innov, ljung_box, lag=10)

## # A tibble: 2 × 4
##   ATM   .model    lb_stat lb_pvalue
##   <chr> <chr>       <dbl>     <dbl>
## 1 ATM2  arima        8.69     0.562
## 2 ATM2  arima_box   10.8      0.377

# Plot ARIMA
atm2_arm%>% 
  forecast(h="30 days") %>%
  autoplot(atm2_ts) +
  labs(title = "ATM1: ARIMA(2,0,2)(0,1,1)")

Conclusion for ATM2:

The best model to predict the next month (May 2010) of the time series of ATM2 is ARIMA(2,0,2)(0,1,1), with an RMSE value of 24.50838 - the smallest RMSE value of all of the models. I filtered for April 2010 and onward to get a better look at the prediction. The forecast looks reasonable to me, especially since it followed the same pattern in the dataset.

#Filter ATM1
atm2_ts_plot <- atm2_ts |>
  filter(DATE >= "2010-04-01")

# Plot ARIMA for ATM1
atm2_plot <- atm2_ts|>
  model(ARIMA(Cash))

atm2_plot %>% 
  forecast(h="30 days") %>%
  autoplot(atm2_ts_plot) +
  labs(title = "ATM2: ARIMA(2,0,2)(0,1,1)")

Excel spreadsheet for forecast

atm2_excel <- atm2_plot %>% 
  forecast(h="30 days")

ATM3

I did an STL model to look at the trends and seasonality of the time series to figure out how to proceed. The results of the STL decomposition graph tell us that there isn’t enough data to determine trends or seasonality. In fact, there were no withdrawls until the last 3 dates in the time series. It looks like there could be a seasonality at some point, but there is too little data to draw conclusions.

#STL
atm3_ts |>
  model(stl = STL(Cash)) |>
  components() |>
  autoplot()

For ATM3, we cannot fully determine seasonality or trend. I decided to use NAIVE() since it takes the value of the last data point to create the forecast, for the next 30 days (May 2010) and does not account for trend or seasonality.

From the Box-Pierce test and the Ljung-Box test, we see that the results are not significant since the p-values are 0.5876272 and 0.580792, respectively. We can conclude that the residuals are not distinguishable from white noise. This means that there is no significant autocorrelation in the residuals and the residuals are white noise. This also suggest that the SNAIVE() model is a good fit for the data.

The accuracy of the training data shows that the RMSE value is 5.087423.

#NAIVE model

atm3_ts %>% 
  model(NAIVE(Cash)) %>%
  forecast(h = '30 days') %>%
  autoplot(atm3_ts)

# Fit the model
fit3 <- atm3_ts |> model(NAIVE(Cash))

# Look at the residuals
fit3 |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

fit3 |> accuracy()

## # A tibble: 1 × 11
##   ATM   .model      .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>       <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM3  NAIVE(Cash) Training 0.234  5.09 0.310  28.8  40.2 0.423 0.632 -0.149

# Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(fit3) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 4
##   ATM   .model      bp_stat bp_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM3  NAIVE(Cash)    8.42     0.588

augment(fit3) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 4
##   ATM   .model      lb_stat lb_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM3  NAIVE(Cash)    8.49     0.581

Conclusion for ATM3:

Since we do not have a lot of data to work with here, it is difficult to really predict the forecast, so I stuck with the NAIVE() model. I filtered for April 2010 and onward to get a better look at the prediction. The forecast picks up from the last data point, which seems like the best option given our dataset without many observations.

#Filter ATM3
atm3_ts_plot <- atm3_ts |>
  filter(DATE >= "2010-04-01")

# Plot NAIVE for ATM3
atm3_plot <- atm3_ts|>
  model(NAIVE(Cash))

atm3_plot %>% 
  forecast(h="30 days") %>%
  autoplot(atm3_ts_plot) +
  labs(title = "ATM3: NAIVE())")

Excel spreadsheet for forecast

atm3_excel <- atm3_plot %>% 
  forecast(h="30 days")

ATM4

I did an STL model to look at the trends and seasonality of the time series to figure out how to proceed. The results of the STL decomposition graph tell us that there is no obvious or apparent trend for ATM4. The trend component shows upward and downward movement throughout. In terms of seasonality, the figure showing the seasonal component from the decomposition clearly tells us that throughout these two years, there are certain times of the month and year where the amount of cash increases and decreases - indicating seasonality. In terms of outliers, we see a large outlier with a very dramatic increase in the early months of 2010.

#STL
atm4_ts |>
  model(stl = STL(Cash)) |>
  components() |>
  autoplot()

I proceeded to do a box-cox transformation to make the data more normal and stabilize the variance at lambda = -0.0737252. The transformation did not seem to help very much in this time series, but I still incorporated in the ETS and ARIMA, just for comparison.

#Box cox transformation

lambda_atm4 <- atm4_ts |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda_atm4

## [1] -0.0737252

atm4_ts |>
  autoplot(box_cox(Cash, lambda_atm4)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM 1 Cash with $\\lambda$ = ",
         round(lambda_atm4,4))))

For ATM4, we see there is seasonality but no significant trend. I decided to use SNAIVE() since it accounts for seasonality, but not trend. The forecast plot using SNAVIE() shows the last season’s pattern in the time series repeated for the next month (May 2010).

From the Box-Pierce test and the Ljung-Box test, we see that the results are significant since the p-values are 1.09357e-13 and 4.152234e-14, respectively. We can conclude that the residuals are distinguishable from white noise. This means that there is autocorrelation in the residuals and the residuals are not white noise. This also suggest that the SNAIVE() model is probably not the best fit for the data.

The accuracy of the training data shows that the RMSE value is 896.033, which will be used to compare with the proceeding models.

#SNAIVE model

atm4_ts %>% 
  model(SNAIVE(Cash)) %>%
  forecast(h = "30 days") %>%
  autoplot(atm4_ts)

# Fit the model
fit4 <- atm4_ts |> model(SNAIVE(Cash))

# Look at the residuals
fit4 |> gg_tsresiduals()

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

fit4 |> accuracy()

## # A tibble: 1 × 11
##   ATM   .model       .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>        <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM4  SNAIVE(Cash) Training -3.70  896.  402. -395.  450.     1     1 -0.0151

# Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(fit4) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 4
##   ATM   .model       bp_stat bp_pvalue
##   <chr> <chr>          <dbl>     <dbl>
## 1 ATM4  SNAIVE(Cash)    83.4  1.09e-13

augment(fit4) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 4
##   ATM   .model       lb_stat lb_pvalue
##   <chr> <chr>          <dbl>     <dbl>
## 1 ATM4  SNAIVE(Cash)    85.5  4.15e-14

I then decided to use the ETS() function to estimate the equivalent model for simple exponential smoothing. The ETS model recommended is the ETS(M,N,A) which is error(“M”) + trend(“N”) + season(“A”) - multiplicative error, no trend, and additive seasonality.

Comparing the SNAIVE() model with the ETS(M,N,A) model, I prefer the the ETS(M,N,A) model since it has the smaller RMSE value, which means it may have best captured the seasonality better and predicted with better accuracy. I also opted to look at the AIC value for model comparison - the box-cox transformed ETS(M,N,A) model had an AIC of 3212.870, while the ETS(M,N,A) model had an AIC of 6690.624. However, since I wanted the better prediction, I went with the RMSE value and it looks like the ETS(M,N,A) model also preformed better than the box-cox transformed ETS(M,N,A) model with an RMSE value of 645.1182 compared to 653.9265.

#ETS model
atm4_ets <- atm4_ts |>
  model(fit = ETS(Cash))
report(atm4_ets)

## Series: Cash 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.001323909 
##     gamma = 0.003004107 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]     s[-3]    s[-4]    s[-5]     s[-6]
##  372.4108 -146.6545 -57.04306 217.4408 -6.243662 37.49176 5.354976 -50.34634
## 
##   sigma^2:  1.6575
## 
##      AIC     AICc      BIC 
## 6690.624 6691.246 6729.623

# ETS of ANA and box-cox ANA
atm4_ets2 <- atm4_ts |>
  model(ets = ETS(Cash ~ error("M") + trend("N") + season("A")),
        ets_box = ETS(box_cox(Cash, lambda_atm1) ~ error("M") + trend("N") + season("A")),)

# Check RMSE
accuracy(atm4_ets2)

## # A tibble: 2 × 11
##   ATM   .model  .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM4  ets     Training  77.0  645.  312. -510.  552. 0.777 0.720 -0.0106 
## 2 ATM4  ets_box Training 123.   654.  297. -346.  397. 0.739 0.730 -0.00564

# Check AIC
glance(atm4_ets2) |> arrange(AICc) |> select(.model:BIC)

## # A tibble: 2 × 6
##   .model  sigma2 log_lik   AIC  AICc   BIC
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ets_box  0.126  -1596. 3213. 3213. 3252.
## 2 ets      1.66   -3335. 6691. 6691. 6730.

# Forecast ETS model
atm4_fc <- atm4_ets2 |> forecast(h = '30 days')

# Plot ETS model of ATM4
atm4_fc |>
  autoplot(atm4_ts, level = NULL) +
  labs(y="Cash", title="ATM1 Cash: ETS(M,N,A) Model")

I then decided to try ARIMA, to see if it is a better model for forecasting ATM4. The ARIMA model recommended is ARIMA(0,0,0) w/ mean which indicated no autoregressive (AR) terms, no differencing, and no MA term.

The ARIMA(0,0,0) without a Box-Cox transformation, has an AIC of 5768.0640 and an RMSE of 650.0437, being the second smallest value of RMSE of all of the models.

From the Box-Pierce test and the Ljung-Box test, we see that the results are not significant since the p-values are 0.9721250 and 0.9696305, respectively. We can conclude that the residuals are not distinguishable from white noise. This means that there is no significant autocorrelation in the residuals and the residuals are white noise. This means the ARIMA model was good at fitting and predicting ATM4.

#ARIMA
atm4_ts|>
  model(ARIMA(Cash))

## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM            `ARIMA(Cash)`
##   <chr>                <model>
## 1 ATM4  <ARIMA(0,0,0) w/ mean>

# ACF anf PCF of ATM4
atm4_ts |> gg_tsdisplay(Cash)

#ARIMA of ATM4
atm4_arm2 <- atm4_ts|>
  model(arima = ARIMA(Cash ~ pdq(0,0,0)),
        arima_box = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,0)))

#RMSE and AIC of ATM 4
report(atm4_arm2)

## Warning in report.mdl_df(atm4_arm2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 9
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 ATM4  arima     423718.     -2882. 5768. 5768. 5776. <cpl [0]>  <cpl [0]>
## 2 ATM4  arima_box      0.842   -486.  979.  979.  995. <cpl [14]> <cpl [0]>

accuracy(atm4_arm2)

## # A tibble: 2 × 11
##   ATM   .model    .type          ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>     <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM4  arima     Traini… -1.51e-10  650.  324. -617.  647. 0.805 0.725 -0.00936
## 2 ATM4  arima_box Traini…  2.10e+ 2  679.  317. -240.  304. 0.789 0.758 -0.00510

#Residuals plot
atm4_arm2 |> select(arima)|> gg_tsresiduals(lag=36)

# Residuals: Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(atm4_arm2) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 2 × 4
##   ATM   .model    bp_stat bp_pvalue
##   <chr> <chr>       <dbl>     <dbl>
## 1 ATM4  arima        3.34     0.972
## 2 ATM4  arima_box   14.5      0.152

augment(atm4_arm2) |> features(.innov, ljung_box, lag=10)

## # A tibble: 2 × 4
##   ATM   .model    lb_stat lb_pvalue
##   <chr> <chr>       <dbl>     <dbl>
## 1 ATM4  arima        3.42     0.970
## 2 ATM4  arima_box   14.9      0.137

# Plot ARIMA
atm4_arm2 %>% 
  forecast(h="30 days") %>%
  autoplot(atm4_ts) +
  labs(title = "ATM4: ARIMA(0,0,0) w/ mean")

Conclusion for ATM4:

The best model to predict the next month (May 2010) of the time series of ATM4 is ETS(M,N,A), with an RMSE value of 645.1182 - the smallest RMSE value of all of the models for ATM4. I filtered for April 2010 and onward to get a better look at the prediction. The forecast looks reasonable to me, the pattern is not exactly the same but very close to it.

#Filter ATM4
atm4_ts_plot <- atm4_ts |>
  filter(DATE >= "2010-04-01")

# Plot ETS for ATM4
atm4_plot <- atm4_ts |>
  model(ETS(Cash))

atm4_plot %>% 
  forecast(h="30 days") %>%
  autoplot(atm4_ts_plot) +
  labs(title = "ATM4: ETS(M.N,A)")

Excel spreadsheet for forecast

atm4_excel <- atm4_plot %>% 
  forecast(h="30 days")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

For part B, I read in the atm dataset and then changed the date name from YYYY-MMM to Date. I then formatted the date variable to be ‘mth’ instead of a character. Then I looked at missing data - the total NAs/missing data was 1 row with a missing KWH, which I decided to impute using k-Nearest Neighbors (kNN) which takes an average of the nearest points depending on the k you choose. This dataset has 192 observations.

I changed the dataframe to a tsibble to be able to plot and forecast the time series. We see that there is some cyclic behavior and seasonality, but not much of a trend.

power <- read_excel("C:/Users/Kesha/Desktop/Spring 2025/DATA 624/ResidentialCustomerForecastLoad-624.xlsx")
head(power)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

# Rename date
colnames(power)[colnames(power) == "YYYY-MMM"] <- "Date"

# Fit Date to be month and year
power$DATE <- as.yearmon(power$Date, "%Y-%b")
power$DATE <- as.Date(power$DATE)
power$DATE <- yearmonth(power$DATE)

# Look at Na
which(is.na(power), arr.ind = TRUE)

##      row col
## [1,] 129   3

# Impute with KNN
power_imput <- kNN(power, k = 3)

# Change to tsibble
power_ts <- power_imput |>
  select(DATE, KWH) |>
  as_tsibble(index = DATE)

# Plot residential power usage
autoplot(power_ts, KWH)

gg_season(power_ts, KWH)

ACF(power_ts, KWH)|> autoplot()

I did an STL model to look at the trends and seasonality of the time series to figure out how to proceed. The results of the STL decomposition graph tell us that there is no obvious or apparent trend for residential power usage for January 1998 until December 2013. In terms of seasonality, the figure showing the seasonal component from the decomposition clearly tells us that throughout these two years, there are certain times of the month and year where the amount of cash increases and decreases - the beginning of the year and the summer months (June-Sept) show an increase in power usage, which indicates seasonality. In terms of outliers, we see a large outlier with a very dramatic decrease in early 2010.

#STL
power_ts |>
  model(stl = STL(KWH)) |>
  components() |>
  autoplot()

I proceeded to do a box-cox transformation to make the data more normal and stabilize the variance at lambda = 0.1060303. The transformation helped us see the overall plot more clearly, especially in early 2010 where there is a large decrease in residential power usage.

#Box cox transformation

lambda_kwh <- power_ts |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)
lambda_kwh

## [1] 0.1060303

power_ts |>
  autoplot(box_cox(KWH, lambda_kwh)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM 1 Cash with $\\lambda$ = ",
         round(lambda_kwh,4))))

For the residential power usage data, we see there is seasonality but no significant trend. I decided to use SNAIVE() since it accounts for seasonality, but not trend. The forecast plot using SNAVIE() shows the last season’s pattern in the time series repeated for the next year (2014).

The histogram of residuals looks somewhat unimodal and normally distributed. The ACF of the residuals shows a few lags out of the the bounds, one of them being far out the lower limit, but most of the lags are somewhat close to zero. There does seem to still be a pattern in the ACF plot. The residuals do not appear to be white noise since there does appear to be autocorrelation in the residual series.

From the Box-Pierce test and the Ljung-Box test, we see that the results are somewhat significant since the p-values are 0.06164224 and 0.05154666, respectively. We can conclude that the residuals are distinguishable from white noise. This means that there is some autocorrelation in the residuals and the residuals are not white noise. This also suggest that the SNAIVE() model is probably not the best fit for the data.

The accuracy of the training data shows that the RMSE value is very large at 1202619, which will be used to compare with the proceeding models.

#SNAIVE model

power_ts %>% 
  model(SNAIVE(KWH)) %>%
  forecast(h = '12 months') %>%
  autoplot(power_ts)

# Fit the model
power_fit <- power_ts |> model(SNAIVE(KWH))

# Look at the residuals
power_fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

power_fit |> accuracy()

## # A tibble: 1 × 10
##   .model      .type        ME     RMSE     MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>       <chr>     <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(KWH) Training 94245. 1202619. 713016. -4.01  14.9     1     1 0.227

power_fit |> report()

## Series: KWH 
## Model: SNAIVE 
## 
## sigma^2: 1445439621945.48

# Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(power_fit) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model      bp_stat bp_pvalue
##   <chr>         <dbl>     <dbl>
## 1 SNAIVE(KWH)    17.6    0.0616

augment(power_fit) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model      lb_stat lb_pvalue
##   <chr>         <dbl>     <dbl>
## 1 SNAIVE(KWH)    18.2    0.0515

I then decided to use the ETS() function to estimate the equivalent model for simple exponential smoothing. The ETS model recommended is the ETS(A,N,A) which is error(“M”) + trend(“N”) + season(“M”) - multiplicative error, no trend, and multiplicative seasonality.

Comparing the SNAIVE() model with the ETS(M,N,M) model, I prefer the the ETS(M,N,M) model since it has the smaller RMSE value, which means it may have best captured the seasonality better and predicted with better accuracy. I also opted to look at the AIC value for model comparison -the box-cox transformed ETS(M,N,M) model had an AIC of 1017.757 , while the ETS(M,N,M) model had an AIC of 6231.102. However, since I wanted the better prediction, I went with the RMSE value and it looks like the ETS(M,N,M) model also preformed better than the box-cox transformed ETS(M,N,M) model with an RMSE value of 844364.3 compared to 894102.3.

From the Box-Pierce test and the Ljung-Box test, we see that the results are significant since the p-values are 0.02307188 and 0.01816262, respectively, for the ETS(M,N,M) model. We can conclude that the residuals are distinguishable from white noise. This means that there is autocorrelation in the residuals and the residuals are not white noise. This also suggest that the ETS() model is probably not the best fit for the data.

#ETS model
kwh_ets <- power_ts |>
  model(fit = ETS(KWH))
report(kwh_ets)

## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1198872 
##     gamma = 0.000103396 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  6128269 0.9493335 0.7537713 0.8718683 1.167763 1.267554 1.177798 1.004366
##      s[-7]     s[-8]    s[-9]   s[-10]   s[-11]
##  0.7693067 0.8149458 0.917753 1.081722 1.223818
## 
##   sigma^2:  0.0149
## 
##      AIC     AICc      BIC 
## 6231.102 6233.830 6279.965

# ETS of MNM and box-cox MNM
kwh_ets2 <- power_ts  |>
  model(ets = ETS(KWH ~ error("M") + trend("N") + season("M")),
        ets_box = ETS(box_cox(KWH, lambda_kwh) ~ error("M") + trend("N") + season("M")),)

# Check RMSE
accuracy(kwh_ets2)

## # A tibble: 2 × 10
##   .model  .type         ME    RMSE     MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr>      <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets     Training  60488. 844364. 516689. -4.38  12.2 0.725 0.702 0.176
## 2 ets_box Training 201985. 894102. 571564. -2.01  12.2 0.802 0.743 0.266

# Check AIC
glance(kwh_ets2) |> arrange(AICc) |> select(.model:BIC)

## # A tibble: 2 × 6
##   .model    sigma2 log_lik   AIC  AICc   BIC
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ets_box 0.000600   -494. 1018. 1020. 1067.
## 2 ets     0.0149    -3101. 6231. 6234. 6280.

# Forecast ETS model
kwh_fc <- kwh_ets2 |> forecast(h = '12 months')

# Plot ETS model of Power Usage
kwh_fc |>
  autoplot(power_ts , level = NULL) +
  labs(y="KWH", title="Residential Power Usage: ETS(M,N,M) Model")

I then decided to try ARIMA, to see if it is a better model for forecasting power usage. The ARIMA model recommended is ARIMA(0,0,1) which indicated no autoregressive (AR) terms, no differencing, and 1 MA term and the seasonal part (0,1,2) indicates no seasonal AR terms, 1 seasonal differencing, and 2 seasonal MA terms.

The ARIMA(0,0,1)(0,1,2) w/ drift model without a Box-Cox transformation, had an AIC of 5457.5337 and an RMSE of 845350.0, being the smallest value of all of the models.

#ARIMA
power_ts|>
  model(ARIMA(KWH))

## # A mable: 1 x 1
##                         `ARIMA(KWH)`
##                              <model>
## 1 <ARIMA(0,0,1)(1,1,1)[12] w/ drift>

# ACF anf PCF of Power Usage
power_ts |> gg_tsdisplay(KWH)

#ARIMA of Power Usage
pow_arm2  <- power_ts|>
  model(arima = ARIMA(KWH ~ pdq(0,0,1) + PDQ(1,1,1)),
        arima_box = ARIMA(box_cox(KWH, lambda_kwh)~ pdq(0,0,1) + PDQ(1,1,1))) 

#RMSE and AIC of Power Usage
report(pow_arm2)

## Warning in report.mdl_df(pow_arm2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 8
##   .model     sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots  
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>    
## 1 arima     7.80e11  -2724. 5458. 5458. 5473. <cpl [12]> <cpl [13]>
## 2 arima_box 1.03e 0   -264.  538.  538.  554. <cpl [12]> <cpl [13]>

accuracy(pow_arm2)

## # A tibble: 2 × 10
##   .model    .type         ME    RMSE     MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>     <chr>      <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 arima     Training -25001. 845350. 500437. -5.58  11.8 0.702 0.703 0.0122
## 2 arima_box Training  31198. 889172. 522922. -4.75  12.1 0.733 0.739 0.114

#Residuals plot
pow_arm2 |> select(arima)|> gg_tsresiduals(lag=36)

# Residuals: Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(pow_arm2) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 2 × 3
##   .model    bp_stat bp_pvalue
##   <chr>       <dbl>     <dbl>
## 1 arima        7.39     0.688
## 2 arima_box    3.24     0.975

augment(pow_arm2) |> features(.innov, ljung_box, lag=10)

## # A tibble: 2 × 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 arima        7.70     0.658
## 2 arima_box    3.38     0.971

# Plot ARIMA
pow_arm2 %>% 
  forecast(h='12 months') %>%
  autoplot(power_ts) +
  labs(title = "ATM1: ARIMA(0,0,1)(1,1,1)")

Conclusion for Residential Power Usage:

The best model to predict the next year (2014) of the time series of Residential Power Usage is ETS(M,N,M), with an RMSE value of 844364.3 - the smallest RMSE value of all of the models. The forecast looks reasonable to me, especially since it followed the same pattern in the dataset.

# Plot ETS
pow_plot <- power_ts |>
  model(ETS(KWH))

pow_plot %>% 
  forecast(h='12 months') %>%
  autoplot(power_ts) +
  labs(title = "Residential Power Usage: ETS(M.N,M)")

Excel spreadsheet for power usage

power_excel <- pow_plot %>% 
  forecast(h='12 months')

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

The first step was reading in both datasets and taking a look at them. I noticed that the datetime needed formatting.

water1 <- read_excel("C:/Users/Kesha/Desktop/Spring 2025/DATA 624/Waterflow_Pipe1.xlsx") |>
  clean_names()
water2 <- read_excel("C:/Users/Kesha/Desktop/Spring 2025/DATA 624/Waterflow_Pipe2.xlsx") |>
  clean_names()
head(water1)

## # A tibble: 6 × 2
##   date_time water_flow
##       <dbl>      <dbl>
## 1    42300.      23.4 
## 2    42300.      28.0 
## 3    42300.      23.1 
## 4    42300.      30.0 
## 5    42300.       6.00
## 6    42300.      15.9

head(water2)

## # A tibble: 6 × 2
##   date_time water_flow
##       <dbl>      <dbl>
## 1    42300.       18.8
## 2    42300.       43.1
## 3    42300.       38.0
## 4    42300.       36.1
## 5    42300.       31.9
## 6    42300.       28.2

Formatting and changing the excel datetime for water1 and water2: I formatted both datetimes to look like a datetime variable instead of a numeric variable.

water1$datetime <- as.POSIXct((water1$date_time - 2) * 86400, origin = "1900-01-01", tz = "UTC")
water1$date_time <- format(water1$datetime, "%m/%d/%y %I:%M %p")

water2$datetime <- as.POSIXct((water2$date_time - 2) * 86400, origin = "1900-01-01", tz = "UTC")
water2$date_time <- format(water2$datetime, "%m/%d/%y %I:%M %p")

First I stacked both datasets, then created an hour variable for each hour in the datetime variable. Then I found the mean of each hour and stored it as waterflow_mean. I change the dataframe to a tsibble, then used fill_gaps. I did this because while there was no missing data, there were time jumps in the tsibble and forecasting does not permit this. Using fill_gaps creating new rows where there were originially time jumps - with the new rows due to fill_gaps, I had NAs in my dataset which I decided to impute. However, the kNN imputation only works on a dataframe, so I changed the tsibble back to a dataframe and changed the hour variable to numeric for the kNN imputations. Then I changed the dataframe back to a tsibble for forecasting.

#Stack water 1 and 2
combined_df <- dplyr::bind_rows(water1, water2)

#Create an hour variable
combined_df$hour <- floor_date(combined_df$datetime, unit = "hour")

#Create mean variable
water_total <- combined_df  %>%
  select(hour, water_flow) %>%
  group_by(hour) %>%
  summarise(waterflow_mean = mean(water_flow, na.rm = TRUE))

#change to tsibble
water_ts <- water_total |>
  as_tsibble(index = hour)

# Fill gaps in hour variable
water_ts_gap <- fill_gaps(water_ts)

# Change tsibble back to a dataframe for imputations
water_ts_df <- as.data.frame(water_ts_gap)

# Change hour to numeric to do imputations
water_ts_df$hour_numeric <- as.numeric(water_ts_df$hour)

# Impute using kNN
wt_imput <- kNN(water_ts_df, k = 2)

# Change back to tsibble
water_ts1 <- wt_imput |>
  select(hour, waterflow_mean) |>
  as_tsibble(index = hour)

Plot Water

The time series of water appears to be somewhat non-stationary. We can see from the first plot of the closing stock prices that there is a very slight upward trend with some fluctuation in between - time series with trends or with seasonality are not stationary. We cannot determine seasonality with the data that we have because we only have data from Oct 2015 to Dec 2015. We can also see from the ACF plot that the first lag is very large and while we do not see decaying behavior, there appears to be a increasing and decreasing pattern here only on the positive side. Lastly, from the PACF plot we see that the first lag is very large and outside of the bounds with other lags in the plot also outside the bounds, indicating significant correlation, which also show that this time series is non-stationary. It seems like the time series could benefit from some differencing.

The KPSS test p-value reported is shown as 0.01 (and therefore it may be smaller than that), indicating that the data is not stationary. We can difference the data, and apply the test again.

#water plot
water_ts1 |> 
  autoplot(waterflow_mean)

#ACF plot
water_ts1 |> ACF(waterflow_mean) |>
  autoplot() + labs(subtitle = "ACF of Waterflow Pipe Mean")

#PACF plot
water_ts1 |> PACF(waterflow_mean) |>
  autoplot() + labs(subtitle = "PACF of Waterflow Pipe Mean")

#KPSS test
water_ts1 |>
  features(waterflow_mean, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      3.95        0.01

Box cox transformation:

I redid the KPSS test with differencing and this time, the p-value is reported as 0.1 (and so it could be larger than that). We can conclude that the differenced data is now stationary. I proceeded to do a box-cox transformation to make the data more normal and stabilize the variance at lambda = -0.2331026. Looking at the residuals we can we can see that the data is differenced and stationary.

# Redo the KPSS test with a difference
water_ts1 |>
  mutate(diff_wt= difference(waterflow_mean)) |>
  features(diff_wt, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1   0.00986         0.1

# Get lambda
lambda_wt <- water_ts1 |>
  features(waterflow_mean, features = guerrero) |>
  pull(lambda_guerrero)
lambda_wt

## [1] -0.2331026

#Unitroot test to find out differences to be taken
water_ts1 |>
  mutate(wt_box = box_cox(waterflow_mean, lambda_wt)) |>
  features(wt_box, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# Plot with transformation and difference
water_ts1 |>
  autoplot(box_cox(waterflow_mean, lambda_wt) |>
  difference(1))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

# Look at residuals with transformation and differencing
water_ts1 |>
  gg_tsdisplay(difference(box_cox(waterflow_mean,lambda_wt),1), plot_type='partial') +
  labs(y = "",
    title = latex2exp::TeX(paste0("Box-cox and ordinary difference of Waterflow $\\lambda$ = ", round(lambda_wt,4))))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

I then decided to try ARIMA since it does the differencing for us. The ARIMA model recommended is ARIMA(0,1,2)(0,0,1) which indicated no autoregressive (AR) terms, 1 order differencing, and 2 MA terms and the seasonal part (0,1,2) indicates no seasonal AR terms, no seasonal differencing, and 2 seasonal MA terms.

The ARIMA(0,0,1)(0,1,2) model without a Box-Cox transformation, had an RMSE of 10.65604.

From the Box-Pierce test and the Ljung-Box test, we see that the results are not significant since the p-values are 0.8986530 and 0.8963472, respectively. We can conclude that the residuals are not distinguishable from white noise. This means that there is no significant autocorrelation in the residuals and the residuals are white noise. This means the ARIMA model was good at fitting and predicting the mean water flow.

#ARIMA
wt_arm <- water_ts1|>
  model(ARIMA(waterflow_mean))
wt_arm

## # A mable: 1 x 1
##     `ARIMA(waterflow_mean)`
##                     <model>
## 1 <ARIMA(0,1,2)(0,0,1)[24]>

# ACF and PCF of Power Usage
water_ts1 |> gg_tsdisplay(waterflow_mean)

#ARIMA of Power Usage
wt_arm2  <- water_ts1|>
  model(arima = ARIMA(waterflow_mean ~ pdq(0,1,2) + PDQ(0,0,1)),
        arima_box = ARIMA(box_cox(waterflow_mean, lambda_wt)~ pdq(0,1,2) + PDQ(0,1,1))) 

#RMSE and AIC of Power Usage
report(wt_arm2)

## Warning in report.mdl_df(wt_arm2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 8
##   .model      sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots  
##   <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>    
## 1 arima     114.      -3787. 7581. 7581. 7601. <cpl [0]> <cpl [26]>
## 2 arima_box   0.0299    304. -600. -600. -580. <cpl [0]> <cpl [26]>

accuracy(wt_arm2)

## # A tibble: 2 × 10
##   .model    .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>     <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima     Training 0.362  10.7  8.20 -12.0   30.0 0.687 0.682 -0.00932
## 2 arima_box Training 1.89   11.5  8.76  -7.16  29.9 0.734 0.737  0.0877

#Residuals plot
wt_arm2 |> select(arima)|> gg_tsresiduals(lag=36)

# Residuals: Check autocorrelation with Box-Pierce Test and Ljung-Box Test
augment(wt_arm2) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 2 × 3
##   .model    bp_stat bp_pvalue
##   <chr>       <dbl>     <dbl>
## 1 arima        4.89     0.899
## 2 arima_box    2.45     0.992

augment(wt_arm2) |> features(.innov, ljung_box, lag=10)

## # A tibble: 2 × 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 arima        4.92     0.896
## 2 arima_box    2.47     0.991

# Plot ARIMA
wt_arm2 %>% 
  forecast(h='7 days') %>%
  autoplot(water_ts1) +
  labs(title = "Waterflow: ARIMA(0,1,2)(0,0,1)")

Conclusion for Waterflow :

I decided to forecast using ARIMA with differencing included of the waterflow dataset, and it produced a forecast for the next 7 days. The forecast did looked a little different from the previous data points.

# Plot ARIMA for Water flow
wt_plot <- water_ts1|>
  model(ARIMA(waterflow_mean))

wt_plot %>% 
  forecast(h="7 days") %>%
  autoplot(water_ts1) +
  labs(title = "Waterflow: ARIMA(0,1,2)(0,0,1)")

Excel spreadsheet for waterflow for extra credit

wt_excel <- wt_plot %>% 
  forecast(h="7 days")

Project 1 excel file with ATM, Power Usage, and Extra Credit WaterFlow

wb <- createWorkbook()

addWorksheet(wb, "atm1_excel")
addWorksheet(wb, "atm2_excel")
addWorksheet(wb, "atm3_excel")
addWorksheet(wb, "atm4_excel")
addWorksheet(wb, "power_excel")
addWorksheet(wb, "wt_excel")

writeData(wb, "atm1_excel", atm1_excel)
writeData(wb, "atm2_excel", atm2_excel)
writeData(wb, "atm3_excel", atm3_excel)
writeData(wb, "atm4_excel", atm4_excel)
writeData(wb, "power_excel", power_excel)
writeData(wb, "wt_excel", wt_excel)

saveWorkbook(wb, "C:/Users/Kesha/Desktop/Spring 2025/DATA 624/Data24_project1_nf.xlsx", overwrite = TRUE)

Data 624: Project 1

Nakesha Fray

2025-03-10

Part A - ATM Forecast, ATM624Data.xlsx**

ATM1

ATM2

ATM3

ATM4