Libraries

library(dplyr)
library(tidyverse)
library(forecast)
library(writexl)

Overview

This project consists of 3 parts:

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Part A: ATM Forecast

Load and Clean Data

To ensure reproducibility, I converted the excel file into a CSV file and then uploaded to GitHub.

atm_raw <- "https://raw.githubusercontent.com/JDO-MSDS/Data-624/refs/heads/main/Project%201/data/ATM624Data%20-%20ATM%20Data.csv"

atm <- read.csv(atm_raw)

glimpse(atm)
## Rows: 1,474
## Columns: 3
## $ DATE <int> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <int> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
summary(atm)
##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10920.0  
##                                     NA's   :19
# fix date format since it was num
atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")

head(atm)
# checking how many different ATM
unique(atm$ATM)
## [1] "ATM1" "ATM2" ""     "ATM3" "ATM4"
# removing the blank/missing ATM
atm <- atm |>
  filter(ATM != "")

unique(atm$ATM)
## [1] "ATM1" "ATM2" "ATM3" "ATM4"
# number of observations per ATM
atm |> count(ATM)
# split by atm
atm_list <- split(atm, atm$ATM)

# clean and sort
atm_list <- lapply(atm_list, function(df) {
  df |>
    filter(!is.na(Cash)) |>
    arrange(DATE)
})

# convert to time series
atm_ts <- lapply(atm_list, function(df) {
  ts_data <- ts(df$Cash, frequency = 7)
  na.interp(ts_data)
})

ATM Plots

par(mfrow = c(2,2)) 

for(name in names(atm_ts)) {
  plot(atm_ts[[name]],
       main = paste("Cash Withdrawals -", name),
       ylab = "Cash",
       xlab = "Time",
       col = "red")
}

par(mfrow = c(2,2))

for(name in names(atm_ts)) {
  acf(atm_ts[[name]],
      main = paste("ACF -", name))
}

par(mfrow = c(2,2))

for(name in names(atm_ts)) {
  pacf(atm_ts[[name]],
       main = paste("PACF -", name))
}

By looking at the plots, I can see that ATM1 and ATM2 show stable patterns over time, with somewhat regular ups and downs. The ACF plots suggest a repeating pattern around every 7 days, which makes sense as a weekly pattern. The PACF plots also has some structure in the early lags, so it should be predictable. The weekly seasonality and structured autocorrelation make ARIMA and ETS good model options. ATM3 is quite different. Basically the only activity is near the end. This probably means that this ATM was only installed by the end of the period. I will use a simple average-based approach due to the lack of data. ATM4 has a few extreme spikes. It also has fluctuations. It is not easy to identify clear patterns, which can be confirmed in the ACF and PACF plots that look more scattered. So, I will treat each ATM separately since they each show different behaviors. In the following section, I will implement these models and compare their forecasting performance.

ATM1 Forecast

Fit ARIMA model for ATM1

As explained above, the plot for ATM1 showed a weekly pattern, so I will use auto.arima() to select the best model based on the data since seasonality makes sense.

atm1_arima <- auto.arima(atm_ts$ATM1, seasonal = TRUE)
summary(atm1_arima)
## Series: atm_ts$ATM1 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.1674  -0.5813
## s.e.  0.0565   0.0472
## 
## sigma^2 = 666.6:  log likelihood = -1658.32
## AIC=3322.63   AICc=3322.7   BIC=3334.25
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
##                     ACF1
## Training set -0.01116544

The model selected by auto.arima() is ARIMA(0,0,1)(0,1,1)[7]. The weekly seasonality is confirmed.

Check ARIMA diagnosis for ATM1

checkresiduals(atm1_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
## 
## Model df: 2.   Total lags used: 14

The residuals from the ARIMA(0,0,1)(0,1,1)[7] model fluctuate around zero with no obvious pattern, so that means that the model has captured the main structure of the data. The ACF plot shows that correlations are within the confidence limits, so no autocorrelation left. The Ljung-Box test produced a p-value of 0.5062, so we fail to reject the null hypothesis of no autocorrelation. The residuals behave like white noise, telling us the model is a good fit.

Fit ETS for ATM1

atm1_ets <- ets(atm_ts$ATM1)
summary(atm1_ets)
## ETS(A,N,A) 
## 
## Call:
## ets(y = atm_ts$ATM1)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.415 
## 
##   Initial states:
##     l = 87.7888 
##     s = -54.0598 16.2223 21.048 -26.068 8.3991 35.2058
##            -0.7474
## 
##   sigma:  26.384
## 
##      AIC     AICc      BIC 
## 4513.138 4513.765 4552.055 
## 
## Training set error measures:
##                      ME     RMSE     MAE       MPE     MAPE      MASE      ACF1
## Training set -0.5473075 26.05397 16.9637 -110.6377 126.3744 0.9013793 0.1374786

The model shows a very small smoothing parameter for the level and some seasonal adjustment, indicating that most of the variation comes from the weekly patterns. The error measures (RMSE around 26.05, MAE 16.96) suggest a reasonable fit to the data.

Check ETS diagnosis for ATM1

checkresiduals(atm1_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 17.703, df = 14, p-value = 0.2206
## 
## Model df: 0.   Total lags used: 14

The residuals from the ETS(A,N,A) model fluctuate around zero with no obvious pattern, meaning we have a reasonable fit. The ACF plot shows that most correlations lie within the confidence limits, so very little remaining autocorrelation. The Ljung-Box test gave a p-value of 0.2206, so we fail to reject the null hypothesis of no autocorrelation. Once again, the residuals behave close to white noise, meaning the model is appropriate.

Since both models make sense, I will forecast with both ARIMA and ETS.

Forecast ATM1 with both models

Forecast the next 31 days (end of May)

atm1_arima_fc <- forecast(atm1_arima, h = 31)
atm1_ets_fc   <- forecast(atm1_ets, h = 31)

# plots
plot(atm1_arima_fc, main = "ATM1 ARIMA Forecast")

plot(atm1_ets_fc, main = "ATM1 ETS Forecast")

ATM2 Forecast

Fit ARIMA model for ATM2

atm2_arima <- auto.arima(atm_ts$ATM2, seasonal = TRUE)
summary(atm2_arima)
## Series: atm_ts$ATM2 
## ARIMA(2,0,2)(0,1,2)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1    sma2
##       -0.4148  -0.8922  0.4008  0.7534  -0.7220  0.0801
## s.e.   0.0674   0.0510  0.1027  0.0688   0.0567  0.0550
## 
## sigma^2 = 708.1:  log likelihood = -1671.98
## AIC=3357.96   AICc=3358.28   BIC=3385.08
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.4583759 26.12942 18.09675 -Inf  Inf 0.8421495 0.009277035

The model select by auto.arima() for ATM2 is ARIMA(2,0,2)(0,1,2)[7]

Check ARIMA diagnosis for ATM2

checkresiduals(atm2_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
## 
## Model df: 6.   Total lags used: 14

The residuals from the ARIMA(2,0,2)(0,1,2)[7] model fluctuate around zero with no obvious pattern, so that means that the model has captured the main structure of the data. The ACF plot shows that most correlations are within the confidence limits, so barely any autocorrelation left. The Ljung-Box test produced a p-value of 0.1459, so we fail to reject the null hypothesis of no autocorrelation. The residuals behave like white noise, telling us the model is a good fit.

Fit ETS for ATM2

atm2_ets <- ets(atm_ts$ATM2)
summary(atm2_ets)
## ETS(A,N,A) 
## 
## Call:
## ets(y = atm_ts$ATM2)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.4647 
## 
##   Initial states:
##     l = 76.1082 
##     s = 11.3136 -47.8461 -13.3718 -1.8486 18.6895 4.6056
##            28.4579
## 
##   sigma:  27.8902
## 
##      AIC     AICc      BIC 
## 4566.882 4567.507 4605.826 
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.6349959 27.54227 19.09383 -Inf  Inf 0.8885496 -0.02291393

Check ETS diagnosis for ATM2

checkresiduals(atm2_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 38.485, df = 14, p-value = 0.0004379
## 
## Model df: 0.   Total lags used: 14

We can see that residuals fluctuate around zero and most without a pattern, so decent fit. The ACF has 3 correlations outside the limits, however, most correlations are within limits. The Ljung-Box test gave a p-value of 0.0004379, so we fail to reject the null hypothesis of no autocorrelation.

Forecast ATM2 with both models

atm2_arima_fc <- forecast(atm2_arima, h = 31)
atm2_ets_fc   <- forecast(atm2_ets, h = 31)

plot(atm2_arima_fc, main = "ATM2 ARIMA Forecast")

plot(atm2_ets_fc, main = "ATM2 ETS Forecast")

## Accuracy Testing Since both models seem to be appropriate for both ATM1 and ATM2, I will compare the accuracy on a test set using the 7 last observations as my test data.

Split data into train and test set

ATM1

atm1_train <- head(atm_ts$ATM1, length(atm_ts$ATM1) - 7)
atm1_test  <- tail(atm_ts$ATM1, 7)

ATM2

atm2_train <- head(atm_ts$ATM2, length(atm_ts$ATM2) - 7)
atm2_test  <- tail(atm_ts$ATM2, 7)

Fit models again on training data

ATM1

atm1_arima_train <- auto.arima(atm1_train, seasonal = TRUE)
atm1_ets_train   <- ets(atm1_train)

atm1_arima_pred <- forecast(atm1_arima_train, h = 31)
atm1_ets_pred   <- forecast(atm1_ets_train, h = 31)

ATM2

atm2_arima_train <- auto.arima(atm2_train, seasonal = TRUE)
atm2_ets_train   <- ets(atm2_train)

atm2_arima_pred <- forecast(atm2_arima_train, h = 31)
atm2_ets_pred   <- forecast(atm2_ets_train, h = 31)

Forecast

ATM1

accuracy(atm1_arima_pred, atm1_test)
##                      ME      RMSE       MAE         MPE       MAPE      MASE
## Training set -0.1103623 25.728299 16.387807 -108.243128 124.591413 0.8614739
## Test set      0.7294579  6.514846  4.719218   -3.870229   9.265912 0.2480797
##                     ACF1  Theil's U
## Training set -0.01129296         NA
## Test set     -0.35807933 0.07617492
accuracy(atm1_ets_pred, atm1_test)
##                      ME      RMSE       MAE        MPE      MAPE      MASE
## Training set -0.9108809 26.475973 17.197222 -116.09993 131.57068 0.9040231
## Test set      0.8572582  6.544522  4.677537   -3.32732   8.81235 0.2458887
##                    ACF1  Theil's U
## Training set  0.1433669         NA
## Test set     -0.3542978 0.07534497

ATM2

accuracy(atm2_arima_pred, atm2_test)
##                      ME     RMSE      MAE      MPE   MAPE      MASE        ACF1
## Training set  0.4509992 26.19483 18.63038     -Inf    Inf 0.8645129 -0.04844147
## Test set     13.5822461 20.59120 15.32357 96.28139 98.628 0.7110660 -0.54126825
##              Theil's U
## Training set        NA
## Test set     0.1238692
accuracy(atm2_ets_pred, atm2_test)
##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.2861317 27.53729 19.17555     -Inf      Inf 0.8898110
## Test set      6.2767598 17.33770 12.92923 10.42447 20.43657 0.5999602
##                    ACF1  Theil's U
## Training set -0.0198363         NA
## Test set     -0.5018867 0.07499634

By analyzing the accuracy measures, we can see that for ATM1, both models performed similarly, but there are some differences since ARIMA had a slightly lower RMSE (6.51 vs 6.54) and ETS had slightly better MAE and MAPE. The performance is very close, but I think ETS provides a marginally better fit because of its slightly lower error in MAE and MAPE.

For ATM2, it’s a little different since ETS outperforms ARIMA across all key metrics. The RMSE is lower (17.34 vs 20.59), the MAE is lower (12.93 vs 15.32), and MAPE is significantly lower (20.44 vs 98.63). So, for ATM2, the ETS is better suited, probably because it handles variability and noise more effectively than ARIMA in this case.

So, I selected ETS for both, although both models performed similarly for ATM1.

ATM3 Forecast

As explained above, ATM3 is kind of a unique case, as the time series is mainly zeros with only a few observations by the end of the period. This probably means that the ATM was inactive for most of the time frame. Because of this, standard time series models like ARIMA or ETS don’t make sense since they rely on consistent patterns to generate forecasts. I will instead use an average-based forecast. I could have also excluded ATM3 because of the lack of data but I decided to do it in case a full ATM analysis is preferred.

atm3_mean <- meanf(atm_ts$ATM3, h = 31)
plot(atm3_mean, main = "ATM3 Mean Forecast")

The forecast is pretty close to zero since the great majority of the values are zero.

ATM4 Forecast

As explained above, ATM4 has high variability and extreme spikes, telling us we have outliers and unstable variance, which can negatively impact standard time series models such as ARIMA and ETS. So, in this case, I will perform a log transformation to stabilize the variance and reduce the influence of outliers, to allow the models to properly capture the actual structure of the data. After the log transformation, I will use ARIMA and ETS and then use exponential to rever to the original scale.

Log Transform ATM4

atm4_log <- log(atm_ts$ATM4 + 1)
plot(atm4_log, main = "ATM4 Log-Transformed")

The log transformation clearly reduced the impact of extremes (spikes).

Fit ARIMA and ETS models in transformed data

ARIMA

atm4_arima <- auto.arima(atm4_log, seasonal = TRUE)
summary(atm4_arima)
## Series: atm4_log 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## 
## Coefficients:
##         sar1    sar2    mean
##       0.2517  0.1963  5.5355
## s.e.  0.0520  0.0524  0.1170
## 
## sigma^2 = 1.603:  log likelihood = -603.16
## AIC=1214.32   AICc=1214.43   BIC=1229.92
## 
## Training set error measures:
##                         ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.0001690912 1.260881 0.9683324 -9.220668 23.89673 0.8320663
##                    ACF1
## Training set 0.04532681
# residuals
checkresiduals(atm4_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.796, df = 12, p-value = 0.2007
## 
## Model df: 2.   Total lags used: 14

ETS

atm4_ets <- ets(atm4_log)
summary(atm4_ets)
## ETS(M,N,A) 
## 
## Call:
## ets(y = atm4_log)
## 
##   Smoothing parameters:
##     alpha = 0.0082 
##     gamma = 0.151 
## 
##   Initial states:
##     l = 5.3957 
##     s = -1.932 -0.4035 0.1322 0.3953 0.5574 0.6734
##            0.5774
## 
##   sigma:  0.2253
## 
##      AIC     AICc      BIC 
## 2319.008 2319.630 2358.007 
## 
## Training set error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0009740136 1.224111 0.9152034 -8.203619 22.63724 0.7864137
##                    ACF1
## Training set 0.06799998
#residuals
checkresiduals(atm4_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,A)
## Q* = 16.033, df = 14, p-value = 0.3114
## 
## Model df: 0.   Total lags used: 14

Forecast

atm4_arima_fc <- forecast(atm4_arima, h = 31)
atm4_ets_fc   <- forecast(atm4_ets, h = 31)

Convert back using exponential function

atm4_arima_fc$mean  <- exp(atm4_arima_fc$mean) - 1
atm4_arima_fc$lower <- exp(atm4_arima_fc$lower) - 1
atm4_arima_fc$upper <- exp(atm4_arima_fc$upper) - 1
atm4_ets_fc$mean   <- exp(atm4_ets_fc$mean) - 1
atm4_ets_fc$lower <- exp(atm4_ets_fc$lower) - 1
atm4_ets_fc$upper <- exp(atm4_ets_fc$upper) - 1

Plot

plot(atm4_arima_fc, main = "ATM4 ARIMA Forecast")

plot(atm4_ets_fc, main = "ATM4 ETS Forecast")

I tried to forecast using manually transformed data (log transformation) but it produced inconsistent visualizations due to scaling issues. Instead, a log transformation was applied directly within the functions, to make sure the forecasts could be properly interpreted in the original scale.

Fit Models

atm4_arima <- auto.arima(atm_ts$ATM4, lambda = 0)
atm4_ets   <- ets(atm_ts$ATM4, lambda = 0)

Forecast

atm4_arima_fc <- forecast(atm4_arima, h = 31)
atm4_ets_fc   <- forecast(atm4_ets, h = 31)

Plots

plot(atm4_arima_fc, main = "ATM4 ARIMA Forecast")

plot(atm4_ets_fc, main = "ATM4 ETS Forecast")

The plots for both ARIMA and ETS have very wide uncertainty intervals.

Accuracy Test ATM4

I will perform accuracy outputs tests just like I did for ATM1 and ATM2.

Split train and test sets

atm4_train <- head(atm_ts$ATM4, length(atm_ts$ATM4) - 7)
atm4_test  <- tail(atm_ts$ATM4, 7)

Fit models again on training data

atm4_arima_train <- auto.arima(atm4_train, lambda = 0)
atm4_ets_train   <- ets(atm4_train, lambda = 0)

Forecast

atm4_arima_pred <- forecast(atm4_arima_train, h = 31)
atm4_ets_pred   <- forecast(atm4_ets_train, h = 31)

Accuracy

accuracy(atm4_arima_pred, atm4_test)
##                      ME     RMSE      MAE       MPE      MAPE      MASE
## Training set 200.165766 682.1659 320.1630 -227.0513  288.5162 0.7916564
## Test set      -1.785581 213.0636 183.6245 -989.3747 1026.1850 0.4540422
##                     ACF1 Theil's U
## Training set -0.01991079        NA
## Test set     -0.46589144 0.5557771
accuracy(atm4_ets_pred, atm4_test)
##                    ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 171.6512 667.8388 307.5688  -257.2406  317.6044 0.7605152
## Test set     -57.8322 203.0103 154.7923 -1154.3208 1173.3328 0.3827497
##                      ACF1 Theil's U
## Training set -0.005027664        NA
## Test set     -0.272918617 0.4828563

For ATM4, the forecast accuracy is way worse compared to the other ATMs, which was expected because of the high variability in the data. Both ARIMA and ETS models produce considerably large errors, with very high RMSE and MAPE values.

Between the two models, ETS performs a little better, with lower RMSE (203 vs 213) and MAE (155 vs 184), so it has more accurate predictions overall.

It is important to note that a major factor impacting performance is the extreme spike in the data, which significantly increases variability and makes it difficult for both models to produce stable forecasts. This explaines the higher uncertainty and lower predictive accuracy.

Final Forecasts Table

atm1_final <- forecast(atm1_ets, h = 31)
atm2_final <- forecast(atm2_ets, h = 31)
atm3_final <- meanf(atm_ts$ATM3, h = 31)
atm4_final <- forecast(atm4_ets, h = 31)

may_dates <- seq(as.Date("2010-05-01"), as.Date("2010-05-31"), by = "day")

forecast_df <- data.frame(
  DATE = may_dates,
  ATM1 = as.numeric(atm1_final$mean),
  ATM2 = as.numeric(atm2_final$mean),
  ATM3 = as.numeric(atm3_final$mean),
  ATM4 = as.numeric(atm4_final$mean)
)

Export to Excel

write_xlsx(forecast_df, "ATM_May_2010_Forecast.xlsx")

Part A Conclusion

Since the four ATM datasets have different behaviors, I evaluated each individually.

For ATM1 and ATM2, both ARIMA and ETS models performed pretty well, capturing the underlying structure and weekly seasonality in the data. The performance for ARIMA was good but ETS provided a slightly better fit overall and outperformed ARIMA for ATM2 based on forecast accuracy.

ATM3 contained mostly zero values with small activity toward the end of the time period. Due to the lack of meaningful (or any) historical patterns, I used a simple average-based forecast.

ATM4 has high variability and extreme outliers, making forecasting more difficult. I applied a log transformation to stabilize the variance, but both ARIMA and ETS models have reduced accuracy. Among the two, ETS performed slightly better, although results need to be interpreted with caution.

Part B: Forecasting Power

Load Data

power_raw <- "https://raw.githubusercontent.com/JDO-MSDS/Data-624/refs/heads/main/Project%201/data/ResidentialCustomerForecastLoad-624%20-%20ResidentialCustomerForecastLoad.csv"

power <- read.csv(power_raw)

str(power)
## 'data.frame':    192 obs. of  3 variables:
##  $ CaseSequence: int  733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY.MMM    : chr  "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : int  6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...
head(power)

The dataset consists of monthly residential power consumption from January 1998 to December 2013. I will convert the data into a time series object with a frequency of 12 to reflect monthly observations and capture yearly seasonality.

Convert to time series

power_ts <- ts(power$KWH, start = c(1998, 1), frequency = 12)

# plot
plot(power_ts, main = "Monthly Power Consumption", ylab = "KWH")

# clean
power_ts_clean <- na.interp(power_ts)

power_ts <- power_ts_clean

The plot shows a repeating seasonal pattern, with yearly clear peaks and dips. So, we can say that residential electricity use is highly seasonal. Most of the series stays within a fairly consistent range, despite some increase in variability towards the end of the time period. There is a sharp drop right after 2010.

Decomposition

power_decomp <- decompose(power_ts)
plot(power_decomp)

I used decomposition to separate the series into trend, seasonal, and irregular components. The seasonal component has a consistent yearly pattern, confirming strong seasonality. The trend component indicates gradual changes in power consumption over time.

ACF and PACF

par(mfrow = c(1,2))

acf(power_ts, main = "ACF - Power")
pacf(power_ts, main = "PACF - Power")

The ACF plot shows strong and repeating spikes at regular intervals, particularly around lag 1.0 years (12 months), so obvious yearly seasonality in the data. The slow decay in autocorrelation suggests the presence of a trend.

The PACF plot has significant spikes at the first few lags, then a gradual decline, indicating that autoregressive terms probably be useful in modeling the series.

Train and test split sets

train <- window(power_ts, end = c(2012, 12))
test  <- window(power_ts, start = c(2013, 1))

Models

ARIMA

power_arima <- auto.arima(train, seasonal = TRUE)
summary(power_arima)
## Series: train 
## ARIMA(1,0,1)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ar1      ma1     sar1     sar2     drift
##       0.6404  -0.4683  -0.8496  -0.6435  7074.385
## s.e.  0.2466   0.2843   0.0670   0.0757  3272.177
## 
## sigma^2 = 6.726e+11:  log likelihood = -2531.84
## AIC=5075.67   AICc=5076.19   BIC=5094.42
## 
## Training set error measures:
##                     ME     RMSE    MAE      MPE    MAPE      MASE        ACF1
## Training set -7708.974 780411.9 476471 -5.00245 11.4285 0.6885211 0.005195015
checkresiduals(power_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(2,1,0)[12] with drift
## Q* = 10.159, df = 20, p-value = 0.9652
## 
## Model df: 4.   Total lags used: 24

auto.arima() selected ARIMA(1,0,1)(2,1,0)[12] model with drift. The inclusion of seasonal terms reflects the strong yearly seasonality mentioned above.

The residual diagnostics indicate a good model fit. The Ljung-Box test produced a p-value of 0.9652, so no significant autocorrelation remains in the residuals. The model successfully captures the temporal structure of the data.

ETS

power_ets <- ets(train)
summary(power_ets)
## ETS(M,N,M) 
## 
## Call:
## ets(y = train)
## 
##   Smoothing parameters:
##     alpha = 0.1653 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6138783.213 
##     s = 0.9372 0.7484 0.8762 1.1895 1.2708 1.2056
##            0.996 0.7612 0.8077 0.917 1.0773 1.213
## 
##   sigma:  0.1172
## 
##      AIC     AICc      BIC 
## 5813.949 5816.875 5861.843 
## 
## Training set error measures:
##                    ME     RMSE    MAE       MPE     MAPE    MASE      ACF1
## Training set 26550.07 819660.3 493314 -5.169436 12.48746 0.71286 0.1504382
checkresiduals(power_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,M)
## Q* = 26.989, df = 24, p-value = 0.305
## 
## Model df: 0.   Total lags used: 24

The selected ETS model is ETS(M,N,M), which captures multiplicative seasonality without a trend. The model reflects the strong seasonal variation mentioned above.

Residual diagnostics show that the model gives an adequate fit, with a Ljung-Box test p-value of 0.305. This means that the residuals behave close to white noise, despite the fit is less strong compared to the ARIMA model.

Forecast

arima_fc <- forecast(power_arima, h = 12)
ets_fc   <- forecast(power_ets, h = 12)

Accuracy

accuracy(arima_fc, test)
##                      ME      RMSE     MAE       MPE     MAPE      MASE
## Training set  -7708.974  780411.9  476471 -5.002450 11.42850 0.6885211
## Test set     692816.151 1585012.3 1028865  7.698308 12.46676 1.4867540
##                     ACF1 Theil's U
## Training set 0.005195015        NA
## Test set     0.016928790 0.8562476
accuracy(ets_fc, test)
##                     ME      RMSE      MAE       MPE      MAPE     MASE
## Training set  26550.07  819660.3 493314.0 -5.169436 12.487462 0.712860
## Test set     414152.94 1063538.6 692049.5  4.371255  8.300801 1.000041
##                    ACF1 Theil's U
## Training set 0.15043825        NA
## Test set     0.09665769 0.6098413

I compared the forecasts using a 12 month test set (2013) to evaluate which one is the best model. Both models capture the general seasonal pattern, but there are differences in accuracy. The ETS model outperforms ARIMA across all key metrics on the test set since RMSE, MAE, and MAPE are lower.

So, despite the ARIMA model showed great residual diagnostics, the forecast accuracy is noticeably worse than ETS. So, ETS is a better choice to capture the multiplicative seasonal structure of the data.

Therefore, I selected ETS(M,N,M).

2014 Forecast

final_forecast <- forecast(power_ets, h = 12)
plot(final_forecast)

# export to excel
forecast_2014 <- data.frame(
  Month = seq(as.Date("2014-01-01"), as.Date("2014-12-01"), by = "month"),
  Forecast = as.numeric(final_forecast$mean)
)

write_xlsx(forecast_2014, "Power_2014_Forecast.xlsx")

Part B Conclucsion

The power consumption data set has strong yearly seasonality, which was effectively captured by both ARIMA and ETS models. The ARIMA provided a good statistical fit based on the residual diagnostics, however, ETS ended up having a superior forecasting performance on the test set. Because of this, I selected ETS as the final model for predicting power consumption in 2014.

Part C: Waterflow

Load Data

pipe1_raw <- "https://raw.githubusercontent.com/JDO-MSDS/Data-624/refs/heads/main/Project%201/data/Waterflow_Pipe1%20-%20Waterflow_Pipe1.csv"
pipe1 <- read.csv(pipe1_raw)


pipe2_raw <- "https://raw.githubusercontent.com/JDO-MSDS/Data-624/refs/heads/main/Project%201/data/Waterflow_Pipe2%20-%20Waterflow_Pipe2.csv"
pipe2 <- read.csv(pipe2_raw)

str(pipe1)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date.Time: num  42300 42300 42300 42300 42300 ...
##  $ WaterFlow: num  23.4 28 23.1 30 6 ...
head(pipe1)
str(pipe2)
## 'data.frame':    1000 obs. of  2 variables:
##  $ Date.Time: num  42300 42300 42300 42300 42300 ...
##  $ WaterFlow: num  18.8 43.1 38 36.1 31.9 ...
head(pipe2)

Convert datetime

pipe1 <- pipe1 |>
  mutate(
    Date.Time = as.POSIXct(Date.Time * 86400, origin = "1899-12-30", tz = "UTC"),
    hour = as.POSIXct(format(Date.Time, "%Y-%m-%d %H:00:00"), tz = "UTC")
  )

pipe2 <- pipe2 |>
  mutate(
    Date.Time = as.POSIXct(Date.Time * 86400, origin = "1899-12-30", tz = "UTC"),
    hour = as.POSIXct(format(Date.Time, "%Y-%m-%d %H:00:00"), tz = "UTC")
  )

Aggregate

Using the mean as suggested

pipe1_hourly <- pipe1 |>
  group_by(hour) |>
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE))

pipe2_hourly <- pipe2 |>
  group_by(hour) |>
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE))

Plot

plot(pipe1_hourly$hour, pipe1_hourly$WaterFlow,
     type = "l", main = "Pipe 1 Hourly Flow")

plot(pipe2_hourly$hour, pipe2_hourly$WaterFlow,
     type = "l", main = "Pipe 2 Hourly Flow")

The water flow data sets have timestamped observations recorded at irregular intervals. I converted the timestamp from Excel numeric format to standard datetime format. To align the data, I aggregated the observations hourly. For hours with various readings, I used the average water flow.

Convert to a time series with 24 hours

pipe1_ts <- ts(pipe1_hourly$WaterFlow, frequency = 24)
pipe2_ts <- ts(pipe2_hourly$WaterFlow, frequency = 24)

Plot and ACF for both pipes time series

plot(pipe1_ts, main = "Pipe 1 Time Series")

acf(pipe1_ts)

plot(pipe2_ts, main = "Pipe 2 Time Series")

acf(pipe2_ts)

The hourly series for both pipes 1 and 2 seem to fluctuate around a relatively constant level with no obvious long-term trend. The variability is fairly stable over time. The ACF plots show that most autocorrelation values are within the confidence bounds after the first lag. So, both series behave close to white noise. They are kind of stationary but they do not have strong patterns/structure, so forecasting doesn’t make much sense.