Part A:

ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Load data and cleanup

Converting the DATE column to a date format.
Transforming the data frame into a tsibble for time series analysis.
Filtering the data to include only records before May 1, 2010

suppressWarnings({
atm_data <- read_excel("ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
  mutate(DATE = as_date(DATE)) %>%
  as_tsibble(index = DATE, key = ATM) %>%
  filter(DATE < "2010-05-01")

head(atm_data)
})

## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99
## 6 2009-05-06 ATM1     88

##Data Analaysis:

str(atm_data)

## tbl_ts [1,460 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1460], format: "2009-05-01" "2009-05-02" ...
##  $ ATM : chr [1:1460] "ATM1" "ATM1" "ATM1" "ATM1" ...
##  $ Cash: num [1:1460] 96 82 85 90 99 88 8 104 87 93 ...
##  - attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ ATM  : chr [1:4] "ATM1" "ATM2" "ATM3" "ATM4"
##   ..$ .rows: list<int> [1:4] 
##   .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:365] 366 367 368 369 370 371 372 373 374 375 ...
##   .. ..$ : int [1:365] 731 732 733 734 735 736 737 738 739 740 ...
##   .. ..$ : int [1:365] 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
##  - attr(*, "index")= chr "DATE"
##   ..- attr(*, "ordered")= logi TRUE
##  - attr(*, "index2")= chr "DATE"
##  - attr(*, "interval")= interval [1:1] 1D
##   ..@ .regular: logi TRUE

summary(atm_data)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

##Handling missing values

missing_ATMs <- atm_data %>%
  filter(is.na(Cash)) %>%
  select(DATE, ATM)
 missing_ATMs

## # A tsibble: 5 x 2 [1D]
## # Key:       ATM [2]
##   DATE       ATM  
##   <date>     <chr>
## 1 2009-06-13 ATM1 
## 2 2009-06-16 ATM1 
## 3 2009-06-22 ATM1 
## 4 2009-06-18 ATM2 
## 5 2009-06-24 ATM2

# Filter data for ATM1 and ATM2
ATM1_data <- atm_data %>% filter(ATM == "ATM1")
ATM2_data <- atm_data %>% filter(ATM == "ATM2")

# Fit ARIMA model to ATM1
ATM1_arima <- auto.arima(ATM1_data$Cash, seasonal = TRUE)

# Forecast missing values for ATM1 using the fitted ARIMA model
ATM1_data$Cash_filled <- ifelse(is.na(ATM1_data$Cash), 
                                 forecast(ATM1_arima, h = sum(is.na(ATM1_data$Cash)))$mean, 
                                 ATM1_data$Cash)

# Fit ARIMA model to ATM2
ATM2_arima <- auto.arima(ATM2_data$Cash, seasonal = TRUE)

# Forecast missing values for ATM2 using the fitted ARIMA model
ATM2_data$Cash_filled <- ifelse(is.na(ATM2_data$Cash), 
                                 forecast(ATM2_arima, h = sum(is.na(ATM2_data$Cash)))$mean, 
                                 ATM2_data$Cash)


ATM_filled <- atm_data %>%
  filter(!(ATM %in% c("ATM1", "ATM2"))) %>%
  bind_rows(ATM1_data %>% select(ATM, Cash = Cash_filled), 
            ATM2_data %>% select(ATM, Cash = Cash_filled))

# Checking if it worked
final_missing <- sum(is.na(ATM_filled$Cash))
cat("Final combined data - Missing values:", final_missing, "\n")

## Final combined data - Missing values: 0

According to Section 13.9 of Hyndman and Athanasopoulos’s “Forecasting: Principles and Practice,” ARIMA models are effective for filling in gaps in time series data.So, I used ARIMA models to handle missing values in the cash flow data for the ATMs. This helped me create complete datasets for both ATMs, improving the overall accuracy of my analysis.

head(ATM_filled)

## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99
## 6 2009-05-06 ATM1     88

Data Exploration

ATM_filled %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = " ATM Withdrawals")

Looks like ATM4 has outlier and we can remove it. We are using Interquartile Range (IQR) method, which defines outliers as points that fall below Q1-1.5IQR or above Q3+1.5IQR

ATM_4 <- ATM_filled %>%
  filter(ATM == "ATM4")

Q1 <- quantile(ATM_4$Cash, 0.25, na.rm = TRUE)
Q3 <- quantile(ATM_4$Cash, 0.75, na.rm = TRUE)
IQR_value <- IQR(ATM_4$Cash, na.rm = TRUE)

# Determine the outlier thresholds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Identify outliers
outliers <- ATM_4 %>% 
  filter(Cash < lower_bound | Cash > upper_bound)

# Display outliers
print(outliers)

## # A tsibble: 3 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2009-09-22 ATM4   1712.
## 2 2010-01-29 ATM4   1575.
## 3 2010-02-09 ATM4  10920.

Cash 10919.762 is much greater than the others,so using tsoutlier to replace the outlier in ATM4

ATM4_ts <- ts(ATM_4$Cash, frequency = 365, start = c(2009, 1))  # Adjust start as necessary

# Identify outliers using tsoutliers
outlier_results <- tsoutliers(ATM4_ts)

# Print outlier information
print(outlier_results)

## $index
## [1] 285
## 
## $replacements
## [1] 230.1752

ATM_4$Cash[285] <- 230.1752
print(ATM_4$Cash[285])

## [1] 230.1752

autoplot(ATM_4) +
  labs(title="ATM4", subtitle="Cash withdrawals daily", y="Hundreds USD")

## Plot variable not specified, automatically selected `.vars = Cash`

ATM_filled <- ATM_filled %>% 
  rename(date=DATE, atm=ATM, cash=Cash) %>%
  mutate(date = as.Date(date)) %>% 
  pivot_wider(names_from=atm, values_from = cash)

Modeling for ATM1*

ATM_1 <- ATM_filled %>% 
  select(date, ATM1)

 ATM_1%>%
  autoplot(ATM1) +
  ggtitle("Original ATM1")

# Create the second plot (STL Decomposition)
 ATM_1 %>%
  model(STL(ATM1 ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

Clear peaks and troughs indicate periodic fluctuations in cash withdrawals. The STL decomposition of the ATM cash withdrawal data uncovers three main insights. The trend component indicates a general increase in cash withdrawals starting in mid-2009, reaching a peak before slightly declining in early 2010. The first plot (Original non-transformed plot) illustrates the raw data over time, revealing significant variability and pronounced weekly seasonality.

The seasonal component captures a stable weekly pattern, with consistent fluctuations that recur each week, reflecting predictable withdrawal behavior associated with weekly cycles. Finally, the remainder component showcases short-term irregularities or noise, with notable spikes around early 2010, suggesting unexpected events affecting withdrawals during that time.

ATM_1 %>% 
  ACF(ATM1, lag_max = 28) %>% 
  autoplot()

The ACF plot shows significant lags at 2, 5, and 7, with lag 7 being the strongest. We see drop in the ACF which could be due to data is non-stationary and may need differencing.

ndiffs(ATM_1$ATM1)

## [1] 0

No differencing is required.

Rationale for choosing the following models: The additive ETS model is appropriate because the seasonality is additive, as indicated by the STL seasonal component, and there is no distinct trend present. The SNAIVE model is a sensible choice due to the strong and consistent weekly seasonality observed in the data.

Lastly, the ARIMA model is included because, although the primary focus is on seasonality, it may also account for underlying autocorrelations in the remainder component.

lambda <- ATM_1 %>%
  features(ATM1, features = guerrero) %>%
  pull(lambda_guerrero)


atm1_fit <- ATM_1 %>% 
  model(
    additive = ETS(ATM1 ~ error("A") + trend("N") + season("A")),
    snaive = SNAIVE(ATM1),
    ARIMA = ARIMA(box_cox(ATM1, lambda))
  )


forecast_ATM1 <- atm1_fit %>%
  forecast(h = 30)

# Plot the forecasts
forecast_ATM1 %>%
  autoplot(ATM_1, level = NULL) +
  facet_wrap(~ .model, scales = "free", ncol = 2) +  # Adjust ncol to control number of columns
  guides(colour = guide_legend(title = "Forecast")) +
  labs(title = "ATM1 Forecasts") +
  xlab("Date") +
  ylab("Cash values in Hundreds")

accuracy(atm1_fit) %>%
  select(.model, RMSE:MAPE)

## # A tibble: 3 × 5
##   .model    RMSE   MAE    MPE  MAPE
##   <chr>    <dbl> <dbl>  <dbl> <dbl>
## 1 additive  23.8  15.1 -106.   121.
## 2 snaive    27.9  17.8  -96.6  116.
## 3 ARIMA     25.1  16.2  -87.8  107.

We see that RMSE (23.80689) and MAE (15.10543) metrics, the additive ETS model appears to be the best overall fit for our data, as it has the lowest values for both metrics.

Forecasting

ATM1_bestfit <- ATM_1 %>% 
  model(
    ETS = ETS(ATM1))

 forecast_ATM1 <- ATM1_bestfit %>%
  forecast(h = 31) %>%
  filter(.model=='ETS')

forecast_ATM1 %>% 
  autoplot(ATM_1) +
  labs(title = "Forecast for ATM1 using ETS",
       y = "Cash Values in Hundreds")

ATM1_bestfit%>%
  gg_tsresiduals() +
  labs(title="Residuals for ETS")

ATM

ATM_2 <- ATM_filled %>% 
  select(date, ATM2)

ATM_2 %>%
  autoplot(ATM2) +
  ggtitle("Original ATM2")

# STL decomposition
ATM_2 %>%
  model(STL(ATM2 ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

ATM_2 %>% 
  ACF(ATM2, lag_max = 28) %>% 
  autoplot()

We see agsin tht data is non-stationary and could potentially benefit from differencing. We can check and see if that’s the case

ndiffs(ATM_2$ATM2)

## [1] 1

ATM_2_diff <- ATM_2 %>%
  mutate(ATM2_diff = ATM2 - lag(ATM2, 1)) %>%  
  filter(!is.na(ATM2_diff))

# Perform KPSS test
kpss_test <- kpss.test(ATM_2_diff$ATM2_diff)

## Warning in kpss.test(ATM_2_diff$ATM2_diff): p-value greater than printed
## p-value

print(kpss_test)

## 
##  KPSS Test for Level Stationarity
## 
## data:  ATM_2_diff$ATM2_diff
## KPSS Level = 0.014714, Truncation lag parameter = 5, p-value = 0.1

mean_value <- mean(ATM_2_diff$ATM2_diff[ATM_2_diff$ATM2_diff >= 0], na.rm = TRUE)
ATM_2_diff$ATM2_diff[ATM_2_diff$ATM2_diff < 0] <- mean_value

Since our p-value (0.1) exceeds the typical significance level (0.05), we do not reject the null hypothesis. This indicates that the series is stationary.

Additionally, while examining ATM_2_diff, I noticed a significant number of negative values. This occurred because the differencing operation calculates the change in withdrawals between consecutive periods. If ATM2 declines from one period to the next, ATM2_diff will be negative.

For instance, if withdrawals were $100 one day and dropped to $80 the next, the difference would be -$20. To address this issue, I opted to impute the negative values by replacing them with the mean of the non-negative differences.

# Calculate lambda for Box-Cox transformation on the differenced data
lambda <- ATM_2_diff %>%
  features(ATM2_diff, features = guerrero) %>%
  pull(lambda_guerrero)

# Fit the models using the differenced data
atm2_fit <- ATM_2_diff %>%
  model(
    additive = ETS(ATM2_diff ~ error("A") + trend("N") + season("A")),
    snaive = SNAIVE(ATM2_diff),
    ARIMA = ARIMA(box_cox(ATM2_diff, lambda))
  )

# Forecast for the next 30 periods
forecast_ATM2 <- atm2_fit %>%
  forecast(h = 30)

# Plot the forecasts
forecast_ATM2 %>%
  autoplot(ATM_2_diff, level = NULL) +
  facet_wrap(~ .model, scales = "free", ncol = 2) +  # Adjust ncol to control number of columns
  guides(colour = guide_legend(title = "Forecast")) +
  labs(title = "ATM2 Forecasts") +
  xlab("Date") +
  ylab("Cash values in Hundreds")

# Get stats for the models
model_stats <- left_join(
  glance(atm2_fit) %>% select(.model:BIC), 
  accuracy(atm2_fit) %>% select(.model, RMSE)
) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

# Display accuracy metrics
accuracy(atm2_fit) %>%
  select(.model, RMSE:MAPE)

## # A tibble: 3 × 5
##   .model    RMSE   MAE   MPE  MAPE
##   <chr>    <dbl> <dbl> <dbl> <dbl>
## 1 additive  20.8  13.9  -Inf   Inf
## 2 snaive    27.6  15.9  -Inf   Inf
## 3 ARIMA     22.1  13.6  -Inf   Inf

# Display the model statistics
model_stats

## # A tibble: 3 × 7
##   .model   sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive   444.  -2178. 4376. 4377. 4415.  20.8
## 2 ARIMA     1622.  -1862. 3732. 3733. 3748.  22.1
## 3 snaive     765.     NA    NA    NA    NA   27.6

Looks like ARIMA model appears to be the best fit for your differenced ATM2 data, as indicated by its lower RMSE, AIC, and BIC values.

Forecast

ATM2_bestfit <- ATM_2_diff %>% 
  model(
    ARIMA = ARIMA(ATM2_diff))

ATM2_bestfit %>% select(.model = "ARIMA") %>% report()

## Series: ATM2_diff 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.3051  0.3525   17.6802
## s.e.  0.0489  0.0502    1.1016
## 
## sigma^2 estimated as 495.1:  log likelihood=-1646.04
## AIC=3300.09   AICc=3300.2   BIC=3315.68

forecast_ATM2 <- ATM2_bestfit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# forcasted plot
forecast_ATM2 %>%
  autoplot(ATM_2_diff) +
  ggtitle(TeX(paste0("ATM 2 Forecast with $ARIMA(1,0,0)(0,1,1))_7$ and lambda = ",
         round(lambda,2))))

# residuals
ATM2_bestfit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(1,0,0)(0,1,1)_7$ with lambda = ",
         round(lambda,2))))

ATM3*

ATM_filled  %>%
  autoplot(ATM3) +
  ggtitle("Original ATM3")

Since ATM3 has only three data points, it is not feasible to forecast the month of May using traditional methods, as there is insufficient data to discern trends or seasonal patterns.

In this situation, calculating the mean of these three days serves as an effective way to estimate cash withdrawals for the month. This approach provides a simple and practical solution, representing the average withdrawal behavior observed during those recorded days.

ATM3_fit <- ATM_filled %>%
  filter(ATM3 != 0) %>%
  model(MEAN(ATM3))

Forcast

forecast_atm3 <- ATM3_fit %>%
  forecast(h = 31) 

# forcasted plot
forecast_atm3 %>%
  autoplot(ATM_filled) +
  ggtitle("ATM 3 Forecast with the MEAN() Model")

ATM4

ATM_4%>%  
  autoplot() +
  labs(title = "Original ATM3")

## Plot variable not specified, automatically selected `.vars = Cash`

ATM_4%>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4 ")

ATM_4 %>% 
  ACF(Cash, lag_max = 28) %>% 
  autoplot()

The ACF plot displays a significant spike at lag 7, along with a smaller spike at lag 28. Following lag 7, the ACF shows a gradual decline, with another increase around lag 28.

This slow decline, coupled with the rise at lag 28, indicates that the series may exhibit some autocorrelation over time, potentially signaling non-stationarity. Therefore, it would be appropriate to consider a test for differencing.

ndiffs(ATM_4$Cash)

## [1] 0

No differencing is needed. I will evaluate the following models: ARIMA, SNAIVE, Additive ETS, and Multiplicative ETS (transformed). ARIMA is included because it can accommodate both trend and seasonal patterns, and the Box-Cox transformation helps stabilize variance while capturing autocorrelation.

The Additive ETS and Multiplicative ETS (transformed) models are also considered since the transformation aids in variance stabilization, and ETS is particularly effective for modeling seasonality.

lambda <- ATM_4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


atm4_fit <- ATM_4 %>%
  model(
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    snaive = SNAIVE(Cash),
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(Cash,lambda)),
  )
forecast_ATM4 <- atm4_fit %>%
  forecast(h = 30)

forecast_ATM4 %>%
  autoplot(ATM_4, level = NULL) +
  facet_wrap(~ .model, scales = "free", ncol = 2) +  # Adjust ncol to control number of columns
  guides(colour = guide_legend(title = "Forecast")) +
  labs(title = "ATM4 Forecasts") +
  xlab("Date") +
  ylab("Cash values in Hundreds")

# stats for the models
left_join(glance(atm4_fit) %>% select(.model:BIC), 
          accuracy(atm4_fit) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

## # A tibble: 4 × 7
##   .model                sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>                  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive          110850.     -3192. 6404. 6405. 6443.  329.
## 2 multiplicative_ts      0.219  -2003. 4026. 4027. 4065.  343.
## 3 ARIMA                176.     -1461. 2931. 2931. 2951.  352.
## 4 snaive            205805.        NA    NA    NA    NA   453.

accuracy(atm4_fit) %>%
  select(.model, RMSE:MAPE)

## # A tibble: 4 × 5
##   .model             RMSE   MAE   MPE  MAPE
##   <chr>             <dbl> <dbl> <dbl> <dbl>
## 1 additive           329.  265. -515.  545.
## 2 snaive             453.  346. -392.  447.
## 3 multiplicative_ts  343.  262. -374.  417.
## 4 ARIMA              352.  274. -368.  413.

ARIMA appears to be the best overall model when considering both model fit (AIC, BIC) and error metrics like RMSE and MAPE. It has the lowest AIC/BIC, indicating it captures the underlying structure with minimal complexity.

ATM4_bestfit <- ATM_4 %>% 
  model(
    ARIMA = ARIMA(Cash))

ATM4_bestfit %>% select(.model = "ARIMA") %>% report()

## Series: Cash 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.8778  -0.6769  0.1700  0.9634  0.8065  0.2484  795.0206
## s.e.   0.1004   0.1186  0.0564  0.0906  0.1079  0.0556   48.8937
## 
## sigma^2 estimated as 117528:  log likelihood=-2645.21
## AIC=5306.42   AICc=5306.82   BIC=5337.62

forecast_ATM4 <- ATM4_bestfit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# forcasted plot
forecast_ATM4 %>%
  autoplot(ATM_4) +
  ggtitle(TeX(paste0("ATM 4 Forecast with $ARIMA(3,0,2)(1,0,0)_7$ and lambda = ",
         round(lambda,2))))

# residuals
ATM4_bestfit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(3,0,2)(1,0,0)_7$ with lambda = ",
         round(lambda,2))))

forecast_ATM1 <- forecast_ATM1 %>%
  rename(atm1 = .mean)

forecast_ATM2 <- forecast_ATM2 %>%
  rename(atm2 = .mean)

forecast_atm3 <- forecast_atm3 %>%
  rename(atm3 = .mean)

forecast_ATM4 <- forecast_ATM4 %>%
  rename(atm4 = .mean)

combined_forecasts <- bind_cols(forecast_ATM1, forecast_ATM2, forecast_atm3, forecast_ATM4)

## New names:
## • `.model` -> `.model...1`
## • `date` -> `date...2`
## • `.model` -> `.model...5`
## • `date` -> `date...6`
## • `.model` -> `.model...9`
## • `date` -> `date...10`
## • `.model` -> `.model...14`

combined_forecasts <- combined_forecasts %>%
    rename(date = date...2)

final_forecast <- combined_forecasts %>%
  select(date, atm1, atm2, atm3, atm4)

final_forecast %>% write.csv("final_forecasts.csv")

Part B:

Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

res<-read.csv("https://raw.githubusercontent.com/deepasharma06/Data-624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv")
head(res)

##   CaseSequence YYYY.MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

res_ts <- res %>% 
  rename(Date = 'YYYY.MMM') %>% 
  mutate(Date = yearmonth(Date)) %>% 
  as_tsibble(index = Date)

head(res_ts)

## # A tsibble: 6 x 3 [1M]
##   CaseSequence     Date     KWH
##          <int>    <mth>   <int>
## 1          733 1998 Jan 6862583
## 2          734 1998 Feb 5838198
## 3          735 1998 Mar 5420658
## 4          736 1998 Apr 5010364
## 5          737 1998 May 4665377
## 6          738 1998 Jun 6467147

We renamed the ‘YYYY-MMM’ column to ‘Month’ for clarity. We transformed the ‘Month’ column into a proper year-month format, so R understands it as dates. We converted the data into a tsibble.

res_ts %>%
  autoplot(KWH)

From above visualization we can clearly notice a possible outlier. We’ll move on with investigating the data for missing values and outliers.

Data Exploration

summary(res_ts$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

There’s one missing value in our data. Same as before we’ll fit an ARIMA model to handle the missing data.

missing_KWH<- res_ts %>% filter(is.na(KWH)) %>% select(Date, KWH)
missing_KWH

## # A tsibble: 1 x 2 [1M]
##       Date   KWH
##      <mth> <int>
## 1 2008 Sep    NA

res_arima <- auto.arima(res_ts$KWH, seasonal = TRUE)

# Count the number of missing values in 'KWH'
num_missing <- sum(is.na(res_ts$KWH))

# If there are missing values, forecast them using the fitted ARIMA model
if (num_missing > 0) {
  forecast_values <- forecast(res_arima, h = num_missing)$mean  
  
  # Create a new column 'KWH_filled' by filling NA with forecast values
  res_filled <- res_ts %>% 
    mutate(KWH_filled = ifelse(is.na(KWH), forecast_values, KWH))
} else {
  res_filled <- res_ts %>% 
    mutate(KWH_filled = KWH)
}

res_filled <- res_filled %>% 
  select(CaseSequence,Date, KWH_filled)

summary(res_filled$KWH_filled) ## Check if the missing value is handled

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6314472  6516367  7649733 10655730

Outlier We can manually identify values that significantly deviate from the trend. We’ll consider values that are more than 1.5 times the interquartile range (IQR).

Q1 <- quantile(res_filled$KWH_filled, 0.25)
Q3 <- quantile(res_filled$KWH_filled, 0.75)
IQR <- Q3 - Q1

# Define upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers
potential_outliers <- res_filled$KWH_filled[res_filled$KWH_filled < lower_bound | res_filled$KWH_filled > upper_bound]

print(potential_outliers)

## [1] 770523

KWH_filled_with_na <- res_filled$KWH_filled
KWH_filled_with_na[KWH_filled_with_na < lower_bound | KWH_filled_with_na > upper_bound] <- NA

# Step 2: Fit an ARIMA model (including NA values)
arima_model <- auto.arima(KWH_filled_with_na)

# Step 3: Forecast missing values
forecasts <- forecast(arima_model, h = sum(is.na(KWH_filled_with_na)))

# Impute the NA values with the forecasted values
KWH_imputed <- KWH_filled_with_na
KWH_imputed[is.na(KWH_imputed)] <- forecasts$mean

res_filled$KWH_filled <- KWH_imputed

head(res_filled)

## # A tsibble: 6 x 3 [1M]
##   CaseSequence     Date KWH_filled
##          <int>    <mth>      <dbl>
## 1          733 1998 Jan    6862583
## 2          734 1998 Feb    5838198
## 3          735 1998 Mar    5420658
## 4          736 1998 Apr    5010364
## 5          737 1998 May    4665377
## 6          738 1998 Jun    6467147

In the previous steps, we began by identifying outliers in the KWH_filled values by comparing them to established lower and upper bounds, marking any outlier values as NA. We then fitted an ARIMA model to the KWH_filled data, including the NA entries, to capture the underlying patterns in the time series.

Using this model, we forecasted the missing values and replaced the NA entries with the predicted values from the ARIMA forecasts.

Finally, we updated the original res_filled DataFrame with the imputed KWH_filled values and printed the updated DataFrame to verify the changes. Next, we will examine the plot.

res_filled <- res_filled %>%
  mutate(KWH_filled = KWH_filled / 1000)
head(res_filled)

## # A tsibble: 6 x 3 [1M]
##   CaseSequence     Date KWH_filled
##          <int>    <mth>      <dbl>
## 1          733 1998 Jan      6863.
## 2          734 1998 Feb      5838.
## 3          735 1998 Mar      5421.
## 4          736 1998 Apr      5010.
## 5          737 1998 May      4665.
## 6          738 1998 Jun      6467.

This conversion makes the data easier to interpret, especially when dealing with larger quantities of energy.

res_filled%>%
  autoplot(KWH_filled) +
  ggtitle("Residential Power Usage cleaned plot")

res_filled %>%
  model(STL(KWH_filled ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition cleaned plot")

The data reveals a distinct upward trend, although it contributes less variation than other factors. Additionally, there is an annual pattern, with power consumption reaching its highest levels in the summer and winter months. Notably, random fluctuations include a significant spike in December 2013. A seasonal plot reinforces the observed rise in power usage during peak months, while the off-peak months likely vary each year due to differences in weather conditions.

res_filled$KWH_filled <- as.numeric(res_filled$KWH_filled)

# Create the time series object
plot_season <- ts(res_filled$KWH_filled, start = c(1998, 1), frequency = 12)


ggseasonplot(plot_season)+
  ggtitle("Residential Power Usage Deasonally Each Year")

The chart above illustrates the seasonal trend in residential power consumption from 1998 to 2013, indicating peaks during the summer months (June to August) and noticeable declines in spring (March to May) and fall (September to November). This pattern suggests that power usage is higher in warmer months, likely driven by increased air conditioning needs.

power_lambda <- res_filled %>%
  features(KWH_filled, features = guerrero) %>%
  pull(lambda_guerrero)

res_filled <- res_filled %>% 
    mutate(KWH_bc = box_cox(KWH_filled, power_lambda))

summary(res_filled$KWH_bc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.040   3.058   3.069   3.069   3.082   3.103

res_filled %>% 
  autoplot(KWH_bc) +
  labs(y = "KWH in Thousands")

We initially employed the Guerrero method to determine the optimal Box-Cox transformation parameter (lambda) for the KWH_filled data, aiming to stabilize its variance. Next, we applied the Box-Cox transformation to the KWH_filled values, resulting in a new variable (KWH_bc).

Finally, we summarized the transformed data and generated a plot to visualize the transformed values (KWH_bc). This approach enhances the normality and homoscedasticity of the dataset.

res_filled %>% 
  ACF(KWH_bc, lag_max = 36) %>% 
  autoplot()

The ACF plot reveals significant autocorrelations at lags that are multiples of 12, highlighting strong seasonal patterns in the data. The gradual decrease in the ACF indicates that the series may be non-stationary, while the PACF displays a sharp decline after lag 1, further suggesting potential non-stationarity.

Given these observations, it may be necessary to apply differencing, especially seasonal differencing, to achieve stationarity prior to modeling.

ndiffs(res_filled$KWH_bc)

## [1] 1

The output of 1 indicates that we need to difference our KWH_bc data once to achieve stationarity.

diff_res <- res_filled %>% 
  mutate(diff_KWH= difference(KWH_filled), diff_KWH_bc = difference(KWH_bc)) %>% 
  select(-KWH_filled, -KWH_bc) %>% 
  slice(-1)


print(kpss.test(diff_res$diff_KWH_bc))

## Warning in kpss.test(diff_res$diff_KWH_bc): p-value greater than printed
## p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff_res$diff_KWH_bc
## KPSS Level = 0.039109, Truncation lag parameter = 4, p-value = 0.1

ndiffs(diff_res$diff_KWH_bc)

## [1] 0

We can conclude that the differenced Box-Cox transformed data (diff_KWH_bc) appears to be stationary.

Modeling

diff_train <- diff_res %>% 
  filter(year(Date) < 2013)

diff_fit <- diff_train %>% 
    model(
    ARIMA = ARIMA(diff_KWH),
    `Auto ARIMA` = ARIMA(diff_KWH, stepwise = FALSE, approx = FALSE)
  )


diff_fc <- diff_fit %>% 
  forecast(h = "1 year")

#plot
diff_fc %>%
  autoplot(diff_res, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "KWH Forecasts")+
  xlab("Date") +
  ylab("KWH in Thousands")

In the code above, we first filtered the differenced data (diff_res) to retain only the observations prior to 2013, then fitted both ARIMA and Auto ARIMA models to this dataset. Next, we generated forecasts for the year 2013 using these models. Finally, we created a plot to visualize the forecasts from each model, illustrating their predictions for KWH usage over time.

train <- res_filled %>% 
  filter(year(Date) < 2013)

#models
fit <- train %>% 
    model(
    ETS = ETS(KWH_filled),
    `Additive ETS` = ETS(KWH_filled ~ error("A") + trend("A") + season("A")),
    SNAIVE = SNAIVE(KWH_filled)
  )

#forecast of 2013
fc <- fit %>% 
  forecast(h = "1 year")

#plot
fc %>%
  autoplot(res_filled, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y",ncol=2) +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "KWH Forecasts",
       subtitle = "January 2013 - December 2013")+
  xlab("Month") +
  ylab("KWH in Thousands")

In the first analysis, we used differencing to make the time series data stationary, which is important for ARIMA models to accurately capture the underlying patterns by removing trends and seasonality. This was done with the differenced data (diff_res). In the second analysis, we used the original filled data (res_filled) for ETS and SNAIVE models, which can directly handle trends and seasonality without needing to difference the data. By comparing both approaches, we can see how ARIMA focuses on changes in the data while ETS and SNAIVE use the original data’s structure. This helps us identify the best forecasting method based on model performance.

accuracy(diff_fc, diff_res) %>%
  select(.model, RMSE:MAE)

## # A tibble: 2 × 3
##   .model      RMSE   MAE
##   <chr>      <dbl> <dbl>
## 1 ARIMA      1252.  904.
## 2 Auto ARIMA 1187.  823.

accuracy(fc, res_filled) %>%
  select(.model, RMSE:MAE)

## # A tibble: 3 × 3
##   .model        RMSE   MAE
##   <chr>        <dbl> <dbl>
## 1 Additive ETS 1047.  647.
## 2 ETS          1011.  681.
## 3 SNAIVE       1036.  619.

ETS shows the lowest RMSE (1011.443) and a competitive MAE (680.7843), indicating it’s the best-performing model among those listed.

Results

ETS_fit <- res_filled %>% 
  model(`Additive ETS` = ETS(KWH_filled ~ error("A") + trend("A") + season("A")))


ETS_forecast <- ETS_fit %>% 
  forecast(h=12)

ETS_forecast %>%
  autoplot(res_filled) +
  labs(title = " Residential Power Usage using Additive ETS ",
       y = "KWH in Thousands")

The generated forecast seems reliable, as it effectively reflects the seasonal patterns observed, indicating higher demand during the summer and winter months. It also accurately falls between the peaks and valleys recorded in the more recent years.

final_forec <- ETS_forecast %>%
  rename(KWH_2014 = .mean) %>%
  select(Date, KWH_2014)

final_forec %>% write.csv("KWH_forecasts.csv")

data 624 project 1

Deepa Sharma

2024-10-25

Part A:

Part B: