Part A- ATM data

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.

Load Libraries

Prepare Data

file_path <- ("~/Desktop/Data 624 fall 2024/Data624-Project1/ATM624Data.xlsx")
atm_data <- read_excel(file_path)

The DATE column is represented in Excel serial date format, which counts days from January 1, 1900. So in order to convert it to a YY-MM-DD format we have add the number of days to 1899-12-30.

atm_data <- atm_data %>%
  mutate(
    Date = as.Date(DATE, origin = "1899-12-30"),  # Convert Excel serial date format to Date
    Cash = as.numeric(Cash)  # Ensure Cash column is numeric
  )

# Create a tsibble for time series analysis
atm_tsibble <- atm_data %>%
  as_tsibble(index = Date, key = ATM)

glimpse(atm_tsibble)

## Rows: 1,474
## Columns: 4
## Key: ATM [5]
## $ DATE <dbl> 39934, 39935, 39936, 39937, 39938, 39939, 39940, 39941, 39942, 39…
## $ ATM  <chr> "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "…
## $ Cash <dbl> 96, 82, 85, 90, 99, 88, 8, 104, 87, 93, 86, 111, 75, 6, 102, 73, …
## $ Date <date> 2009-05-01, 2009-05-02, 2009-05-03, 2009-05-04, 2009-05-05, 2009…

I noticed there are NA values for Cash and ATM. Since we’re working with ATMs, I’m going to assume the ATMs were out of service that day and I’m going to impute the missing values with zeros for each ATM. The zeros will represent when no withdrawals occurred. As for the NA values for ATM, it would be best to remove them.

# Remove rows with missing or blank ATM labels
atm_tsibble <- atm_tsibble %>%
  filter(!is.na(ATM) & ATM != "")

# impute with zeros
atm_tsibble <- atm_tsibble %>%
  mutate(Cash = replace_na(Cash, 0))

Begin Modeling

# Autoplot the time series for all ATMs 
atm_tsibble %>%
  autoplot(Cash) +
  labs(
    title = "Cash Withdrawals for All ATMs (Original Series)",
    y = "Cash (in hundreds of dollars)"
  ) +
  facet_wrap(~ ATM, scales = "free_y")

For ATM1 and ATM2: The data appears consistent, showing regular fluctuations around similar ranges. There seems to be a normal pattern of daily cash withdrawals.

For ATM3: Most of the series shows very low or no withdrawals, except for a sudden spike toward the end. This ATM could have been out of commission until 2010. There is not enough data for atm 3 to forecast.

For ATM4: There’s a significant outlier with withdrawals exceeding 9000. This spike may need to be handled separately. I don’t see how an ATM could carry this much money.

The outlier in atm 4 can be removed

# Remove outliers from ATM4
atm_tsibble <- atm_tsibble %>%
  mutate(Cash = if_else(ATM == "ATM4" & Cash > 5000, NA_real_, Cash))  


atm_tsibble <- atm_tsibble %>%
  fill_gaps(Cash = 0)  # Impute remaining NA values with 0

# KPSS test for each ATM
atm1_kpss <- atm_tsibble %>%
  filter(ATM == "ATM1") %>%
  features(Cash, unitroot_kpss)

atm2_kpss <- atm_tsibble %>%
  filter(ATM == "ATM2") %>%
  features(Cash, unitroot_kpss)

atm3_kpss <- atm_tsibble %>%
  filter(ATM == "ATM3") %>%
  features(Cash, unitroot_kpss)

atm4_kpss <- atm_tsibble %>%
  filter(ATM == "ATM4") %>%
  features(Cash, unitroot_kpss)

# Display the results
atm1_kpss

## # A tibble: 1 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM1      0.300         0.1

atm2_kpss

## # A tibble: 1 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM2       1.96        0.01

atm3_kpss

## # A tibble: 1 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM3      0.389      0.0818

atm4_kpss

## # A tibble: 1 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM4      0.117         0.1

A low p-value (< 0.05) indicates that the series is non-stationary. A high p-value (> 0.05) suggests that the series is stationary.

Based on the KPSS values: ATM1 and ATM 2: Apply box-cox and first-order differencing to make it stationary. ATM4: Able to proceed with ARIMA modeling

There is not enough data for ATM3, so it will be excluded in the forecasting.

# Find lambda for ATM1
lambda_atm1 <- atm_tsibble %>%
  filter(ATM == "ATM1") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

# Apply Box-Cox transformation to ATM1
atm1_transformed <- atm_tsibble %>%
  filter(ATM == "ATM1") %>%
  mutate(Cash_transformed = box_cox(Cash, lambda_atm1))

# Find lambda for ATM2
lambda_atm2 <- atm_tsibble %>%
  filter(ATM == "ATM2") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

# Apply Box-Cox transformation to ATM2
atm2_transformed <- atm_tsibble %>%
  filter(ATM == "ATM2") %>%
  mutate(Cash_transformed = box_cox(Cash, lambda_atm2))

# Apply first-order differencing to box-coxed ATM1 and ATM2
atm1_diff <- atm1_transformed %>%
  filter(ATM == "ATM1") %>%
  mutate(Cash_diff = difference(Cash))

atm2_diff <- atm2_transformed %>%
  filter(ATM == "ATM2") %>%
  mutate(Cash_diff = difference(Cash))

# Plot the differenced series for ATM1 and ATM2
p1 <- autoplot(atm1_diff, Cash_diff) + labs(title = "ATM1 Cash Withdrawals (First-Order Differenced Box-cox)", y = "Differenced Cash")
p2 <- autoplot(atm2_diff, Cash_diff) + labs(title = "ATM2 Cash Withdrawals (First-Order Differenced Box-cox)", y = "Differenced Cash")

# combine with patchwork
p1 / p2

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

This data looks stationary and we can move forward with ARIMA modeling

# Fit ARIMA models to ATM1 and ATM2 differenced series with automatic selection
atm1_arima <- atm1_diff %>% model(ARIMA(Cash_diff))
atm2_arima <- atm2_diff %>% model(ARIMA(Cash_diff))

# Fit ARIMA models to ATM3 and ATM4 without differencing, using automatic selection
atm4_arima <- atm_tsibble %>% filter(ATM == "ATM4") %>% model(ARIMA(Cash))

report(atm1_arima)

## Series: Cash_diff 
## Model: ARIMA(4,0,0)(2,1,0)[7] 
## 
## Coefficients:
##           ar1     ar2      ar3      ar4     sar1     sar2
##       -0.7248  -0.565  -0.3000  -0.1394  -0.5182  -0.2530
## s.e.   0.0529   0.063   0.0627   0.0524   0.0521   0.0506
## 
## sigma^2 estimated as 923.1:  log likelihood=-1724.31
## AIC=3462.61   AICc=3462.93   BIC=3489.78

report(atm2_arima)

## Series: Cash_diff 
## Model: ARIMA(3,0,0)(2,1,0)[7] 
## 
## Coefficients:
##           ar1      ar2      ar3     sar1     sar2
##       -0.7442  -0.5687  -0.3386  -0.5940  -0.2019
## s.e.   0.0501   0.0569   0.0496   0.0526   0.0521
## 
## sigma^2 estimated as 790.3:  log likelihood=-1697.27
## AIC=3406.55   AICc=3406.79   BIC=3429.83

report(atm4_arima)

## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.1509  0.1056  330.1213
## s.e.  0.0522  0.0525   17.8500
## 
## sigma^2 estimated as 118902:  log likelihood=-2642.55
## AIC=5293.1   AICc=5293.21   BIC=5308.7

# Print all residual plots 
atm1_arima %>% gg_tsresiduals() + labs(title = "Residuals of ARIMA Model (ATM1)")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

atm2_arima %>% gg_tsresiduals() + labs(title = "Residuals of ARIMA Model (ATM2)")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

atm4_arima %>% gg_tsresiduals() + labs(title = "Residuals of ARIMA Model (ATM4)")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing non-finite outside the scale range (`stat_bin()`).

# ATM1 forecast using the fitted model
atm1_forecast <- atm1_arima %>%
  forecast(h = "10 days")

# ATM2 forecast using the fitted model
atm2_forecast <- atm2_arima %>%
  forecast(h = "10 days")

# ATM4 forecast using the fitted model
atm4_forecast <- atm4_arima %>%
  forecast(h = "10 days")

atm1_tsibble <- atm_tsibble %>% filter(ATM == "ATM1")

autoplot(atm1_tsibble, Cash) +
  autolayer(atm1_forecast, series = "Forecast") +
  labs(
    title = "ATM1 Cash Withdrawals Forecast",
    y = "Cash (in hundreds of dollars)"
  )

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`

atm2_tsibble <- atm_tsibble %>% filter(ATM == "ATM2")

autoplot(atm2_tsibble, Cash) +
  autolayer(atm2_forecast, series = "Forecast") +
  labs(
    title = "ATM2 Cash Withdrawals Forecast",
    y = "Cash (in hundreds of dollars)"
  )

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`

atm4_tsibble <- atm_tsibble %>% filter(ATM == "ATM4")

autoplot(atm4_tsibble, Cash) +
  autolayer(atm4_forecast, series = "Forecast") +
  labs(
    title = "ATM4 Cash Withdrawals Forecast",
    y = "Cash (in hundreds of dollars)"
  )

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`

combined_forecasts <- bind_rows(
  atm1_forecast %>% as_tibble() %>% mutate(ATM = "ATM1"),
  atm2_forecast %>% as_tibble() %>% mutate(ATM = "ATM2"),
  atm4_forecast %>% as_tibble() %>% mutate(ATM = "ATM4")
)

# Export the combined forecasts to an Excel file
write.xlsx(combined_forecasts, "ATM_Cash_Forecasts_May_2010.xlsx")

The ARIMA models developed for the ATM cash withdrawal data produced forecasts for May 2010 based on historical patterns observed in the data provided. For ATM1 and ATM2, the models were fit using Box-Cox transformations and first-order differencing to achieve stationarity, as indicated by the KPSS tests. This approach helped stabilize variance and mean, improving the overall model fit. Further exploration of the data for ATM1 and ATM2 may provide additional insights to enhance the accuracy of future forecasts. For ATM4, the model was fit without differencing, as the data appeared stationary after removing outliers. The forecast generated for ATM4 aligns closely with the historical cash withdrawal levels. Overall, the forecasts for all ATMs show reasonable expected values, with ATM1 and ATM2 showing a higher level of uncertainty due to the nature of cash withdrawals, which can fluctuate significantly.

Part B- Residential Customer Forecast

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

Data Prep

raw_data_path <- ("~/Desktop/Data 624 fall 2024/Data624-Project1/ResidentialCustomerForecast-624.xlsx")

res_cust <- read_excel(raw_data_path)

head(res_cust)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

res_cust <- res_cust %>%
  rename(Year_Month = `YYYY-MMM`)

We can remove the case sequence because it doesn’t provide meaningful time-based information for this specific analysis.

res_tsibble <- res_cust %>%
  mutate(Date = ym(`Year_Month`)) %>%  # Use the exact column name with hyphen
  select(-CaseSequence, -`Year_Month`) %>%  # Drop unnecessary columns
  as_tsibble(index = Date)

res_tsibble <- res_tsibble %>%
  mutate(Date = yearmonth(Date))

head(res_tsibble)

## # A tsibble: 6 x 2 [1M]
##       KWH     Date
##     <dbl>    <mth>
## 1 6862583 1998 Jan
## 2 5838198 1998 Feb
## 3 5420658 1998 Mar
## 4 5010364 1998 Apr
## 5 4665377 1998 May
## 6 6467147 1998 Jun

Begin Modeling

res_tsibble %>%
  autoplot(KWH) +
  labs(title = "Residential Power Usage (1998-2013)", 
       x = "Year", y = "Kilowatt Hours")

This data is displaying a gap around 2008-2009. In order to proceed with modeling, we have to fill the gaps. Upon further inspection, there is just one missing value, in 2008. We can just impute this. I’m not really sure what caused this, so we’re filling in the missing value with the average of the KWH values for August and October 2008.

# Impute the missing value for September 2008 with the average of August and October 2008
res_tsibble_f <- res_tsibble %>%
  mutate(KWH = ifelse(Date == yearmonth("2008 Sep"), 
                      (KWH[Date == yearmonth("2008 Aug")] + KWH[Date == yearmonth("2008 Oct")]) / 2, 
                      KWH))

# Plot the data to double check
res_tsibble_f %>%
  autoplot(KWH) +
  labs(title = "Residential Power Usage (Imputed for September 2008)", 
       x = "Year", y = "Kilowatt Hours")

This is displaying a slight seasonal pattern. To confirm, let’s perform an ADF test.

adf_test <- res_tsibble_f %>%
  features(KWH, unitroot_kpss)

print(adf_test)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.28        0.01

The kpss value is 0.01, indicating non-stationarity in this series. We can proceed with seasonal differencing.

res_seasonal_diff <- res_tsibble_f %>%
  mutate(SD_KWH = difference(KWH, lag = 12))

res_seasonal_diff %>%
  autoplot(SD_KWH) +
  labs(title = "Seasonally Differenced Residential Power Usage (Lag = 12)",
       x = "Year", y = "Differenced Kilowatt Hours")

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

The seasonal differencing seems to suffice with the exception of two spikes. We can proceed with ARIMA modeling and allow the model to apply the seasonal differencing.

# Fit the seasonal ARIMA model on the seasonally differenced series
fit_seasonal_arima <- res_tsibble_f %>%
  model(ARIMA(KWH ~ pdq(0,1,0) + PDQ(0,1,0)))

report(fit_seasonal_arima)

## Series: KWH 
## Model: ARIMA(0,1,0)(0,1,0)[12] 
## 
## sigma^2 estimated as 2.084e+12:  log likelihood=-2792.7
## AIC=5587.4   AICc=5587.42   BIC=5590.59

# Generate forecasts for 2014 (12 months ahead)
forecast_2014 <- fit_seasonal_arima %>%
  forecast(h = "12 months")


forecast_2014 %>%
  autoplot(res_tsibble_f) +
  labs(title = "Seasonal ARIMA Model Forecast for 2014",
       x = "Year", y = "Kilowatt Hours") +
  theme_minimal()

# Convert forecast data to a data frame
forecast_df <- as.data.frame(forecast_2014)

# Write the forecast to an Excel file
write.xlsx(forecast_df, "Res_forecast_2014.xlsx")

The forecast plot from the seasonal ARIMA model provides a projection of residential power usage for 2014, capturing the historical seasonal patterns observed in the data from 1998 to 2013. The model effectively incorporates both regular and seasonal components, resulting in a forecast that follows the typical peaks and troughs of electricity consumption. The shaded areas represent prediction intervals, with the darker region indicating an 80% confidence interval and the lighter region showing a 95% confidence interval.

Data 624 Project 1

Adriana Medina

2024-10-21

Part A- ATM data

Load Libraries

Prepare Data

Begin Modeling

Part B- Residential Customer Forecast

Data Prep

Begin Modeling