In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.
file_path <- ("~/Desktop/Data 624 fall 2024/Data624-Project1/ATM624Data.xlsx")
atm_data <- read_excel(file_path)The DATE column is represented in Excel serial date format, which counts days from January 1, 1900. So in order to convert it to a YY-MM-DD format we have add the number of days to 1899-12-30.
atm_data <- atm_data %>%
mutate(
Date = as.Date(DATE, origin = "1899-12-30"), # Convert Excel serial date format to Date
Cash = as.numeric(Cash) # Ensure Cash column is numeric
)
# Create a tsibble for time series analysis
atm_tsibble <- atm_data %>%
as_tsibble(index = Date, key = ATM)
glimpse(atm_tsibble)## Rows: 1,474
## Columns: 4
## Key: ATM [5]
## $ DATE <dbl> 39934, 39935, 39936, 39937, 39938, 39939, 39940, 39941, 39942, 39…
## $ ATM <chr> "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "…
## $ Cash <dbl> 96, 82, 85, 90, 99, 88, 8, 104, 87, 93, 86, 111, 75, 6, 102, 73, …
## $ Date <date> 2009-05-01, 2009-05-02, 2009-05-03, 2009-05-04, 2009-05-05, 2009…
I noticed there are NA values for Cash and ATM. Since we’re working with ATMs, I’m going to assume the ATMs were out of service that day and I’m going to impute the missing values with zeros for each ATM. The zeros will represent when no withdrawals occurred. As for the NA values for ATM, it would be best to remove them.
# Autoplot the time series for all ATMs
atm_tsibble %>%
autoplot(Cash) +
labs(
title = "Cash Withdrawals for All ATMs (Original Series)",
y = "Cash (in hundreds of dollars)"
) +
facet_wrap(~ ATM, scales = "free_y") For ATM1 and ATM2: The data appears consistent, showing regular fluctuations around similar ranges. There seems to be a normal pattern of daily cash withdrawals.
For ATM3: Most of the series shows very low or no withdrawals, except for a sudden spike toward the end. This ATM could have been out of commission until 2010. There is not enough data for atm 3 to forecast.
For ATM4: There’s a significant outlier with withdrawals exceeding 9000. This spike may need to be handled separately. I don’t see how an ATM could carry this much money.
The outlier in atm 4 can be removed
# Remove outliers from ATM4
atm_tsibble <- atm_tsibble %>%
mutate(Cash = if_else(ATM == "ATM4" & Cash > 5000, NA_real_, Cash))
atm_tsibble <- atm_tsibble %>%
fill_gaps(Cash = 0) # Impute remaining NA values with 0# KPSS test for each ATM
atm1_kpss <- atm_tsibble %>%
filter(ATM == "ATM1") %>%
features(Cash, unitroot_kpss)
atm2_kpss <- atm_tsibble %>%
filter(ATM == "ATM2") %>%
features(Cash, unitroot_kpss)
atm3_kpss <- atm_tsibble %>%
filter(ATM == "ATM3") %>%
features(Cash, unitroot_kpss)
atm4_kpss <- atm_tsibble %>%
filter(ATM == "ATM4") %>%
features(Cash, unitroot_kpss)
# Display the results
atm1_kpss## # A tibble: 1 × 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM1 0.300 0.1
## # A tibble: 1 × 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM2 1.96 0.01
## # A tibble: 1 × 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM3 0.389 0.0818
## # A tibble: 1 × 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM4 0.117 0.1
A low p-value (< 0.05) indicates that the series is non-stationary. A high p-value (> 0.05) suggests that the series is stationary.
Based on the KPSS values: ATM1 and ATM 2: Apply box-cox and first-order differencing to make it stationary. ATM4: Able to proceed with ARIMA modeling
There is not enough data for ATM3, so it will be excluded in the forecasting.
# Find lambda for ATM1
lambda_atm1 <- atm_tsibble %>%
filter(ATM == "ATM1") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
# Apply Box-Cox transformation to ATM1
atm1_transformed <- atm_tsibble %>%
filter(ATM == "ATM1") %>%
mutate(Cash_transformed = box_cox(Cash, lambda_atm1))
# Find lambda for ATM2
lambda_atm2 <- atm_tsibble %>%
filter(ATM == "ATM2") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
# Apply Box-Cox transformation to ATM2
atm2_transformed <- atm_tsibble %>%
filter(ATM == "ATM2") %>%
mutate(Cash_transformed = box_cox(Cash, lambda_atm2))# Apply first-order differencing to box-coxed ATM1 and ATM2
atm1_diff <- atm1_transformed %>%
filter(ATM == "ATM1") %>%
mutate(Cash_diff = difference(Cash))
atm2_diff <- atm2_transformed %>%
filter(ATM == "ATM2") %>%
mutate(Cash_diff = difference(Cash))
# Plot the differenced series for ATM1 and ATM2
p1 <- autoplot(atm1_diff, Cash_diff) + labs(title = "ATM1 Cash Withdrawals (First-Order Differenced Box-cox)", y = "Differenced Cash")
p2 <- autoplot(atm2_diff, Cash_diff) + labs(title = "ATM2 Cash Withdrawals (First-Order Differenced Box-cox)", y = "Differenced Cash")
# combine with patchwork
p1 / p2## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
This data looks stationary and we can move forward with ARIMA
modeling
# Fit ARIMA models to ATM1 and ATM2 differenced series with automatic selection
atm1_arima <- atm1_diff %>% model(ARIMA(Cash_diff))
atm2_arima <- atm2_diff %>% model(ARIMA(Cash_diff))
# Fit ARIMA models to ATM3 and ATM4 without differencing, using automatic selection
atm4_arima <- atm_tsibble %>% filter(ATM == "ATM4") %>% model(ARIMA(Cash))
report(atm1_arima)## Series: Cash_diff
## Model: ARIMA(4,0,0)(2,1,0)[7]
##
## Coefficients:
## ar1 ar2 ar3 ar4 sar1 sar2
## -0.7248 -0.565 -0.3000 -0.1394 -0.5182 -0.2530
## s.e. 0.0529 0.063 0.0627 0.0524 0.0521 0.0506
##
## sigma^2 estimated as 923.1: log likelihood=-1724.31
## AIC=3462.61 AICc=3462.93 BIC=3489.78
## Series: Cash_diff
## Model: ARIMA(3,0,0)(2,1,0)[7]
##
## Coefficients:
## ar1 ar2 ar3 sar1 sar2
## -0.7442 -0.5687 -0.3386 -0.5940 -0.2019
## s.e. 0.0501 0.0569 0.0496 0.0526 0.0521
##
## sigma^2 estimated as 790.3: log likelihood=-1697.27
## AIC=3406.55 AICc=3406.79 BIC=3429.83
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## sar1 sar2 constant
## 0.1509 0.1056 330.1213
## s.e. 0.0522 0.0525 17.8500
##
## sigma^2 estimated as 118902: log likelihood=-2642.55
## AIC=5293.1 AICc=5293.21 BIC=5308.7
# Print all residual plots
atm1_arima %>% gg_tsresiduals() + labs(title = "Residuals of ARIMA Model (ATM1)")## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing non-finite outside the scale range (`stat_bin()`).
# ATM1 forecast using the fitted model
atm1_forecast <- atm1_arima %>%
forecast(h = "10 days")
# ATM2 forecast using the fitted model
atm2_forecast <- atm2_arima %>%
forecast(h = "10 days")
# ATM4 forecast using the fitted model
atm4_forecast <- atm4_arima %>%
forecast(h = "10 days")atm1_tsibble <- atm_tsibble %>% filter(ATM == "ATM1")
autoplot(atm1_tsibble, Cash) +
autolayer(atm1_forecast, series = "Forecast") +
labs(
title = "ATM1 Cash Withdrawals Forecast",
y = "Cash (in hundreds of dollars)"
)## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`
atm2_tsibble <- atm_tsibble %>% filter(ATM == "ATM2")
autoplot(atm2_tsibble, Cash) +
autolayer(atm2_forecast, series = "Forecast") +
labs(
title = "ATM2 Cash Withdrawals Forecast",
y = "Cash (in hundreds of dollars)"
)## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`
atm4_tsibble <- atm_tsibble %>% filter(ATM == "ATM4")
autoplot(atm4_tsibble, Cash) +
autolayer(atm4_forecast, series = "Forecast") +
labs(
title = "ATM4 Cash Withdrawals Forecast",
y = "Cash (in hundreds of dollars)"
)## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`
combined_forecasts <- bind_rows(
atm1_forecast %>% as_tibble() %>% mutate(ATM = "ATM1"),
atm2_forecast %>% as_tibble() %>% mutate(ATM = "ATM2"),
atm4_forecast %>% as_tibble() %>% mutate(ATM = "ATM4")
)
# Export the combined forecasts to an Excel file
write.xlsx(combined_forecasts, "ATM_Cash_Forecasts_May_2010.xlsx")The ARIMA models developed for the ATM cash withdrawal data produced forecasts for May 2010 based on historical patterns observed in the data provided. For ATM1 and ATM2, the models were fit using Box-Cox transformations and first-order differencing to achieve stationarity, as indicated by the KPSS tests. This approach helped stabilize variance and mean, improving the overall model fit. Further exploration of the data for ATM1 and ATM2 may provide additional insights to enhance the accuracy of future forecasts. For ATM4, the model was fit without differencing, as the data appeared stationary after removing outliers. The forecast generated for ATM4 aligns closely with the historical cash withdrawal levels. Overall, the forecasts for all ATMs show reasonable expected values, with ATM1 and ATM2 showing a higher level of uncertainty due to the nature of cash withdrawals, which can fluctuate significantly.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.
raw_data_path <- ("~/Desktop/Data 624 fall 2024/Data624-Project1/ResidentialCustomerForecast-624.xlsx")
res_cust <- read_excel(raw_data_path)
head(res_cust)## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
We can remove the case sequence because it doesn’t provide meaningful time-based information for this specific analysis.
res_tsibble <- res_cust %>%
mutate(Date = ym(`Year_Month`)) %>% # Use the exact column name with hyphen
select(-CaseSequence, -`Year_Month`) %>% # Drop unnecessary columns
as_tsibble(index = Date)
res_tsibble <- res_tsibble %>%
mutate(Date = yearmonth(Date))
head(res_tsibble)## # A tsibble: 6 x 2 [1M]
## KWH Date
## <dbl> <mth>
## 1 6862583 1998 Jan
## 2 5838198 1998 Feb
## 3 5420658 1998 Mar
## 4 5010364 1998 Apr
## 5 4665377 1998 May
## 6 6467147 1998 Jun
res_tsibble %>%
autoplot(KWH) +
labs(title = "Residential Power Usage (1998-2013)",
x = "Year", y = "Kilowatt Hours")
This data is displaying a gap around 2008-2009. In order to proceed with
modeling, we have to fill the gaps. Upon further inspection, there is
just one missing value, in 2008. We can just impute this. I’m not really
sure what caused this, so we’re filling in the missing value with the
average of the KWH values for August and October 2008.
# Impute the missing value for September 2008 with the average of August and October 2008
res_tsibble_f <- res_tsibble %>%
mutate(KWH = ifelse(Date == yearmonth("2008 Sep"),
(KWH[Date == yearmonth("2008 Aug")] + KWH[Date == yearmonth("2008 Oct")]) / 2,
KWH))
# Plot the data to double check
res_tsibble_f %>%
autoplot(KWH) +
labs(title = "Residential Power Usage (Imputed for September 2008)",
x = "Year", y = "Kilowatt Hours")This is displaying a slight seasonal pattern. To confirm, let’s perform an ADF test.
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 1.28 0.01
The kpss value is 0.01, indicating non-stationarity in this series. We can proceed with seasonal differencing.
res_seasonal_diff <- res_tsibble_f %>%
mutate(SD_KWH = difference(KWH, lag = 12))
res_seasonal_diff %>%
autoplot(SD_KWH) +
labs(title = "Seasonally Differenced Residential Power Usage (Lag = 12)",
x = "Year", y = "Differenced Kilowatt Hours")## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
The seasonal differencing seems to suffice with the exception of two
spikes. We can proceed with ARIMA modeling and allow the model to apply
the seasonal differencing.
# Fit the seasonal ARIMA model on the seasonally differenced series
fit_seasonal_arima <- res_tsibble_f %>%
model(ARIMA(KWH ~ pdq(0,1,0) + PDQ(0,1,0)))
report(fit_seasonal_arima)## Series: KWH
## Model: ARIMA(0,1,0)(0,1,0)[12]
##
## sigma^2 estimated as 2.084e+12: log likelihood=-2792.7
## AIC=5587.4 AICc=5587.42 BIC=5590.59
# Generate forecasts for 2014 (12 months ahead)
forecast_2014 <- fit_seasonal_arima %>%
forecast(h = "12 months")
forecast_2014 %>%
autoplot(res_tsibble_f) +
labs(title = "Seasonal ARIMA Model Forecast for 2014",
x = "Year", y = "Kilowatt Hours") +
theme_minimal()# Convert forecast data to a data frame
forecast_df <- as.data.frame(forecast_2014)
# Write the forecast to an Excel file
write.xlsx(forecast_df, "Res_forecast_2014.xlsx")The forecast plot from the seasonal ARIMA model provides a projection of residential power usage for 2014, capturing the historical seasonal patterns observed in the data from 1998 to 2013. The model effectively incorporates both regular and seasonal components, resulting in a forecast that follows the typical peaks and troughs of electricity consumption. The shaded areas represent prediction intervals, with the darker region indicating an 80% confidence interval and the lighter region showing a 95% confidence interval.