ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Load data and cleanup
suppressWarnings({
atm_data <- read_excel("ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
mutate(DATE = as_date(DATE)) %>%
as_tsibble(index = DATE, key = ATM) %>%
filter(DATE < "2010-05-01")
head(atm_data)
})
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
##Data Analaysis:
str(atm_data)
## tbl_ts [1,460 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ DATE: Date[1:1460], format: "2009-05-01" "2009-05-02" ...
## $ ATM : chr [1:1460] "ATM1" "ATM1" "ATM1" "ATM1" ...
## $ Cash: num [1:1460] 96 82 85 90 99 88 8 104 87 93 ...
## - attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ ATM : chr [1:4] "ATM1" "ATM2" "ATM3" "ATM4"
## ..$ .rows: list<int> [1:4]
## .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ : int [1:365] 366 367 368 369 370 371 372 373 374 375 ...
## .. ..$ : int [1:365] 731 732 733 734 735 736 737 738 739 740 ...
## .. ..$ : int [1:365] 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
## - attr(*, "index")= chr "DATE"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "DATE"
## - attr(*, "interval")= interval [1:1] 1D
## ..@ .regular: logi TRUE
summary(atm_data)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
##Handling missing values
missing_ATMs <- atm_data %>%
filter(is.na(Cash)) %>%
select(DATE, ATM)
missing_ATMs
## # A tsibble: 5 x 2 [1D]
## # Key: ATM [2]
## DATE ATM
## <date> <chr>
## 1 2009-06-13 ATM1
## 2 2009-06-16 ATM1
## 3 2009-06-22 ATM1
## 4 2009-06-18 ATM2
## 5 2009-06-24 ATM2
# Filter data for ATM1 and ATM2
ATM1_data <- atm_data %>% filter(ATM == "ATM1")
ATM2_data <- atm_data %>% filter(ATM == "ATM2")
# Fit ARIMA model to ATM1
ATM1_arima <- auto.arima(ATM1_data$Cash, seasonal = TRUE)
# Forecast missing values for ATM1 using the fitted ARIMA model
ATM1_data$Cash_filled <- ifelse(is.na(ATM1_data$Cash),
forecast(ATM1_arima, h = sum(is.na(ATM1_data$Cash)))$mean,
ATM1_data$Cash)
# Fit ARIMA model to ATM2
ATM2_arima <- auto.arima(ATM2_data$Cash, seasonal = TRUE)
# Forecast missing values for ATM2 using the fitted ARIMA model
ATM2_data$Cash_filled <- ifelse(is.na(ATM2_data$Cash),
forecast(ATM2_arima, h = sum(is.na(ATM2_data$Cash)))$mean,
ATM2_data$Cash)
ATM_filled <- atm_data %>%
filter(!(ATM %in% c("ATM1", "ATM2"))) %>%
bind_rows(ATM1_data %>% select(ATM, Cash = Cash_filled),
ATM2_data %>% select(ATM, Cash = Cash_filled))
# Checking if it worked
final_missing <- sum(is.na(ATM_filled$Cash))
cat("Final combined data - Missing values:", final_missing, "\n")
## Final combined data - Missing values: 0
According to Section 13.9 of Hyndman and Athanasopoulos’s “Forecasting: Principles and Practice,” ARIMA models are effective for filling in gaps in time series data.So, I used ARIMA models to handle missing values in the cash flow data for the ATMs. This helped me create complete datasets for both ATMs, improving the overall accuracy of my analysis.
head(ATM_filled)
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
Data Exploration
ATM_filled %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = " ATM Withdrawals")
Looks like ATM4 has outlier and we can remove it. We are using Interquartile Range (IQR) method, which defines outliers as points that fall below Q1-1.5IQR or above Q3+1.5IQR
ATM_4 <- ATM_filled %>%
filter(ATM == "ATM4")
Q1 <- quantile(ATM_4$Cash, 0.25, na.rm = TRUE)
Q3 <- quantile(ATM_4$Cash, 0.75, na.rm = TRUE)
IQR_value <- IQR(ATM_4$Cash, na.rm = TRUE)
# Determine the outlier thresholds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Identify outliers
outliers <- ATM_4 %>%
filter(Cash < lower_bound | Cash > upper_bound)
# Display outliers
print(outliers)
## # A tsibble: 3 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-09-22 ATM4 1712.
## 2 2010-01-29 ATM4 1575.
## 3 2010-02-09 ATM4 10920.
Cash 10919.762 is much greater than the others,so using tsoutlier to replace the outlier in ATM4
ATM4_ts <- ts(ATM_4$Cash, frequency = 365, start = c(2009, 1)) # Adjust start as necessary
# Identify outliers using tsoutliers
outlier_results <- tsoutliers(ATM4_ts)
# Print outlier information
print(outlier_results)
## $index
## [1] 285
##
## $replacements
## [1] 230.1752
ATM_4$Cash[285] <- 230.1752
print(ATM_4$Cash[285])
## [1] 230.1752
autoplot(ATM_4) +
labs(title="ATM4", subtitle="Cash withdrawals daily", y="Hundreds USD")
## Plot variable not specified, automatically selected `.vars = Cash`
ATM_filled <- ATM_filled %>%
rename(date=DATE, atm=ATM, cash=Cash) %>%
mutate(date = as.Date(date)) %>%
pivot_wider(names_from=atm, values_from = cash)
ATM_1 <- ATM_filled %>%
select(date, ATM1)
ATM_1%>%
autoplot(ATM1) +
ggtitle("Original ATM1")
# Create the second plot (STL Decomposition)
ATM_1 %>%
model(STL(ATM1 ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM1")
Clear peaks and troughs indicate periodic fluctuations in cash withdrawals. The STL decomposition of the ATM cash withdrawal data uncovers three main insights. The trend component indicates a general increase in cash withdrawals starting in mid-2009, reaching a peak before slightly declining in early 2010. The first plot (Original non-transformed plot) illustrates the raw data over time, revealing significant variability and pronounced weekly seasonality.
The seasonal component captures a stable weekly pattern, with consistent fluctuations that recur each week, reflecting predictable withdrawal behavior associated with weekly cycles. Finally, the remainder component showcases short-term irregularities or noise, with notable spikes around early 2010, suggesting unexpected events affecting withdrawals during that time.
ATM_1 %>%
ACF(ATM1, lag_max = 28) %>%
autoplot()
The ACF plot shows significant lags at 2, 5, and 7, with lag 7 being the strongest. We see drop in the ACF which could be due to data is non-stationary and may need differencing.
ndiffs(ATM_1$ATM1)
## [1] 0
No differencing is required.
Rationale for choosing the following models: The additive ETS model is appropriate because the seasonality is additive, as indicated by the STL seasonal component, and there is no distinct trend present. The SNAIVE model is a sensible choice due to the strong and consistent weekly seasonality observed in the data.
Lastly, the ARIMA model is included because, although the primary focus is on seasonality, it may also account for underlying autocorrelations in the remainder component.
lambda <- ATM_1 %>%
features(ATM1, features = guerrero) %>%
pull(lambda_guerrero)
atm1_fit <- ATM_1 %>%
model(
additive = ETS(ATM1 ~ error("A") + trend("N") + season("A")),
snaive = SNAIVE(ATM1),
ARIMA = ARIMA(box_cox(ATM1, lambda))
)
forecast_ATM1 <- atm1_fit %>%
forecast(h = 30)
# Plot the forecasts
forecast_ATM1 %>%
autoplot(ATM_1, level = NULL) +
facet_wrap(~ .model, scales = "free", ncol = 2) + # Adjust ncol to control number of columns
guides(colour = guide_legend(title = "Forecast")) +
labs(title = "ATM1 Forecasts") +
xlab("Date") +
ylab("Cash values in Hundreds")
accuracy(atm1_fit) %>%
select(.model, RMSE:MAPE)
## # A tibble: 3 × 5
## .model RMSE MAE MPE MAPE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 additive 23.8 15.1 -106. 121.
## 2 snaive 27.9 17.8 -96.6 116.
## 3 ARIMA 25.1 16.2 -87.8 107.
We see that RMSE (23.80689) and MAE (15.10543) metrics, the additive ETS model appears to be the best overall fit for our data, as it has the lowest values for both metrics.
Forecasting
ATM1_bestfit <- ATM_1 %>%
model(
ETS = ETS(ATM1))
forecast_ATM1 <- ATM1_bestfit %>%
forecast(h = 31) %>%
filter(.model=='ETS')
forecast_ATM1 %>%
autoplot(ATM_1) +
labs(title = "Forecast for ATM1 using ETS",
y = "Cash Values in Hundreds")
ATM1_bestfit%>%
gg_tsresiduals() +
labs(title="Residuals for ETS")
ATM
ATM_2 <- ATM_filled %>%
select(date, ATM2)
ATM_2 %>%
autoplot(ATM2) +
ggtitle("Original ATM2")
# STL decomposition
ATM_2 %>%
model(STL(ATM2 ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM2")
ATM_2 %>%
ACF(ATM2, lag_max = 28) %>%
autoplot()
We see agsin tht data is non-stationary and could potentially benefit
from differencing. We can check and see if that’s the case
ndiffs(ATM_2$ATM2)
## [1] 1
ATM_2_diff <- ATM_2 %>%
mutate(ATM2_diff = ATM2 - lag(ATM2, 1)) %>%
filter(!is.na(ATM2_diff))
# Perform KPSS test
kpss_test <- kpss.test(ATM_2_diff$ATM2_diff)
## Warning in kpss.test(ATM_2_diff$ATM2_diff): p-value greater than printed
## p-value
print(kpss_test)
##
## KPSS Test for Level Stationarity
##
## data: ATM_2_diff$ATM2_diff
## KPSS Level = 0.014714, Truncation lag parameter = 5, p-value = 0.1
mean_value <- mean(ATM_2_diff$ATM2_diff[ATM_2_diff$ATM2_diff >= 0], na.rm = TRUE)
ATM_2_diff$ATM2_diff[ATM_2_diff$ATM2_diff < 0] <- mean_value
Since our p-value (0.1) exceeds the typical significance level (0.05), we do not reject the null hypothesis. This indicates that the series is stationary.
Additionally, while examining ATM_2_diff, I noticed a significant number of negative values. This occurred because the differencing operation calculates the change in withdrawals between consecutive periods. If ATM2 declines from one period to the next, ATM2_diff will be negative.
For instance, if withdrawals were $100 one day and dropped to $80 the next, the difference would be -$20. To address this issue, I opted to impute the negative values by replacing them with the mean of the non-negative differences.
# Calculate lambda for Box-Cox transformation on the differenced data
lambda <- ATM_2_diff %>%
features(ATM2_diff, features = guerrero) %>%
pull(lambda_guerrero)
# Fit the models using the differenced data
atm2_fit <- ATM_2_diff %>%
model(
additive = ETS(ATM2_diff ~ error("A") + trend("N") + season("A")),
snaive = SNAIVE(ATM2_diff),
ARIMA = ARIMA(box_cox(ATM2_diff, lambda))
)
# Forecast for the next 30 periods
forecast_ATM2 <- atm2_fit %>%
forecast(h = 30)
# Plot the forecasts
forecast_ATM2 %>%
autoplot(ATM_2_diff, level = NULL) +
facet_wrap(~ .model, scales = "free", ncol = 2) + # Adjust ncol to control number of columns
guides(colour = guide_legend(title = "Forecast")) +
labs(title = "ATM2 Forecasts") +
xlab("Date") +
ylab("Cash values in Hundreds")
# Get stats for the models
model_stats <- left_join(
glance(atm2_fit) %>% select(.model:BIC),
accuracy(atm2_fit) %>% select(.model, RMSE)
) %>%
arrange(RMSE)
## Joining with `by = join_by(.model)`
# Display accuracy metrics
accuracy(atm2_fit) %>%
select(.model, RMSE:MAPE)
## # A tibble: 3 × 5
## .model RMSE MAE MPE MAPE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 additive 20.8 13.9 -Inf Inf
## 2 snaive 27.6 15.9 -Inf Inf
## 3 ARIMA 22.1 13.6 -Inf Inf
# Display the model statistics
model_stats
## # A tibble: 3 × 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive 444. -2178. 4376. 4377. 4415. 20.8
## 2 ARIMA 1622. -1862. 3732. 3733. 3748. 22.1
## 3 snaive 765. NA NA NA NA 27.6
Looks like ARIMA model appears to be the best fit for your differenced ATM2 data, as indicated by its lower RMSE, AIC, and BIC values.
Forecast
ATM2_bestfit <- ATM_2_diff %>%
model(
ARIMA = ARIMA(ATM2_diff))
ATM2_bestfit %>% select(.model = "ARIMA") %>% report()
## Series: ATM2_diff
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## sar1 sar2 constant
## 0.3051 0.3525 17.6802
## s.e. 0.0489 0.0502 1.1016
##
## sigma^2 estimated as 495.1: log likelihood=-1646.04
## AIC=3300.09 AICc=3300.2 BIC=3315.68
forecast_ATM2 <- ATM2_bestfit %>%
forecast(h = 31) %>%
filter(.model=='ARIMA')
# forcasted plot
forecast_ATM2 %>%
autoplot(ATM_2_diff) +
ggtitle(TeX(paste0("ATM 2 Forecast with $ARIMA(1,0,0)(0,1,1))_7$ and lambda = ",
round(lambda,2))))
# residuals
ATM2_bestfit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(1,0,0)(0,1,1)_7$ with lambda = ",
round(lambda,2))))
ATM_filled %>%
autoplot(ATM3) +
ggtitle("Original ATM3")
Since ATM3 has only three data points, it is not feasible to forecast the month of May using traditional methods, as there is insufficient data to discern trends or seasonal patterns.
In this situation, calculating the mean of these three days serves as an effective way to estimate cash withdrawals for the month. This approach provides a simple and practical solution, representing the average withdrawal behavior observed during those recorded days.
ATM3_fit <- ATM_filled %>%
filter(ATM3 != 0) %>%
model(MEAN(ATM3))
Forcast
forecast_atm3 <- ATM3_fit %>%
forecast(h = 31)
# forcasted plot
forecast_atm3 %>%
autoplot(ATM_filled) +
ggtitle("ATM 3 Forecast with the MEAN() Model")
ATM4
ATM_4%>%
autoplot() +
labs(title = "Original ATM3")
## Plot variable not specified, automatically selected `.vars = Cash`
ATM_4%>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM4 ")
ATM_4 %>%
ACF(Cash, lag_max = 28) %>%
autoplot()
The ACF plot displays a significant spike at lag 7, along with a smaller spike at lag 28. Following lag 7, the ACF shows a gradual decline, with another increase around lag 28.
This slow decline, coupled with the rise at lag 28, indicates that the series may exhibit some autocorrelation over time, potentially signaling non-stationarity. Therefore, it would be appropriate to consider a test for differencing.
ndiffs(ATM_4$Cash)
## [1] 0
No differencing is needed. I will evaluate the following models: ARIMA, SNAIVE, Additive ETS, and Multiplicative ETS (transformed). ARIMA is included because it can accommodate both trend and seasonal patterns, and the Box-Cox transformation helps stabilize variance while capturing autocorrelation.
The Additive ETS and Multiplicative ETS (transformed) models are also considered since the transformation aids in variance stabilization, and ETS is particularly effective for modeling seasonality.
lambda <- ATM_4 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm4_fit <- ATM_4 %>%
model(
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
snaive = SNAIVE(Cash),
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(Cash,lambda)),
)
forecast_ATM4 <- atm4_fit %>%
forecast(h = 30)
forecast_ATM4 %>%
autoplot(ATM_4, level = NULL) +
facet_wrap(~ .model, scales = "free", ncol = 2) + # Adjust ncol to control number of columns
guides(colour = guide_legend(title = "Forecast")) +
labs(title = "ATM4 Forecasts") +
xlab("Date") +
ylab("Cash values in Hundreds")
# stats for the models
left_join(glance(atm4_fit) %>% select(.model:BIC),
accuracy(atm4_fit) %>% select(.model, RMSE)) %>%
arrange(RMSE)
## Joining with `by = join_by(.model)`
## # A tibble: 4 × 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive 110850. -3192. 6404. 6405. 6443. 329.
## 2 multiplicative_ts 0.219 -2003. 4026. 4027. 4065. 343.
## 3 ARIMA 176. -1461. 2931. 2931. 2951. 352.
## 4 snaive 205805. NA NA NA NA 453.
accuracy(atm4_fit) %>%
select(.model, RMSE:MAPE)
## # A tibble: 4 × 5
## .model RMSE MAE MPE MAPE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 additive 329. 265. -515. 545.
## 2 snaive 453. 346. -392. 447.
## 3 multiplicative_ts 343. 262. -374. 417.
## 4 ARIMA 352. 274. -368. 413.
ARIMA appears to be the best overall model when considering both model fit (AIC, BIC) and error metrics like RMSE and MAPE. It has the lowest AIC/BIC, indicating it captures the underlying structure with minimal complexity.
ATM4_bestfit <- ATM_4 %>%
model(
ARIMA = ARIMA(Cash))
ATM4_bestfit %>% select(.model = "ARIMA") %>% report()
## Series: Cash
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 constant
## -0.8778 -0.6769 0.1700 0.9634 0.8065 0.2484 795.0206
## s.e. 0.1004 0.1186 0.0564 0.0906 0.1079 0.0556 48.8937
##
## sigma^2 estimated as 117528: log likelihood=-2645.21
## AIC=5306.42 AICc=5306.82 BIC=5337.62
forecast_ATM4 <- ATM4_bestfit %>%
forecast(h = 31) %>%
filter(.model=='ARIMA')
# forcasted plot
forecast_ATM4 %>%
autoplot(ATM_4) +
ggtitle(TeX(paste0("ATM 4 Forecast with $ARIMA(3,0,2)(1,0,0)_7$ and lambda = ",
round(lambda,2))))
# residuals
ATM4_bestfit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(3,0,2)(1,0,0)_7$ with lambda = ",
round(lambda,2))))
forecast_ATM1 <- forecast_ATM1 %>%
rename(atm1 = .mean)
forecast_ATM2 <- forecast_ATM2 %>%
rename(atm2 = .mean)
forecast_atm3 <- forecast_atm3 %>%
rename(atm3 = .mean)
forecast_ATM4 <- forecast_ATM4 %>%
rename(atm4 = .mean)
combined_forecasts <- bind_cols(forecast_ATM1, forecast_ATM2, forecast_atm3, forecast_ATM4)
## New names:
## • `.model` -> `.model...1`
## • `date` -> `date...2`
## • `.model` -> `.model...5`
## • `date` -> `date...6`
## • `.model` -> `.model...9`
## • `date` -> `date...10`
## • `.model` -> `.model...14`
combined_forecasts <- combined_forecasts %>%
rename(date = date...2)
final_forecast <- combined_forecasts %>%
select(date, atm1, atm2, atm3, atm4)
final_forecast %>% write.csv("final_forecasts.csv")
Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
res<-read.csv("https://raw.githubusercontent.com/deepasharma06/Data-624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv")
head(res)
## CaseSequence YYYY.MMM KWH
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
res_ts <- res %>%
rename(Date = 'YYYY.MMM') %>%
mutate(Date = yearmonth(Date)) %>%
as_tsibble(index = Date)
head(res_ts)
## # A tsibble: 6 x 3 [1M]
## CaseSequence Date KWH
## <int> <mth> <int>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
We renamed the ‘YYYY-MMM’ column to ‘Month’ for clarity. We transformed the ‘Month’ column into a proper year-month format, so R understands it as dates. We converted the data into a tsibble.
res_ts %>%
autoplot(KWH)
From above visualization we can clearly notice a possible outlier. We’ll
move on with investigating the data for missing values and outliers.
Data Exploration
summary(res_ts$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
There’s one missing value in our data. Same as before we’ll fit an ARIMA model to handle the missing data.
missing_KWH<- res_ts %>% filter(is.na(KWH)) %>% select(Date, KWH)
missing_KWH
## # A tsibble: 1 x 2 [1M]
## Date KWH
## <mth> <int>
## 1 2008 Sep NA
res_arima <- auto.arima(res_ts$KWH, seasonal = TRUE)
# Count the number of missing values in 'KWH'
num_missing <- sum(is.na(res_ts$KWH))
# If there are missing values, forecast them using the fitted ARIMA model
if (num_missing > 0) {
forecast_values <- forecast(res_arima, h = num_missing)$mean
# Create a new column 'KWH_filled' by filling NA with forecast values
res_filled <- res_ts %>%
mutate(KWH_filled = ifelse(is.na(KWH), forecast_values, KWH))
} else {
res_filled <- res_ts %>%
mutate(KWH_filled = KWH)
}
res_filled <- res_filled %>%
select(CaseSequence,Date, KWH_filled)
summary(res_filled$KWH_filled) ## Check if the missing value is handled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6314472 6516367 7649733 10655730
Outlier We can manually identify values that significantly deviate from the trend. We’ll consider values that are more than 1.5 times the interquartile range (IQR).
Q1 <- quantile(res_filled$KWH_filled, 0.25)
Q3 <- quantile(res_filled$KWH_filled, 0.75)
IQR <- Q3 - Q1
# Define upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Identify outliers
potential_outliers <- res_filled$KWH_filled[res_filled$KWH_filled < lower_bound | res_filled$KWH_filled > upper_bound]
print(potential_outliers)
## [1] 770523
KWH_filled_with_na <- res_filled$KWH_filled
KWH_filled_with_na[KWH_filled_with_na < lower_bound | KWH_filled_with_na > upper_bound] <- NA
# Step 2: Fit an ARIMA model (including NA values)
arima_model <- auto.arima(KWH_filled_with_na)
# Step 3: Forecast missing values
forecasts <- forecast(arima_model, h = sum(is.na(KWH_filled_with_na)))
# Impute the NA values with the forecasted values
KWH_imputed <- KWH_filled_with_na
KWH_imputed[is.na(KWH_imputed)] <- forecasts$mean
res_filled$KWH_filled <- KWH_imputed
head(res_filled)
## # A tsibble: 6 x 3 [1M]
## CaseSequence Date KWH_filled
## <int> <mth> <dbl>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
In the previous steps, we began by identifying outliers in the KWH_filled values by comparing them to established lower and upper bounds, marking any outlier values as NA. We then fitted an ARIMA model to the KWH_filled data, including the NA entries, to capture the underlying patterns in the time series.
Using this model, we forecasted the missing values and replaced the NA entries with the predicted values from the ARIMA forecasts.
Finally, we updated the original res_filled DataFrame with the imputed KWH_filled values and printed the updated DataFrame to verify the changes. Next, we will examine the plot.
res_filled <- res_filled %>%
mutate(KWH_filled = KWH_filled / 1000)
head(res_filled)
## # A tsibble: 6 x 3 [1M]
## CaseSequence Date KWH_filled
## <int> <mth> <dbl>
## 1 733 1998 Jan 6863.
## 2 734 1998 Feb 5838.
## 3 735 1998 Mar 5421.
## 4 736 1998 Apr 5010.
## 5 737 1998 May 4665.
## 6 738 1998 Jun 6467.
This conversion makes the data easier to interpret, especially when dealing with larger quantities of energy.
res_filled%>%
autoplot(KWH_filled) +
ggtitle("Residential Power Usage cleaned plot")
res_filled %>%
model(STL(KWH_filled ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition cleaned plot")
The data reveals a distinct upward trend, although it contributes less variation than other factors. Additionally, there is an annual pattern, with power consumption reaching its highest levels in the summer and winter months. Notably, random fluctuations include a significant spike in December 2013. A seasonal plot reinforces the observed rise in power usage during peak months, while the off-peak months likely vary each year due to differences in weather conditions.
res_filled$KWH_filled <- as.numeric(res_filled$KWH_filled)
# Create the time series object
plot_season <- ts(res_filled$KWH_filled, start = c(1998, 1), frequency = 12)
ggseasonplot(plot_season)+
ggtitle("Residential Power Usage Deasonally Each Year")
The chart above illustrates the seasonal trend in residential power consumption from 1998 to 2013, indicating peaks during the summer months (June to August) and noticeable declines in spring (March to May) and fall (September to November). This pattern suggests that power usage is higher in warmer months, likely driven by increased air conditioning needs.
power_lambda <- res_filled %>%
features(KWH_filled, features = guerrero) %>%
pull(lambda_guerrero)
res_filled <- res_filled %>%
mutate(KWH_bc = box_cox(KWH_filled, power_lambda))
summary(res_filled$KWH_bc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.040 3.058 3.069 3.069 3.082 3.103
res_filled %>%
autoplot(KWH_bc) +
labs(y = "KWH in Thousands")
We initially employed the Guerrero method to determine the optimal
Box-Cox transformation parameter (lambda) for the KWH_filled data,
aiming to stabilize its variance. Next, we applied the Box-Cox
transformation to the KWH_filled values, resulting in a new variable
(KWH_bc).
Finally, we summarized the transformed data and generated a plot to visualize the transformed values (KWH_bc). This approach enhances the normality and homoscedasticity of the dataset.
res_filled %>%
ACF(KWH_bc, lag_max = 36) %>%
autoplot()
The ACF plot reveals significant autocorrelations at lags that are multiples of 12, highlighting strong seasonal patterns in the data. The gradual decrease in the ACF indicates that the series may be non-stationary, while the PACF displays a sharp decline after lag 1, further suggesting potential non-stationarity.
Given these observations, it may be necessary to apply differencing, especially seasonal differencing, to achieve stationarity prior to modeling.
ndiffs(res_filled$KWH_bc)
## [1] 1
The output of 1 indicates that we need to difference our KWH_bc data once to achieve stationarity.
diff_res <- res_filled %>%
mutate(diff_KWH= difference(KWH_filled), diff_KWH_bc = difference(KWH_bc)) %>%
select(-KWH_filled, -KWH_bc) %>%
slice(-1)
print(kpss.test(diff_res$diff_KWH_bc))
## Warning in kpss.test(diff_res$diff_KWH_bc): p-value greater than printed
## p-value
##
## KPSS Test for Level Stationarity
##
## data: diff_res$diff_KWH_bc
## KPSS Level = 0.039109, Truncation lag parameter = 4, p-value = 0.1
ndiffs(diff_res$diff_KWH_bc)
## [1] 0
We can conclude that the differenced Box-Cox transformed data (diff_KWH_bc) appears to be stationary.
Modeling
diff_train <- diff_res %>%
filter(year(Date) < 2013)
diff_fit <- diff_train %>%
model(
ARIMA = ARIMA(diff_KWH),
`Auto ARIMA` = ARIMA(diff_KWH, stepwise = FALSE, approx = FALSE)
)
diff_fc <- diff_fit %>%
forecast(h = "1 year")
#plot
diff_fc %>%
autoplot(diff_res, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "KWH Forecasts")+
xlab("Date") +
ylab("KWH in Thousands")
In the code above, we first filtered the differenced data (diff_res) to retain only the observations prior to 2013, then fitted both ARIMA and Auto ARIMA models to this dataset. Next, we generated forecasts for the year 2013 using these models. Finally, we created a plot to visualize the forecasts from each model, illustrating their predictions for KWH usage over time.
train <- res_filled %>%
filter(year(Date) < 2013)
#models
fit <- train %>%
model(
ETS = ETS(KWH_filled),
`Additive ETS` = ETS(KWH_filled ~ error("A") + trend("A") + season("A")),
SNAIVE = SNAIVE(KWH_filled)
)
#forecast of 2013
fc <- fit %>%
forecast(h = "1 year")
#plot
fc %>%
autoplot(res_filled, level = NULL)+
facet_wrap( ~ .model, scales = "free_y",ncol=2) +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "KWH Forecasts",
subtitle = "January 2013 - December 2013")+
xlab("Month") +
ylab("KWH in Thousands")
In the first analysis, we used differencing to make the time series data stationary, which is important for ARIMA models to accurately capture the underlying patterns by removing trends and seasonality. This was done with the differenced data (diff_res). In the second analysis, we used the original filled data (res_filled) for ETS and SNAIVE models, which can directly handle trends and seasonality without needing to difference the data. By comparing both approaches, we can see how ARIMA focuses on changes in the data while ETS and SNAIVE use the original data’s structure. This helps us identify the best forecasting method based on model performance.
accuracy(diff_fc, diff_res) %>%
select(.model, RMSE:MAE)
## # A tibble: 2 × 3
## .model RMSE MAE
## <chr> <dbl> <dbl>
## 1 ARIMA 1252. 904.
## 2 Auto ARIMA 1187. 823.
accuracy(fc, res_filled) %>%
select(.model, RMSE:MAE)
## # A tibble: 3 × 3
## .model RMSE MAE
## <chr> <dbl> <dbl>
## 1 Additive ETS 1047. 647.
## 2 ETS 1011. 681.
## 3 SNAIVE 1036. 619.
ETS shows the lowest RMSE (1011.443) and a competitive MAE (680.7843), indicating it’s the best-performing model among those listed.
Results
ETS_fit <- res_filled %>%
model(`Additive ETS` = ETS(KWH_filled ~ error("A") + trend("A") + season("A")))
ETS_forecast <- ETS_fit %>%
forecast(h=12)
ETS_forecast %>%
autoplot(res_filled) +
labs(title = " Residential Power Usage using Additive ETS ",
y = "KWH in Thousands")
The generated forecast seems reliable, as it effectively reflects the seasonal patterns observed, indicating higher demand during the summer and winter months. It also accurately falls between the peaks and valleys recorded in the more recent years.
final_forec <- ETS_forecast %>%
rename(KWH_2014 = .mean) %>%
select(Date, KWH_2014)
final_forec %>% write.csv("KWH_forecasts.csv")