library(tidyverse)
library(readxl)
library(fpp3)
library(fable)
library(purrr)
library(zoo)
library(openxlsx)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file
ATM624Data <- read_excel("C:/Users/puddi/Downloads/ATM624Data.xlsx")
# Converting Date from Numeric to YYYY-MM-DD Format
ATM624Data$DATE <- as.Date(ATM624Data$DATE, origin = "1899-12-30")
# Information
summary(ATM624Data)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
As seen below, when looking at the ATM data set we can see that there is 19 missing data points in the file. With that being said, 14 of 19 NA’s don’t have an ATM or a Cash so I will be just getting rid of them completely since there isn’t enough data to help support adding data. The other 5 missing NA’s have an ATM class associating with them and since we are doing time series forecasting, for their imputations, we are going with an ARIMA model for interpolation. The other reason why this method was preferred is since we are utilizing ARIMA, it will considered different dependencies and patterns to impute the missing values that are correlated to neighboring data points.
# Finding the Rows that have NA's
ATM624Data[which(is.na(ATM624Data$Cash)), ]
## # A tibble: 19 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
# Removing any NA's that don't have an ATM
ATM624Data <- ATM624Data[!is.na(ATM624Data$ATM), ]
ATM624Data <- ATM624Data %>%
arrange(ATM)
ATM624Data[which(is.na(ATM624Data$Cash)), ]
## # A tibble: 5 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
missing_rows <- which(is.na(ATM624Data$Cash))
missing_rows
## [1] 44 47 53 414 420
# Converting ATM Dataframe to Tsibble
atm_ts <- as_tsibble(ATM624Data, index = DATE, key = ATM)
atm_ts <- atm_ts %>%
# Fit ARIMA model to the data containing missing values
model(ARIMA(Cash)) %>%
# Estimate Cash for the missing values.
interpolate(atm_ts)
atm_ts[missing_rows, ]
## # A tsibble: 5 x 3 [1D]
## # Key: ATM [2]
## ATM DATE Cash
## <chr> <date> <dbl>
## 1 ATM1 2009-06-13 111.
## 2 ATM1 2009-06-16 124.
## 3 ATM1 2009-06-22 100.
## 4 ATM2 2009-06-18 9.42
## 5 ATM2 2009-06-24 29.4
For the ATMS below we are splitting them up and running a few models on them for forecast, the main ones are Random Walking, Naive, Naive-Drift, ETS and ARIMA. As we can see that for ATM’s 1,2,4 we are going with the ARIMA method since there is a good amount of historic data as well as the models preform the most. For ATM 3, ETS had preformed the best since for the longest amount of time there wasn’t any withdrawals coming from the ATM so using ARIMA would heavily Skewed the amount of Cash needed to be very low.
# In order to Avoid Scientific Notation
options(scipen = 999)
aggregate(Cash ~ ATM, data = atm_ts, summary)
## ATM Cash.Min. Cash.1st Qu. Cash.Median Cash.Mean Cash.3rd Qu.
## 1 ATM1 1.0000000 73.0000000 91.0000000 84.1156368 108.0000000
## 2 ATM2 0.0000000 25.0000000 66.0000000 62.3420282 93.0000000
## 3 ATM3 0.0000000 0.0000000 0.0000000 0.7205479 0.0000000
## 4 ATM4 1.5632595 124.3344146 403.8393360 474.0433449 704.5070416
## Cash.Max.
## 1 180.0000000
## 2 147.0000000
## 3 96.0000000
## 4 10919.7616377
# Time Series Graph by ATM
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "ATM Cash Withdrawals", x = "Date", y = "Cash") +
theme_minimal()
atm_ts %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Cash Withdrawals")
# ATM 1
atm_ts_1 <- atm_ts %>%
filter(ATM == 'ATM1')
atm_ts_1 %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_1 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_1) +
labs(title = "ATM 1 Cash Withdrawals Forecast Using Different Predictive Models")
arima_model <- atm_ts_1 %>% model(ARIMA(Cash))
ets_model <- atm_ts_1 %>% model(ETS(Cash))
naive_model <- atm_ts_1 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM1 23.2 14.4 -102. 117. 0.822 0.840 -0.00838
## 2 ets ATM1 23.7 15.0 -107. 122. 0.853 0.859 0.142
## 3 naive ATM1 50.0 37.4 -132. 167. 2.13 1.81 -0.358
# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)
a1 = c("Atm1", sum(forecast_results$.mean) * 100)
# ATM 2
atm_ts_2 <- atm_ts %>%
filter(ATM == 'ATM2')
atm_ts_2 %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_2 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_2) +
labs(title = "ATM 2 Cash Withdrawals Forecast Using Different Predictive Models")
arima_model <- atm_ts_2 %>% model(ARIMA(Cash))
ets_model <- atm_ts_2 %>% model(ETS(Cash))
naive_model <- atm_ts_2 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM2 23.8 16.7 -Inf Inf 0.823 0.803 -0.00268
## 2 ets ATM2 24.6 17.2 -Inf Inf 0.847 0.831 0.0119
## 3 naive ATM2 54.4 42.6 -Inf Inf 2.10 1.84 -0.306
# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)
a2 = c("Atm2", sum(forecast_results$.mean) * 100)
# ATM 3
atm_ts_3 <- atm_ts %>%
filter(ATM == 'ATM3')
atm_ts_3 %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_3 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_3) +
labs(title = "ATM 3 Cash Withdrawals Forecast Using Different Predictive Models")
arima_model <- atm_ts_3 %>% model(ARIMA(Cash))
ets_model <- atm_ts_3 %>% model(ETS(Cash))
naive_model <- atm_ts_3 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM3 5.03 0.271 34.6 34.6 0.370 0.625 0.0124
## 2 ets ATM3 5.03 0.273 Inf Inf 0.371 0.625 -0.00736
## 3 naive ATM3 5.09 0.310 28.8 40.2 0.423 0.632 -0.149
# Forecast for 30 days
forecast_results <- ets_model %>% forecast(h = 30)
a3 = c("Atm3", sum(forecast_results$.mean) * 100)
The most amount of money that an ATM can hold is $ 200,000 or 2000 100-Dollar Bills according to the resource which means any ATM Withdraws that has over that amount can be taken out as a freak incident or as a error in reporting. Remember since each of the number represent the amount of $ 100’s taken out each day, we can assume that anything over 2000 bills taken in one day to be an error. From a realistic point of view, it is very unlikely for an ATM to be restocked in 1 day let alone multiple times in a singular day. So for the predicting the forecasted values of ATM4, I decided to filter out the outlier and then run the different models.
Resource: link
# ATM 4
atm_ts %>%
filter(ATM == 'ATM4') %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_4 <- atm_ts %>%
filter(ATM == 'ATM4') %>%
filter(Cash <= 2000 )
atm_ts_4 <- atm_ts_4 %>% fill_gaps()
atm_ts_4 %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM 4 Cash Withdrawals(Outlier Removed)")
atm_ts_4 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_4) +
labs(title = "ATM 4 Cash Withdrawals Forecast Using Different Predictive Models")
arima_model <- atm_ts_4 %>% model(ARIMA(Cash))
ets_model <- atm_ts_4 %>% model(ETS(Cash))
naive_model <- atm_ts_4 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM4 344. 284. -535. 567. 0.822 0.760 0.0818
## 2 ets ATM4 NaN NaN NaN NaN NaN NaN NA
## 3 naive ATM4 478. 387. -489. 549. 1.12 1.06 -0.465
# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)
a4 = c("Atm4", sum(forecast_results$.mean) * 100)
df1 <- data.frame(matrix(c(a1, a2, a3, a4), nrow = 4, byrow = TRUE))
colnames(df1) <- c("ATM", "Total Dollars")
print(df1)
## ATM Total Dollars
## 1 Atm1 235330.63538969
## 2 Atm2 177724.633390371
## 3 Atm3 253752.392779698
## 4 Atm4 1280071.35746589
#write.xlsx(df1, file = "Project_1_Data_624_PartA.xlsx", sheetName = "Part A")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Going from the Excel file for the Residential Power Usage, there was 2 things that done for data manipulation. The first thing was converting the date into a month/year index since it was a string when imported. As seen below I converted them into a yearmonth() date format and then converted the dataframe into a tsibble for time series analysis as being shown later.
# Importing Excel to df
ResidentialCustomerForecastLoad_624 <- read_excel("C:/Users/puddi/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
# Formatting Time to Month/Year Index
ResidentialCustomerForecastLoad_624 <- rename(ResidentialCustomerForecastLoad_624, date = `YYYY-MMM`)
power_ts <- ResidentialCustomerForecastLoad_624 %>%
mutate(Month = yearmonth(date))%>%
select(-CaseSequence, -date) %>%
tsibble(index= Month)
The 2nd thing that was done with the data set is data cleaning with handling the missing data. As seen we only have 1 missing data point that is only missing KWH. For this imputation, I decided to go with mean data imputation since with the seasonality and the consistency of the data, it felt the best way to limit any changes to the data.
power_ts[which(is.na(power_ts$KWH)), ]
## # A tsibble: 1 x 2 [1M]
## KWH Month
## <dbl> <mth>
## 1 NA 2008 Sep
# Calculating Mean
mean_KWH <- mean(power_ts$KWH, na.rm = TRUE)
# Impute missing values with Mean
power_ts$KWH[which(is.na(power_ts$KWH))] <- mean_KWH
summary(power_ts$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6314472 6502475 7608792 10655730
power_ts %>%
autoplot(KWH) +
labs(title = "Residential Power Usage in Kilowatt Hours")
power_ts %>%
gg_tsdisplay(KWH, plot_type='partial') +
labs(title = "Before Transformation, Residential Power Usage")
# Calculating Lambda
lambda <- power_ts %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
# Unit Root Test KPSS
power_ts %>%
features(box_cox(KWH,lambda), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# Box-Cox Transformation
power_ts %>%
gg_tsdisplay(difference(box_cox(KWH, lambda)), plot_type = 'partial') +
labs(title = paste("Residential Power Usage with lambda = ", round(lambda, 2)))
As I decided to compare different ARIMA models, we can see that the ARIMA(1,1,1) model consistently exhibits the best performance based on multiple criteria. With the (1,1,1) Model showcasing lower values across the Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AICc), and Bayesian Information Criterion (BIC), we can make the inference that the ARIMA(1,1,1) model is a superior fit. This makes sense since we don’t have many parameters, there would be a higher preference in a simpler model since they are less prone to overfitting and have less noise.
# Comparison of different ARIMA models
power_fit <- power_ts %>%
model(arima110 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,0)),
arima120 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,2,0)),
arima210 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,0)),
arima212 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,2)),
arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1)))
# Comparison Results
glance(power_fit) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 5 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima111 1.40 -303. 617. 617. 633.
## 2 arima212 1.46 -307. 624. 624. 640.
## 3 arima110 1.78 -327. 663. 663. 676.
## 4 arima210 1.93 -333. 676. 676. 692.
## 5 arima120 3.63 -393. 793. 793. 806.
When we are forecasting we can see there are some residual outliers as well as on the ACF graph that there are peaks past the bounded area indicating the significant threshold. We can also see that the big outlier indicated on the graph is the one past the -10 on the residual graph. Overall we do see the distribution around the 0 showcasing a normality in the data.
# Using Best Fit ARIMA Model
power_fit %>%
select(arima111) %>%
gg_tsresiduals() +
ggtitle("Residuals for ARIMA(1,1,0)")
# Forecasting 2014 Monthly Utilizing ARIMA(1,1,1)
df3 <- power_ts %>%
model(arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1))) %>%
forecast(h=12)
# Time Series of Forecast
df3 %>%
autoplot() +
labs(title="Residential Power Usage Forecast for 2014 Monthly")
#write.xlsx(df3, file = "Project_1_Data_624_PartB.xlsx", sheetName = "Part B", append = TRUE)
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Waterflow_Pipe1 <- read_excel("C:/Users/puddi/Downloads/Waterflow_Pipe1.xlsx")
Waterflow_Pipe2 <- read_excel("C:/Users/puddi/Downloads/Waterflow_Pipe2.xlsx")