Project 1 Data 624

Libraries Used:

library(tidyverse)
library(readxl)
library(fpp3)
library(fable)
library(purrr)
library(zoo)
library(openxlsx)

Part A

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file

ATM624Data <- read_excel("C:/Users/puddi/Downloads/ATM624Data.xlsx")

# Converting Date from Numeric to YYYY-MM-DD Format
ATM624Data$DATE <- as.Date(ATM624Data$DATE, origin = "1899-12-30") 

# Information
summary(ATM624Data)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

Handling Missing Data

As seen below, when looking at the ATM data set we can see that there is 19 missing data points in the file. With that being said, 14 of 19 NA’s don’t have an ATM or a Cash so I will be just getting rid of them completely since there isn’t enough data to help support adding data. The other 5 missing NA’s have an ATM class associating with them and since we are doing time series forecasting, for their imputations, we are going with an ARIMA model for interpolation. The other reason why this method was preferred is since we are utilizing ARIMA, it will considered different dependencies and patterns to impute the missing values that are correlated to neighboring data points.

# Finding the Rows that have NA's
ATM624Data[which(is.na(ATM624Data$Cash)), ]

## # A tibble: 19 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-18 ATM2     NA
##  4 2009-06-22 ATM1     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

# Removing any NA's that don't have an ATM
ATM624Data <- ATM624Data[!is.na(ATM624Data$ATM), ]

ATM624Data <- ATM624Data %>%
  arrange(ATM)
ATM624Data[which(is.na(ATM624Data$Cash)), ]

## # A tibble: 5 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

missing_rows <- which(is.na(ATM624Data$Cash))
missing_rows

## [1]  44  47  53 414 420

# Converting ATM Dataframe to Tsibble 
atm_ts <- as_tsibble(ATM624Data, index = DATE, key = ATM)

atm_ts <- atm_ts %>%
  # Fit ARIMA model to the data containing missing values
  model(ARIMA(Cash)) %>%
  # Estimate Cash for the missing values.
  interpolate(atm_ts)



atm_ts[missing_rows, ]

## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
##   ATM   DATE         Cash
##   <chr> <date>      <dbl>
## 1 ATM1  2009-06-13 111.  
## 2 ATM1  2009-06-16 124.  
## 3 ATM1  2009-06-22 100.  
## 4 ATM2  2009-06-18   9.42
## 5 ATM2  2009-06-24  29.4

Time Series Graph

For the ATMS below we are splitting them up and running a few models on them for forecast, the main ones are Random Walking, Naive, Naive-Drift, ETS and ARIMA. As we can see that for ATM’s 1,2,4 we are going with the ARIMA method since there is a good amount of historic data as well as the models preform the most. For ATM 3, ETS had preformed the best since for the longest amount of time there wasn’t any withdrawals coming from the ATM so using ARIMA would heavily Skewed the amount of Cash needed to be very low.

# In order to Avoid Scientific Notation
options(scipen = 999)
aggregate(Cash ~ ATM, data = atm_ts, summary)

##    ATM     Cash.Min.  Cash.1st Qu.   Cash.Median     Cash.Mean  Cash.3rd Qu.
## 1 ATM1     1.0000000    73.0000000    91.0000000    84.1156368   108.0000000
## 2 ATM2     0.0000000    25.0000000    66.0000000    62.3420282    93.0000000
## 3 ATM3     0.0000000     0.0000000     0.0000000     0.7205479     0.0000000
## 4 ATM4     1.5632595   124.3344146   403.8393360   474.0433449   704.5070416
##       Cash.Max.
## 1   180.0000000
## 2   147.0000000
## 3    96.0000000
## 4 10919.7616377

# Time Series Graph by ATM
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cash Withdrawals", x = "Date", y = "Cash") +
  theme_minimal()

atm_ts %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Cash Withdrawals")

ATM 1

# ATM 1
atm_ts_1 <- atm_ts %>%
  filter(ATM == 'ATM1') 
atm_ts_1 %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_1 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_1) +
  labs(title = "ATM 1 Cash Withdrawals Forecast Using Different Predictive Models")

arima_model <- atm_ts_1 %>% model(ARIMA(Cash))
ets_model <- atm_ts_1 %>% model(ETS(Cash))
naive_model <- atm_ts_1 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima ATM1   23.2  14.4 -102.  117. 0.822 0.840 -0.00838
## 2 ets   ATM1   23.7  15.0 -107.  122. 0.853 0.859  0.142  
## 3 naive ATM1   50.0  37.4 -132.  167. 2.13  1.81  -0.358

# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)

a1 = c("Atm1", sum(forecast_results$.mean) * 100)

ATM 2

# ATM 2
atm_ts_2 <- atm_ts %>%
  filter(ATM == 'ATM2') 
atm_ts_2 %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_2 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_2) +
  labs(title = "ATM 2 Cash Withdrawals Forecast Using Different Predictive Models")

arima_model <- atm_ts_2 %>% model(ARIMA(Cash))
ets_model <- atm_ts_2 %>% model(ETS(Cash))
naive_model <- atm_ts_2 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima ATM2   23.8  16.7  -Inf   Inf 0.823 0.803 -0.00268
## 2 ets   ATM2   24.6  17.2  -Inf   Inf 0.847 0.831  0.0119 
## 3 naive ATM2   54.4  42.6  -Inf   Inf 2.10  1.84  -0.306

# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)

a2 = c("Atm2", sum(forecast_results$.mean) * 100)

ATM 3

# ATM 3
atm_ts_3 <- atm_ts %>%
  filter(ATM == 'ATM3') 
atm_ts_3 %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_3 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_3) +
  labs(title = "ATM 3 Cash Withdrawals Forecast Using Different Predictive Models")

arima_model <- atm_ts_3 %>% model(ARIMA(Cash))
ets_model <- atm_ts_3 %>% model(ETS(Cash))
naive_model <- atm_ts_3 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima ATM3   5.03 0.271  34.6  34.6 0.370 0.625  0.0124 
## 2 ets   ATM3   5.03 0.273 Inf   Inf   0.371 0.625 -0.00736
## 3 naive ATM3   5.09 0.310  28.8  40.2 0.423 0.632 -0.149

# Forecast for 30 days
forecast_results <- ets_model %>% forecast(h = 30)

a3 = c("Atm3", sum(forecast_results$.mean) * 100)

ATM 4

The most amount of money that an ATM can hold is $ 200,000 or 2000 100-Dollar Bills according to the resource which means any ATM Withdraws that has over that amount can be taken out as a freak incident or as a error in reporting. Remember since each of the number represent the amount of $ 100’s taken out each day, we can assume that anything over 2000 bills taken in one day to be an error. From a realistic point of view, it is very unlikely for an ATM to be restocked in 1 day let alone multiple times in a singular day. So for the predicting the forecasted values of ATM4, I decided to filter out the outlier and then run the different models.

Resource: link

# ATM 4
atm_ts %>%
  filter(ATM == 'ATM4') %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_4 <- atm_ts %>%
  filter(ATM == 'ATM4') %>%
  filter(Cash <= 2000 ) 


atm_ts_4 <- atm_ts_4 %>% fill_gaps()

atm_ts_4 %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM 4 Cash Withdrawals(Outlier Removed)")

atm_ts_4 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_4) +
  labs(title = "ATM 4 Cash Withdrawals Forecast Using Different Predictive Models")

arima_model <- atm_ts_4 %>% model(ARIMA(Cash))
ets_model <- atm_ts_4 %>% model(ETS(Cash))
naive_model <- atm_ts_4 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE    MASE   RMSSE    ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 arima ATM4   344.  284. -535.  567.   0.822   0.760  0.0818
## 2 ets   ATM4   NaN   NaN   NaN   NaN  NaN     NaN     NA     
## 3 naive ATM4   478.  387. -489.  549.   1.12    1.06  -0.465

# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)

a4 = c("Atm4", sum(forecast_results$.mean) * 100)

Writing Excel:

df1 <- data.frame(matrix(c(a1, a2, a3, a4), nrow = 4, byrow = TRUE))
colnames(df1) <- c("ATM", "Total Dollars")
print(df1)

##    ATM    Total Dollars
## 1 Atm1  235330.63538969
## 2 Atm2 177724.633390371
## 3 Atm3 253752.392779698
## 4 Atm4 1280071.35746589

#write.xlsx(df1, file = "Project_1_Data_624_PartA.xlsx", sheetName = "Part A")

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Manipulation

Going from the Excel file for the Residential Power Usage, there was 2 things that done for data manipulation. The first thing was converting the date into a month/year index since it was a string when imported. As seen below I converted them into a yearmonth() date format and then converted the dataframe into a tsibble for time series analysis as being shown later.

# Importing Excel to df
ResidentialCustomerForecastLoad_624 <- read_excel("C:/Users/puddi/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
# Formatting Time to Month/Year Index
ResidentialCustomerForecastLoad_624 <- rename(ResidentialCustomerForecastLoad_624, date = `YYYY-MMM`)
power_ts <- ResidentialCustomerForecastLoad_624 %>% 
  mutate(Month = yearmonth(date))%>%
  select(-CaseSequence, -date) %>% 
  tsibble(index= Month)

Missing Data

The 2nd thing that was done with the data set is data cleaning with handling the missing data. As seen we only have 1 missing data point that is only missing KWH. For this imputation, I decided to go with mean data imputation since with the seasonality and the consistency of the data, it felt the best way to limit any changes to the data.

power_ts[which(is.na(power_ts$KWH)), ]

## # A tsibble: 1 x 2 [1M]
##     KWH    Month
##   <dbl>    <mth>
## 1    NA 2008 Sep

# Calculating Mean
mean_KWH <- mean(power_ts$KWH, na.rm = TRUE)

# Impute missing values with Mean
power_ts$KWH[which(is.na(power_ts$KWH))] <- mean_KWH

summary(power_ts$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6314472  6502475  7608792 10655730

Time Series Graph

power_ts %>%
  autoplot(KWH) +
  labs(title = "Residential Power Usage in Kilowatt Hours")

power_ts %>%
  gg_tsdisplay(KWH, plot_type='partial') +
  labs(title = "Before Transformation, Residential Power Usage")

# Calculating Lambda
lambda <- power_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)


# Unit Root Test KPSS
power_ts %>%
  features(box_cox(KWH,lambda), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# Box-Cox Transformation
power_ts %>%
  gg_tsdisplay(difference(box_cox(KWH, lambda)), plot_type = 'partial') +
  labs(title = paste("Residential Power Usage with lambda = ", round(lambda, 2)))

Model Comparison

As I decided to compare different ARIMA models, we can see that the ARIMA(1,1,1) model consistently exhibits the best performance based on multiple criteria. With the (1,1,1) Model showcasing lower values across the Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AICc), and Bayesian Information Criterion (BIC), we can make the inference that the ARIMA(1,1,1) model is a superior fit. This makes sense since we don’t have many parameters, there would be a higher preference in a simpler model since they are less prone to overfitting and have less noise.

# Comparison of different ARIMA models
power_fit <- power_ts %>%
  model(arima110 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,0)),
        arima120 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,2,0)),
        arima210 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,0)),
        arima212 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,2)),
        arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1)))

# Comparison Results
glance(power_fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 5 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima111   1.40   -303.  617.  617.  633.
## 2 arima212   1.46   -307.  624.  624.  640.
## 3 arima110   1.78   -327.  663.  663.  676.
## 4 arima210   1.93   -333.  676.  676.  692.
## 5 arima120   3.63   -393.  793.  793.  806.

Forecasting 2014

When we are forecasting we can see there are some residual outliers as well as on the ACF graph that there are peaks past the bounded area indicating the significant threshold. We can also see that the big outlier indicated on the graph is the one past the -10 on the residual graph. Overall we do see the distribution around the 0 showcasing a normality in the data.

# Using Best Fit ARIMA Model
power_fit %>%
  select(arima111) %>%
  gg_tsresiduals() +
  ggtitle("Residuals for ARIMA(1,1,0)")

# Forecasting 2014 Monthly Utilizing ARIMA(1,1,1)
df3 <- power_ts %>%
  model(arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1))) %>%
  forecast(h=12)

# Time Series of Forecast
df3 %>%
  autoplot() +
  labs(title="Residential Power Usage Forecast for 2014 Monthly")

Writing Excel:

#write.xlsx(df3, file = "Project_1_Data_624_PartB.xlsx", sheetName = "Part B", append = TRUE)

Part C Bonus

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Waterflow_Pipe1 <- read_excel("C:/Users/puddi/Downloads/Waterflow_Pipe1.xlsx")
Waterflow_Pipe2 <- read_excel("C:/Users/puddi/Downloads/Waterflow_Pipe2.xlsx")

Project 1 Data 624

Jlok17

2024-03-22

Libraries Used:

Part A

Handling Missing Data

Time Series Graph

ATM 1

ATM 2

ATM 3

ATM 4

Writing Excel:

Part B

Data Manipulation

Missing Data

Time Series Graph

Model Comparison

Forecasting 2014

Writing Excel:

Part C Bonus