Part 1

Question:

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Response:

First, the date column should be converted to a lubridate time, rather than an excel decimal based time.

Perhaps counterintuitively, the 1900 date system is converted to the correct date by using an origin of December 30, 1899. Considering that ATM is the identifier, it is not sensible to include any observations that have an empty ATM entry.

atm <- read_excel("ATM624Data.xlsx")
#which(is.na(atm))

atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")

atm <- atm[!is.na(atm$ATM),]
#which(is.na(atm$DATE))
which(is.na(atm$Cash))
## [1]  87  93  98 105 110
summary(atm)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5
atm <- as_tsibble(atm,key = ATM, index = DATE)
autoplot(atm)
## Plot variable not specified, automatically selected `.vars = Cash`

There is an egregious outlier in the dataset that needs to be smoothed. Further, the missing values need to be imputed.

max(atm$Cash,na.rm=TRUE)
## [1] 10919.76
#atm$Cash[which(is.na(atm$Cash))] <- median(atm$Cash,na.rm=TRUE)

atm$Cash[which(atm$Cash == max(atm$Cash,na.rm=TRUE))] <- median(atm$Cash[which(atm$ATM == atm$ATM[which(atm$Cash == max(atm$Cash,na.rm=TRUE))])],na.rm=TRUE)

autoplot(atm)
## Plot variable not specified, automatically selected `.vars = Cash`

ATM4 is much more actively used than the other three ATMs. The missing values should be imputed with the median value of the corresponding ATM’s cash value.

atm$ATM[which(is.na(atm$Cash))]
## [1] "ATM1" "ATM1" "ATM1" "ATM2" "ATM2"
which(is.na(atm$Cash))
## [1]  44  47  53 414 420
idx <- c(which(is.na(atm$Cash)))

for(i in idx){
  atm$Cash[i] <- median(atm$Cash[which(atm$ATM == atm$ATM[i])],na.rm=TRUE)
}

autoplot(atm)
## Plot variable not specified, automatically selected `.vars = Cash`

atm1 <- subset(atm,ATM == "ATM1")
atm2 <- subset(atm,ATM == "ATM2")
atm3 <- subset(atm,ATM == "ATM3")
atm4 <- subset(atm,ATM == "ATM4")


ggarrange(autoplot(atm2)+labs(title="atm2")+theme(plot.title = element_text(hjust = 0.5)),autoplot(atm2)+labs(title="ATM2")+theme(plot.title = element_text(hjust = 0.5)),autoplot(atm3)+labs(title="ATM3")+theme(plot.title = element_text(hjust = 0.5)),autoplot(atm4)+labs(title="ATM4")+theme(plot.title = element_text(hjust = 0.5)))
## Plot variable not specified, automatically selected `.vars = Cash`
## Plot variable not specified, automatically selected `.vars = Cash`
## Plot variable not specified, automatically selected `.vars = Cash`
## Plot variable not specified, automatically selected `.vars = Cash`

atm2_ts <- ts(atm2$Cash,frequency = 7)
forecast::ggseasonplot(atm2_ts)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Registered S3 methods overwritten by 'forecast':
##   method                 from     
##   autoplot.Arima         ggfortify
##   autoplot.acf           ggfortify
##   autoplot.ar            ggfortify
##   autoplot.bats          ggfortify
##   autoplot.decomposed.ts ggfortify
##   autoplot.ets           ggfortify
##   autoplot.forecast      ggfortify
##   autoplot.stl           ggfortify
##   autoplot.ts            ggfortify
##   fitted.ar              ggfortify
##   fortify.ts             ggfortify
##   residuals.ar           ggfortify

The forecast package can be used to visualize each week in the dataset to check for seasonality. Interestingly, later in the year, Thursday is the least utilized day for ATM1, while Saturday is a heavy usage day for ATM1 late in the year. Early in the year, the opposite was true. Thursday was a heavy usage day and Saturday was a low usage day. Tuesday and Wednesday appear to be the most consistent days.

ATMs 1, 2, and 4 are all stationary time series. ATM3 appears to have been opened late in the series. It may be difficult to forecast ATM3 with only 3 entries.

atm2_ts <- ts(atm2$Cash,frequency = 7)
forecast::ggseasonplot(atm2_ts)

atm3_ts <- ts(atm3$Cash,frequency = 7)
forecast::ggseasonplot(atm3_ts)

atm4_ts <- ts(atm4$Cash,frequency = 7)
forecast::ggseasonplot(atm4_ts)

ATM4 appears to be randomly used, with no particular preference on the day. It does not appear to be seasonal. ATM2 follows a sporadic pattern, but there is an element of seasonality to it. Early in the year, Thursday is a heavy usage day, and Friday/Saturday are very low usage days. Later in the year, the opposite was true. Wednesday and Thursday had little to no utilization, while Friday and Saturday were heavily used.

ATM4

I will begin with ATM4 because it seems the most straightforward of the four series.

ATM4 is a stationary, non-seasonal time series. There may be some seasonality late in the year on Thursdays, however.

gg_tsdisplay(atm4)
## Plot variable not specified, automatically selected `y = Cash`

The 7th lag (in a cycle: 7th,14th barely the 21st) is the lag that is outside the confidence interval in the ACF. The data is largely stationary, with few outliers now that the outlier has been smoothed.

ARMA Forecast

ARIMA is not needed here, because the data is already stationary, so ARMA will be used. AR is also ARIMA(1,0,0), and MA is ARIMA(0,0,1).

Differencing does not make sense in this case because there are no negative entries for Cash. I will use a lambda of 0 for the ARIMA model to force a log transform on the data.

transform <- box_cox(atm4$Cash,0)

atm4$tf <- transform

ggarrange(autoplot(atm4,Cash)+labs(title="Time Series of ATM4 Cash Withdrawals")+theme(plot.title = element_text(hjust = 0.5)),autoplot(atm4,Cash)+labs(title="Box-Cox (log) Transformed ATM4 Series")+theme(plot.title = element_text(hjust = 0.5)),ncol=1)

atm4 %>% 
  gg_tsdisplay(Cash,plot_type='partial',lag = 36)+
  labs(title = "Unmanipulated ATM4 Series",y="")

atm4 %>% 
  gg_tsdisplay(difference(Cash,7),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced ATM4 Series (Note NEGATIVE VALUES)",y="")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).

atm4 %>% 
  gg_tsdisplay(difference(tf,7),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced Box-Cox Transformed ATM4 Series",y="")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Removed 7 rows containing missing values (geom_point).

fit <- atm4 %>%
  model(ar100 = ARIMA(Cash ~ pdq(1,0,0)),
        ma001 = ARIMA(Cash ~ pdq(0,0,1)),
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(3,1,2) + PDQ(1,0,1)))

fit_summary <- data.frame(rbind(c("ar100",0,1,2,2,0,0),c("ma001",0,1,0,2,0,0),c("stepwise",3,0,2,1,0,0),c("search",3,0,2,1,0,0),c("seasonal",3,1,2,1,0,1)))

colnames(fit_summary) <- c("Model","p","d","q","P","D","Q")

kbl(fit_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q P D Q
ar100 0 1 2 2 0 0
ma001 0 1 0 2 0 0
stepwise 3 0 2 1 0 0
search 3 0 2 1 0 0
seasonal 3 1 2 1 0 1
glance(fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

The seasonal model that I selected provided the optimal outcome, likely because it did not predict negative values. The parameters from the automatically assigned models will be used to re-fit.

fit2 <- atm4 %>%
  model(ar100 = ARIMA(log(Cash) ~ 0 + pdq(1,0,0) + PDQ(2,0,0)),
        ma001 = ARIMA(log(Cash) ~ 0 + pdq(0,0,1) + PDQ(2,0,0)),
        stepwise = ARIMA(log(Cash) ~ 0 + pdq(3,0,2) + PDQ(1,0,0)),
        modified_search_nonseasonal = ARIMA(log(Cash) ~ 0 + pdq(3,1,2) + PDQ(0,0,0)),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(3,1,2) + PDQ(1,0,1)))

fit2_summary <- data.frame(rbind(c("ar100",0,1,2,2,0,0),c("ma001",0,1,0,2,0,0),c("stepwise and search",3,0,2,1,0,0),c("modified search",3,1,2,1,0,0),c("seasonal",3,1,2,1,0,1)))

colnames(fit2_summary) <- c("Model","p","d","q","P","D","Q")

kbl(fit2_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q P D Q
ar100 0 1 2 2 0 0
ma001 0 1 0 2 0 0
stepwise and search 3 0 2 1 0 0
modified search 3 1 2 1 0 0
seasonal 3 1 2 1 0 1
glance(fit2) %>%
  arrange(AICc) %>%
  select(.model:BIC)

Prediction

set.seed(34)

atm4 %>%
  model(#ar100 = ARIMA(log(Cash) ~ 0 + pdq(1,0,0) + PDQ(2,0,0)),
        #ma001 = ARIMA(log(Cash) ~ 0 + pdq(0,0,1) + PDQ(2,0,0)),
        stepwise = ARIMA(log(Cash) ~ 0 + pdq(3,0,2) + PDQ(1,0,0)),
        modified_search_nonseasonal = ARIMA(log(Cash) ~ 0 + pdq(3,1,2) + PDQ(0,0,0)),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(3,1,2) + PDQ(1,0,1))) %>%
  forecast(h=31) %>%
  autoplot(atm4)

bestfit_atm4 <- atm4 %>%
  model(seasonal = ARIMA(log(Cash) ~ 0 + pdq(3,1,2) + PDQ(1,0,1)))
  
atm4_predict <- forecast(bestfit_atm4,h=31)
as.character(atm4_predict$DATE[1])
## [1] "2010-05-01"
atm4_predictions <- data.frame(cbind(c(as.character(atm4_predict$DATE)),atm4_predict$ATM,atm4_predict$.mean))

colnames(atm4_predictions) <- c("DATE","ATM","Cash")

atm4_predictions$DATE <- as.Date(atm4_predictions$DATE)

ATM1

ATM1 may need a round of differencing, but it is a largely stationary time series.

gg_tsdisplay(atm1)
## Plot variable not specified, automatically selected `y = Cash`

The ACF follows a bit of a pattern, and there are three lags outside of the significance interval. I will use a lambda of 0 for the ARIMA model to force a log transform on the data.

transform <- box_cox(atm1$Cash,0)

atm1$tf <- transform

ggarrange(autoplot(atm1,Cash)+labs(title="Time Series of ATM1 Cash Withdrawals")+theme(plot.title = element_text(hjust = 0.5)),autoplot(atm1,Cash)+labs(title="Box-Cox (log) Transformed ATM1 Series")+theme(plot.title = element_text(hjust = 0.5)),ncol=1)

atm1 %>% 
  gg_tsdisplay(Cash,plot_type='partial',lag = 36)+
  labs(title = "Unmanipulated ATM1 Series",y="")

atm1 %>% 
  gg_tsdisplay(difference(Cash,7),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced ATM1 Series (Note NEGATIVE VALUES)",y="")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).

atm1 %>% 
  gg_tsdisplay(difference(tf,7),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced Box-Cox Transformed ATM1 Series",y="")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Removed 7 rows containing missing values (geom_point).

fit <- atm1 %>%
  model(ar100 = ARIMA(Cash ~ pdq(1,0,0)),
        ma001 = ARIMA(Cash ~ pdq(0,0,1)),
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(1,0,1)))

fit_summary <- data.frame(rbind(c("ar100",1,0,0,0,1,2),c("ma001",0,0,1,0,1,2),c("stepwise",0,0,1,0,1,2),c("search",0,0,1,0,1,2),c("seasonal",1,1,4,1,0,1)))

colnames(fit_summary) <- c("Model","p","d","q","P","D","Q")

kbl(fit_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q P D Q
ar100 1 0 0 0 1 2
ma001 0 0 1 0 1 2
stepwise 0 0 1 0 1 2
search 0 0 1 0 1 2
seasonal 1 1 4 1 0 1
glance(fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

The seasonal model that I selected provided the optimal outcome, likely because it did not predict negative values. The seasonal parameters from the automatically assigned models will be used to re-fit.

fit2 <- atm1 %>%
  model(ar100 = ARIMA(log(Cash) ~ 0 + pdq(1,0,0) + PDQ(2,0,0)),
        ma001 = ARIMA(log(Cash) ~ 0 + pdq(0,0,1) + PDQ(2,0,0)),
        modified_stepwise = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(0,1,2)),
        modified_search_nonseasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(0,0,0)),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(1,0,1)))

fit2_summary <- data.frame(rbind(c("ar100",1,0,0,2,0,0),c("ma001",0,0,1,2,0,0),c("modified_stepwise",1,1,4,0,1,2),c("modified search",1,1,4,0,0,0),c("seasonal",1,1,4,1,0,1)))

colnames(fit2_summary) <- c("Model","p","d","q","P","D","Q")

kbl(fit2_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q P D Q
ar100 1 0 0 2 0 0
ma001 0 0 1 2 0 0
modified_stepwise 1 1 4 0 1 2
modified search 1 1 4 0 0 0
seasonal 1 1 4 1 0 1
glance(fit2) %>%
  arrange(AICc) %>%
  select(.model:BIC)

Combining the seasonal parameters that were automatically selected, in addition to the parameters that I chose manually, an optimal model of ARIMA(1,1,4,0,1,2) was found. The AICc is the lowest of the model options at 659.6.

Prediction

set.seed(34)

atm1 %>%
  model(#ar100 = ARIMA(log(Cash) ~ 0 + pdq(1,0,0) + PDQ(2,0,0)),
        #ma001 = ARIMA(log(Cash) ~ 0 + pdq(0,0,1) + PDQ(2,0,0)),
        modified_stepwise = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(0,1,2)),
        #modified_search_nonseasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(0,0,0)),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(1,0,1))) %>%
  forecast(h=31) %>%
  autoplot(atm1)

bestfit_atm1 <- atm1 %>%
  model(modified_stepwise_114_012 = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(0,1,2)))
  
atm1_predict <- forecast(bestfit_atm1,h=31)
as.character(atm1_predict$DATE[1])
## [1] "2010-05-01"
atm1_predictions <- data.frame(cbind(c(as.character(atm1_predict$DATE)),atm1_predict$ATM,atm1_predict$.mean))

colnames(atm1_predictions) <- c("DATE","ATM","Cash")

atm1_predictions$DATE <- as.Date(atm1_predictions$DATE)

ATM2

gg_tsdisplay(atm2)
## Plot variable not specified, automatically selected `y = Cash`

The ACF follows a bit of a pattern, and there are three lags outside of the significance interval. I will use a lambda of 0 for the ARIMA model to force a log transform on the data.

set.seed(34)
atm2$Cash[atm2$Cash == 0] <- 1
transform <- box_cox(atm2$Cash,0)

atm2$tf <- transform

ggarrange(autoplot(atm2,Cash)+labs(title="Time Series of ATM2 Cash Withdrawals")+theme(plot.title = element_text(hjust = 0.5)),autoplot(atm2,Cash)+labs(title="Box-Cox (log) Transformed ATM2 Series")+theme(plot.title = element_text(hjust = 0.5)),ncol=1)

atm2 %>% 
  gg_tsdisplay(Cash,plot_type='partial',lag = 36)+
  labs(title = "Unmanipulated ATM2 Series",y="")

atm2 %>% 
  gg_tsdisplay(difference(Cash,7),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced ATM2 Series (Note NEGATIVE VALUES)",y="")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).

atm2 %>% 
 gg_tsdisplay(difference(tf),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced Box-Cox Transformed ATM2 Series",y="")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

fit <- atm2 %>%
  model(ar100 = ARIMA(Cash ~ pdq(1,0,0)),
        ma001 = ARIMA(Cash ~ pdq(0,0,1)),
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(1,0,1)),
        arima514 = ARIMA(log(Cash) ~ 0 + pdq(5,1,4) + PDQ(0,1,1)))
## Warning in sqrt(diag(best$var.coef)): NaNs produced
fit_summary <- data.frame(rbind(c("ar100",1,0,0,0,1,1),c("ma001",0,0,1,0,1,1),c("stepwise",2,0,2,0,1,1),c("search",1,1,4,0,1,1),c("seasonal",1,1,4,1,0,1),c("arima514",5,1,4,0,1,1)))

colnames(fit_summary) <- c("Model","p","d","q","P","D","Q")

kbl(fit_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q P D Q
ar100 1 0 0 0 1 1
ma001 0 0 1 0 1 1
stepwise 2 0 2 0 1 1
search 1 1 4 0 1 1
seasonal 1 1 4 1 0 1
arima514 5 1 4 0 1 1
glance(fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

The optimal AICc comes from an ARIMA(5,1,4,0,1,1) model (AICc 806.61)

Prediction

set.seed(34)

atm2 %>%
  model(#ar100 = ARIMA(Cash ~ pdq(1,0,0)),
        #ma001 = ARIMA(Cash ~ pdq(0,0,1)),
        #stepwise = ARIMA(Cash),
        #search = ARIMA(Cash, stepwise=FALSE),
        seasonal = ARIMA(log(Cash) ~ 0 + pdq(1,1,4) + PDQ(1,0,1)),
        arima514 = ARIMA(log(Cash) ~ 0 + pdq(5,1,4) + PDQ(0,1,1))) %>%
  forecast(h=31) %>%
  autoplot(atm2)
## Warning in sqrt(diag(best$var.coef)): NaNs produced

bestfit_atm2 <- atm2 %>%
  model(arima514 = ARIMA(log(Cash) ~ 0 + pdq(5,1,4) + PDQ(0,1,1)))
## Warning in sqrt(diag(best$var.coef)): NaNs produced
atm2_predict <- forecast(bestfit_atm2,h=31)
as.character(atm2_predict$DATE[1])
## [1] "2010-05-01"
atm2_predictions <- data.frame(cbind(c(as.character(atm2_predict$DATE)),atm2_predict$ATM,atm2_predict$.mean))

colnames(atm2_predictions) <- c("DATE","ATM","Cash")

atm2_predictions$DATE <- as.Date(atm2_predictions$DATE)

ATM3

ATM3 only has 3 entries, so the forecast will be highly unreliable. I will use an ARIMA model still, though it is very difficult to draw any reliable conclusions with a small sample size.

gg_tsdisplay(atm3)
## Plot variable not specified, automatically selected `y = Cash`

atm3 <- atm3[atm3$Cash > 0,]

gg_tsdisplay(atm3)
## Plot variable not specified, automatically selected `y = Cash`

Prediction

fit <- atm3 %>%
  model(stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE))
## Warning in sqrt(diag(best$var.coef)): NaNs produced

## Warning in sqrt(diag(best$var.coef)): NaNs produced
fit_summary <- data.frame(rbind(c("stepwise",0,2,1),c("search",0,2,1)))

colnames(fit_summary) <- c("Model","p","d","q")

kbl(fit_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q
stepwise 0 2 1
search 0 2 1
glance(fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)
set.seed(34)

atm3 %>%
  model(stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE)) %>%
  forecast(h=31) %>%
  autoplot(atm2)
## Warning in sqrt(diag(best$var.coef)): NaNs produced

## Warning in sqrt(diag(best$var.coef)): NaNs produced

bestfit_atm3 <- atm3 %>%
  model(search = ARIMA(Cash, stepwise=FALSE))
## Warning in sqrt(diag(best$var.coef)): NaNs produced
atm3_predict <- forecast(bestfit_atm3,h=31)
as.character(atm3_predict$DATE[1])
## [1] "2010-05-01"
atm3_predictions <- data.frame(cbind(c(as.character(atm3_predict$DATE)),atm3_predict$ATM,atm3_predict$.mean))

colnames(atm3_predictions) <- c("DATE","ATM","Cash")

atm3_predictions$DATE <- as.Date(atm3_predictions$DATE)
predictions <- rbind(atm1_predictions,atm2_predictions,atm3_predictions,atm4_predictions)

write.table(predictions,"atm_predictions.csv",row.names = FALSE, col.names = FALSE, quote = FALSE, sep = ",")

Overall, most of the optimal models had AICc values of under 1000, and the confidence intervals were spread widely, but realistic predictions were made, with few unexpected forecasted values.

Question 2

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Response

resident <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")

resident <- resident[,c(2,3)]

colnames(resident) <- c("DATE","KWH")

resident$DATE <- yearmonth(resident$DATE)
resident$KWH <- as.numeric(resident$KWH)
resi <- as_tsibble(resident,index = DATE)


autoplot(resi)
## Plot variable not specified, automatically selected `.vars = KWH`

With the exception of one substantial decrease in 2010, the series is largely stationary. It appears that there is a missing entry in 2008.

Instead of transforming the dataset, I will take a less invasive approach and replace the outlier with the median value of the KWH data. The same will be done with the missing value.

resi$KWH[which(is.na(resi$KWH))] <- median(resi$KWH,na.rm=TRUE)
resi$KWH[which(resi$KWH == min(resi$KWH))] <- median(resi$KWH,na.rm=TRUE)

autoplot(resi)
## Plot variable not specified, automatically selected `.vars = KWH`

This dataset may not need to be differenced, but there is a slight lack of stationarity.

resi_szn <- ts(resi$KWH,frequency = 12)
forecast::ggseasonplot(resi_szn)

This data is highly seasonal, with consistently low power consumption in November, April, and May, and peak consumption in January, July, and August.

gg_tsdisplay(resi)
## Plot variable not specified, automatically selected `y = KWH`

I will model this using ARIMA, and a seasonal adjustment will be made as well. With seasonal differencing, the sinusoidal pattern in the ACF is removed.

set.seed(34)

resi %>% 
  gg_tsdisplay(KWH,plot_type='partial',lag = 36)+
  labs(title = "Unmanipulated KWH Series",y="")

resi %>% 
  gg_tsdisplay(difference(KWH,12),plot_type='partial',lag = 36)+
  labs(title = "Seasonally Differenced KWH Series",y="")
## Warning: Removed 12 row(s) containing missing values (geom_path).
## Warning: Removed 12 rows containing missing values (geom_point).

fit <- resi %>%
  model(stepwise = ARIMA(KWH),
        search = ARIMA(KWH, stepwise=FALSE),
        seasonal = ARIMA(KWH ~ 0 + pdq(2,2,3) + PDQ(1,0,1)),
        arima213 = ARIMA(KWH ~ 0 + pdq(2,1,3) + PDQ(2,1,0)))

fit_summary <- data.frame(rbind(c("stepwise",1,0,1,2,1,0),c("search",0,0,4,2,1,0),c("seasonal",2,2,3,1,0,1),c("arima514",2,1,3,2,1,0)))

colnames(fit_summary) <- c("Model","p","d","q","P","D","Q")

kbl(fit_summary, longtable = T, booktabs = T, caption = "Parameters of Each Model") %>%
  kable_styling(latex_options = c("repeat_header"))
Parameters of Each Model
Model p d q P D Q
stepwise 1 0 1 2 1 0
search 0 0 4 2 1 0
seasonal 2 2 3 1 0 1
arima514 2 1 3 2 1 0
glance(fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

Prediction

set.seed(34)

resi %>%
  model(stepwise = ARIMA(KWH),
        search = ARIMA(KWH, stepwise=FALSE),
        seasonal = ARIMA(KWH ~ 0 + pdq(2,2,3) + PDQ(1,0,1)),
        arima213 = ARIMA(KWH ~ 0 + pdq(2,1,3) + PDQ(2,1,0))) %>%
  forecast(h=12) %>%
  autoplot(resi)

bestfit_resi <- resi %>%
  model(arima213 = ARIMA(KWH ~ 0 + pdq(2,1,3) + PDQ(2,1,0)))
  

forecast(bestfit_resi,h=12)%>%
  autoplot(resi)+labs(title="ARIMA(2,1,3,2,1,0) Forecast")+theme(plot.title = element_text(hjust = 0.5))

resi_predict <- forecast(bestfit_resi,h=12)
#as.character(resi_predict$DATE[1])

resi_predictions <- data.frame(cbind(c(as.character(resi_predict$DATE)),resi_predict$.mean))

colnames(resi_predictions) <- c("DATE","KWH")

resi_predictions$DATE <- yearmonth(resi_predictions$DATE)

write.table(resi_predictions,"resi_predictions.csv",row.names = FALSE, col.names = FALSE, quote = FALSE, sep = ",")

Part 3

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows).Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Response

p1 <- read_excel("Waterflow_Pipe1.xlsx")
p2 <- read_excel("Waterflow_Pipe2.xlsx")

colnames(p1) <- c("DateTime","WaterFlow")
colnames(p2) <- c("DateTime","WaterFlow")

#p1$DateTime <- 34

p1$Date <- as.Date(p1$DateTime, origin = "1899-12-30")
p1$Time <- hms::hms(24*60*60*(p1$DateTime - floor(p1$DateTime)))
p1$hour <- hms::hms(60*60*floor(24*(p1$DateTime - floor(p1$DateTime))))

p1$Date_Time <- ymd_hms(ymd(p1$Date) + hms(p1$hour))

p1 <- p1[,c(2,6)]

p2$Date <- as.Date(p2$DateTime, origin = "1899-12-30")
p2$Time <- hms::hms(24*60*(p2$DateTime - round(p2$DateTime,0)))
p2$hour <- hms::hms(60*60*floor(24*(p2$DateTime - floor(p2$DateTime))))

p2$Date_Time <- ymd_hms(ymd(p2$Date) + hms(p2$hour))

p2 <- p2[,c(2,6)]
p1 <- p1 %>%
  group_by(Date_Time) %>%
  summarize(WaterFlow = mean(WaterFlow))

p1_ts <- as_tsibble(p1,index = Date_Time)


autoplot(p1_ts)+labs(title="Time Series Plot of P1")+theme(plot.title = element_text(hjust = 0.5))
## Plot variable not specified, automatically selected `.vars = WaterFlow`

p2 <- p2 %>%
  group_by(Date_Time) %>%
  summarize(WaterFlow = mean(WaterFlow))

p2_ts <- as_tsibble(p2,index = Date_Time)


autoplot(p2_ts)+labs(title="Time Series Plot of P2")+theme(plot.title = element_text(hjust = 0.5))
## Plot variable not specified, automatically selected `.vars = WaterFlow`

P1

Because P1 and P2 are stationary, ARIMA will likely yield an optimal value of 0,0,0. ARIMA search will also be checked, and ETS will be used.

fit1 <- p1_ts %>%
  fill_gaps(WaterFlow = median(WaterFlow)) %>%
  model("Simple ARIMA (0,0,0)" = ARIMA(WaterFlow),
        "ARIMA Search" = ARIMA(WaterFlow,stepwise=FALSE),
        "ETS(A,A,N) (Holt's)" = ETS(WaterFlow ~ error("A") + trend("A") + season("N")))

glance(fit1) %>%
  arrange(AICc) %>%
  select(.model:BIC)
p1_ts %>%
  fill_gaps(WaterFlow = median(WaterFlow)) %>%
  model("Simple ARIMA (0,0,0)" = ARIMA(WaterFlow),
        "ARIMA Search" = ARIMA(WaterFlow,stepwise=FALSE),
        "ETS(A,A,N) (Holt's)" = ETS(WaterFlow ~ error("A") + trend("A") + season("N"))) %>%
  forecast(h=24) %>%
  autoplot(p1_ts)+
  labs(title="Forecasting P1 Water Flow")+
  theme(plot.title = element_text(hjust = 0.5))

bestfit_p1 <- p1_ts %>%
  fill_gaps(WaterFlow = median(WaterFlow)) %>%
  model(arima000 = ARIMA(WaterFlow))
  

forecast(bestfit_p1,h=168)%>%
  autoplot(p1_ts)+labs(title="ARIMA(0,0,0) Forecast")+theme(plot.title = element_text(hjust = 0.5))

p1_predict <- forecast(bestfit_p1,h=168)


p1_predictions <- data.frame(cbind(c(as.character(p1_predict$Date_Time)),p1_predict$.mean))

colnames(p1_predictions) <- c("DateTime","WaterFlow")

write.table(p1_predictions,"p1_predictions.csv",row.names = FALSE, col.names = FALSE, quote = FALSE, sep = ",")

P2

P2 appears to have an element of seasonality to it, as the automatic ARIMA forecast added a PDQ of (0,0,1). In both P1 and P2, the ETS (Holt’s) performed poorer than the automatic ARIMA models.

fit2 <- p2_ts %>%
  fill_gaps(WaterFlow = median(WaterFlow)) %>%
  model("ARIMA (0,0,1) (0,0,1)" = ARIMA(WaterFlow),
        "ARIMA Search (0,0,1) (2,0,0)" = ARIMA(WaterFlow,stepwise=FALSE),
        "ETS(A,A,N) (Holt's)" = ETS(WaterFlow ~ error("A") + trend("A") + season("N")))

glance(fit2) %>%
  arrange(AICc) %>%
  select(.model:BIC)
p2_ts %>%
  fill_gaps(WaterFlow = median(WaterFlow)) %>%
  model("ARIMA (0,0,1) (0,0,1)" = ARIMA(WaterFlow),
        "ARIMA Search (0,0,1) (2,0,0)" = ARIMA(WaterFlow,stepwise=FALSE),
        "ETS(A,A,N) (Holt's)" = ETS(WaterFlow ~ error("A") + trend("A") + season("N"))) %>%
  forecast(h=24) %>%
  autoplot(p2_ts)+
  labs(title="Forecasting P2 Water Flow")+
  theme(plot.title = element_text(hjust = 0.5))

bestfit_p2 <- p2_ts %>%
  fill_gaps(WaterFlow = median(WaterFlow)) %>%
  model(arima001 = ARIMA(WaterFlow))
  

forecast(bestfit_p2,h=168)%>%
  autoplot(p2_ts)+labs(title="ARIMA(0,0,1) Forecast")+theme(plot.title = element_text(hjust = 0.5))

p2_predict <- forecast(bestfit_p2,h=168)


p2_predictions <- data.frame(cbind(c(as.character(p2_predict$Date_Time)),p2_predict$.mean))

colnames(p2_predictions) <- c("DateTime","WaterFlow")

write.table(p2_predictions,"p2_predictions.csv",row.names = FALSE, col.names = FALSE, quote = FALSE, sep = ",")