DATA 624 Project 1

Part A - ATM Forecast

I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Import and Exploration

After importing the ATM data we will split them up by ATM for future use. We will hten splot them, splitting them out by ATM to get a general sense of the data. By doing this we an see ATM1 and ATM2 have a relative consistent withdrawal pattern. ATM3 only has a few with withdrawals of over 0 suggesting it may not have offered cash but some other service until the end of this time frame. Finally ATM4 seems to have one extreme outlier of over 10,000 on one withdrawal. We will remove this datapoint and replot ATM4. After making this adjustment for ATM4 we see it has a similar distribution as to ATM1 and ATM2. Additionally, moving forward I will not be forecasting how much cash will be taken out of ATM3. with only 3 datapoints > 0 it would be disingenous to predict the withdrawals for that ATM. It should also be noted that ATM4 has higher withdrawal amounts then the other 2 atms. We can also see that ATM 1 and ATM4 seem to have a cyclical aspect to them while ATM2 seems to gradually decrease before picking back up again.

atm<- read_excel("ATM624Data.xlsx")
atm<-atm%>%as_tsibble(index=DATE,key=ATM)
atm<- atm%>%mutate(DATE = as.Date(DATE, origin='1899-12-30'))


atm1<-atm%>%filter(ATM=="ATM1")
atm2<-atm%>%filter(ATM=="ATM2")
atm3<-atm%>%filter(ATM=="ATM3")
atm4<-atm%>%filter(ATM=="ATM4")


atm %>%
  ggplot(aes(DATE, Cash, color=ATM)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ATM, scales='free')

atm4$Cash[atm4$Cash > 9000] <- mean(atm4$Cash)
atm$Cash[atm$Cash > 9000] <- mean(atm4$Cash)

atm<-atm%>%filter(ATM!="ATM3")

atm %>%
  ggplot(aes(DATE, Cash, color=ATM)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ATM, scales='free')

### ATM1

We can easily see there is strong weekly (every 7 days) seasonality form the ACF.The plot of ATM1 also appears to be stable. This suggest we will want to use a seasonal ARIMA model. Additionally, the plot appears to be stable so we will not need to difference to address this. We will let the ARIMA() function select a function as well as manually running a ARIMA with a basic seasonal differencing. We can see with the searched for ARIMA model beat our seasonal model and will be what we choose for prediction. Ass expected the select model produced a model of ARIMA(0,0,1)(1,1,1)[7] with a seasonal difference but no first difference.

ggtsdisplay(atm1$Cash)

fit <- atm1 %>%
  model(
    Seasonal = ARIMA(Cash~pdq(0,0,0)+PDQ(0,1,0))
    ,
    auto = ARIMA(Cash)
  )


r<-fit%>%report()
fit <- atm1 %>%
  model(
    auto = ARIMA(Cash)
  )

f<-fit%>%forecast(h=30)
r<-fit%>%report()

## Series: Cash 
## Model: ARIMA(1,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          ar1     sar1     sar2
##       0.1511  -0.4698  -0.2048
## s.e.  0.0535   0.0517   0.0514
## 
## sigma^2 estimated as 601.5:  log likelihood=-1641.05
## AIC=3290.11   AICc=3290.22   BIC=3305.63

atm1_pred<-as.data.frame(f)
atm1_pred<-atm1_pred%>%dplyr::select(DATE,.mean)
atm1_pred<-atm1_pred%>%rename("prediction"=".mean")
write_xlsx(atm1_pred,"ATM1 Predictions.xlsx")
f%>%autoplot(atm1)+
  labs(title = "Forecast of atm1 withdrawals")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of atm1 withdrawals")

### ATM2

We will repeat the process we used above. similar to ATM1 the ARIMA model selected was ARIMA(2,0,2)(0,1,1)[7] with a lag of 7 and no first differencing.

ggtsdisplay(atm2$Cash)

fit <- atm2 %>%
  model(
    auto = ARIMA(Cash)
  )


r<-fit%>%report()

## Series: Cash 
## Model: ARIMA(0,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          sar1     sar2
##       -0.5747  -0.1766
## s.e.   0.0521   0.0518
## 
## sigma^2 estimated as 649.2:  log likelihood=-1659.31
## AIC=3324.61   AICc=3324.68   BIC=3336.26

f<-fit%>%forecast(h=30)

atm2_pred<-as.data.frame(f)
atm2_pred<-atm2_pred%>%dplyr::select(DATE,.mean)
atm2_pred<-atm2_pred%>%rename("prediction"=".mean")
write_xlsx(atm2_pred,"ATM2 Predictions.xlsx")
f%>%autoplot(atm2)+
  labs(title = "Forecast of atm2 withdrawals")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of atm2 withdrawals")

ATM4

We will repeat the process we used above. similar to ATM1 the ARIMA model selected was ARIMA(1,0,0)(2,0,0)[7] w/ mean with a lag of 7 and no first differencing or seasonal differencing. Based on what we know with the first 2 ATM’s we will input a seasonal difference into our own ARIMA and plot that. Th chosen prediction seems to predict much lower variance than the recorded data and our inputted seasonal model appears to have a much lower AIC so we will use this model for out predictions.

ggtsdisplay(atm4$Cash)

fit <- atm4 %>%
  model(
    auto = ARIMA(Cash)
  )


r<-fit%>%report()

## Series: Cash 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.8884  -0.6735  0.1661  0.9716  0.7966  0.2481  800.3641
## s.e.   0.1050   0.1239  0.0564  0.0959  0.1150  0.0555   48.8692
## 
## sigma^2 estimated as 117552:  log likelihood=-2645.24
## AIC=5306.48   AICc=5306.88   BIC=5337.68

f<-fit%>%forecast(h=30)
f%>%autoplot(atm4)+
  labs(title = "Forecast of atm4 withdrawals")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of atm4 withdrawals")

fit <- atm1 %>%
  model(
    Seasonal = ARIMA(Cash~pdq(1,0,0)+PDQ(2,1,0))
    
  )

r<-fit%>%report()

## Series: Cash 
## Model: ARIMA(1,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          ar1     sar1     sar2
##       0.1511  -0.4698  -0.2048
## s.e.  0.0535   0.0517   0.0514
## 
## sigma^2 estimated as 601.5:  log likelihood=-1641.05
## AIC=3290.11   AICc=3290.22   BIC=3305.63

f<-fit%>%forecast(h=30)
atm4_pred<-as.data.frame(f)
atm4_pred<-atm4_pred%>%dplyr::select(DATE,.mean)
atm4_pred<-atm4_pred%>%rename("prediction"=".mean")
write_xlsx(atm4_pred,"ATM4 Predictions.xlsx")
f%>%autoplot(atm4)+
  labs(title = "Forecast of atm4 withdrawals")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of atm4 withdrawals")

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

We can import the data and plot it simply to see it has a slight trend upward with very strong seasonality and an outlier in early 2010 or late 2011. We will alter this outlier to be the mean for timeseries.

We can also see there is a strong monthly seasonality so we will likely end up using a seasonal ARIMA to predict this data. This produces a ARIMA(0,0,3)(0,1,1)[12] w/ drift that produces acceptable residuals and forecast graph so we will go with this model for our predictions.

kw<- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kw<-ts(kw[, "KWH"], start = c(1998, 1), frequency = 12)
kw<-kw%>%as_tsibble()


autoplot(kw)

ggtsdisplay(kw$value, main="Monthly Power Usage (KWH)")

kw$value[kw$value < 3000000] <- mean(kw$value)
autoplot(kw)

ggtsdisplay(kw$value, main="Monthly Power Usage (KWH)")

fit <- kw %>%
  model(
    auto = ARIMA(value)
  )


r<-fit%>%report()

## Series: value 
## Model: ARIMA(2,0,2)(2,1,0)[12] 
## 
## Coefficients:
##          ar1     ar2      ma1      ma2     sar1     sar2
##       0.3922  0.5837  -0.0876  -0.7626  -0.7509  -0.4375
## s.e.  0.1595  0.1476   0.1335   0.1007   0.0800   0.0801
## 
## sigma^2 estimated as 3.826e+11:  log likelihood=-2629.34
## AIC=5272.68   AICc=5273.33   BIC=5295.03

f<-fit%>%forecast(h=12)
kwh_pred<-as.data.frame(f)
kwh_pred<-kwh_pred%>%dplyr::select(index,.mean)
kwh_pred<-kwh_pred%>%rename("prediction"=".mean")
write_xlsx(kwh_pred,"KWH Predictions.xlsx")
f%>%autoplot(kw)+
  labs(title = "Forecast of KWH Usage")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of KWH Usage")

Part C – BONUS, optional (part or all)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

After reading the files in we need to do some data cleansing to convert them to a tsibble with an hourly format. By will also add in 4 missing hourly values for pipe 1. Pipe 1 appears to be stable with no obvious seasonality and pipe 2 has slight seasonality and appears to be stable. We wil use the ARIMA() function.

wp1<- read_excel("Waterflow_Pipe1.xlsx")
wp2<- read_excel("Waterflow_Pipe2.xlsx")

NoOfHours <- as.numeric(ymd_hms("2015-12-3 16:00:00") - ymd_hms("2015-10-23 01:00:00"))*24 
wp2$Datetime<-ymd_hms("2015-10-23 01:00:00") + hours(0:NoOfHours)

wp2<-wp2%>%as_tsibble()
wp1<-separate(wp1, `Date Time`, into = c('Day', 'min'), sep="\\.")
wp1$min<- round(as.numeric(sub("^", ".", wp1$min ))*1440,2)
wp1$hour<-floor(wp1$min/60)


wp<-aggregate(wp1$WaterFlow, list(wp1$hour,wp1$Day), FUN=mean) 
wp<-wp%>%rename("Hour"="Group.1")%>%rename("WaterFlow"="x")%>%rename("Day"="Group.2")
wp$behind<- lag(wp$Hour)
wp$ahead<- lead(wp$Hour)
wp$miss <- ifelse(wp$Hour!=0 & wp$Hour-1==wp$behind, 'good', 'bad')
wp_miss<-wp%>%filter(miss=='bad' & Hour != 0)
wp_miss$Hour<-wp_miss$Hour-1
wp_miss$WaterFlow<-mean(wp$WaterFlow)
wp1<- wp[, c("Hour","Day","WaterFlow")]
wp_miss<-wp_miss[, c("Hour","Day","WaterFlow")]
wp1<-rbind(wp1,wp_miss)
wp1<-wp1%>%arrange(Day,Hour)



NoOfHours <- as.numeric(ymd_hms("2015-11-1 23:00:00") - ymd_hms("2015-10-23 00:00:00"))*24 
wp1$Datetime<-ymd_hms("2015-10-23 00:00:00") + hours(0:NoOfHours)

wp1<-wp1%>%as_tsibble()

autoplot(wp1,WaterFlow)+labs(title = "Hourly Waterflow for Pipe1")

autoplot(wp2,WaterFlow)+labs(title = "Hourly Waterflow for Pipe2")

ggtsdisplay(wp1$WaterFlow, main="Hourly Waterflow for Pipe1")

ggtsdisplay(wp2$WaterFlow, main="Hourly Waterflow for Pipe2")

fit <- wp1 %>%
  model(
    auto = ARIMA(WaterFlow)
  )


r<-fit%>%report()

## Series: WaterFlow 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##        19.8927
## s.e.    0.2708
## 
## sigma^2 estimated as 17.68:  log likelihood=-684.72
## AIC=1373.44   AICc=1373.49   BIC=1380.4

f<-fit%>%forecast(h=7)
wp1_pred<-as.data.frame(f)
wp1_pred<-wp1_pred%>%dplyr::select(Datetime,.mean)
wp1_pred<-wp1_pred%>%rename("prediction"=".mean")
write_xlsx(wp1_pred,"Water Pipe 1.xlsx")
f%>%autoplot(wp1)+
  labs(title = "Forecast of Water Pipe 1 Waterflow")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of Water Pipe 1 Waterflow")

fit <- wp2 %>%
  model(
    auto = ARIMA(WaterFlow)
  )


r<-fit%>%report()

## Series: WaterFlow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## 
## Coefficients:
##         sma1  constant
##       0.0846   39.5732
## s.e.  0.0314    0.5472
## 
## sigma^2 estimated as 256:  log likelihood=-4190.54
## AIC=8387.08   AICc=8387.1   BIC=8401.8

f<-fit%>%forecast(h=7)
wp2_pred<-as.data.frame(f)
wp2_pred<-wp2_pred%>%dplyr::select(Datetime,.mean)
wp2_pred<-wp2_pred%>%rename("prediction"=".mean")
write_xlsx(wp2_pred,"Water Pipe 2.xlsx")
f%>%autoplot(wp2)+
  labs(title = "Forecast of Water Pipe 2 Waterflow")

fit%>% gg_tsresiduals()+labs(title = "Residuals of Forecast of Water Pipe 2 Waterflow")