Project 1

library(tidyverse)
library(fpp2)
library(readxl)
library(lubridate)
library(urca)

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

#reading in the excel doc
atmraw <- read_xlsx("ATM624Data.xlsx")

#changing date from numeric to date format
atmraw$DATE <- openxlsx::convertToDate(atmraw$DATE)

#adding year, month and day columns
#removing the 0
atm <- atmraw %>% 
  mutate(year = as.character(lubridate::year(DATE)), 
         month = as.character(lubridate::month(DATE)), 
         day = as.character(lubridate::day(DATE)),
         dayofyear = format(DATE, "%j"),
         yearmonth = format(DATE, "%Y-%m"),
         yearweek = format(DATE, "%Y-%W"), 
         dayofweek = format(DATE, "%w"),
         dayname = weekdays(DATE)) %>% 
  filter(Cash != 0) %>%
  spread(ATM, Cash) %>% 
  replace_na(list(ATM1 = 0, ATM2 = 0, ATM3 = 0, ATM4 = 0))

#change to ts object
tsatm <- ts(atm %>% select(ATM1, ATM2, ATM3, ATM4), 
            start = c(2009,121), 
            frequency = 365)

#plot out time series objects
autoplot(tsatm, facets = TRUE)

ATM 1 and 2 look like they have weekly seasonality and possibly monthly seasonality. ATM 3 only has 3 dates with non-zero values. ATM4 looks like it has weekly seasonality with one large outlier.

atm %>% 
  group_by(dayname) %>% 
  summarize(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))

## # A tibble: 7 x 5
##   dayname   `sum(ATM1)` `sum(ATM2)` `sum(ATM3)` `sum(ATM4)`
##   <chr>           <dbl>       <dbl>       <dbl>       <dbl>
## 1 Friday           5228        4877          85      30397.
## 2 Monday           4386        3054           0      25027.
## 3 Saturday         4927        3951           0      25560.
## 4 Sunday           5338        3492           0      28032.
## 5 Thursday         1648        1302          82       8838.
## 6 Tuesday          4568        3809           0      33658.
## 7 Wednesday        4272        2231          96      21514.

It appears that Thursdays are a low day for ATMs 1, 2, and 4. ATM3 only has usage on wednesday, thursday, and friday. Let’s check for seasonality.

atmw1 <- ts(tsatm[,1], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw1)

atmw2 <- ts(tsatm[,2], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw2)

atmw3 <- ts(tsatm[,3], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw3)

atmw4 <- ts(tsatm[,4], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw4)

There appears to be weekly seasonality with a number of high and low outliers. There is also an error and the subseries is 1 day early. The low day in ATM1, 2, and 4 is suppose to be Thursday, the 3 days in use for ATM3 are W/Th/F and the outlier in ATM4 is on Tuesday.

#sum of weekly cash
atmweek <- atm %>% 
  group_by(yearweek) %>% 
  summarise(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))

colnames(atmweek) <- c("yearweek", "ATM1", "ATM2", "ATM3", "ATM4")

tsweek <- ts(atmweek %>% select(-yearweek), frequency = 52, start = c(2009,17))

autoplot(tsweek, facets = TRUE)

ATM 1 and 2 have a dip around the weeks of Christmas and New Years at the end of 2009 and the outlier in ATM4 is in mid February.

atm %>% 
  group_by(month) %>% 
  summarize(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))

## # A tibble: 12 x 5
##    month `sum(ATM1)` `sum(ATM2)` `sum(ATM3)` `sum(ATM4)`
##    <chr>       <dbl>       <dbl>       <dbl>       <dbl>
##  1 1            2677        1811           0      14903.
##  2 10           2247        1906           0       9514.
##  3 11           2289        1692           0      12908.
##  4 12           2422        1688           0      12966.
##  5 2            2531        1517           0      23084.
##  6 3            2433        1602           0      13315.
##  7 4            2300        1842         263      11858.
##  8 5            2480        2338           0      12622.
##  9 6            2412        2129           0      13744.
## 10 7            2621        2094           0      14477.
## 11 8            3129        2230           0      15903.
## 12 9            2826        1867           0      17733.

#sum of monthly cash by ATM
atmmonth <- atm %>% 
  group_by(yearmonth) %>% 
  summarise(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))

colnames(atmmonth) <- c("yearmonth", "ATM1", "ATM2", "ATM3", "ATM4")

tsmonth <- ts(atmmonth %>% select(-yearmonth), frequency = 12)

#monthly average for monthdays
tsmonthday <- cbind(tsmonth/monthdays(tsmonth))
autoplot(tsmonthday, facets = TRUE)

There seems to be some monthly trends but it looks like it’s over smoothed a lot of detail.

Next we’re going to make the data stationary. We know a lag of 7 should work for ATMs 1, 2 and 4.

atm1 <- tsatm[,1] %>% diff(lag=7) 
atm1 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0203 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(atm1)

ATM1 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by decreasing spikes at 14 and 21 in the PACF.

atmw1 <- ts(tsatm[,1], frequency = 7) %>% diff(lag=7) 
atmw1 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0203 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(atmw1)

ATM1 with a freuency of 7 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by decreasing spikes at 14 and 21 in the PACF.

atm1model1 <- auto.arima(atm1)
atm1model2 <- auto.arima(atmw1)
summary(atm1model1)

## Series: atm1 
## ARIMA(0,0,0) with zero mean 
## 
## sigma^2 estimated as 969.6:  log likelihood=-1738.95
## AIC=3479.89   AICc=3479.9   BIC=3483.77
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE MASE       ACF1
## Training set -0.03631285 31.13868 19.23184 100  100  NaN 0.03135272

summary(atm1model2)

## Series: atmw1 
## ARIMA(0,0,0)(0,0,1)[7] with zero mean 
## 
## Coefficients:
##          sma1
##       -0.6868
## s.e.   0.0429
## 
## sigma^2 estimated as 683.9:  log likelihood=-1678.2
## AIC=3360.39   AICc=3360.43   BIC=3368.15
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE       ACF1
## Training set -0.1068561 26.11536 16.60482 NaN  Inf 0.4838766 0.06496742

The time series with the frequency of 7 performed better with a lower AICc than the time series with a frequency of 365. Let’s check the residuals now.

atm1model1 %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with zero mean
## Q* = 157.8, df = 72, p-value = 2.32e-08
## 
## Model df: 0.   Total lags used: 72

atm1model2 %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,0,1)[7] with zero mean
## Q* = 22.748, df = 13, p-value = 0.0448
## 
## Model df: 1.   Total lags used: 14

Neither model passes the Ljung-Box test. Both p-values are less than 0.05 indicating that we reject the null hypothesis and say this model exhibit a lack of fit. Let’s try adding a BoxCox transformation to see if that helps.

atm1l <- BoxCox.lambda(atm1)
atmw1l <- BoxCox.lambda(atmw1)
atm1model1a <- auto.arima(atm1, lambda = atm1l)
atm1model2a <- auto.arima(atmw1, lambda = atmw1l)
atm1model1a %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with zero mean
## Q* = 157.8, df = 72, p-value = 2.32e-08
## 
## Model df: 0.   Total lags used: 72

atm1model2a %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,0,1)[7] with non-zero mean
## Q* = 18.257, df = 11, p-value = 0.0758
## 
## Model df: 3.   Total lags used: 14

This time, the frequency 7 ATM1 model passes the Ljung Box test. The residuals appear to be white noise. This will be the model we use for forecasting.

No we’ll use the same process for ATM2

atmw2 <- ts(tsatm[,2], frequency = 7) %>% diff(lag=7) 
atmw2 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0161 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(atmw2)

ATM2 with a freuency of 7 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by lower spikes at 14 and 21 in the PACF.

atmw2l <- BoxCox.lambda(atmw2)
atm2model1 <- auto.arima(atmw2, lambda = atmw2l)
summary(atm2model1)

## Series: atmw2 
## ARIMA(2,0,2)(1,0,2)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.9215252 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sar1     sma1     sma2     mean
##       -0.4181  -0.9054  0.4511  0.7823  -0.6805  -0.1267  -0.4427  -1.4188
## s.e.   0.0577   0.0448  0.0887  0.0610   0.1990   0.2126   0.1676   0.2622
## 
## sigma^2 estimated as 368.1:  log likelihood=-1564.01
## AIC=3146.01   AICc=3146.53   BIC=3180.94
## 
## Training set error measures:
##                     ME     RMSE      MAE MPE MAPE      MASE         ACF1
## Training set 0.4263577 23.88372 16.84544 NaN  Inf 0.4617531 -0.004845533

atm2model1 %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(1,0,2)[7] with non-zero mean
## Q* = 5.7537, df = 6, p-value = 0.4513
## 
## Model df: 8.   Total lags used: 14

The frequency 7 ATM2 model has an AICc of 3146 but narrowly fails the Ljung Box test. We’ll still use this model but take it’s forecast results with a grain of salt.

For ATM3, there’s not enough data to use for a forecast. We could take the mean but I would tell this business that it would not be worth their time to forecast it.

For ATM4, we’ll use the same method at ATM 1 and 2

atmw4 <- ts(tsatm[,4], frequency = 7) %>% diff(lag=7) 
atmw4 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.014 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(atmw4)

The large outlier needs more explaination to know what to do with it. If it’s an annual event or an error, we would handle it differently. For now, I’m going to remove it and replace it with the median.

tsatm[,4][285] <- median(tsatm[,4])
atmw4 <- ts(tsatm[,4], frequency = 7) %>% diff(lag=7) 
atmw4 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0154 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(atmw4)

Much better. Median imputed ATM4 with a freuency of 7 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by decreasing spikes at 14 and 21 in the PACF.

atmw4l <- BoxCox.lambda(atmw4)
atm4model1 <- auto.arima(atmw4, lambda = atmw4l)
summary(atm4model1)

## Series: atmw4 
## ARIMA(2,0,2)(1,0,1)[7] with zero mean 
## Box Cox transformation: lambda= 0.97073 
## 
## Coefficients:
##          ar1      ar2     ma1     ma2     sar1     sma1
##       0.7126  -0.9357  -0.678  0.9792  -0.0443  -0.9073
## s.e.  0.0292   0.0372   0.020  0.0314   0.0625   0.0296
## 
## sigma^2 estimated as 79994:  log likelihood=-2532.55
## AIC=5079.11   AICc=5079.43   BIC=5106.27
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -9.684603 329.6829 256.8833 1399.553 1671.569 0.4236931
##                    ACF1
## Training set 0.05770441

atm4model1 %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(1,0,1)[7] with zero mean
## Q* = 10.387, df = 8, p-value = 0.2389
## 
## Model df: 6.   Total lags used: 14

The ATM4 model passes the Ljung Box test. The residuals appear to be white noise. This will be the model we use for forecasting.

atm1f<-forecast(atm1model2a, 31)
atm2f<-forecast(atm2model1, 31)
atm4f<-forecast(atm4model1, 31)

autoplot(atm1f)

autoplot(atm2f)

autoplot(atm4f)

Well I’m no data scientist but I have a funny feeling people aren’t deposting money into the ATMs so I think something went wrong in my forecast.

cbind(atm1f$x, atm2f$x, atm4f$x) %>% write.csv("Part1_Forecast")

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

eraw <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")

tse <- ts(eraw$KWH, start = c(1998,1), frequency = 12)

autoplot(tse)

There is seasonality and trend. There is an outlier just after 2010.

ggsubseriesplot(tse)

## Warning: Removed 16 rows containing missing values (geom_path).

There is increased energy usage in the summer and winter months.

Let’s take an annual lag and see what we get.

tsebox <- BoxCox.lambda(tse)
tsefit <- tse %>% BoxCox(tsebox) %>% diff(lag=12) 
tsefit %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0696 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(tsefit)

## Warning: Removed 2 rows containing missing values (geom_point).

We see a number of critical values at position 1, 4, and 8 but the main peek in the ACF is at 12. We see the same pattern in the PACF with an additional peek at 24 months.

emodel <- auto.arima(tsefit)
summary(emodel)

## Series: tsefit 
## ARIMA(0,0,1)(2,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          ma1     sar1     sar2    mean
##       0.0724  -0.7045  -0.3585  0.0911
## s.e.  0.0723   0.0681   0.0660  0.0492
## 
## sigma^2 estimated as 1.471:  log likelihood=-288.68
## AIC=587.37   AICc=587.71   BIC=603.33
## 
## Training set error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.009863523 1.198977 0.5552135 -2432.114 2642.291 0.4414396
##                     ACF1
## Training set 0.005494605

checkresiduals(emodel)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(2,0,0)[12] with non-zero mean
## Q* = 9.0463, df = 20, p-value = 0.9824
## 
## Model df: 4.   Total lags used: 24

The residuals don’t exactly look like white noise. There are 3 peeks towards the end of the plot but the model passes the Ljung Box test. There might be additional seasonality in the model but we will use this model anyway.

eforecast <- forecast(emodel, 12)
autoplot(eforecast)

Again, I’m getting negative values for the forecast. I’m sure it would look beautiful if I could figure out what I’m doing wrong.

write.csv(eforecast$x, "Part2_Forecast")

Project 1

Randall Thompson

10/24/2020