library(tidyverse)
library(fpp2)
library(readxl)
library(lubridate)
library(urca)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
#reading in the excel doc
atmraw <- read_xlsx("ATM624Data.xlsx")
#changing date from numeric to date format
atmraw$DATE <- openxlsx::convertToDate(atmraw$DATE)
#adding year, month and day columns
#removing the 0
atm <- atmraw %>%
mutate(year = as.character(lubridate::year(DATE)),
month = as.character(lubridate::month(DATE)),
day = as.character(lubridate::day(DATE)),
dayofyear = format(DATE, "%j"),
yearmonth = format(DATE, "%Y-%m"),
yearweek = format(DATE, "%Y-%W"),
dayofweek = format(DATE, "%w"),
dayname = weekdays(DATE)) %>%
filter(Cash != 0) %>%
spread(ATM, Cash) %>%
replace_na(list(ATM1 = 0, ATM2 = 0, ATM3 = 0, ATM4 = 0))
#change to ts object
tsatm <- ts(atm %>% select(ATM1, ATM2, ATM3, ATM4),
start = c(2009,121),
frequency = 365)
#plot out time series objects
autoplot(tsatm, facets = TRUE)
ATM 1 and 2 look like they have weekly seasonality and possibly monthly seasonality. ATM 3 only has 3 dates with non-zero values. ATM4 looks like it has weekly seasonality with one large outlier.
atm %>%
group_by(dayname) %>%
summarize(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))
## # A tibble: 7 x 5
## dayname `sum(ATM1)` `sum(ATM2)` `sum(ATM3)` `sum(ATM4)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Friday 5228 4877 85 30397.
## 2 Monday 4386 3054 0 25027.
## 3 Saturday 4927 3951 0 25560.
## 4 Sunday 5338 3492 0 28032.
## 5 Thursday 1648 1302 82 8838.
## 6 Tuesday 4568 3809 0 33658.
## 7 Wednesday 4272 2231 96 21514.
It appears that Thursdays are a low day for ATMs 1, 2, and 4. ATM3 only has usage on wednesday, thursday, and friday. Let’s check for seasonality.
atmw1 <- ts(tsatm[,1], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw1)
atmw2 <- ts(tsatm[,2], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw2)
atmw3 <- ts(tsatm[,3], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw3)
atmw4 <- ts(tsatm[,4], frequency = 7, start = c(2009,5,1))
ggsubseriesplot(atmw4)
There appears to be weekly seasonality with a number of high and low outliers. There is also an error and the subseries is 1 day early. The low day in ATM1, 2, and 4 is suppose to be Thursday, the 3 days in use for ATM3 are W/Th/F and the outlier in ATM4 is on Tuesday.
#sum of weekly cash
atmweek <- atm %>%
group_by(yearweek) %>%
summarise(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))
colnames(atmweek) <- c("yearweek", "ATM1", "ATM2", "ATM3", "ATM4")
tsweek <- ts(atmweek %>% select(-yearweek), frequency = 52, start = c(2009,17))
autoplot(tsweek, facets = TRUE)
ATM 1 and 2 have a dip around the weeks of Christmas and New Years at the end of 2009 and the outlier in ATM4 is in mid February.
atm %>%
group_by(month) %>%
summarize(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))
## # A tibble: 12 x 5
## month `sum(ATM1)` `sum(ATM2)` `sum(ATM3)` `sum(ATM4)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 2677 1811 0 14903.
## 2 10 2247 1906 0 9514.
## 3 11 2289 1692 0 12908.
## 4 12 2422 1688 0 12966.
## 5 2 2531 1517 0 23084.
## 6 3 2433 1602 0 13315.
## 7 4 2300 1842 263 11858.
## 8 5 2480 2338 0 12622.
## 9 6 2412 2129 0 13744.
## 10 7 2621 2094 0 14477.
## 11 8 3129 2230 0 15903.
## 12 9 2826 1867 0 17733.
#sum of monthly cash by ATM
atmmonth <- atm %>%
group_by(yearmonth) %>%
summarise(sum(ATM1), sum(ATM2), sum(ATM3), sum(ATM4))
colnames(atmmonth) <- c("yearmonth", "ATM1", "ATM2", "ATM3", "ATM4")
tsmonth <- ts(atmmonth %>% select(-yearmonth), frequency = 12)
#monthly average for monthdays
tsmonthday <- cbind(tsmonth/monthdays(tsmonth))
autoplot(tsmonthday, facets = TRUE)
There seems to be some monthly trends but it looks like it’s over smoothed a lot of detail.
Next we’re going to make the data stationary. We know a lag of 7 should work for ATMs 1, 2 and 4.
atm1 <- tsatm[,1] %>% diff(lag=7)
atm1 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0203
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(atm1)
ATM1 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by decreasing spikes at 14 and 21 in the PACF.
atmw1 <- ts(tsatm[,1], frequency = 7) %>% diff(lag=7)
atmw1 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0203
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(atmw1)
ATM1 with a freuency of 7 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by decreasing spikes at 14 and 21 in the PACF.
atm1model1 <- auto.arima(atm1)
atm1model2 <- auto.arima(atmw1)
summary(atm1model1)
## Series: atm1
## ARIMA(0,0,0) with zero mean
##
## sigma^2 estimated as 969.6: log likelihood=-1738.95
## AIC=3479.89 AICc=3479.9 BIC=3483.77
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.03631285 31.13868 19.23184 100 100 NaN 0.03135272
summary(atm1model2)
## Series: atmw1
## ARIMA(0,0,0)(0,0,1)[7] with zero mean
##
## Coefficients:
## sma1
## -0.6868
## s.e. 0.0429
##
## sigma^2 estimated as 683.9: log likelihood=-1678.2
## AIC=3360.39 AICc=3360.43 BIC=3368.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.1068561 26.11536 16.60482 NaN Inf 0.4838766 0.06496742
The time series with the frequency of 7 performed better with a lower AICc than the time series with a frequency of 365. Let’s check the residuals now.
atm1model1 %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with zero mean
## Q* = 157.8, df = 72, p-value = 2.32e-08
##
## Model df: 0. Total lags used: 72
atm1model2 %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(0,0,1)[7] with zero mean
## Q* = 22.748, df = 13, p-value = 0.0448
##
## Model df: 1. Total lags used: 14
Neither model passes the Ljung-Box test. Both p-values are less than 0.05 indicating that we reject the null hypothesis and say this model exhibit a lack of fit. Let’s try adding a BoxCox transformation to see if that helps.
atm1l <- BoxCox.lambda(atm1)
atmw1l <- BoxCox.lambda(atmw1)
atm1model1a <- auto.arima(atm1, lambda = atm1l)
atm1model2a <- auto.arima(atmw1, lambda = atmw1l)
atm1model1a %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with zero mean
## Q* = 157.8, df = 72, p-value = 2.32e-08
##
## Model df: 0. Total lags used: 72
atm1model2a %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,0,1)[7] with non-zero mean
## Q* = 18.257, df = 11, p-value = 0.0758
##
## Model df: 3. Total lags used: 14
This time, the frequency 7 ATM1 model passes the Ljung Box test. The residuals appear to be white noise. This will be the model we use for forecasting.
No we’ll use the same process for ATM2
atmw2 <- ts(tsatm[,2], frequency = 7) %>% diff(lag=7)
atmw2 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0161
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(atmw2)
ATM2 with a freuency of 7 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by lower spikes at 14 and 21 in the PACF.
atmw2l <- BoxCox.lambda(atmw2)
atm2model1 <- auto.arima(atmw2, lambda = atmw2l)
summary(atm2model1)
## Series: atmw2
## ARIMA(2,0,2)(1,0,2)[7] with non-zero mean
## Box Cox transformation: lambda= 0.9215252
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1 sma2 mean
## -0.4181 -0.9054 0.4511 0.7823 -0.6805 -0.1267 -0.4427 -1.4188
## s.e. 0.0577 0.0448 0.0887 0.0610 0.1990 0.2126 0.1676 0.2622
##
## sigma^2 estimated as 368.1: log likelihood=-1564.01
## AIC=3146.01 AICc=3146.53 BIC=3180.94
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.4263577 23.88372 16.84544 NaN Inf 0.4617531 -0.004845533
atm2model1 %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(1,0,2)[7] with non-zero mean
## Q* = 5.7537, df = 6, p-value = 0.4513
##
## Model df: 8. Total lags used: 14
The frequency 7 ATM2 model has an AICc of 3146 but narrowly fails the Ljung Box test. We’ll still use this model but take it’s forecast results with a grain of salt.
For ATM3, there’s not enough data to use for a forecast. We could take the mean but I would tell this business that it would not be worth their time to forecast it.
For ATM4, we’ll use the same method at ATM 1 and 2
atmw4 <- ts(tsatm[,4], frequency = 7) %>% diff(lag=7)
atmw4 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.014
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(atmw4)
The large outlier needs more explaination to know what to do with it. If it’s an annual event or an error, we would handle it differently. For now, I’m going to remove it and replace it with the median.
tsatm[,4][285] <- median(tsatm[,4])
atmw4 <- ts(tsatm[,4], frequency = 7) %>% diff(lag=7)
atmw4 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0154
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(atmw4)
Much better. Median imputed ATM4 with a freuency of 7 is stationary after a lag of 7, confirmed by the KPSS unit root test. There are significant values at 7 in the ACF and PACF, followed by decreasing spikes at 14 and 21 in the PACF.
atmw4l <- BoxCox.lambda(atmw4)
atm4model1 <- auto.arima(atmw4, lambda = atmw4l)
summary(atm4model1)
## Series: atmw4
## ARIMA(2,0,2)(1,0,1)[7] with zero mean
## Box Cox transformation: lambda= 0.97073
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## 0.7126 -0.9357 -0.678 0.9792 -0.0443 -0.9073
## s.e. 0.0292 0.0372 0.020 0.0314 0.0625 0.0296
##
## sigma^2 estimated as 79994: log likelihood=-2532.55
## AIC=5079.11 AICc=5079.43 BIC=5106.27
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -9.684603 329.6829 256.8833 1399.553 1671.569 0.4236931
## ACF1
## Training set 0.05770441
atm4model1 %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(1,0,1)[7] with zero mean
## Q* = 10.387, df = 8, p-value = 0.2389
##
## Model df: 6. Total lags used: 14
The ATM4 model passes the Ljung Box test. The residuals appear to be white noise. This will be the model we use for forecasting.
atm1f<-forecast(atm1model2a, 31)
atm2f<-forecast(atm2model1, 31)
atm4f<-forecast(atm4model1, 31)
autoplot(atm1f)
autoplot(atm2f)
autoplot(atm4f)
Well I’m no data scientist but I have a funny feeling people aren’t deposting money into the ATMs so I think something went wrong in my forecast.
cbind(atm1f$x, atm2f$x, atm4f$x) %>% write.csv("Part1_Forecast")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
eraw <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")
tse <- ts(eraw$KWH, start = c(1998,1), frequency = 12)
autoplot(tse)
There is seasonality and trend. There is an outlier just after 2010.
ggsubseriesplot(tse)
## Warning: Removed 16 rows containing missing values (geom_path).
There is increased energy usage in the summer and winter months.
Let’s take an annual lag and see what we get.
tsebox <- BoxCox.lambda(tse)
tsefit <- tse %>% BoxCox(tsebox) %>% diff(lag=12)
tsefit %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.0696
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(tsefit)
## Warning: Removed 2 rows containing missing values (geom_point).
We see a number of critical values at position 1, 4, and 8 but the main peek in the ACF is at 12. We see the same pattern in the PACF with an additional peek at 24 months.
emodel <- auto.arima(tsefit)
summary(emodel)
## Series: tsefit
## ARIMA(0,0,1)(2,0,0)[12] with non-zero mean
##
## Coefficients:
## ma1 sar1 sar2 mean
## 0.0724 -0.7045 -0.3585 0.0911
## s.e. 0.0723 0.0681 0.0660 0.0492
##
## sigma^2 estimated as 1.471: log likelihood=-288.68
## AIC=587.37 AICc=587.71 BIC=603.33
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.009863523 1.198977 0.5552135 -2432.114 2642.291 0.4414396
## ACF1
## Training set 0.005494605
checkresiduals(emodel)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(2,0,0)[12] with non-zero mean
## Q* = 9.0463, df = 20, p-value = 0.9824
##
## Model df: 4. Total lags used: 24
The residuals don’t exactly look like white noise. There are 3 peeks towards the end of the plot but the model passes the Ljung Box test. There might be additional seasonality in the model but we will use this model anyway.
eforecast <- forecast(emodel, 12)
autoplot(eforecast)
Again, I’m getting negative values for the forecast. I’m sure it would look beautiful if I could figure out what I’m doing wrong.
write.csv(eforecast$x, "Part2_Forecast")