library(fpp2)
library(tidyverse)
library(readxl)
library(e1071)
library(caret)
library(car)
library(kableExtra)
library(GGally)
library(gridExtra)
library(mice)
Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate processes, techniques used and not used, and the actual forecast.
Preview
atm_data <- read_xlsx("ATM624Data.xlsx")
atm_data$DATE <- as.Date(atm_data$DATE, origin = "1899-12-30")
kable(head(atm_data,30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")
DATE | ATM | Cash |
---|---|---|
2009-05-01 | ATM1 | 96 |
2009-05-01 | ATM2 | 107 |
2009-05-02 | ATM1 | 82 |
2009-05-02 | ATM2 | 89 |
2009-05-03 | ATM1 | 85 |
2009-05-03 | ATM2 | 90 |
2009-05-04 | ATM1 | 90 |
2009-05-04 | ATM2 | 55 |
2009-05-05 | ATM1 | 99 |
2009-05-05 | ATM2 | 79 |
2009-05-06 | ATM1 | 88 |
2009-05-06 | ATM2 | 19 |
2009-05-07 | ATM1 | 8 |
2009-05-07 | ATM2 | 2 |
2009-05-08 | ATM1 | 104 |
2009-05-08 | ATM2 | 103 |
2009-05-09 | ATM1 | 87 |
2009-05-09 | ATM2 | 107 |
2009-05-10 | ATM1 | 93 |
2009-05-10 | ATM2 | 118 |
2009-05-11 | ATM1 | 86 |
2009-05-11 | ATM2 | 75 |
2009-05-12 | ATM1 | 111 |
2009-05-12 | ATM2 | 111 |
2009-05-13 | ATM1 | 75 |
2009-05-13 | ATM2 | 25 |
2009-05-14 | ATM1 | 6 |
2009-05-14 | ATM2 | 16 |
2009-05-15 | ATM1 | 102 |
2009-05-15 | ATM2 | 137 |
## [1] 1474 3
The dataset contains 3 variables with 1474 observations.
After removing missing values and reshaping the dataset there are now 4 variables and 365 observations.
atm_data <- atm_data[complete.cases(atm_data),] %>% spread(ATM, Cash)
#missing values are present after the spread so we impute with the predictive mean
temp <- mice(atm_data,m=5,maxit=5,meth='pmm',seed=500, printFlag = F)
atm_data <- complete(temp)
kable(head(atm_data, 30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")
DATE | ATM1 | ATM2 | ATM3 | ATM4 |
---|---|---|---|---|
2009-05-01 | 96 | 107 | 0 | 776.993423 |
2009-05-02 | 82 | 89 | 0 | 524.417959 |
2009-05-03 | 85 | 90 | 0 | 792.811362 |
2009-05-04 | 90 | 55 | 0 | 908.238457 |
2009-05-05 | 99 | 79 | 0 | 52.832103 |
2009-05-06 | 88 | 19 | 0 | 52.208454 |
2009-05-07 | 8 | 2 | 0 | 55.473609 |
2009-05-08 | 104 | 103 | 0 | 558.503251 |
2009-05-09 | 87 | 107 | 0 | 904.341359 |
2009-05-10 | 93 | 118 | 0 | 879.493588 |
2009-05-11 | 86 | 75 | 0 | 274.022340 |
2009-05-12 | 111 | 111 | 0 | 396.108347 |
2009-05-13 | 75 | 25 | 0 | 274.547188 |
2009-05-14 | 6 | 16 | 0 | 16.321159 |
2009-05-15 | 102 | 137 | 0 | 852.307037 |
2009-05-16 | 73 | 95 | 0 | 379.561703 |
2009-05-17 | 92 | 103 | 0 | 31.284953 |
2009-05-18 | 82 | 80 | 0 | 491.850577 |
2009-05-19 | 86 | 118 | 0 | 83.705480 |
2009-05-20 | 73 | 30 | 0 | 128.653781 |
2009-05-21 | 20 | 7 | 0 | 14.357590 |
2009-05-22 | 100 | 118 | 0 | 815.358321 |
2009-05-23 | 93 | 104 | 0 | 758.218587 |
2009-05-24 | 90 | 59 | 0 | 601.421108 |
2009-05-25 | 94 | 40 | 0 | 906.796873 |
2009-05-26 | 98 | 106 | 0 | 502.907894 |
2009-05-27 | 73 | 18 | 0 | 88.273138 |
2009-05-28 | 10 | 9 | 0 | 35.438336 |
2009-05-29 | 97 | 136 | 0 | 338.459425 |
2009-05-30 | 102 | 118 | 0 | 4.547996 |
## [1] 365 5
We will now convert the data into a time series. Looking at the date column in the table above, we can see that the data resembles a daily time series. To follow through with this we will use a frequency of 365.25
.
atm_ts <- ts(atm_data[, c(2:5)], start = c(2009, 5), frequency = 365.25)
autoplot(atm_ts, facets = T) + ggtitle("Daily Withdrawls from ATMs") +labs(x = "Year", y = "Cash")
ATMs 1 and 2 has the same level of activity although at some points you can see a slow decrease in activity with ATM2 compared to ATM1.
ATM3 has no activity until 2010. Specifically the last 3 days of April. Various external reasons could be that the machine was out of service, location, newly installed etc. Therefore no seasonality or trend displayed in this time series.
ATM4 seems stationary. There is a point where more than $900,000 was withdrawn from that particular ATM but after that it goes back to regular peaks and troughs.
Before we move forward, let’s split the data into training and test sets. According to Hyndman when a time series is long enough to make more than a year it may be necessary to allow for annual seasonality as well as weekly seasonality. In this case we will split the data and use a frequency of 7
just to capture some weekly seasonality.
ts_split <- floor(0.8*nrow(atm_ts)) #select the first 80% of the data
atm_ts.train <- ts(atm_ts[1:ts_split,], frequency = 7, start=1) #assign the first 80% of the data to the train set
atm_ts.test <- ts(atm_ts[(ts_split + 1): nrow(atm_ts),], frequency = 7, start= 43) #assign the most recent 20% to the
In order to see the relationships between these four time series, we can plot each time series against the others. These plots can be arranged in a scatterplot matrix, therefore we see the relationship between the four ATMs.
ggpairs(as.data.frame(atm_ts.train[,1:4]), progress = F, lower = list(continuous = wrap("smooth", size=0.1)))
We can see a strong positive relationship between ATM1 and ATM2. The correlation is 0.725
. Both ATM1 and 2 has bi-modal distributions. Outliers are present especially with ATM4 which has the largest value of cash withdrawn. There is no relations with ATM3 and the other ATMs. ATM4 has positive but rather weak relationships with ATM1 and ATM2. ATM4’s distribution is highly skewed to the right which confirms the notion that outliers are present in the data.
Autocorrelation Plots
g1 <- ggAcf(atm_ts.train[,1])
g2 <- ggAcf(atm_ts.train[,2])
g3 <- ggAcf(atm_ts.train[,3])
g4 <- ggAcf(atm_ts.train[,4])
grid.arrange(g1, g2, g3, g4, nrow=2)
ATM 1 and 2 are not of white noise data but ATM 4 is. Therefore ATM 4 is not forecastable because their values are at different times and are statistically independent. Only 3 of ATM 3’s values are non-zero which can be an issue when it comes to forecasting. There are methods that can be used such as croston()
to handle multiple zero elements in a time series but that is beyond the scope of this project. Going forward we will only focus on ATM1, ATM2 and ATM4.
You can see a slight slow decreasing trend here.
atm1.ets.fit <- ets(atm_ts.train[,1], lambda = BoxCox.lambda(atm_ts.train[,1])) #transform data
autoplot(forecast(atm1.ets.fit, h = 30))+
xlab("Year") + ylab("Cash")
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 6.0637, df = 5, p-value = 0.3001
##
## Model df: 9. Total lags used: 14
A portmanteau test returns a large p-value 0.3
, also suggesting that the residuals are white noise.
## [1] 0
## [1] 1
The series appears to be stationary so no non-seasonal differencing is required. If you look at the the ACF plots above, you can see the large spikes at lags 7, 14 and 21. This suggests using a seasonal arima model. The results also shows that seasonal differencing is required.
atm1bc <- BoxCox.lambda(atm_ts.train[,1])
atm_ts.train[,1] %>% BoxCox(lambda = atm1bc) %>% diff(lag=7) %>% ggtsdisplay()
ATM 1 was transformed and seasonally differenced at lag 7.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 5.2559, df = 11, p-value = 0.9181
##
## Model df: 3. Total lags used: 14
The ACF plot of the residuals from the ARIMA model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting that the residuals are white noise.
## ME RMSE MAE MPE MAPE MASE
## Training set 2.287547 21.20777 13.99114 -67.16378 83.40243 0.7792603
## Test set 21.494709 33.67780 25.24902 -87.95632 123.51223 1.4062870
## ACF1 Theil's U
## Training set 0.1741151 NA
## Test set 0.1370690 0.1913796
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1429377 20.86448 13.51170 -70.52406 87.24602 0.7525573
## Test set 22.2468741 32.33712 25.16351 -82.09762 119.47749 1.4015246
## ACF1 Theil's U
## Training set 0.06567733 NA
## Test set 0.13397094 0.09647175
Overall, ARIMA performs best with it having the lowest RMSE score on the test set.
We’ll follow the same steps from ATM 1 for ATM 2.
Same decreasing trend but more visible.
atm2.ets.fit <- ets(atm_ts.train[,2], lambda = BoxCox.lambda(atm_ts.train[,2])) #transform data
autoplot(forecast(atm2.ets.fit, h = 30), PI = T)+
xlab("Year") + ylab("Cash") +
guides(colour=guide_legend(title="Forecast"))
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 6.1281, df = 5, p-value = 0.294
##
## Model df: 9. Total lags used: 14
## [1] 1
## [1] 1
The results are somewhat similar to that of ATM1 with seasonal spikes at lags 7, 14 and 21. Only exception here is that we may need to take both a seasonal difference and a first difference to obtain stationary data. The second diff() is seasonal and the first diff() is for first difference.
atm2bc <- BoxCox.lambda(atm_ts.train[,2])
atm_ts.train[,2] %>% BoxCox(lambda = atm2bc) %>% diff() %>% diff(lag=7) %>% ggtsdisplay()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,1)[7] with drift
## Q* = 9.069, df = 11, p-value = 0.6155
##
## Model df: 3. Total lags used: 14
Residuals are from white noise data as well.
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1622341 21.67710 14.87736 -Inf Inf 0.7186524
## Test set 9.8171019 23.69851 20.18985 -21.51108 53.65024 0.9752726
## ACF1 Theil's U
## Training set 0.01588618 NA
## Test set -0.36931178 0.1984766
## ME RMSE MAE MPE MAPE MASE
## Training set 0.6377714 21.40199 14.57929 -Inf Inf 0.7042540
## Test set 9.9489607 21.20234 17.68974 -3.252306 55.52204 0.8545045
## ACF1 Theil's U
## Training set 0.001405669 NA
## Test set -0.389767168 0.1056466
ARIMA does better here as well.
No visible trend here.
atm4.ets.fit <- ets(atm_ts.train[,4], lambda = BoxCox.lambda(atm_ts.train[,4])) #transform data
autoplot(forecast(atm4.ets.fit, h = 30)) +
xlab("Year") + ylab("Cash") +
guides(colour=guide_legend(title="Forecast"))
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 26.306, df = 5, p-value = 7.782e-05
##
## Model df: 9. Total lags used: 14
## [1] 0
## [1] 0
No seasonal or first differencing required. Series is stationary.
atm4bc <- BoxCox.lambda(atm_ts.train[,4])
atm_ts.train[,4] %>% BoxCox(lambda = atm4bc) %>% ggtsdisplay()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 4.2431, df = 13, p-value = 0.9883
##
## Model df: 1. Total lags used: 14
## ME RMSE MAE MPE MAPE MASE
## Training set 215.4884 725.234 316.5507 -174.64822 241.8772 0.7623691
## Test set 195.5207 440.006 346.1958 -48.72414 104.9477 0.8337653
## ACF1 Theil's U
## Training set -0.02904169 NA
## Test set -0.12234133 0.6723007
## ME RMSE MAE MPE MAPE MASE
## Training set -1.476213e-10 709.3400 339.3748 -584.4829 614.4673 0.8173381
## Test set 7.907664e+01 413.2063 350.1456 -435.6746 477.3959 0.8432779
## ACF1 Theil's U
## Training set -0.01204745 NA
## Test set -0.01326018 0.9031432
The scores from ATM4’s ARIMA forecast are lower which indicates that the model performed better for the data. We mentioned earlier that ATM 4’s forecasts cannot be predicted. Look at how high the scores are compared to the other two ATMs. Also if you look at the predictions, they are all pretty much the same.
The Residential Power Usage dataset consists of residential power usage for January 1998 until December 2013. The objective here is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.
power <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")
power <- power[,-1]
power <-power[complete.cases(power),]
kable(head(power,30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")
YYYY-MMM | KWH |
---|---|
1998-Jan | 6862583 |
1998-Feb | 5838198 |
1998-Mar | 5420658 |
1998-Apr | 5010364 |
1998-May | 4665377 |
1998-Jun | 6467147 |
1998-Jul | 8914755 |
1998-Aug | 8607428 |
1998-Sep | 6989888 |
1998-Oct | 6345620 |
1998-Nov | 4640410 |
1998-Dec | 4693479 |
1999-Jan | 7183759 |
1999-Feb | 5759262 |
1999-Mar | 4847656 |
1999-Apr | 5306592 |
1999-May | 4426794 |
1999-Jun | 5500901 |
1999-Jul | 7444416 |
1999-Aug | 7564391 |
1999-Sep | 7899368 |
1999-Oct | 5358314 |
1999-Nov | 4436269 |
1999-Dec | 4419229 |
2000-Jan | 7068296 |
2000-Feb | 5876083 |
2000-Mar | 4807961 |
2000-Apr | 4873080 |
2000-May | 5050891 |
2000-Jun | 7092865 |
This dataset has 192 observations on residential power usage every month ranging from years 1998 to 2013.
Convert to time series. Data suggests monthly usage so a frequency of 12
will be used.
Split data into train and test sets.
Data does have a trend and seasonality.
power.ets.fit <- ets(power_ts.train, lambda = BoxCox.lambda(power_ts.train)) #transform data
autoplot(forecast(power.ets.fit, h = 24)) + ggtitle("Forecasts") +
xlab("Year") + ylab("KWH")
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 6.6664, df = 10, p-value = 0.7565
##
## Model df: 14. Total lags used: 24
## [1] 1
## [1] 1
Indication that the series is not stationary. Requires bot first and seasonal differencing.
powerbc <- BoxCox.lambda(power_ts.train)
power_ts.train %>% BoxCox(lambda = powerbc) %>% diff() %>% diff(lag = 12) %>% ggtsdisplay()
## Series: power_ts.train
## ARIMA(0,0,1)(0,1,2)[12] with drift
##
## Coefficients:
## ma1 sma1 sma2 drift
## 0.2689 -0.7448 0.1803 7928.444
## s.e. 0.0792 0.0860 0.0968 3686.555
##
## sigma^2 estimated as 9.513e+11: log likelihood=-2556.65
## AIC=5123.31 AICc=5123.68 BIC=5138.93
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -11076.55 931003.6 578829.7 -5.299695 12.84865 0.7630838
## ACF1
## Training set -0.003046509
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,2)[12] with drift
## Q* = 14.263, df = 20, p-value = 0.8169
##
## Model df: 4. Total lags used: 24
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 263142.0 1077377 726801.1 -0.9845716 13.92056 0.9581577 0.2948704
## Test set 610895.5 1653352 1157694.1 6.0634506 14.50179 1.5262134 0.2311912
## Theil's U
## Training set NA
## Test set 1.060516
## ME RMSE MAE MPE MAPE MASE
## Training set -11076.55 931003.6 578829.7 -5.299695 12.848655 0.7630838
## Test set 136077.29 874809.6 605124.8 1.277760 7.435522 0.7977492
## ACF1 Theil's U
## Training set -0.003046509 NA
## Test set 0.059044478 0.5089892
Again ARIMA performs better so we’ll use this model to predict outcomes for 2014.