In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
#laods as list
atm <- read_excel('ATM624Data.xlsx', col_names=TRUE, col_types=c('date', 'text', 'numeric'))
As you can see, the ‘ATM’ column identifies the ATM machine: ATM1, ATM2, ATM3, ATM4. ‘Cash’ is expressed in hundreds of dollars.
unique(atm$ATM)
## [1] "ATM1" "ATM2" NA "ATM3" "ATM4"
kable(head(atm))
DATE | ATM | Cash |
---|---|---|
2009-05-01 | ATM1 | 96 |
2009-05-01 | ATM2 | 107 |
2009-05-02 | ATM1 | 82 |
2009-05-02 | ATM2 | 89 |
2009-05-03 | ATM1 | 85 |
2009-05-03 | ATM2 | 90 |
The ‘DATE’ column does not have any missing values.
any(is.na(atm$DATE))
## [1] FALSE
The ‘ATM’ column has some missing values. It would be difficult to determine which ATM machine to assign rows with missing values. The approach I am going to take is to drop rows with missing ATM values.
any(is.na(atm$ATM))
## [1] TRUE
Upon further investigation, rows with missing ‘ATM’ values are not significant because these are blank rows for the month of May 2010, the month that we want to forecast. We can drop these.
kable(atm[is.na(atm$ATM),])
DATE | ATM | Cash |
---|---|---|
2010-05-01 | NA | NA |
2010-05-02 | NA | NA |
2010-05-03 | NA | NA |
2010-05-04 | NA | NA |
2010-05-05 | NA | NA |
2010-05-06 | NA | NA |
2010-05-07 | NA | NA |
2010-05-08 | NA | NA |
2010-05-09 | NA | NA |
2010-05-10 | NA | NA |
2010-05-11 | NA | NA |
2010-05-12 | NA | NA |
2010-05-13 | NA | NA |
2010-05-14 | NA | NA |
ATM
valuesThe code below drops rows the rows in the table above with missing ‘ATM’ values. These are not relevant as these are blank rows for the month of May 2010.
atm <- atm[!is.na(atm$ATM),]
Are there any rows with missing ‘Cash’ values? The answer is YES.
any(is.na(atm$Cash))
## [1] TRUE
We have 5 rows with missing Cash values. They all happen to be for the month of June 2009. We can fill these missing values with the average withdrawals of the previous and next day. For example, for missing 6/13/2009 cash for ATM1, take the average of the day before and the day after. This average value will be used for the missing value.
kable(atm[is.na(atm$Cash),])
DATE | ATM | Cash |
---|---|---|
2009-06-13 | ATM1 | NA |
2009-06-16 | ATM1 | NA |
2009-06-18 | ATM2 | NA |
2009-06-22 | ATM1 | NA |
2009-06-24 | ATM2 | NA |
Before assigning values to missing Cash
, we’re going to sort the data by ATM
first and then by DATE
.
atm <- atm[with(atm, order(ATM, DATE)),]
#Convert atm$DATE to 'Date' class. Originally, it is "POSIXct" "POSIXt" class.
atm$DATE = as.Date(atm$DATE)
#assign missing values for ATM1 with following dates by taking average of previous and next day
atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-13"), ]$Cash <-
(atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-12"), ]$Cash + atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-14"), ]$Cash)/2
atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-16"), ]$Cash <-
(atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-15"), ]$Cash + atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-17"), ]$Cash)/2
atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-22"), ]$Cash <-
(atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-21"), ]$Cash + atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-23"), ]$Cash)/2
#assign missing values for ATM2 with following dates by taking average of pervious and next day
atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-18"), ]$Cash <-
(atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-17"), ]$Cash + atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-19"), ]$Cash)/2
atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-24"), ]$Cash <-
(atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-23"), ]$Cash + atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-25"), ]$Cash)/2
DATE | ATM | Cash |
---|---|---|
2009-06-13 | ATM1 | 131.0 |
2009-06-16 | ATM1 | 107.0 |
2009-06-22 | ATM1 | 111.5 |
2009-06-18 | ATM2 | 79.0 |
2009-06-24 | ATM2 | 51.0 |
Create lists for each ATM machine.
atm1 <- atm[atm$ATM == 'ATM1',]
atm2 <- atm[atm$ATM == 'ATM2',]
atm3 <- atm[atm$ATM == 'ATM3',]
atm4 <- atm[atm$ATM == 'ATM4',]
Create time series objects for each ATM.
atm1_ts <- ts(atm1["Cash"], start=c(2009, 05, 01), frequency=7)
atm2_ts <- ts(atm2["Cash"], start=c(2009, 05, 01), frequency=7)
atm3_ts <- ts(atm3["Cash"], start=c(2009, 05, 01), frequency=7)
atm4_ts <- ts(atm4["Cash"], start=c(2009, 05, 01), frequency=7)
I don’t see any obvious trend in the data. ACF and PACF shows significant correlation at lag 7.
tsdisplay(atm1_ts)
The KPSS test for has p-value of 0.07863. We reject the null hypothesis. ATM 1 series is not stationary.
kpss.test(atm1_ts, null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: atm1_ts
## KPSS Trend = 0.13054, Truncation lag parameter = 5, p-value =
## 0.07863
(atm1_lambda = BoxCox.lambda(atm1_ts))
## [1] 0.2615708
Below is plot with Box-Cox transformation and differencing of 1. There is correlation spike at lag 7, 14, 21 for ACF.
atm1_ts %>% BoxCox(atm1_lambda) %>% diff(1) %>% tsdisplay()
auto.arima
): ARIMA(1,1,0)(0,1,1)[7], AICc=1384.94atm1_model1 <- auto.arima(atm1_ts, d=1, max.p=10, max.q=10, lambda = atm1_lambda)
summary(atm1_model1)
## Series: atm1_ts
## ARIMA(1,1,0)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2615708
##
## Coefficients:
## ar1 sma1
## -0.3908 -0.5994
## s.e. 0.0486 0.0452
##
## sigma^2 estimated as 2.776: log likelihood=-689.43
## AIC=1384.87 AICc=1384.94 BIC=1396.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3365687 34.89839 21.47221 -106.0333 128.8101 1.213234
## ACF1
## Training set -0.1152877
After a few trial and errors, ARIMA(7,1,8)(0,1,1)[7] appears to have a lower AICc.
atm1_model2 <- Arima(atm1_ts, order=c(7,1,8), seasonal=c(0,1,1), lambda = atm1_lambda)
summary(atm1_model2)
## Series: atm1_ts
## ARIMA(7,1,8)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2615708
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ma1
## -0.2389 0.3859 -0.2668 0.1050 -0.1147 0.1169 0.7039 -0.6283
## s.e. 0.1642 0.1825 0.1572 0.1673 0.1686 0.1440 0.1519 0.1735
## ma2 ma3 ma4 ma5 ma6 ma7 ma8 sma1
## -0.8359 0.7028 -0.2262 -0.0460 -0.1756 -0.3951 0.6043 -0.7315
## s.e. 0.1607 0.2708 0.2370 0.2722 0.2455 0.1629 0.1625 0.0624
##
## sigma^2 estimated as 1.728: log likelihood=-604.04
## AIC=1242.08 AICc=1243.89 BIC=1308
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.4596644 24.11322 15.66045 -78.80768 95.82487 0.8848551
## ACF1
## Training set 0.04340309
Residuals appear to be approximately normal. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of 0.5562. Accept null hypothesis that residuals is white noise.
checkresiduals(atm1_model2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(7,1,8)(0,1,1)[7]
## Q* = 2.0787, df = 3, p-value = 0.5562
##
## Model df: 16. Total lags used: 19
autoplot(forecast(atm1_model2, h=31))
I don’t see any obvious trend in the data. ACF and PACF shows multiple correlation spikes.
tsdisplay(atm2_ts)
The KPSS test for has p-value of 0.06702. We reject the null hypothesis. ATM 2 series is not stationary.
kpss.test(atm2_ts, null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: atm2_ts
## KPSS Trend = 0.13681, Truncation lag parameter = 5, p-value =
## 0.06702
(atm2_lambda = BoxCox.lambda(atm2_ts))
## [1] 0.7242585
Below is plot with Box-Cox transformation and differencing of 1.
atm2_ts %>% BoxCox(atm2_lambda) %>% diff(1) %>% tsdisplay()
auto.arima
): ARIMA(1,1,0)(0,1,1)[7], AICc=2707.4atm2_model1 <- auto.arima(atm2_ts, d=1, max.p=10, max.q=10, lambda = atm2_lambda)
summary(atm2_model1)
## Series: atm2_ts
## ARIMA(1,1,0)(0,1,1)[7]
## Box Cox transformation: lambda= 0.7242585
##
## Coefficients:
## ar1 sma1
## -0.4391 -0.5305
## s.e. 0.0479 0.0469
##
## sigma^2 estimated as 113: log likelihood=-1350.66
## AIC=2707.33 AICc=2707.4 BIC=2718.96
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2513631 32.42686 21.68245 NaN Inf 1.043042 -0.1725429
After a few trial and errors, ARIMA(7,1,5)(0,1,1)[7] appears to have a lower AICc.
atm2_model2 <- Arima(atm2_ts, order=c(7,1,5), seasonal=c(0,1,1), lambda = atm2_lambda)
summary(atm2_model2)
## Series: atm2_ts
## ARIMA(7,1,5)(0,1,1)[7]
## Box Cox transformation: lambda= 0.7242585
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7
## -1.0555 -1.6784 -0.6883 -0.5785 -0.2066 -0.1997 -0.2077
## s.e. 0.1516 0.2750 0.2892 0.2432 0.1282 0.0966 0.0696
## ma1 ma2 ma3 ma4 ma5 sma1
## 0.1325 0.5394 -1.0202 -0.1707 -0.4809 -0.7574
## s.e. 0.1523 0.1917 0.0309 0.1448 0.1967 0.0457
##
## sigma^2 estimated as 65.79: log likelihood=-1256.19
## AIC=2540.39 AICc=2541.61 BIC=2594.68
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.6039325 23.84162 16.68036 -Inf Inf 0.8024147 -0.01275306
Residuals appear to be approximately normal. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of 0.3098. Accept null hypothesis that residuals is white noise.
checkresiduals(atm2_model2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(7,1,5)(0,1,1)[7]
## Q* = 3.5855, df = 3, p-value = 0.3098
##
## Model df: 13. Total lags used: 16
autoplot(forecast(atm2_model2, h=31))
It appears that ATM 3 just recently became operational. We don’t have much data to use for forecasting.
tsdisplay(atm3_ts)
For this, a simple naive forecast is done where the forecast is the value of the last observation.
plot(naive(atm3_ts, h=31))
ATM 4 series appears to have an outlier. I’m going to use the tsoutliers
function to get more insight on this.
tsdisplay(atm4_ts)
Below is replacement recommendation from the function tsoutliers
.
tsoutliers(atm4_ts)
## $index
## [1] 285
##
## $replacements
## [1] 190.9351
Replace outlier with recommended value.
atm4_ts[285] <- 190.9351
I don’t see any obvious trend. I see multiple spikes on ACF and PACF. They have very similar patterns.
tsdisplay(atm4_ts)
The test-statistic 0.1215 is less than the critical values. We accept the null hypothesis that the series is stationary. There is no need for differencing.
#https://stats.stackexchange.com/questions/160810/interpretation-of-critical-values-of-kpss-test
ur.kpss(atm4_ts) %>% summary
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.1223
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
(atm4_lambda = BoxCox.lambda(atm4_ts))
## [1] 0.4492823
Below is plot with Box-Cox transformation. ACF and PACF look very similar.
atm4_ts %>% BoxCox(atm4_lambda) %>% tsdisplay()
auto.arima
): ARIMA(0,0,1)(2,0,0)[7], AICc=2929.39atm4_model1 <- auto.arima(atm4_ts, max.p=10, max.q=10, lambda = atm4_lambda)
summary(atm4_model1)
## Series: atm4_ts
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= 0.4492823
##
## Coefficients:
## ma1 sar1 sar2 mean
## 0.0795 0.2074 0.2030 28.5700
## s.e. 0.0527 0.0516 0.0524 1.2384
##
## sigma^2 estimated as 175.6: log likelihood=-1459.61
## AIC=2929.23 AICc=2929.39 BIC=2948.73
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 84.92431 352.1777 274.4794 -367.8509 412.3995 0.7923398
## ACF1
## Training set 0.01963427
This appears to have a lower AIC.
atm4_model2 <- Arima(atm4_ts, order=c(0,0,0), seasonal = c(0,1,1), lambda = atm4_lambda)
summary(atm4_model2)
## Series: atm4_ts
## ARIMA(0,0,0)(0,1,1)[7]
## Box Cox transformation: lambda= 0.4492823
##
## Coefficients:
## sma1
## -0.8698
## s.e. 0.0317
##
## sigma^2 estimated as 165.9: log likelihood=-1427.32
## AIC=2858.65 AICc=2858.68 BIC=2866.41
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 63.29268 339.3257 256.9349 -371.0955 414.7838 0.7416942
## ACF1
## Training set 0.09999483
Residuals appear to to be somewhat skewed but roughly normal in shapre. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of 0.06942. Residuals may not be white noise.
checkresiduals(atm4_model2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 21.181, df = 13, p-value = 0.06942
##
## Model df: 1. Total lags used: 14
autoplot(forecast(atm2_model2, h=31))
The forecasts below for the four ATM machines for the next 31 days (May 2010) seem reasonable.
grid.arrange(ncol=2, nrow=2,
forecast(atm1_model2, h=31) %>% autoplot,
forecast(atm2_model2, h=31) %>% autoplot,
naive(atm3_ts, h=31) %>% autoplot,
forecast(atm4_model2, h=31) %>% autoplot)
Dump CSV file of forecast.
write.csv(forecast(atm1_model2, h=31), "ATM1_forecast.csv")
write.csv(forecast(atm2_model2, h=31), "ATM2_forecast.csv")
write.csv(naive(atm3_ts, h=31), "ATM3_forecast.csv")
write.csv(forecast(atm4_model2, h=31), "ATM4_forecast.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.
Read data set file.
power_data <- read_excel('ResidentialCustomerForecastLoad-624.xlsx', col_names=TRUE)
Preview file.
kable(head(power_data))
CaseSequence | YYYY-MMM | KWH |
---|---|---|
733 | 1998-Jan | 6862583 |
734 | 1998-Feb | 5838198 |
735 | 1998-Mar | 5420658 |
736 | 1998-Apr | 5010364 |
737 | 1998-May | 4665377 |
738 | 1998-Jun | 6467147 |
summary(power_data)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
power_ts <- ts(power_data[, "KWH"], start = c(1998, 1), frequency = 12)
The tsoutliers
function shows 2 outlier points in the series with recommended replacement values.
tsoutliers(power_ts)
## $index
## [1] 151 192
##
## $replacements
## [1] 7696306 7028762
Below are the current values at the indicated indices above (no replacement done yet)
(power_ts[151])
## [1] 770523
(power_ts[192])
## [1] 9606304
The red and blue horizontal lines below show the outlier values. The outlier indicated by red line does seem high but it doesn’t seem like it is too high relative to the surrounding values. The outlier indicated by blue line is way too low. I will replace this.
autoplot(power_ts) + geom_hline(yintercept=770523, color='blue') + geom_hline(yintercept=9606304, color='red')
Replacing outlier value indicated by blue line.
power_ts[151] <- 7696306
autoplot(power_ts)
There is a slight upward trend towards the right. Plot shows a somewhat stationary series.
tsdisplay(power_ts)
The KPSS test for has p-value of 0.02475. We reject the null hypothesis. Series is not stationary.
kpss.test(power_ts, null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: power_ts
## KPSS Trend = 0.17667, Truncation lag parameter = 4, p-value =
## 0.02475
(power_lambda = BoxCox.lambda(power_ts))
## [1] -0.1977787
ndiffs
function recommends differencing by 1.
ndiffs(power_ts)
## [1] 1
Below is plot with Box-Cox transformation with differencing of 1.
The slight upward trend is no longer present.
power_ts %>% BoxCox(power_lambda) %>% diff(1) %>% tsdisplay()
auto.arima
): ARIMA(0,1,1)(2,1,0)[12] , AICc=-1429.06power_model1 <- auto.arima(power_ts, d=1, max.p=10, max.q=10, lambda = power_lambda)
summary(power_model1)
## Series: power_ts
## ARIMA(0,1,1)(2,1,0)[12]
## Box Cox transformation: lambda= -0.1977787
##
## Coefficients:
## ma1 sar1 sar2
## -0.9129 -0.7900 -0.3998
## s.e. 0.0502 0.0732 0.0759
##
## sigma^2 estimated as 1.911e-05: log likelihood=718.64
## AIC=-1429.29 AICc=-1429.06 BIC=-1416.54
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 22291.25 665989.7 475942.3 -0.1436716 7.108555 0.774068
## ACF1
## Training set 0.1625114
This appears to have a lower AIC after trying out different combinations.
power_model2 <- Arima(power_ts, order=c(3,1,2), seasonal = c(0,1,2), lambda = power_lambda)
summary(power_model2)
## Series: power_ts
## ARIMA(3,1,2)(0,1,2)[12]
## Box Cox transformation: lambda= -0.1977787
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1 sma2
## -0.2699 0.0670 0.1789 -0.4507 -0.4874 -0.8051 0.1559
## s.e. 0.2141 0.0953 0.0830 0.2117 0.1980 0.0847 0.0922
##
## sigma^2 estimated as 1.818e-05: log likelihood=725.53
## AIC=-1435.05 AICc=-1434.21 BIC=-1409.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 36494.63 645086.4 453332.2 0.05866348 6.787337 0.7372951
## ACF1
## Training set 0.06634145
Residuals appear to be approximately normal. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of p-value = 0.8325. Accept residuals as white noise.
checkresiduals(power_model2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,2)(0,1,2)[12]
## Q* = 11.445, df = 17, p-value = 0.8325
##
## Model df: 7. Total lags used: 24
The forecast for 2014 seem reasonable.
autoplot(forecast(power_model2, h=12))