In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
We will read in the data and then check the missing values.
atmData<-read_excel("ATM624Data.xlsx")
summary(atmData)
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
atmData$DATE<-as.Date(atmData$DATE,origin = "1899-12-30")
missmap(atmData)
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
aggr(atmData, bars=T, numbers=T, sortVars=T)
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies
##
## Variables sorted by number of missings:
## Variable Count
## Cash 0.012890095
## ATM 0.009497965
## DATE 0.000000000
When we check the variables, variable DATE is numeric and need to convert to Date format. At the same time, there are some missing values for the dataset, and most of them are cash amount. Because we need to forecast how much cash is taken out of 4 different ATM machines for May 2010, and May 2010 include in the dataset with blank cash amount, so I will filter out May 2010 and then apply KNN to impute the data.
atmData<-atmData %>%
filter(DATE<"2010-05-01")
atmData<-atmData%>%
kNN()%>%
subset(select = DATE:Cash)
summary(atmData)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 1.0
## Median :2009-10-30 Mode :character Median : 74.0
## Mean :2009-10-30 Mean : 155.4
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
After imputation, we can see variable DATE include the data from May 2009 to April 2010. No missing value is found.
To better forecast the cash withdraw of different ATM machine, we need to separate the data by ATM number.
atm1Data<-atmData%>%
filter(ATM=="ATM1")
atm1<- ts(atm1Data[,"Cash"],frequency=7)
atm2Data<-atmData%>%
filter(ATM=="ATM2")
atm2<- ts(atm2Data[,"Cash"],frequency=7)
atm3Data<-atmData%>%
filter(ATM=="ATM3")
atm3<- ts(atm3Data[,"Cash"],frequency=7)
atm4Data<-atmData%>%
filter(ATM=="ATM4")
atm4<- ts(atm4Data[,"Cash"],frequency=7)
We combine four ATM withdraw amount in one plot.
autoplot(atm1,series = "ATM1")+autolayer(atm2,series = "ATM2")+autolayer(atm3,series = "ATM3")+autolayer(atm4,series = "ATM4")
Comparing the cash withdraw amount, ATM4 has the highest amount with a noticeable outlier.
atm1 %>%
decompose(type="multiplicative")%>%
autoplot()+ggtitle('Classical multiplicative decomposition of Monthly ATM1 Withdraw')
ggseasonplot(atm1)
ggsubseriesplot(atm1)
gglagplot(atm1)
ggAcf(atm1)
Decomposition Plot of ATM1 monthly ATM Withdraw indicates that there is no upward or downward trend, but the data is seasonal. We use seasonal plot to check which date has the highest and lowest withdraw number and apparently, Tuesday has the highest withdraw number and Saturday has the lowest withdraw number. For the ACF plot, we notice spikes in lag 7, lag 14 and lag 21. Lag plot reveal relatively strong positive autocorrelations for every Sunday. We also detect some abnormal points/outlier from week 42 to week 55.
To better test our model, I separate the dataset into training dataset and test dataset.
atm1.train <- window(atm1, end=c(41,1))
atm1.test <- window(atm1, start=c(42,1))
We will use Holt-Winter, ETS, and Auto ARIMA to fit our training data. Because the test data include 78 weeks, we will forecast 78 weeks’ cash withdraw amount and compare them with the actual withdraw amount.
lambda1<-BoxCox.lambda(atm1.train)
#Holt-Winters’Simple Exponential smoothing with additive errors
atm1Pre_holt<-hw(atm1.train,seasonal="additive",h=78)
#Forecasting with ETS models
atm1Pre_ets <- atm1.train%>%
stlm( s.window = 13, robust = TRUE, method = "ets" , lambda = lambda1,biasadj=TRUE) %>%
forecast( h = 78,lambda = lambda1)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj
## information not found, defaulting to FALSE.
#Auto ARIMA model
atm1Pre_arima<-auto.arima(atm1.train)%>%
forecast( h = 78)
checkresiduals(atm1Pre_holt)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 8.4181, df = 3, p-value = 0.03812
##
## Model df: 11. Total lags used: 14
checkresiduals(atm1Pre_ets)
## Warning in checkresiduals(atm1Pre_ets): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 4.4882, df = 12, p-value = 0.9729
##
## Model df: 2. Total lags used: 14
checkresiduals(atm1Pre_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,0)(0,1,1)[7]
## Q* = 11.534, df = 11, p-value = 0.3996
##
## Model df: 3. Total lags used: 14
Holt-Winters model has the lowest P value. When we check the ACF plots, all three models show white noise series, and they are almost normal distributed, which means they all past the residual test.
To better compare the forecast of three models, I combine them into one plot.
autoplot(atm1)+autolayer(atm1Pre_holt,series="Holt-Winters",PI=FALSE)+autolayer(atm1Pre_ets,series="ets",PI=FALSE)+autolayer(atm1Pre_arima,series="arima",PI=FALSE)
All three models’ prediction are very similar. The green line, represent ETS model, has a wider range, comparing to the other two models, and seems more closed to the black line, which reprsent the acutual data.
Which model has the best test accuracy?
accuracy(atm1Pre_holt,atm1.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01140586 20.07449 13.05958 -75.04634 91.69145 0.7437797
## Test set -7.98413173 53.71096 39.47767 -478.19750 508.62074 2.2483646
## ACF1 Theil's U
## Training set 0.08327157 NA
## Test set 0.07507848 0.2725146
accuracy(atm1Pre_ets,atm1.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 2.554395 20.59417 12.51123 -74.86264 91.00704 0.7125497
## Test set -1.955039 51.69240 37.23724 -425.99817 460.31693 2.1207657
## ACF1 Theil's U
## Training set 0.11131608 NA
## Test set 0.09146545 0.3222798
accuracy(atm1Pre_arima,atm1.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2674077 20.55161 13.56616 -70.0377 86.49244 0.7726311
## Test set -0.7349643 52.32977 39.00573 -424.3575 461.22115 2.2214861
## ACF1 Theil's U
## Training set -0.01172374 NA
## Test set 0.03078549 0.4181184
The RMSE for Holt model, ETS model, and ARIMA model are 53.71096, 51.69240 and 52.32977, respectively, so ETS model perform best among those models. I will choose ETS model to fit our ATM1 data.
I will use the ETS model to forecast the ATM1 cash withdraw and write the data into a csv file.
atm1_pre <- atm1%>%
stlm( s.window = 13, robust = TRUE, method = "ets" , lambda = lambda1,biasadj=TRUE) %>%
forecast( h = 31,lambda = lambda1)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj
## information not found, defaulting to FALSE.
autoplot(atm1_pre)
atm1_final<-data.frame(atm1_pre)
write_csv(atm1_final, "ATM1_Forecast.csv")
We will use decomposition to check the trend and seasonality of the ATM2 data.
atm2 %>%
decompose(type="multiplicative")%>%
autoplot()+ggtitle('Classical multiplicative decomposition of Monthly ATM2 Withdraw')
ggseasonplot(atm2)
ggsubseriesplot(atm2)
ggAcf(atm2)
Same as ATM1, we can detect the seasonality from the plot, but no upward or downward trend. Saturday has the lowest withdraw amount and Sunday has the highest withdraw amount. Same as ATM 1, high spikes are in lag 7, 14 and 21 and some outliers are detect during week 42 to week 55.
I separate the data into training and test dataset, and use Holt-Winters, ETS and Auto ARIMA models to fit the data.
atm2.train <- window(atm2, end=c(43,1))
atm2.test <- window(atm2, start=c(44,1))
lambda2<-BoxCox.lambda(atm2.train)
#Holt-Winters’Simple Exponential smoothing with additive errors
atm2Pre_holt<-hw(atm2.train,seasonal="additive",h=78)
#Forecasting with ETS models
atm2Pre_ets <- atm2.train%>%
stlm( s.window = 13, robust = TRUE, method = "ets" , lambda = lambda2,biasadj=TRUE) %>%
forecast( h = 78,lambda = lambda2)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj
## information not found, defaulting to FALSE.
#Auto ARIMA model
atm2Pre_arima<-auto.arima(atm2.train)%>%
forecast( h = 78)
Combine the forecast plot and compare the shap of the lines.
autoplot(atm2)+autolayer(atm2Pre_holt,series="Holt-Winters",PI=FALSE)+autolayer(atm2Pre_ets,series="ets",PI=FALSE)+autolayer(atm2Pre_arima,series="arima",PI=FALSE)
Auto ARIMA seems more fit to the data because ETS and Holt-Winters models predict negative withdraw amounts which is impossible in reality.
checkresiduals(atm2Pre_holt)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 7.7107, df = 3, p-value = 0.05238
##
## Model df: 11. Total lags used: 14
checkresiduals(atm2Pre_ets)
## Warning in checkresiduals(atm2Pre_ets): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 9.9646, df = 12, p-value = 0.6191
##
## Model df: 2. Total lags used: 14
checkresiduals(atm2Pre_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(0,1,1)[7] with drift
## Q* = 10.918, df = 12, p-value = 0.536
##
## Model df: 2. Total lags used: 14
All the ACF plots are white noise and normal distributed.
accuracy(atm2Pre_holt,atm2.test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.022302 24.24467 16.81797 -Inf Inf 0.7644531 0.05898475
## Test set 10.247015 63.70296 55.82170 -Inf Inf 2.5373498 0.01007337
## Theil's U
## Training set NA
## Test set 0
accuracy(atm2Pre_ets,atm2.test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.371413 23.15324 14.26618 -Inf Inf 0.6484628 0.03969808
## Test set 2.569678 65.17373 59.81651 -Inf Inf 2.7189321 -0.00426089
## Theil's U
## Training set NA
## Test set 0
accuracy(atm2Pre_arima,atm2.test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.662133 23.59368 15.70117 -Inf Inf 0.7136893 0.040790797
## Test set 8.752954 59.08689 54.19218 -Inf Inf 2.4632808 -0.006858483
## Theil's U
## Training set NA
## Test set 0
The RMSE for Holt-Winters, ETS and ARIMA is 63.70296, 65.17373 and 59.08689, respectively. ARIMA model has the lowest RMSE and predict all positive values, which should be selected.
Choose ARIMA model to forecast the ATM2 cash withdraw.
atm2_pre<-auto.arima(atm2)%>%
forecast( h = 31)
atm2_final<-data.frame(atm2_pre)
write_csv(atm2_final, "ATM2_Forecast.csv")
tsdisplay(atm3)
ATM3 plot shows a stable 0 withdraw amount from week 1 to week 51, and suddenly increase in week 52. Lag plot indicates a positive correlation in lag 1 and lag 2 and all the prediction seems to rely on the first two spikes.
atm3Pre_holt<-holt(atm3,h=31,damped = TRUE)
atm3_naive <- naive(atm3,h=31)
autoplot(atm3)+autolayer(atm3Pre_holt,series = "Holt")+autolayer(atm3_naive,series = "naive",PI=FALSE)
All the value before week 52 is 0, so we need to forecast the withdraw amount base on the last three observations. We use naive and Exponential smoothing model with trend method to predict our withdraw amount of next month. For naive forecasts, it simply set all forecasts to be the value of the last observation. Holt’s method allow the forecasting of data with a trend and it looks better than just the plain linear line.
We decide to use Holt’s method and wirte the CSV file
atm3_final<-data.frame(atm3Pre_holt)
write_csv(atm3_final, "ATM3_Forecast.csv")
atm4 %>%
decompose(type="multiplicative")%>%
autoplot()+ggtitle('Classical multiplicative decomposition of Monthly ATM1 Withdraw')
Same as ATM1 and ATM2, ATM4 has a strong seasonality without trend. From the plot, we also notice an outlier which change the the distribution of our data. To better perform the forecast, we need to use tsoutliers() to find out the exact location of the outlier and replace it with a suggested value.
boxplot(atm4)
tsoutliers(atm4)
## $index
## [1] 285
##
## $replacements
## [1] 190.9351
atm4[285]<-190.9351
autoplot(atm4)
Outlier locates in index 285. According to the estimation of replacement, we replace it with value 190.9351. After replacement, the plot looks much better without extreme value.
ggseasonplot(atm4)
ggsubseriesplot(atm4)
ggAcf(atm4)
From the plot, we can see Saturday has the lowest withdraw amount and Sunday has the highest amount. For the ACF plot, a positive correlation in leg 7,14 and 21 was detected.
I split the data into training and test set, and apply Holt-Winters, ETS, and ARIMA model to predict the withdraw amount.
atm4.train <- window(atm4, end=c(43,1))
atm4.test <- window(atm4, start=c(44,1))
lambda4<-BoxCox.lambda(atm4.train)
#Holt-Winters’Simple Exponential smoothing with additive errors
atm4Pre_holt<-hw(atm4.train,seasonal="additive",h=78)
#Forecasting with ETS models
atm4Pre_ets <- atm4.train%>%
stlm( s.window = 13, robust = TRUE, method = "ets" , lambda = lambda4,biasadj=TRUE) %>%
forecast( h = 78,lambda = lambda4)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj
## information not found, defaulting to FALSE.
#Auto ARIMA model
atm4Pre_arim<-auto.arima(atm4.train)%>%
forecast( h = 78)
checkresiduals(atm4Pre_holt)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 27.421, df = 3, p-value = 4.804e-06
##
## Model df: 11. Total lags used: 14
checkresiduals(atm4Pre_ets)
## Warning in checkresiduals(atm4Pre_ets): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 36.024, df = 12, p-value = 0.0003211
##
## Model df: 2. Total lags used: 14
checkresiduals(atm4Pre_arim)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(1,0,0)[7] with non-zero mean
## Q* = 16.113, df = 11, p-value = 0.137
##
## Model df: 3. Total lags used: 14
ACF plot show white noise for Holt_winter and ARIMA model, but ETS model are not white noise because lag 1,5,7,24 all above the blue dot line.For Auto ARIMA, The histogram of residuals indicates a right skew distribution.
autoplot(atm4)+autolayer(atm4Pre_holt,series="Holt-Winters",PI=FALSE)+autolayer(atm4Pre_ets,series="ets",PI=FALSE)+autolayer(atm4Pre_arim,series="arima",PI=FALSE)
Comparing to Holt-Winters and ETS model, ARIMA model looks like a stable straight line and doesn’t match with our test data, but we can use accuracy to check the model’s RMSE.
accuracy(atm4Pre_holt,atm4.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -2.254334 325.6849 257.7313 -401.3045 439.6419 0.7402866
## Test set -64.348344 359.3135 304.1611 -990.2452 1026.7148 0.8736477
## ACF1 Theil's U
## Training set 0.09810224 NA
## Test set 0.07404776 0.2521717
accuracy(atm4Pre_ets,atm4.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 63.49130 326.1645 243.7161 -306.5187 348.2833 0.7000302
## Test set 48.82744 350.9931 298.5294 -734.9086 788.2869 0.8574715
## ACF1 Theil's U
## Training set 0.13232581 NA
## Test set 0.07268141 0.5316337
accuracy(atm4Pre_arim,atm4.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.07427274 351.8279 294.7598 -493.5134 526.8487 0.8466442
## Test set -31.32423186 303.2796 254.4130 -788.8555 817.0348 0.7307551
## ACF1 Theil's U
## Training set 0.004037442 NA
## Test set -0.011839829 0.2516428
Although it seems inconsistent with the test data when we check the plot, Auto ARIMA has the highest accuracy of the test data. We will use this model to forecast the withdraw amount for ATM4.
atm4_pre<-auto.arima(atm4)%>%
forecast( h = 31)
atm4_final<-data.frame(atm4_pre)
write_csv(atm4_final, "ATM4_Forecast.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.Add this to your existing files above.
First we need to check the missing values after we read in the data.
powerUse<-read_excel("ResidentialCustomerForecastLoad-624.xlsx")
aggr(powerUse, bars=T, numbers=T, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## KWH 0.005208333
## CaseSequence 0.000000000
## YYYY-MMM 0.000000000
KWH is the only variable with 0.52% of missing data. We can use kNN imputation to replace many of the missing values we identified above. With the KNN method, a missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value.
#use KNN to impute the data
powerUse<-powerUse%>%
kNN()%>%
subset(select = CaseSequence:KWH)
missmap(powerUse)
After the imputation, there are no missing values.Next, we need to turn the data into a ts object using the ts() function.Later, we can use decomposition to analyze the trend,seasonality and remainder.
tsPower <- ts(powerUse$KWH, start = c(1998,1), frequency = 12)
tsPower %>%
decompose(type="multiplicative")%>%
autoplot()+ggtitle('Classical multiplicative decomposition of residential power usage')
The plot show a stable power usage from 1998 until 2010, but then a suddenly drop down in 2011. We don’t know what cause the drop but detect an outlier when we check the remainder plot. The data also shows a strong seasonality in the seasonal plot.
We use tsoutliers() to handle the outlier.
boxplot(tsPower)
tsoutliers(tsPower)
## $index
## [1] 129 151 192
##
## $replacements
## [1] 7374737 7696306 7028762
We can detect the outlier from boxplot and replace them with the suggested values.
tsPower[129]<-7374737
tsPower[151]<-7696306
tsPower[192]<-7028762
After the outliers were handled, we can rerun the plot, check the seasonality and ACF plot.
autoplot(tsPower)
ggseasonplot(tsPower)
ggsubseriesplot(tsPower)
ggAcf(tsPower)
The power usage drop in 2011 was caused by outlier. After the outlier was replaced by the suggested value, trend plot looks normal without suddenly decreased amount. The seasonal plot and Seasonal subseries plot clearly indicate that the high power usage happened in August and January, which caused by people widely use electricity to lower the temperature in summer and higher the temperature winter. May and November has a lower power usage because the surrounding temperature is still comfortable at that time.
We can use Holt-Winters, ETS and ARIMA to predict the power usage next month.
lambda_p<-BoxCox.lambda(tsPower)
#Holt-Winters’Simple Exponential smoothing with additive errors
power_holt<-hw(tsPower,seasonal="additive",h=31)
#Forecasting with ETS models
power_ets <- tsPower%>%
stlm( s.window = 13, robust = TRUE, method = "ets" , lambda = lambda_p,biasadj=TRUE) %>%
forecast( h = 31,lambda = lambda_p)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj
## information not found, defaulting to FALSE.
#Auto ARIMA model
power_arima<-auto.arima(tsPower)%>%
forecast( h = 31)
checkresiduals(power_holt)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 61.748, df = 8, p-value = 2.114e-10
##
## Model df: 16. Total lags used: 24
checkresiduals(power_ets)
## Warning in checkresiduals(power_ets): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 33.874, df = 22, p-value = 0.05059
##
## Model df: 2. Total lags used: 24
checkresiduals(power_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
## Q* = 11.539, df = 17, p-value = 0.8272
##
## Model df: 7. Total lags used: 24
All three models’ lag plots show white noise series and residuals are normal distributed.
Compare the prediction of three models.
autoplot(tsPower)+autolayer(power_holt,series="holt",PI=FALSE)+autolayer(power_ets,series="ets",PI=FALSE)+autolayer(power_arima,series="arima",PI=FALSE)
Compare accuracy of three models:
accuracy(power_holt,)
## ME RMSE MAE MPE MAPE MASE
## Training set 11056.28 589808.7 446723.2 -0.5549773 6.817696 0.7489219
## ACF1
## Training set 0.2907402
accuracy(power_ets)
## ME RMSE MAE MPE MAPE MASE
## Training set 69078.46 528137.2 400239.6 0.4253346 6.068165 0.670993
## ACF1
## Training set 0.1805739
accuracy(power_arima)
## ME RMSE MAE MPE MAPE MASE
## Training set -7313.458 552028.7 409562.5 -0.7072208 6.308841 0.6866227
## ACF1
## Training set 0.006463511
For the dataset, ETS model has the lowest RMSE,528137.2, and should be choose as the prediction model.
Write the data into a CSV file.
power_final<-data.frame(power_arima)
write_csv(power_final, "PowerUsage_Forecast.csv")