This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. I will accept late submissions with a penalty until the meetup after that when we review some projects.
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
First, we read in our data from the excel file and drop any NA values. The data spans from 5/1/2009 to 4/30/2010, and we will forecast ATM withdrawal amounts for the month of May 2010 for each ATM. We convert each of the four ATM’s data from the original excel and imported dataframe into individual time series and visualize them all on the same plot. We see that ATM #4 has greater withdrawals overall than the other 3 ATMs, and it has one considerable spike in it that we might want to take a closer look at. We will explore each ATM individually and forecast accordingly.
library(fpp2)
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(imputeTS)
atm <- readxl::read_excel("ATM624Data.xlsx")
atm <- drop_na(atm)
atm1 <- atm[atm$ATM == "ATM1",]
atm2 <- atm[atm$ATM == "ATM2",]
atm3 <- atm[atm$ATM == "ATM3",]
atm4 <- atm[atm$ATM == "ATM4",]
atm1 <- subset(atm1, select = -c(ATM, DATE))
atm2 <- subset(atm2, select = -c(ATM, DATE))
atm3 <- subset(atm3, select = -c(ATM, DATE))
atm4 <- subset(atm4, select = -c(ATM, DATE))
atm1ts <- ts(atm1)
atm2ts <- ts(atm2)
atm3ts <- ts(atm3)
atm4ts <- ts(atm4)
autoplot(atm1ts, series = "ATM 1") + autolayer(atm2ts, series = "ATM 2", PI=FALSE) + autolayer(atm3ts, series = "ATM 3", PI=FALSE) + autolayer(atm4ts, series = "ATM 4", PI=FALSE) + ggtitle("ATM Withdrawals (5/1/2009 - 4/30/2010)")
## Warning: Ignoring unknown parameters: PI
## Warning: Ignoring unknown parameters: PI
## Warning: Ignoring unknown parameters: PI
We’ll start with ATM #1:
autoplot(atm1ts) + ggtitle("ATM #1 Withdrawals")
First, we explore an ARIMA model, and auto.arima selects an ARIMA(0,0,2) model for us. This has an AICc of 3622.64 and a RMSE of 35.63. When we plot it, we see that it is more of a consistent flat line throughout the month of May, and this is probably not realistic if we look at how the series behaves in its day to day variations, so it makes sense to look into other models as well.
fit <- auto.arima(atm1ts)
summary(fit)
## Series: atm1ts
## ARIMA(0,0,2) with non-zero mean
##
## Coefficients:
## ma1 ma2 mean
## 0.0585 -0.2423 83.8912
## s.e. 0.0524 0.0577 1.5309
##
## sigma^2 estimated as 1280: log likelihood=-1807.26
## AIC=3622.53 AICc=3622.64 BIC=3638.09
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.007055245 35.63488 27.15933 -159.1922 182.3459 0.7239547
## ACF1
## Training set 0.008210899
fc <- forecast(fit, h=31)
autoplot(fc)
We also considered Holt-Winters’ method for forecasting this series, including regular, damped and Box Cox transformation variations. All three had smaller RMSE values than the forecast with the ARIMA model, and they also visually appeared to behave more realistically in their trends. The damped HW model has the smallest RMSE at 26.5, so we choose it to forecast withdrawals for ATM # 1.
atm1ts2 <- ts(atm1ts, frequency = 7)
atm1.hw <- hw(atm1ts2, h = 31)
autoplot(atm1.hw)
atm1.hw.damped <- hw(atm1ts2, damped = TRUE, h = 31)
autoplot(atm1.hw.damped)
atm1.hw.boxcox <- hw(atm1ts2, lambda = BoxCox.lambda(atm1ts2), h=31)
autoplot(atm1.hw.boxcox)
accuracy(atm1.hw)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3248614 26.69971 17.28292 -113.4744 129.995 0.9183413
## ACF1
## Training set 0.1421671
accuracy(atm1.hw.damped)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3171818 26.54717 17.24541 -115.2295 131.7731 0.9163482
## ACF1
## Training set 0.130437
accuracy(atm1.hw.boxcox)
## ME RMSE MAE MPE MAPE MASE
## Training set 3.235501 28.77077 18.706 -97.38168 119.4301 0.9939574
## ACF1
## Training set 0.1478558
atm1may <- data.frame(Y=as.matrix(round(atm1.hw.damped$mean)))
ATM #2:
autoplot(atm2ts) + ggtitle("ATM #2 Withdrawals")
Again, we explore an ARIMA model first, and auto.arima selects an ARIMA(5,1,4) model for us. This has an AICc of 3473.93 and a RMSE of 28.27. The forecast seems to continue and simulate a lot of the variation that we see in the series, but it looks like it takes on more conservative smaller variables than the original series alludes to.
fit2 <- auto.arima(atm2ts)
summary(fit2)
## Series: atm2ts
## ARIMA(5,1,4)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.0351 -0.7962 -0.0623 -0.2303 -0.4691 -1.0444 0.7030 -0.7622
## s.e. 0.0777 0.0559 0.0952 0.0512 0.0557 0.0819 0.0925 0.0932
## ma4
## 0.1729
## s.e. 0.0738
##
## sigma^2 estimated as 822: log likelihood=-1727.15
## AIC=3474.3 AICc=3474.93 BIC=3513.22
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.509031 28.27253 22.54609 -Inf Inf 0.5270023 -0.01925641
fc2 <- forecast(fit2, h=31)
autoplot(fc2)
Again, we try the three variations of the HW model (regular, damped and Box Cox transformation), and all except for the Box Cox transformation perform better than the ARIMA(5,1,4) model in terms of their RMSE. Visually, the regular and damped models also appear better than the other models in their forecast of withdrawals for ATM # 2. The damped HW model performs best, so we choose it for our forecast.
atm2ts2 <- ts(atm2ts, frequency = 7)
atm2.hw <- hw(atm2ts2, h = 31)
autoplot(atm2.hw)
atm2.hw.damped <- hw(atm2ts2, damped = TRUE, h = 31)
autoplot(atm2.hw.damped)
atm2.hw.boxcox <- hw(atm2ts2, lambda = BoxCox.lambda(atm2ts2), h=31)
autoplot(atm2.hw.boxcox)
accuracy(atm2.hw)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.004498074 28.03076 19.38286 -Inf Inf 0.9019996
## ACF1
## Training set -0.007054699
accuracy(atm2.hw.damped)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.3382394 28.00894 19.51643 -Inf Inf 0.9082155 -0.008575387
accuracy(atm2.hw.boxcox)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1.487496 28.36527 19.55013 -Inf Inf 0.9097837 -0.003205164
atm2may <- data.frame(Y=as.matrix(round(atm2.hw.damped$mean)))
ATM #3:
autoplot(atm3ts) + ggtitle("ATM #3 Withdrawals")
This ATM is more difficult to forecast as it was only withdrawn from on 3 separate days according to the data. This leaves us with limited options. We can either hold off on forecasting for this ATM, or create a forecast to implement in the short term until more data is collected. These three observations are at the end of the series, so we focus on just these points and do as is commonly done when we have limited data points: a mean forecast.
head(atm3[atm3$Cash != 0,])
## # A tibble: 3 x 1
## Cash
## <dbl>
## 1 96
## 2 82
## 3 85
atm3.2 <- atm3[atm3$Cash != 0,]
atm3ts2 <- ts(atm3.2)
fc3 <- meanf(ts(atm3ts2), h =31)
autoplot(fc3)
atm3may <- data.frame(Y=as.matrix(round(summary(fc3)$"Point Forecast")))
##
## Forecast method: Mean
##
## Model Information:
## $mu
## [1] 87.66667
##
## $mu.se
## [1] 4.255715
##
## $sd
## [1] 7.371115
##
## $bootstrap
## [1] FALSE
##
## $call
## meanf(y = ts(atm3ts2), h = 31)
##
## attr(,"class")
## [1] "meanf"
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -4.737024e-15 6.01849 5.555556 -0.4557562 6.242793 0.6535948
## ACF1
## Training set -0.295501
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 4 87.66667 71.61736 103.716 51.04494 124.2884
## 5 87.66667 71.61736 103.716 51.04494 124.2884
## 6 87.66667 71.61736 103.716 51.04494 124.2884
## 7 87.66667 71.61736 103.716 51.04494 124.2884
## 8 87.66667 71.61736 103.716 51.04494 124.2884
## 9 87.66667 71.61736 103.716 51.04494 124.2884
## 10 87.66667 71.61736 103.716 51.04494 124.2884
## 11 87.66667 71.61736 103.716 51.04494 124.2884
## 12 87.66667 71.61736 103.716 51.04494 124.2884
## 13 87.66667 71.61736 103.716 51.04494 124.2884
## 14 87.66667 71.61736 103.716 51.04494 124.2884
## 15 87.66667 71.61736 103.716 51.04494 124.2884
## 16 87.66667 71.61736 103.716 51.04494 124.2884
## 17 87.66667 71.61736 103.716 51.04494 124.2884
## 18 87.66667 71.61736 103.716 51.04494 124.2884
## 19 87.66667 71.61736 103.716 51.04494 124.2884
## 20 87.66667 71.61736 103.716 51.04494 124.2884
## 21 87.66667 71.61736 103.716 51.04494 124.2884
## 22 87.66667 71.61736 103.716 51.04494 124.2884
## 23 87.66667 71.61736 103.716 51.04494 124.2884
## 24 87.66667 71.61736 103.716 51.04494 124.2884
## 25 87.66667 71.61736 103.716 51.04494 124.2884
## 26 87.66667 71.61736 103.716 51.04494 124.2884
## 27 87.66667 71.61736 103.716 51.04494 124.2884
## 28 87.66667 71.61736 103.716 51.04494 124.2884
## 29 87.66667 71.61736 103.716 51.04494 124.2884
## 30 87.66667 71.61736 103.716 51.04494 124.2884
## 31 87.66667 71.61736 103.716 51.04494 124.2884
## 32 87.66667 71.61736 103.716 51.04494 124.2884
## 33 87.66667 71.61736 103.716 51.04494 124.2884
## 34 87.66667 71.61736 103.716 51.04494 124.2884
ATM #4:
autoplot(atm4ts) + ggtitle("ATM #4 Withdrawals")
It seems like this ATM has one considerably large outlier that might affect our forecasts, so we go ahead and impute it with Kalman smoothing.
outlier <- which.max(atm4$Cash)
atm4 <- atm4[-c(outlier),]
atm4ts <- ts(atm4)
atm4ts <- na_kalman(atm4ts)
autoplot(atm4ts) + ggtitle("ATM #4 Withdrawals (after imputing for outlier)")
We then start forecasting withdrawals from this ATM. auto.arima selects an ARIMA(0,0,0) model for us which has an AICc of 5252.17 and a RMSE of 340.31. The forecast is more of a flat line, and I am sure that we can find an alternative that accounts for more of the daily variations in withdrawals.
fit4 <- auto.arima(atm4ts)
summary(fit4)
## Series: atm4ts
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## 0.0745 445.4059
## s.e. 0.0522 19.7034
##
## sigma^2 estimated as 123121: log likelihood=-2648.7
## AIC=5303.4 AICc=5303.47 BIC=5315.1
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.04494946 349.9213 291.6334 -568.7919 602.1878 0.7530638
## ACF1
## Training set 0.0003911351
fc4 <- forecast(fit4, h=31)
autoplot(fc4)
All three variations of the Holt-Winters’ method have smaller RMSE values than the ARIMA(0,0,0) model. Of the three, the damped HW forecast has the smallest RMSE value, so it is the best fit here.
atm4ts2 <- ts(atm4ts, frequency = 7)
atm4.hw <- hw(atm4ts2, h = 31)
autoplot(atm4.hw)
atm4.hw.damped <- hw(atm4ts2, damped = TRUE, h = 31)
autoplot(atm4.hw.damped)
atm4.hw.boxcox <- hw(atm4ts2, lambda = BoxCox.lambda(atm4ts2), h=31)
autoplot(atm4.hw.boxcox)
accuracy(atm4.hw)
## ME RMSE MAE MPE MAPE MASE
## Training set -4.430172 332.8095 268.3569 -505.708 536.4508 0.7734634
## ACF1
## Training set 0.09716164
accuracy(atm4.hw.damped)
## ME RMSE MAE MPE MAPE MASE
## Training set -11.79613 330.3002 266.8285 -523.0685 552.2923 0.7690583
## ACF1
## Training set 0.1164429
accuracy(atm4.hw.boxcox)
## ME RMSE MAE MPE MAPE MASE
## Training set 69.85466 344.0054 263.9311 -352.2249 397.1189 0.7607075
## ACF1
## Training set 0.08602466
atm4may <- data.frame(Y=as.matrix(atm1.hw.damped$mean))
These forecasts should ultimately help the ATM operators better estimate utilization in order to ensure that the ATMs have enough cash available to meet each individualized demand. This is something that is already being done by many operators, or at least the largest ones. I have provided daily ATM withdrawal forecasts for May 2010 in a separate file.
atmfc <- cbind(atm1may, atm2may, atm3may, atm4may)
names(atmfc) <- c("ATM1", "ATM2", "ATM3", "ATM4")
write.csv(atmfc,"atm_forecasts_OP.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
We start by loading our dataset, dropping any NA values and converting it into a ts object.
pow <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
pow <- drop_na(pow)
powts <- ts(pow[,"KWH"], frequency = 12)
autoplot(powts) + ggtitle("Residential power usage (January 1998-December 2013)")
It seems that we have an outlier here that would drive our forecasts down, so we go ahead and impute it using Kalman smoothing.
There is an upward trend in the series with power usage increasing over the years, possibly due to an increase in population size or dependence on electricity. There also seems to be a seasonal component to this series with more power being used in the summer and winter months rather than in the fall and spring.
outlier2 <- which.min(pow$KWH)
pow <- pow[-c(outlier2),]
powts <- ts(pow[,"KWH"], frequency = 12)
powts <- na_kalman(powts)
autoplot(powts) + ggtitle("Residential power usage with imputation for outlier removed (Jan 1998-Dec 2013)")
ggseasonplot(powts, polar = TRUE)
We use auto.arima to produce an ARIMA(2,1,2)(1,0,1)[12] model with drift, and visually it appears to accurately depict what the next 12 months of power usage would be considering the trend of the series. This forecast has a RMSE value of 813275.5.
fit5 <- auto.arima(powts)
summary(fit5)
## Series: powts
## ARIMA(4,1,1)(2,0,1)[12]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 sar1 sar2 sma1
## -0.0238 -0.5035 -0.2825 -0.3955 -0.3983 0.7910 0.0420 -0.4065
## s.e. 0.1370 0.0710 0.0803 0.0897 0.1531 0.2123 0.1498 0.2058
##
## sigma^2 estimated as 6.429e+11: log likelihood=-2837.83
## AIC=5693.67 AICc=5694.67 BIC=5722.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 16190.18 782578.5 592494.5 -0.9313444 9.091479 0.8352336
## ACF1
## Training set 0.002877042
fc5 <- forecast(fit5, h=12)
autoplot(fc5)
We also considered HW forecasts for power usage. The HW forecast with the Box Cox transformation performed the best coming in with the lowest RMSE value, and this forecast also beats out the ARIMA model.
pow.hw <- hw(powts, h = 12, seasonal = "multiplicative")
autoplot(pow.hw)
pow.hw.damped <- hw(powts, damped = TRUE, h = 12)
autoplot(pow.hw.damped)
pow.hw.boxcox <- hw(powts, lambda = BoxCox.lambda(powts), h=12)
autoplot(pow.hw.boxcox)
accuracy(pow.hw)
## ME RMSE MAE MPE MAPE MASE
## Training set 3081.901 884147.7 635263.5 -1.095551 9.61109 0.8955248
## ACF1
## Training set 0.3200341
accuracy(pow.hw.damped)
## ME RMSE MAE MPE MAPE MASE
## Training set 75597.96 907937.3 665113.7 -0.175879 10.07382 0.9376043
## ACF1
## Training set 0.3349624
accuracy(pow.hw.boxcox)
## ME RMSE MAE MPE MAPE MASE
## Training set 83855.45 906575.9 650289.6 -0.01635677 9.662432 0.9167068
## ACF1
## Training set 0.3447324
pow2014 <- data.frame(Y=as.matrix(pow.hw.damped$mean))
We provide these monthly forecasts for 2014 in a .csv file. This should help project demand and prepare capacity accordingly.
write.csv(pow2014,"power_forecasts_OP.csv")