In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
df<-read_excel("/Users/albina/Downloads/ATM624Data.xlsx")
head(df)
## # A tibble: 6 x 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1474 obs. of 3 variables:
## $ DATE: num 39934 39934 39935 39935 39936 ...
## $ ATM : chr "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num 96 107 82 89 85 90 90 55 99 79 ...
datatable(describe(df))
First look of the data shows that we need to clean the data such as change the date format, investigate missing values and outliers, and replace/impute those values as necessary.
df$DATE <- as.Date(df$DATE, origin = "1899-12-30")
df %>% ggplot(aes(x = DATE, y = Cash, col = ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ ATM, ncol = 1, scales = "free_y")
na_seadec()
function from imputeTS
package that removes the seasonal component from the time series, performs imputation on the deseasonalized series and afterwards adds the seasonal component again.outliers
package.df <- df %>%
# remove last two weeks of NAs
drop_na() %>%
# create a column for each atm
spread(ATM,Cash)
plotNA.distribution(df$ATM1)
plotNA.distribution(df$ATM2)
plotNA.distribution(df$ATM3)
plotNA.distribution(df$ATM4)
# na_seadec removes the seasonal component from the time series, performs imputation on the deseasonalized series and afterwards adds the seasonal component again.
df$ATM1 <- na_seadec(df$ATM1)
df$ATM2 <- na_seadec(df$ATM2)
# substitute zeroes in ATM3 with NAs
df$ATM3[df$ATM3 == 0] <- NA
# Find value with largest difference between vector and sample mean, which can be an outlier, and replace with the mean of the series.
outlier(df$ATM4, opposite = FALSE, logical = FALSE)
## [1] 10919.76
df$ATM4 <- rm.outlier(df$ATM4, fill = TRUE, median = FALSE, opposite = FALSE)
datatable(describe(df))
Now that the data have been cleaned, let’s take a closer look at individual series.
df %>% gather(atm, Cash, -DATE) %>%
ggplot(aes(x = DATE, y = Cash, col = atm)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ atm, ncol = 1, scales = "free_y")
atm1_ts <- ts(df$ATM1,frequency=7)
ggtsdisplay(atm1_ts)
ggseasonplot(atm1_ts, polar=TRUE)
autoplot(decompose(atm1_ts))
Data show clear weekly pattern. Decomposition charts confirm that there is no apparent trend. Interestingly, remainder starts oscilating towards the end of the series. At the same time, seasonal plot show changing behavior, where Thursdays go from being the day with highest withdrawals in the middle of the time series towards lowest in the end of the series. So there appears to be some connection between the two.
atm2_ts <- ts(df[,"ATM2"],frequency=7)
ggtsdisplay(atm2_ts)
ggseasonplot(atm2_ts, polar=TRUE)
autoplot(decompose(atm2_ts))
Similarly to ATM1, ATM2 data show clear weekly pattern but with slight downward trend. Same as with ATM1, remainder starts oscilating towards the end of the series and we see similar shift in pattern in seasonal plot.
atm4_ts <- ts(df[,"ATM4"],frequency=7)
ggtsdisplay(atm4_ts)
ggseasonplot(atm4_ts, polar=TRUE)
autoplot(decompose(atm4_ts))
ATM4 show same seasonaly patterns, no apparent trend and the remainder appears to be white noise.
It is a good practice to have several models fit first to be able to compare the performance and choose a final candidate for forecasting.
Given seasonality in ATM1, ATM2, and ATM4, we will fit and test the following models:
- Seasonal naive, in which we set each forecast to be equal to the last observed value from the same season of the year (month, etc.)
- Holt-Winters seasonal method allow to capture seasonal component. The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series. Since or data appear to maintain variance, we will use additive method.
- ARIMA
Given the lack of data for ATM3, we will use simple mean forecasting.
For ATM1, ATM2, and ATM4 we will train the above models on 90% of available data, then select best performing one based on RMSE for test set, and lastly retrain the model on all data and predict the 31 values for May 2010.
train_perc = 0.9
train <- head(atm1_ts, round(length(atm1_ts) * train_perc))
h <- length(atm1_ts) - length(train)
test <- tail(atm1_ts, h)
# Seasonal Naive
fit <- snaive(train, length(test))
fc <- forecast(fit, h = length(test))
snaive_rmse <- rmse(test,fc$mean)
autoplot(train,main='Seasonal Naive') +
autolayer(fc,series='Naive') +
autolayer(test,series='Actual')
fit[['model']]
## Call: snaive(y = train, h = length(test))
##
## Residual sd: 28.995
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 75.609, df = 14, p-value = 1.83e-10
##
## Model df: 0. Total lags used: 14
# Holt-Winters seasonal method
fit <- hw(train,seasonal="additive",length(test))
fc <- forecast(fit, h = length(test))
hws_rmse <- rmse(test,fc$mean)
autoplot(train,main="Holt-Winters seasonal") +
autolayer(fc,series='HW seasonal') +
autolayer(test,series='Actual')
fit[['model']]
## Holt-Winters' additive method
##
## Call:
## hw(y = train, h = length(test), seasonal = "additive")
##
## Smoothing parameters:
## alpha = 0.0227
## beta = 1e-04
## gamma = 0.3248
##
## Initial states:
## l = 78.0384
## b = 0.0058
## s = -60.074 -3.6633 18.9093 4.2836 17.4204 13.6339
## 9.4902
##
## sigma: 25.263
##
## AIC AICc BIC
## 4031.368 4032.359 4076.884
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 27.892, df = 3, p-value = 3.826e-06
##
## Model df: 11. Total lags used: 14
# ARIMA
fit <- auto.arima(train, seasonal=T, stepwise=F, approximation=F)
fc <- forecast(fit, h = length(test))
arima_rmse <- rmse(test,fc$mean)
autoplot(train,main='ARIMA') +
autolayer(fc,series='ARIMA') +
autolayer(test,series='Actual')
summary(fit)
## Series: train
## ARIMA(0,0,1)(1,1,1)[7]
##
## Coefficients:
## ma1 sar1 sma1
## 0.2075 0.2316 -0.8030
## s.e. 0.0582 0.0893 0.0666
##
## sigma^2 estimated as 601.3: log likelihood=-1483.43
## AIC=2974.86 AICc=2974.99 BIC=2989.95
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1966475 24.14531 15.2034 -108.9955 124.9788 0.8154204
## ACF1
## Training set -0.01123355
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(1,1,1)[7]
## Q* = 15.6, df = 11, p-value = 0.1566
##
## Model df: 3. Total lags used: 14
All of the models are sufficient based on the residuals but comparing the RMSE results side-by-side, we see that Holt-Winters results in the best model.
res <- data.frame(rbind(snaive_rmse,hws_rmse, arima_rmse))
colnames(res) <- "RMSE"
rownames(res) <- c('Seasonal Naive','Holt-Winters Seasonal', 'ARIMA')
datatable(res) %>% formatRound('RMSE',3)
train_perc = 0.9
train <- head(atm2_ts, round(length(atm2_ts) * train_perc))
h <- length(atm2_ts) - length(train)
test <- tail(atm2_ts, h)
# Seasonal Naive
fit <- snaive(train, length(test))
fc <- forecast(fit, h = length(test))
snaive_rmse <- rmse(test,fc$mean)
autoplot(train,main='Seasonal Naive') +
autolayer(fc,series='Naive') +
autolayer(test,series='Actual')
fit[['model']]
## Call: snaive(y = train, h = length(test))
##
## Residual sd: 30.9242
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 95.121, df = 14, p-value = 4.052e-14
##
## Model df: 0. Total lags used: 14
# Holt-Winters seasonal method
fit <- hw(train,seasonal="additive",length(test))
fc <- forecast(fit, h = length(test))
hws_rmse <- rmse(test,fc$mean)
autoplot(train,main='Holt-Winters Seasonal') +
autolayer(fc,series='HW seasonal') +
autolayer(test,series='Actual')
fit[['model']]
## Holt-Winters' additive method
##
## Call:
## hw(y = train, h = length(test), seasonal = "additive")
##
## Smoothing parameters:
## alpha = 1e-04
## beta = 1e-04
## gamma = 0.3588
##
## Initial states:
## l = 74.186
## b = -0.0665
## s = -45.0356 -25.9466 19.9782 1.8337 4.0716 13.9378
## 31.161
##
## sigma: 26.1721
##
## AIC AICc BIC
## 4054.559 4055.550 4100.076
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 38.198, df = 3, p-value = 2.566e-08
##
## Model df: 11. Total lags used: 14
# ARIMA
fit <- auto.arima(train, seasonal=T, stepwise=F, approximation=F)
fc <- forecast(fit, h = length(test))
arima_rmse <- rmse(test,fc$mean)
autoplot(train,main='ARIMA') +
autolayer(fc,series='ARIMA') +
autolayer(test,series='Actual')
summary(fit)
## Series: train
## ARIMA(2,0,2)(0,1,1)[7] with drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1 drift
## -0.4482 -0.9516 0.4972 0.8257 -0.8231 -0.0730
## s.e. 0.0372 0.0344 0.0669 0.0491 0.0438 0.0378
##
## sigma^2 estimated as 617.4: log likelihood=-1486.87
## AIC=2987.75 AICc=2988.1 BIC=3014.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.4557674 24.35081 17.00471 -Inf Inf 0.7981446 -0.005601147
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7] with drift
## Q* = 9.0769, df = 8, p-value = 0.3359
##
## Model df: 6. Total lags used: 14
Once again, we see that Holt-Winters results in the best model for ATM2.
res <- data.frame(rbind(snaive_rmse,hws_rmse, arima_rmse))
colnames(res) <- "RMSE"
rownames(res) <- c('Seasonal Naive','Holt-Winters Seasonal', 'ARIMA')
datatable(res) %>% formatRound('RMSE',3)
train_perc = 0.9
train <- head(atm4_ts, round(length(atm4_ts) * train_perc))
h <- length(atm4_ts) - length(train)
test <- tail(atm4_ts, h)
# Seasonal Naive
fit <- snaive(train, length(test))
fc <- forecast(fit, h = length(test))
snaive_rmse <- rmse(test,fc$mean)
autoplot(train,main='Seasonal Naive') +
autolayer(fc,series='Naive') +
autolayer(test,series='Actual')
fit[['model']]
## Call: snaive(y = train, h = length(test))
##
## Residual sd: 463.3882
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 97.738, df = 14, p-value = 1.288e-14
##
## Model df: 0. Total lags used: 14
# Holt-Winters seasonal method
fit <- hw(train,seasonal="additive",length(test))
fc <- forecast(fit, h = length(test))
hws_rmse <- rmse(test,fc$mean)
autoplot(train,main='Holt-Winters seasonal') +
autolayer(fc,series='HW seasonal') +
autolayer(test,series='Actual')
fit[['model']]
## Holt-Winters' additive method
##
## Call:
## hw(y = train, h = length(test), seasonal = "additive")
##
## Smoothing parameters:
## alpha = 0.0385
## beta = 1e-04
## gamma = 1e-04
##
## Initial states:
## l = 581.6897
## b = -0.1469
## s = -311.4446 -54.6314 51.8001 41.8296 96.1871 57.9326
## 118.3266
##
## sigma: 340.5622
##
## AIC AICc BIC
## 5737.792 5738.782 5783.308
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 14.026, df = 3, p-value = 0.00287
##
## Model df: 11. Total lags used: 14
# ARIMA
fit <- auto.arima(train, seasonal=T, stepwise=F, approximation=F)
fc <- forecast(fit, h = length(test))
arima_test_rmse <- rmse(test,fc$mean)
autoplot(train,main='ARIMA') +
autolayer(fc,series='ARIMA') +
autolayer(test,series='Actual')
summary(fit)
## Series: train
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
##
## Coefficients:
## sar1 sar2 mean
## 0.1506 0.0888 447.6496
## s.e. 0.0552 0.0555 25.2657
##
## sigma^2 estimated as 124501: log likelihood=-2388.12
## AIC=4784.23 AICc=4784.36 BIC=4799.41
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0746234 351.2298 290.8324 -508.0236 540.5876 0.8223528
## ACF1
## Training set 0.07818136
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.022, df = 11, p-value = 0.1815
##
## Model df: 3. Total lags used: 14
This time, auto.ARIMA generates the best model for ATM4.
res <- data.frame(rbind(snaive_rmse,hws_rmse, arima_rmse))
colnames(res) <- "RMSE"
rownames(res) <- c('Seasonal Naive','Holt-Winters Seasonal', 'ARIMA')
datatable(res) %>% formatRound('RMSE',3)
atm1_pred <- hw(atm1_ts,seasonal="additive",h=31,level=95)
atm2_pred <- hw(atm2_ts,seasonal="additive",h=31,level=95)
atm3_pred <- meanf(ts(tail(df$ATM3,3)),h=31,level=95)
atm4_fit <- Arima(atm4_ts, order = c(0,0,0), seasonal = list(order=c(2,0,0),period=7))
atm4_pred <- forecast(atm4_fit, h=31,level=95)
values = seq(from = as.Date("2010-05-01"), to = as.Date("2010-05-31"), by = 'day')
part_a_pred = data_frame("Date"=values,
"ATM1"=as.numeric(atm1_pred$mean),
"ATM2"=as.numeric(atm2_pred$mean),
"ATM3"=as.numeric(atm3_pred$mean),
"ATM4"=as.numeric(atm4_pred$mean))
write_csv(part_a_pred,'Part A predictions.csv')
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
df<-read_excel("/Users/albina/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
head(df)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 192 obs. of 3 variables:
## $ CaseSequence: num 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num 6862583 5838198 5420658 5010364 4665377 ...
datatable(describe(df))
myts <- ts(df$KWH,frequency=12, start=c(1998,1))
plotNA.distribution(myts)
ggtsdisplay(myts)
ggseasonplot(myts, polar=TRUE)
ggsubseriesplot(myts)
The data has a distinctive seasonality pattern and a slight upward trend towards the end of the series. The data peaks twice in a year which is logical given that people tend to use more electricity in colder months to warm their houses and in hotter months to cool them off.
There is one missing value that we will impute using imputeTS
package that takes seasonality into account. There is also what appears to be an outlier (July 2010 drop) that is hard to explain with no additonal context about the data. Although it could be possible due to outage, we will assume this is an erroneous observation and replace it with the mean of the data.
myts <- na_seadec(myts)
# Find value with largest difference between vector and sample mean, which can be an outlier, and replace with the mean of the series.
outlier(myts, opposite = FALSE, logical = FALSE)
## [1] 770523
myts <- rm.outlier(myts, fill = TRUE, median = FALSE, opposite = FALSE)
autoplot(decompose(myts))
Given the model
train_perc = 0.9
train <- head(myts, round(length(myts) * train_perc))
h <- length(myts) - length(train)
test <- tail(myts, h)
autoplot(myts) +
autolayer(train, series="Training") +
autolayer(test, series="Test") +
theme(legend.title=element_blank())
# Seasonal Naive
fit <- snaive(train, length(test))
fc <- forecast(fit, h = length(test))
snaive_test_rmse <- rmse(test,fc$mean)
autoplot(train,main='Seasonal Naive') +
autolayer(fc,series='Naive') +
autolayer(test,series='Test')
fit[['model']]
## Call: snaive(y = train, h = length(test))
##
## Residual sd: 810850.2641
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 84.596, df = 24, p-value = 1.109e-08
##
## Model df: 0. Total lags used: 24
# Holt-Winters seasonal method
fit <- hw(train,seasonal="additive",damped=TRUE, length(test))
fc <- forecast(fit, h = length(test))
hws_test_rmse <- rmse(test,fc$mean)
autoplot(train,main='Holt-Winters seasonal') +
autolayer(fc,series='HW seasonal') +
autolayer(test,series='Test')
fit[['model']]
## Damped Holt-Winters' additive method
##
## Call:
## hw(y = train, h = length(test), seasonal = "additive", damped = TRUE)
##
## Smoothing parameters:
## alpha = 0.2779
## beta = 1e-04
## gamma = 1e-04
## phi = 0.9775
##
## Initial states:
## l = 6185694.4435
## b = 3949.0468
## s = -480375.8 -1614156 -774013 1219060 1713068 1221644
## -92928.94 -1463345 -1162787 -455988.5 511465.3 1378358
##
## sigma: 628778.8
##
## AIC AICc BIC
## 5529.256 5533.698 5586.015
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' additive method
## Q* = 40.038, df = 7, p-value = 1.238e-06
##
## Model df: 17. Total lags used: 24
# ARIMA
fit <- auto.arima(train, seasonal=T, stepwise=F, approximation=F)
fc <- forecast(fit, h = length(test))
arima_test_rmse <- rmse(test,fc$mean)
autoplot(train,main='ARIMA') +
autolayer(fc,series='ARIMA') +
autolayer(test,series='Test')
summary(fc)
##
## Forecast method: ARIMA(0,0,3)(2,1,0)[12] with drift
##
## Model Information:
## Series: train
## ARIMA(0,0,3)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 drift
## 0.3571 0.0459 0.2284 -0.8077 -0.4973 7737.812
## s.e. 0.0793 0.0827 0.0691 0.0832 0.0828 2814.255
##
## sigma^2 estimated as 3.393e+11: log likelihood=-2368.33
## AIC=4750.66 AICc=4751.39 BIC=4772.23
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -8651.303 551363.7 415191.4 -0.7681296 6.514558 0.6712225
## ACF1
## Training set 0.009119741
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jun 2012 6825450 6078947 7571953 5683772 7967128
## Jul 2012 8009982 7217302 8802663 6797682 9222283
## Aug 2012 8923027 8129607 9716446 7709596 10136457
## Sep 2012 8132151 7320615 8943688 6891013 9373290
## Oct 2012 5883549 5072013 6695086 4642411 7124688
## Nov 2012 5543163 4731626 6354700 4302025 6784301
## Dec 2012 7240544 6429007 8052081 5999406 8481682
## Jan 2013 9222103 8410567 10033640 7980965 10463242
## Feb 2013 8677864 7866327 9489401 7436726 9919003
## Mar 2013 7063787 6252250 7875324 5822649 8304925
## Apr 2013 5922271 5110734 6733808 4681133 7163409
## May 2013 5481729 4670192 6293266 4240590 6722867
## Jun 2013 7351526 6527389 8175662 6091118 8611934
## Jul 2013 8137796 7312066 8963526 6874951 9400641
## Aug 2013 9069467 8243711 9895223 7806582 10332352
## Sep 2013 8442535 7616128 9268942 7178655 9706415
## Oct 2013 6006991 5180585 6833398 4743111 7270872
## Nov 2013 5577603 4751196 6404010 4313723 6841484
## Dec 2013 7234009 6407602 8060416 5970129 8497889
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 13.305, df = 18, p-value = 0.7731
##
## Model df: 6. Total lags used: 24
This time, auto.ARIMA generates the best model.
res <- data.frame(rbind(snaive_rmse,hws_rmse, arima_rmse))
colnames(res) <- "RMSE"
rownames(res) <- c('Seasonal Naive','Holt-Winters Seasonal', 'ARIMA')
datatable(res) %>% formatRound('RMSE',3)
#myts_pred <- hw(myts,seasonal="additive",h=12,level=95)
myts_fit <- Arima(myts, order = c(0,0,3), seasonal = list(order=c(2,1,0),period=12))
myts_pred <- forecast(myts_fit, h=12,level=95)
values = seq(from = as.Date("2014-01-01"), to = as.Date("2014-12-31"), by = 'month')
part_b_pred = data_frame("Date"=values,
"power-usage"=as.numeric(myts_pred$mean))
write_csv(part_b_pred,'Part B predictions.csv')