Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4
different ATM machines for May 2010. The data is given in a single file.
The variable ‘Cash’ is provided in hundreds of dollars, other than that
it is straight forward. I am being somewhat ambiguous on purpose to make
this have a little more business feeling.
Explain and demonstrate your process, techniques used and not used, and
your actual forecast. I am giving you data via an excel file, please
provide your written report on your findings, visuals, discussion and
your R code via an RPubs link along with the actual.rmd file Also please
submit the forecast which you will put in an Excel readable file.
Import and Clean Data First i need to import the data from excel. I am opening the file in excel and formatting it as a date(5/1/2009),removing time format and saving the file, I am able to import the dates correctly.
we have 4 ATMs we need to move each ATM to a separate column. I see a few issue in data that needed to be dealt with:
there are NA’s in the ATM field that needed to be removed ATM3 has almost no values and it is mostly zeros there is an outlier in ATM4’s data that is far above the normal there are 3 NA’s in the Cash field for ATM1 and 2 for ATM2
# Format DATE column
atm$DATE <- lubridate::ymd(atm$DATE)
# Remove empty rows with no data
atm <- atm[complete.cases(atm), ]
# Plot data
ggplot(atm, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales="free") +
labs(title="ATM Withdrawals", y="Hundreds of dollars", x="") +
theme(panel.background = element_blank())
# Convert in a separate column for each ATM machine
atm <- atm %>% pivot_wider(names_from = ATM, values_from = Cash)
atm %>% select(-DATE) %>% summary()
ATM1 ATM2 ATM3 ATM4
Min. : 1.00 Min. : 0.00 Min. : 0.0000 Min. : 1.563
1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.: 124.334
Median : 91.00 Median : 67.00 Median : 0.0000 Median : 403.839
Mean : 83.89 Mean : 62.58 Mean : 0.7206 Mean : 474.043
3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000 3rd Qu.: 704.507
Max. :180.00 Max. :147.00 Max. :96.0000 Max. :10919.762
NA's :3 NA's :2
# Replace NA's and Outlier
atm$ATM1 <- atm$ATM1 %>% replace(is.na(.), median(atm$ATM1, na.rm = TRUE))
atm$ATM2 <- atm$ATM2 %>% replace(is.na(.), median(atm$ATM2, na.rm = TRUE))
atm$ATM4[atm$ATM4==max(atm$ATM4)] <- median(atm$ATM4, na.rm = TRUE)
The NA’s in the ATM field turned out to be the dates that will be forecast with no data in the other columns, so they were removed.
The NA’s in ATM1 and ATM2’s data were imputed using the median since the number of missing values was so small.
ATM4’s outlier was also replaced with the median since it was so far above the norm, and the only abnormal data point in the entire ATM4 series, so it seemed likely to be an error.
Cleaned up data
# Plot and summarize again after cleaning up
atm2 <- atm %>% pivot_longer(cols = starts_with("ATM"),
names_to = "ATM", values_to = "Cash")
ggplot(atm2, aes(DATE, Cash)) + geom_line() + facet_grid(ATM~., scales="free") +
labs(title="ATM Withdrawals", y="Hundreds of dollars", x="") +
theme(panel.background = element_blank())
ATM1 ATM2 ATM3 ATM4
Min. : 1.00 Min. : 0.0 Min. : 0.0000 Min. : 1.563
1st Qu.: 73.00 1st Qu.: 26.0 1st Qu.: 0.0000 1st Qu.: 124.334
Median : 91.00 Median : 67.0 Median : 0.0000 Median : 403.839
Mean : 83.95 Mean : 62.6 Mean : 0.7206 Mean : 445.233
3rd Qu.:108.00 3rd Qu.: 93.0 3rd Qu.: 0.0000 3rd Qu.: 704.192
Max. :180.00 Max. :147.0 Max. :96.0000 Max. :1712.075
ATM’s 1 and 2 appear to have some weekly seasonality. ATM3 with only 3 data points does not have enough data to do a proper forecast so a simple mean will be used. ATM4 Appears to be white noise, but there may still be some seasonality that we will investigate further o determine. There is no apparent trend to any of the series.
Convert to Time Series
Time series objects were created for each ATM since there seemed to be some weekly and possibly monthly seasonality in the plots of the data.
atm1.ts <- atm2 %>% filter(ATM=="ATM1") %>% select(Cash) %>%
ts(start = 1, frequency = 7)
atm2.ts <- atm2 %>% filter(ATM=="ATM2") %>% select(Cash) %>%
ts(start = 1, frequency = 7)
atm3.ts <- atm2 %>% filter(ATM=="ATM3") %>% select(Cash) %>%
ts(start = 1, frequency = 2)
atm4.ts <- atm2 %>% filter(ATM=="ATM4") %>% select(Cash) %>%
ts(start = 1, frequency = 7)
ATM1
For each of ATM1, ATM2 and ATM4, I want to try running at least a Holt-Winter’s additive model with damped trend, since the seasonal variations are roughly constant through the series, and ETS and ARIMA models to see if they result in better performance than the Holt-Winter’s model.
Before running any models I will check the ACF and PACF plots, and the ndiffs, nsdiffs, and BoxCox.lambda functions to see what they recommend for differencing and what type of model they suggest might be most appropriate.
[1] 0
[1] 1
[1] 0.2446101
For ATM1 no first order differencing is recommended, only a first order seasonal difference and a box-cox transformation with λ = 0.24461. Let’s plot the data again after these transformations are performed to see what impact they have.
The plot above shows that most of the seasonality has been eliminated and are left with an almost stationary time series although there are still spikes in the ACF plot at lag 7 and in the PACF plot at lags 7, 14, and 21. By adding a first order differencing we can make.
Holt-Winters
atm1.fit1 <- atm1.ts %>% hw(h=31, seasonal="additive",
damped=TRUE, lambda = atm1.lambda)
autoplot(atm1.fit1) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 2.100449 | 25.21299 | 16.10613 | -92.92647 | 111.5487 | 0.9096064 | 0.1149874 |
Ljung-Box test
data: Residuals from Damped Holt-Winters' additive method
Q* = 19.083, df = 14, p-value = 0.1618
Model df: 0. Total lags used: 14
The residuals plot looks not too bad, but our test has an extremely small p-value indicating that there is still some autocorrelation in our data as we saw in the plot of the transformed data. Our forecast plot looks not too bad either although those confidence intervals extend way past what we have seen historically in the data.
ETS
atm1.fit2 <- atm1.ts %>% ets(model="ZZZ", lambda = atm1.lambda)
autoplot(atm1.fit2) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 2.337381 | 25.04745 | 16.1103 | -91.70147 | 110.6442 | 0.9098418 | 0.1213482 |
Ljung-Box test
data: Residuals from ETS(A,N,A)
Q* = 19.696, df = 14, p-value = 0.14
Model df: 0. Total lags used: 14
The ETS model gave us almost exactly the same results with only slightly better RMSE and Ljung-Box results.
ARIMA
atm1.fit3 <- auto.arima(atm1.ts)
# as.character(atm1.fit3)
autoplot(forecast(atm1.fit3, h=31)) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | -0.0736394 | 23.3332 | 14.60281 | -102.7359 | 117.6027 | 0.8247052 | -0.0081341 |
Ljung-Box test
data: Residuals from ARIMA(0,0,1)(0,1,2)[7]
Q* = 16.789, df = 11, p-value = 0.1143
Model df: 3. Total lags used: 14
The ARIMA model resulted in the best fit with the best RMSE and a Ljung-Box p-value that means we cannot reject the null hypothesis that the series is consistent with white noise. The plot of the forecast also looks like a more reasonable estimate of what we can expect based on the historical data.
results <- data.frame(rbind(accuracy(atm1.fit1), accuracy(atm1.fit2), accuracy(atm1.fit3)))
rownames(results) <- c("Holt-Winter's", "ETS", "ARIMA(0,0,1)(0,1,2)[7]")
kable(results)
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Holt-Winter’s | 2.1004493 | 25.21299 | 16.10613 | -92.92647 | 111.5487 | 0.9096064 | 0.1149874 |
ETS | 2.3373807 | 25.04745 | 16.11030 | -91.70147 | 110.6442 | 0.9098418 | 0.1213482 |
ARIMA(0,0,1)(0,1,2)[7] | -0.0736394 | 23.33320 | 14.60281 | -102.73591 | 117.6027 | 0.8247052 | -0.0081341 |
atm1.fc <- data.frame(forecast(atm1.fit3, h=31)) %>% remove_rownames()
write_csv(atm1.fc, "ATM1_Forecast.csv")
ATM2
[1] 1
[1] 1
[1] 0.7278107
For ATM2 both first order differencing and seasonal differencing are recommended, and a box-cox transformation with λ = 0.7278107. Let’s plot the data again after these transformations are performed to see what impact they have.
I can see clear seasonality in the ACF and PACF plots before transformation in the plot above, and after in the plot below. Note the first order differencing was not applied because it resulted in a PACF with many spikes outside the critical values.
Holt-Winters
atm2.fit1 <- atm2.ts %>% hw(h=31, seasonal="additive", damped=TRUE,
lambda = atm2.lambda)
autoplot(atm2.fit1) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 0.6284115 | 25.40858 | 17.92711 | -Inf | Inf | 0.8614639 | 0.0199519 |
Ljung-Box test
data: Residuals from Damped Holt-Winters' additive method
Q* = 33.37, df = 14, p-value = 0.002547
Model df: 0. Total lags used: 14
ETS
atm2.fit2 <- atm2.ts %>% ets(model="ZZZ", lambda =atm2.lambda)
autoplot(atm2.fit2) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 0.5166014 | 25.36411 | 17.84184 | -Inf | Inf | 0.8573662 | 0.019386 |
Ljung-Box test
data: Residuals from ETS(A,N,A)
Q* = 33.355, df = 14, p-value = 0.00256
Model df: 0. Total lags used: 14
ARIMA
atm2.fit3 <- auto.arima(atm2.ts, lambda = atm2.lambda)
autoplot(forecast(atm2.fit3, h=31)) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 1.44981 | 24.25163 | 17.12271 | -Inf | Inf | 0.8228094 | 0.0058678 |
Ljung-Box test
data: Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
Q* = 8.859, df = 7, p-value = 0.2629
Model df: 7. Total lags used: 14
ARIMA with first order differencing
Since the ndiffs function recommended first order differencing but the auto.arima function did not use differencing in the model, let’s try manually adding it to see if we can improve the model.
atm2.fit4 <- Arima(diff(atm2.ts), order=c(3,1,3),seasonal=c(0,1,1), lambda = atm2.lambda)
autoplot(forecast(atm2.fit4, h=31)) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 2.332289 | 28.9457 | 20.52568 | NaN | Inf | 0.7141976 | -0.0447085 |
Ljung-Box test
data: Residuals from ARIMA(3,1,3)(0,1,1)[7]
Q* = 24.274, df = 7, p-value = 0.001019
Model df: 7. Total lags used: 14
results <- data.frame(rbind(accuracy(atm2.fit1), accuracy(atm2.fit2),
accuracy(atm2.fit3),accuracy(atm2.fit4)))
rownames(results) <- c("Holt-Winter's", "ETS", "ARIMA(3,0,3)(0,1,1)[7]",
"ARIMA(3,1,3)(0,1,1)[7]")
kable(results)
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Holt-Winter’s | 0.6284115 | 25.40858 | 17.92711 | -Inf | Inf | 0.8614639 | 0.0199519 |
ETS | 0.5166014 | 25.36411 | 17.84184 | -Inf | Inf | 0.8573662 | 0.0193860 |
ARIMA(3,0,3)(0,1,1)[7] | 1.4498098 | 24.25163 | 17.12271 | -Inf | Inf | 0.8228094 | 0.0058678 |
ARIMA(3,1,3)(0,1,1)[7] | 2.3322890 | 28.94570 | 20.52568 | NaN | Inf | 0.7141976 | -0.0447085 |
The auto.arima function gave the best results so that model will be used for predictions.
atm2.fc <- data.frame(forecast(atm2.fit3, h=31)) %>% remove_rownames()
write_csv(atm2.fc, "ATM2_Forecast.csv")
ATM3
As mentioned earlier there just is not enough data to make a good forecast for ATM3, a simple mean will be used to forecast until more data can be collected.
atm3.fc <- data.frame(forecast(atm3.fit, h=31)) %>% remove_rownames()
write_csv(atm3.fc, "ATM3_Forecast.csv")
ATM4
The same procedure for ATM4 as done earlier for ATM1 or ATM2.
[1] 0
[1] 0
[1] 0.4525697
Here no differencing or seasonal differencing was recommended but a box-cox transformation with λ = 0.4525697 was. Looking at the plots however, it’s not clear that the box-cox transformation improved the stationarity of the data. Seasonal spikes are still apparent.
Holt-Winters
atm4.fit1 <- atm4.ts %>% hw(h=31, seasonal="additive", damped=TRUE,
lambda = atm4.lambda)
autoplot(atm4.fit1) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 72.46451 | 340.8815 | 261.8729 | -369.0055 | 413.0543 | 0.7559486 | 0.1006656 |
Ljung-Box test
data: Residuals from Damped Holt-Winters' additive method
Q* = 18.947, df = 14, p-value = 0.167
Model df: 0. Total lags used: 14
ETS
atm4.fit2 <- atm4.ts %>% ets(model="ZZZ", lambda = atm4.lambda)
autoplot(atm4.fit2) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 67.23932 | 338.2241 | 259.7873 | -376.9491 | 420.7907 | 0.7499281 | 0.0973636 |
Ljung-Box test
data: Residuals from ETS(A,N,A)
Q* = 19.488, df = 14, p-value = 0.1471
Model df: 0. Total lags used: 14
ARIMA
atm4.fit3 <- auto.arima(atm4.ts, seasonal = TRUE, lambda = atm4.lambda)
autoplot(forecast(atm4.fit3, h=31)) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 84.4142 | 351.8346 | 274.3411 | -370.6976 | 415.1683 | 0.7919408 | 0.019962 |
Ljung-Box test
data: Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
Q* = 15.872, df = 11, p-value = 0.1459
Model df: 3. Total lags used: 14
atm4.fit4 <- Arima(atm4.ts, order=c(0,0,1),seasonal=c(14,1,0),
lambda = atm4.lambda)
autoplot(forecast(atm4.fit4, h=31)) + theme(panel.background = element_blank())
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Training set | 58.30159 | 334.008 | 251.7076 | -339.9261 | 382.9995 | 0.7266044 | 0.0143578 |
Ljung-Box test
data: Residuals from ARIMA(0,0,1)(14,1,0)[7]
Q* = 16.453, df = 3, p-value = 0.0009157
Model df: 15. Total lags used: 18
results <- data.frame(rbind(accuracy(atm4.fit1), accuracy(atm4.fit2),
accuracy(atm4.fit3),accuracy(atm4.fit4)))
rownames(results) <- c("Holt-Winter's", "ETS", "ARIMA(0,0,1)(2,0,0)[7]",
"ARIMA(0,0,1)(14,1,0)[7]")
kable(results)
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
---|---|---|---|---|---|---|---|
Holt-Winter’s | 72.46451 | 340.8815 | 261.8729 | -369.0055 | 413.0543 | 0.7559486 | 0.1006656 |
ETS | 67.23932 | 338.2241 | 259.7873 | -376.9491 | 420.7907 | 0.7499281 | 0.0973636 |
ARIMA(0,0,1)(2,0,0)[7] | 84.41420 | 351.8346 | 274.3411 | -370.6976 | 415.1683 | 0.7919408 | 0.0199620 |
ARIMA(0,0,1)(14,1,0)[7] | 58.30159 | 334.0080 | 251.7076 | -339.9261 | 382.9995 | 0.7266044 | 0.0143578 |
Although similar performance was eventually reached with the ARIMA model the much less complex ETS model is preferred since the improvement in performance is very minimal.
atm4.fc <- data.frame(forecast(atm4.fit2, h=31)) %>% remove_rownames()
write_csv(atm4.fc, "ATM4_Forecast.csv")
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for
January 1998 until December 2013. Your assignment is to model these data
and create a monthly forecast for 2014. The data is given in a single
file. The variable ‘KWH’ is power consumption in Kilowatt hours, the
rest is straight forward.
Download and Load Excel files
Loading the excel file into R and cleaning it by using function ‘tsclean()’ which can handle outliers and NA valuekwh <- import("c:/Users/Bikash_Bhowmik/downloads/ResidentialCustomerForecastLoad-624.xlsx")
head(kwh)
CaseSequence YYYY-MMM KWH
1 733 1998-Jan 6862583
2 734 1998-Feb 5838198
3 735 1998-Mar 5420658
4 736 1998-Apr 5010364
5 737 1998-May 4665377
6 738 1998-Jun 6467147
Jan Feb Mar Apr May Jun Jul Aug
1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428
1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416 7564391
2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662 7517830
2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129 8450717
2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702 8058748
2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676 8476499
2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931 7309774
2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998 7786659
2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564 8241110
2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220 7447146
2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987 8037137
2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260 8350517
2010 9397357 8390677 7347915 5776131 4919289 6696292 7686742 7922701
2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343 10308076
2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851 9612423
2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321 9080226
Sep Oct Nov Dec
1998 6989888 6345620 4640410 4693479
1999 7899368 5358314 4436269 4419229
2000 8912169 5844352 5041769 6220334
2001 7112069 5242535 4461979 5240995
2002 8245227 5865014 4908979 5779958
2003 7791791 5344613 4913707 5756193
2004 6690366 5444948 4824940 5791208
2005 7057213 6694523 4313019 6181548
2006 7296355 5104799 4458429 6226214
2007 7666970 5785964 4907057 6047292
2008 7436335 5101803 4555602 6442746
2009 7583146 5566075 5339890 7089880
2010 7819472 5875917 4800733 6152583
2011 8943599 5603920 6154138 8273142
2012 7559148 5576852 5731899 6609694
2013 7968220 5759367 5769083 6987874
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
770523 5429912 6283324 6502475 7620524 10655730 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
4313019 5443502 6351262 6529723 7608792 10655730
Timeseries
converting the data into a time series.autoplot(kwh_ts) +
ggtitle("Monthly Power Usage", subtitle = "From Jan 1998 to Dec 2013 (KWH)") +
xlab("Month") +
ylab("Kilowatt hours (KWH)")
Seasonality is found in this timeseries and appears to have a peak every 6 months. The seasonality may be annual due to the high power usage during winter and summer.
ARIMA MODEL
Annual seasonality.ggtsdisplay(kwh_ts, points = FALSE, main="Monthly Residential Power Usage ( From Jan 1998 to Dec 2013) - (KWH)", xlab = "Years", ylab = "Power Usage (kwh)")
Apply the BoxCox transformation:
kwh_bc <- BoxCox(kwh_ts, lambda = BoxCox.lambda(kwh_ts))
ggtsdisplay(kwh_bc, main="kwh_ts with BoxCox Transformation")
#######################
# KPSS Unit Root Test #
#######################
Test is of type: mu with 4 lags.
Value of test-statistic is: 0.1159
Critical value for a significance level of:
10pct 5pct 2.5pct 1pct
critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary. ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations. The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.
The seasonal spike at lag7 in ACF and lag7 & lag14 in PACF suggest seasonal AR(1) and MA(2) components. As ACF decays gradually, this suggests seasonal AR(1) and MA(2), P=0, Q=2. I will try ARIMA(1,0,1)(0,1,2).
kwh_arima <- Arima(kwh_ts, order=c(1,0,1), seasonal=c(0,1,2), lambda = BoxCox.lambda(kwh_ts))
summary(kwh_arima)
Series: kwh_ts
ARIMA(1,0,1)(0,1,2)[12]
Box Cox transformation: lambda= -0.1454175
Coefficients:
ar1 ma1 sma1 sma2
0.9475 -0.8014 -0.8016 0.1738
s.e. 0.0838 0.1802 0.0933 0.0912
sigma^2 = 8.786e-05: log likelihood = 585.93
AIC=-1161.86 AICc=-1161.51 BIC=-1145.89
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 86012.58 597771.6 463632.2 0.8088745 7.00649 0.7783202 0.1518374
Ljung-Box test
data: Residuals from ARIMA(1,0,1)(0,1,2)[12]
Q* = 24.322, df = 20, p-value = 0.2286
Model df: 4. Total lags used: 24
let’s use auto.arima to find out which is the best Arima model:
kwh_auto <- auto.arima(kwh_ts, approximation = FALSE, lambda=BoxCox.lambda(kwh_ts))
checkresiduals(kwh_auto)
Ljung-Box test
data: Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
Q* = 25.402, df = 22, p-value = 0.2782
Model df: 2. Total lags used: 24
Series: kwh_ts
ARIMA(1,0,0)(0,1,1)[12] with drift
Box Cox transformation: lambda= -0.1454175
Coefficients:
ar1 sma1 drift
0.2895 -0.7353 1e-04
s.e. 0.0724 0.0700 1e-04
sigma^2 = 8.429e-05: log likelihood = 588.5
AIC=-1169 AICc=-1168.77 BIC=-1156.23
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 20961.31 580774.8 445952.2 -0.2249688 6.812064 0.7486401
ACF1
Training set 0.05258332
The ARIMA model found by ‘auto.arima’ is ARIMA(1,0,0)(0,1,1)[12]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,2)[12]. According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values. Both models can be applied, but the best Arima model is ARIMA(1,0,0)(0,1,1)[12], because it has a smaller AICc.
Forecast
kwh_model <- Arima(kwh_ts, order=c(1,0,0), seasonal=c(0,1,1), lambda = BoxCox.lambda(kwh_ts))
kwh_f <- forecast(kwh_model, h=12, level=95)
kwh_f
Point Forecast Lo 95 Hi 95
Jan 2014 9396883 7762540 11437614
Feb 2014 7887475 6475039 9664189
Mar 2014 6435175 5306104 7848082
Apr 2014 5756155 4760036 6998419
May 2014 5570691 4610683 6766688
Jun 2014 7479517 6140217 9164205
Jul 2014 8521791 6969852 10482565
Aug 2014 9080655 7413299 11191835
Sep 2014 7911892 6484811 9710387
Oct 2014 5661786 4684004 6880566
Nov 2014 5568723 4609087 6764249
Dec 2014 6904666 5681062 8439739
ggtsdisplay(resid(kwh_model), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(2,1,1)(0,0,2) of power consumption", xlab = "Year", ylab = "Residual")
autoplot(kwh_f) +
labs(title = "Energy Forecast for 2014", x = "Month", y = "kWh") +
theme(legend.position = "none")
The forecast data in the file: ‘kwh_Forecast.csv’
I think the forecast generated is accurate. Our forecast captures the seasonality of what appears to be increased demand in the summer and winter while falling between the peaks and troughs of more recent years.
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Load the excel file:
wfp1 <- import("c:/Users/Bikash_Bhowmik/downloads/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfp2 <- import("c:/Users/Bikash_Bhowmik/downloads/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
colnames(wfp1) <- c("DateTime", "WaterFlow")
colnames(wfp2) <- c("DateTime", "WaterFlow")
wfp1 <- wfp1 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1) %>%
group_by(Date, Time) %>%
summarise(Water=mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime=ymd_h(paste(Date,Time))) %>%
select(DateTime,Water)
wfp1
# A tibble: 236 × 2
DateTime Water
<dttm> <dbl>
1 2015-10-23 01:00:00 26.1
2 2015-10-23 02:00:00 18.9
3 2015-10-23 03:00:00 15.2
4 2015-10-23 04:00:00 23.1
5 2015-10-23 05:00:00 15.5
6 2015-10-23 06:00:00 22.7
7 2015-10-23 07:00:00 20.6
8 2015-10-23 08:00:00 18.4
9 2015-10-23 09:00:00 20.8
10 2015-10-23 10:00:00 21.2
# ℹ 226 more rows
wfp2 <- wfp2 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)) %>%
group_by(Date, Time) %>%
summarise(Water=mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime=ymd_h(paste(Date,Time))) %>%
select(DateTime, Water)
wfp2
# A tibble: 1,000 × 2
DateTime Water
<dttm> <dbl>
1 2015-10-23 01:00:00 18.8
2 2015-10-23 02:00:00 43.1
3 2015-10-23 03:00:00 38.0
4 2015-10-23 04:00:00 36.1
5 2015-10-23 05:00:00 31.9
6 2015-10-23 06:00:00 28.2
7 2015-10-23 07:00:00 9.86
8 2015-10-23 08:00:00 26.7
9 2015-10-23 09:00:00 55.8
10 2015-10-23 10:00:00 54.2
# ℹ 990 more rows
Timeseries Combining the two waterflows into one:
water <- full_join(wfp1, wfp2, by="DateTime", suffix=c("_1", "_2")) %>%
mutate(Water_1=ifelse(is.na(Water_1), 0, Water_1)) %>%
mutate(Water_2=ifelse(is.na(Water_2), 0, Water_2)) %>%
mutate(Water = Water_1 + Water_2) %>%
select(DateTime, Water)
water
# A tibble: 1,000 × 2
DateTime Water
<dttm> <dbl>
1 2015-10-23 01:00:00 44.9
2 2015-10-23 02:00:00 61.9
3 2015-10-23 03:00:00 53.1
4 2015-10-23 04:00:00 59.2
5 2015-10-23 05:00:00 47.3
6 2015-10-23 06:00:00 51.0
7 2015-10-23 07:00:00 30.5
8 2015-10-23 08:00:00 45.0
9 2015-10-23 09:00:00 76.6
10 2015-10-23 10:00:00 75.4
# ℹ 990 more rows
water_ts <- ts(water$Water, frequency=24)
ggseasonplot(water_ts) + theme(legend.title = element_blank())
We cannot see significant seasonality involved in ‘water_ts’ however there is a slightly decreasing trend. It is a non-stationary timeseries.
ARIMA Model
BocCox transformation:
water_bc <- BoxCox(water_ts, lambda = BoxCox.lambda(water_ts))
ggtsdisplay(water_bc, main="Water_ts with BoxCox Transformation")
Trend differencing is needed.
[1] 1
[1] 0
#######################
# KPSS Unit Root Test #
#######################
Test is of type: mu with 7 lags.
Value of test-statistic is: 0.0091
Critical value for a significance level of:
10pct 5pct 2.5pct 1pct
critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary. One seasonal differencing was applied so D=0, while the non-seasonal part suggests d=1. There is one seasonal lags in ACF, suggest Q=1. There one non-seasonal spike at lag1 in ACF suggest q=1.I will try ARIMA(0,1,1)(0,0,1)[24].
water_arima <- Arima(water_ts, order=c(0,1,1), seasonal=c(0,0,1), lambda = BoxCox.lambda(water_ts))
summary(water_arima)
Series: water_ts
ARIMA(0,1,1)(0,0,1)[24]
Box Cox transformation: lambda= 0.8713037
Coefficients:
ma1 sma1
-0.9578 0.0712
s.e. 0.0101 0.0319
sigma^2 = 103.3: log likelihood = -3734.4
AIC=7474.8 AICc=7474.83 BIC=7489.52
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
ACF1
Training set -0.04444759
Ljung-Box test
data: Residuals from ARIMA(0,1,1)(0,0,1)[24]
Q* = 57.881, df = 46, p-value = 0.1124
Model df: 2. Total lags used: 48
Now, let’s use auto.arima to find out which is the best Arima model:
water_auto <- auto.arima(water_ts, approximation = FALSE, lambda=BoxCox.lambda(water_ts))
summary(water_auto)
Series: water_ts
ARIMA(0,1,1)(1,0,0)[24]
Box Cox transformation: lambda= 0.8713037
Coefficients:
ma1 sar1
-0.9578 0.0714
s.e. 0.0101 0.0322
sigma^2 = 103.3: log likelihood = -3734.4
AIC=7474.8 AICc=7474.82 BIC=7489.52
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
ACF1
Training set -0.04431537
Ljung-Box test
data: Residuals from ARIMA(0,1,1)(1,0,0)[24]
Q* = 57.858, df = 46, p-value = 0.1128
Model df: 2. Total lags used: 48
Forecast
I performed the forecast on a week on water use, 7 days, for 24 hours.
water_f <- forecast(water_model, h=7*24, level=95)
autoplot(water_f) +
labs(title="Water Usage Forecast", subtitle = "ARIMA(0,1,1)(1,0,0)[24]", x="Day")
export <- water_f$mean
write.csv(export, "Waterflow_Forecast.csv")
df <- data.frame(water_f) %>% select(Point.Forecast)
rownames(df) <- NULL
df
Point.Forecast
1 44.13862
2 45.16903
3 46.07068
4 44.27144
5 45.44000
6 44.89509
7 42.95309
8 44.67901
9 46.80433
10 44.92962
11 46.40268
12 45.53528
13 45.54120
14 43.38716
15 44.16368
16 42.46275
17 43.26947
18 44.53708
19 46.52640
20 43.60654
21 46.10964
22 44.44325
23 43.74894
24 46.10043
25 44.52937
26 44.60297
27 44.66721
28 44.53887
29 44.62229
30 44.58342
31 44.44443
32 44.56799
33 44.71937
34 44.58589
35 44.69083
36 44.62908
37 44.62950
38 44.47556
39 44.53116
40 44.40922
41 44.46713
42 44.55785
43 44.69962
44 44.49128
45 44.66998
46 44.55115
47 44.50148
48 44.66933
49 44.55730
50 44.56256
51 44.56715
52 44.55798
53 44.56394
54 44.56117
55 44.55123
56 44.56006
57 44.57088
58 44.56134
59 44.56884
60 44.56443
61 44.56446
62 44.55346
63 44.55743
64 44.54872
65 44.55286
66 44.55934
67 44.56947
68 44.55458
69 44.56735
70 44.55886
71 44.55531
72 44.56730
73 44.55930
74 44.55967
75 44.56000
76 44.55935
77 44.55977
78 44.55958
79 44.55887
80 44.55950
81 44.56027
82 44.55959
83 44.56012
84 44.55981
85 44.55981
86 44.55902
87 44.55931
88 44.55869
89 44.55898
90 44.55944
91 44.56017
92 44.55910
93 44.56002
94 44.55941
95 44.55916
96 44.56001
97 44.55944
98 44.55947
99 44.55949
100 44.55945
101 44.55948
102 44.55946
103 44.55941
104 44.55946
105 44.55951
106 44.55946
107 44.55950
108 44.55948
109 44.55948
110 44.55942
111 44.55944
112 44.55940
113 44.55942
114 44.55945
115 44.55950
116 44.55943
117 44.55949
118 44.55945
119 44.55943
120 44.55949
121 44.55945
122 44.55945
123 44.55946
124 44.55945
125 44.55945
126 44.55945
127 44.55945
128 44.55945
129 44.55946
130 44.55945
131 44.55946
132 44.55945
133 44.55945
134 44.55945
135 44.55945
136 44.55945
137 44.55945
138 44.55945
139 44.55946
140 44.55945
141 44.55946
142 44.55945
143 44.55945
144 44.55946
145 44.55945
146 44.55945
147 44.55945
148 44.55945
149 44.55945
150 44.55945
151 44.55945
152 44.55945
153 44.55945
154 44.55945
155 44.55945
156 44.55945
157 44.55945
158 44.55945
159 44.55945
160 44.55945
161 44.55945
162 44.55945
163 44.55945
164 44.55945
165 44.55945
166 44.55945
167 44.55945
168 44.55945
The forecast data in the file ‘Waterflow_Forecast.csv’.