In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
atm <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/atm_data.csv")
table(atm$ATM)
##
## ATM1 ATM2 ATM3 ATM4
## 14 365 365 365 365
## 14 null values, not assigned to any atm
summary(atm$Cash)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.5 73.0 155.6 114.0 10920.0 19
## zero's indicate no money withdrawn. There might be various reasons behind this, transaction cancelled, machine error. We will retain this in our records for prediction. Let's check which atm has most zero's. Is there a trend?
table(atm$Cash==0, atm$ATM)
##
## ATM1 ATM2 ATM3 ATM4
## FALSE 0 362 361 3 365
## TRUE 0 0 2 362 0
##ATM3 is the culprit, only three values of cash dispensing
max(atm$DATE)
## [1] "9/9/2009 12:00:00 AM"
min(atm$DATE)
## [1] "1/1/2010 12:00:00 AM"
atm %>% filter(!complete.cases(atm))
## DATE ATM Cash
## 1 6/13/2009 12:00:00 AM ATM1 NA
## 2 6/16/2009 12:00:00 AM ATM1 NA
## 3 6/18/2009 12:00:00 AM ATM2 NA
## 4 6/22/2009 12:00:00 AM ATM1 NA
## 5 6/24/2009 12:00:00 AM ATM2 NA
## 6 5/1/2010 12:00:00 AM NA
## 7 5/2/2010 12:00:00 AM NA
## 8 5/3/2010 12:00:00 AM NA
## 9 5/4/2010 12:00:00 AM NA
## 10 5/5/2010 12:00:00 AM NA
## 11 5/6/2010 12:00:00 AM NA
## 12 5/7/2010 12:00:00 AM NA
## 13 5/8/2010 12:00:00 AM NA
## 14 5/9/2010 12:00:00 AM NA
## 15 5/10/2010 12:00:00 AM NA
## 16 5/11/2010 12:00:00 AM NA
## 17 5/12/2010 12:00:00 AM NA
## 18 5/13/2010 12:00:00 AM NA
## 19 5/14/2010 12:00:00 AM NA
# 19 records with either missing values from column ATM or Cash
aggr_plot <- aggr(atm, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Cash 0.01289009
## DATE 0.00000000
## ATM 0.00000000
atm <- atm %>% mutate(DATE=mdy_hms(DATE)) %>% filter(complete.cases(atm))
## missing values dropped and data converted to Date instead of date time
## convert dataframe into ts for all 4 atms
## As pr Hyndman "If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25)".
# also if series is not too long we might want to set frequesny to 7.
#https://robjhyndman.com/hyndsight/dailydata/
for (i in unique(atm$ATM)){
temp <- atm %>%filter(ATM==i)%>% select(-ATM, -DATE)%>%ts(frequency=7)
assign(paste0("ts_", tolower(i)),temp)
rm(temp)
}
boxplot(ts_atm1)
## only a few outlier at the upper end, we will include in the analysis.
autoplot(ts_atm1)
ggtsdisplay(ts_atm1)
ggseasonplot(ts_atm1, year.labels=TRUE, year.labels.left=TRUE) +
ylab("Cash") +
ggtitle("ATM1")
Variance seems constant throughout the time series, no transformation is required. In the seasonal, besides some different spikes, we can see the seasonal pattern. In addition, the significant lags at 7, 14, 21 in ACF and PACF plots eveals the presence of seasonality.
Seasonality is reflected in the ts atm1. We wll use seasonal differencing and explore ACF and PACF plots to create seasonal arima model.
ts_atm1 %>% diff(lag=7) %>% ggtsdisplay()
Significant spike in PACF and ACF at lag 1 suggest AR(1) or MA(1) non-seasonal model. Seasonal - PCF Significant lag at 7,14,21 spikes at 14 and 21 are decreasing, the lag 7 spike is more significant in PACF suggest AR(1) model. (using same analogy we can try two seasonal arima models) as below:
ARIMA(1,0,0)(1,1,0) ARIMA(0,0,1)(1,1,0)
a_ar1 <- ts_atm1%>% Arima(order=c(1,0,0), seasonal=c(1,1,0))
a_ar2 <- ts_atm1%>% Arima(order=c(0,0,1), seasonal=c(0,1,1))
a_ar1$aic
## [1] 3354.58
a_ar2$aic
## [1] 3322.631
## additive decomp assumed by stl
decomp <- stl(ts_atm1[, "Cash"], t.window=13,s.window="periodic")
deseasonal <- seasadj(decomp)
autoplot(deseasonal)
We will apply naive model on the seasonal adjusted data
stl_fc <- stlf(deseasonal,method='naive', h=31)
accuracy(a_ar1)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.06716054 26.73159 17.05704 -100.19 119.2043 0.9063389
## ACF1
## Training set 0.0106164
accuracy(a_ar2)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
## ACF1
## Training set -0.01116544
accuracy(stl_fc)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02165497 26.71567 17.82508 -1038.474 1214.219 0.9471491
## ACF1
## Training set -0.3853073
checkresiduals(a_ar1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(1,1,0)[7]
## Q* = 38.845, df = 12, p-value = 0.0001116
##
## Model df: 2. Total lags used: 14
checkresiduals(a_ar2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
##
## Model df: 2. Total lags used: 14
checkresiduals(stl_fc)
## Warning in checkresiduals(stl_fc): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 86.331, df = 14, p-value = 1.863e-12
##
## Model df: 0. Total lags used: 14
Model 2 ARIMA(0,0,1)(1,1,0) have better AIC and RMSE values, residuals seems to be normally distributed also.
unloadNamespace("aTSA")
atm1_fc <- forecast(a_ar1, 31)
head(atm1_fc)
## $method
## [1] "ARIMA(1,0,0)(1,1,0)[7]"
##
## $model
## Series: .
## ARIMA(1,0,0)(1,1,0)[7]
##
## Coefficients:
## ar1 sar1
## 0.1360 -0.4018
## s.e. 0.0525 0.0480
##
## sigma^2 estimated as 732.8: log likelihood=-1674.29
## AIC=3354.58 AICc=3354.65 BIC=3366.2
##
## $level
## [1] 80 95
##
## $mean
## Time Series:
## Start = c(52, 6)
## End = c(57, 1)
## Frequency = 7
## [1] 84.871352 97.787871 79.629679 3.598927 97.607131 76.777142
## [7] 85.000002 84.923038 102.292448 77.367903 3.760062 96.961451
## [13] 78.875474 85.000001 84.902273 100.482692 78.276592 3.695324
## [19] 97.220859 78.032449 85.000001 84.910615 101.209779 77.911518
## [25] 3.721333 97.116639 78.371142 85.000001 84.907263 100.917665
## [31] 78.058190
##
## $lower
## Time Series:
## Start = c(52, 6)
## End = c(57, 1)
## Frequency = 7
## 80% 95%
## 52.71429 50.17947 31.814696
## 52.85714 62.77683 44.243097
## 53.00000 44.61276 26.075924
## 53.14286 -31.41810 -49.954993
## 53.28571 62.59010 44.053207
## 53.42857 41.76011 23.223218
## 53.57143 49.98297 31.446078
## 53.71429 44.21768 22.669568
## 53.85714 61.48941 39.889587
## 54.00000 36.56306 14.962283
## 54.14286 -37.04481 -58.645609
## 54.28571 56.15658 34.555780
## 54.42857 38.07060 16.469802
## 54.57143 44.19513 22.594330
## 54.71429 36.32699 10.612791
## 54.85714 51.77545 25.991388
## 55.00000 29.56691 3.781563
## 55.14286 -45.01440 -70.799774
## 55.28571 48.51113 22.725759
## 55.42857 29.32272 3.537349
## 55.57143 36.29027 10.504901
## 55.71429 30.56319 1.793400
## 55.85714 46.76364 17.941591
## 56.00000 23.46355 -5.359458
## 56.14286 -50.72666 -79.549694
## 56.28571 42.66864 13.845611
## 56.42857 23.92314 -4.899886
## 56.57143 30.55200 1.728973
## 56.71429 24.99083 -6.727013
## 56.85714 40.90484 9.135971
## 57.00000 18.04359 -13.726227
##
## $upper
## Time Series:
## Start = c(52, 6)
## End = c(57, 1)
## Frequency = 7
## 80% 95%
## 52.71429 119.56323 137.92801
## 52.85714 132.79892 151.33264
## 53.00000 114.64660 133.18343
## 53.14286 38.61595 57.15285
## 53.28571 132.62416 151.16105
## 53.42857 111.79417 130.33107
## 53.57143 120.01703 138.55393
## 53.71429 125.62839 147.17651
## 53.85714 143.09549 164.69531
## 54.00000 118.17274 139.77352
## 54.14286 44.56494 66.16573
## 54.28571 137.76633 159.36712
## 54.42857 119.68035 141.28115
## 54.57143 125.80488 147.40567
## 54.71429 133.47755 159.19175
## 54.85714 149.18994 174.97400
## 55.00000 126.98627 152.77162
## 55.14286 52.40505 78.19042
## 55.28571 145.93059 171.71596
## 55.42857 126.74218 152.52755
## 55.57143 133.70973 159.49510
## 55.71429 139.25804 168.02783
## 55.85714 155.65592 184.47797
## 56.00000 132.35948 161.18249
## 56.14286 58.16933 86.99236
## 56.28571 151.56464 180.38767
## 56.42857 132.81914 161.64217
## 56.57143 139.44800 168.27103
## 56.71429 144.82370 176.54154
## 56.85714 160.93049 192.69936
## 57.00000 138.07279 169.84261
#write.csv(atm1_fc, "ATM1_forecast.csv")
We will perform similar checks on time series for ATM2
boxplot(ts_atm2)
autoplot(ts_atm2)
ggtsdisplay(ts_atm2)
ggseasonplot(ts_atm2, year.labels=TRUE, year.labels.left=TRUE) +
ylab("Cash") +
ggtitle("ATM2")
No outliers present. Variance is constant throughout the series, no transformation required. Significant lags at 7,14 and 21 in PACF reflects seasonality. ACF plot reveals non-stationary data.
ndiffs(ts_atm2)
## [1] 1
nsdiffs(ts_atm2)
## [1] 1
## We will apply the seasonal differencing first to see if the ordinary differencing is still required
ts_atm2%>%diff(lag=7)%>%ggtsdisplay()
ts_atm2%>%diff(lag=7)%>%ndiffs()
## [1] 0
##ordinary differencing is not required
##We will select our seasonal arima model from here
We see significant spikes at lag 5 in ACF and PACF plots. For non-seasonal part we can select AR(5) or MA(5). For seasonal we select AR(1) due to significant spike at 7 and decreasing significant spikes at 14, 21.
AR(5,0,0)(1,1,0)
AR(0,0,5)(0,1,1)
b_ar1 <- ts_atm2%>% Arima(order=c(5,0,0), seasonal=c(1,1,0))
b_ar2 <- ts_atm2%>% Arima(order=c(0,0,5), seasonal=c(0,1,1))
b_ar1$aic
## [1] 3378.869
b_ar2$aic
## [1] 3353.789
Seasonal Decomposition and forecasting with seasonal decomposition
## additive decomp assumed by stl
decomp2 <- stl(ts_atm2[, "Cash"], t.window=13,s.window="periodic")
deseasonal2 <- seasadj(decomp2)
autoplot(decomp2)
autoplot(deseasonal2)
We will apply naive model on the seasonal adjusted data
stl_fc2 <- stlf(deseasonal2,method='naive', h=31)
accuracy(b_ar1)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.1352166 26.96241 19.07778 -Inf Inf 0.8878028 0.004138287
accuracy(b_ar2)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.6299813 25.97515 18.59492 -Inf Inf 0.8653322 0.0004582112
accuracy(stl_fc2)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.03366952 29.93007 21.3727 1583.463 1756.777 0.9945987
## ACF1
## Training set -0.4833479
checkresiduals(b_ar1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,0,0)(1,1,0)[7]
## Q* = 25.775, df = 8, p-value = 0.001147
##
## Model df: 6. Total lags used: 14
checkresiduals(b_ar2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,5)(0,1,1)[7]
## Q* = 6.4129, df = 8, p-value = 0.6011
##
## Model df: 6. Total lags used: 14
checkresiduals(stl_fc2)
## Warning in checkresiduals(stl_fc2): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 125.09, df = 14, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 14
Based on accuracy, aic and residuals, box-ljung test, model2 AR(0,0,5)(0,1,1) is the best model and we will forecast the time series based on this model.
Forecast
atm2_fc <- forecast(b_ar2, 31)
write.csv(atm2_fc, "ATM2_forecast.csv")
Due to degeneracy, it might not be a good approach to create a forecasting model on ATM3, however we have the possibility of mean or median imputation. We can use the average method on the atm3 time series.
atm3_fc<-meanf(ts_atm3, 31)
checkresiduals(atm3_fc)
##
## Ljung-Box test
##
## data: Residuals from Mean
## Q* = 196.67, df = 13, p-value < 2.2e-16
##
## Model df: 1. Total lags used: 14
The model is not a better approach for forecasting. We can assume the forecast as the mean of three values for ATM3.
boxplot(ts_atm4)
ts_atm4 <- ts_atm4[-285]
ts_atm4 <- ts(ts_atm4)
autoplot(ts_atm4)
ggtsdisplay(ts_atm4)
Stationary time series, no ordinary differencing required. Variance is constant, no transformation required. No seasonality present. Significant spike at lag 8. AR(8) or MA(8) non-seasonal model can be used. we will use auto arima to select the model
auto.arima(ts_atm4)
## Series: ts_atm4
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## 0.0746 445.3755
## s.e. 0.0522 19.7044
##
## sigma^2 estimated as 123119: log likelihood=-2648.7
## AIC=5303.4 AICc=5303.46 BIC=5315.09
accuracy(auto.arima(ts_atm4))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.04498806 349.918 291.6344 -543.2407 576.6388 0.7530519
## ACF1
## Training set 0.0003956452
Since there was no seasonality present and there seems to be no specific trend, we can use ses model
fc_ses <- ses(ts_atm4, h=31)
accuracy(fc_ses)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1742947 350.9191 292.5632 -550.0211 583.4599 0.7554501
## ACF1
## Training set 0.0751692
checkresiduals(fc_ses)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 24.468, df = 8, p-value = 0.001912
##
## Model df: 2. Total lags used: 10
checkresiduals(auto.arima(ts_atm4))
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 21.413, df = 8, p-value = 0.006129
##
## Model df: 2. Total lags used: 10
Model one ARIMA(8,0,0) is our best model.
Forecast
atm4_fc <- forecast(fc_ses, 31)
head(fc_ses)
## $model
## Simple exponential smoothing
##
## Call:
## ses(y = ts_atm4, h = 31)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 445.3416
##
## sigma: 351.8871
##
## AIC AICc BIC
## 6419.048 6419.115 6430.740
##
## $mean
## Time Series:
## Start = 365
## End = 395
## Frequency = 1
## [1] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [8] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [15] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [22] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [29] 445.3352 445.3352 445.3352
##
## $level
## [1] 80 95
##
## $x
## Time Series:
## Start = 1
## End = 364
## Frequency = 1
## [1] 777 524 793 908 53 52 55 559 904 879 274 396 275 16
## [15] 852 380 31 492 84 129 14 815 758 601 907 503 88 35
## [29] 338 5 123 150 721 443 17 15 741 1058 576 1484 194 27
## [43] 1191 746 1221 1022 373 321 92 117 202 524 81 64 91 48
## [57] 1026 424 61 540 174 393 42 310 110 682 55 214 738 16
## [71] 16 1050 438 547 858 447 95 644 569 705 572 480 419 28
## [85] 835 911 468 768 1089 267 7 704 495 143 429 895 610 71
## [99] 594 342 735 463 1156 454 283 572 772 260 358 16 334 26
## [113] 255 357 1246 917 592 412 83 996 104 1117 817 914 648 141
## [127] 1495 1301 780 744 200 854 51 7 1061 715 35 492 343 21
## [141] 506 97 474 900 1712 281 26 329 761 629 236 1195 782 108
## [155] 847 576 442 319 154 543 124 449 615 219 946 10 696 8
## [169] 845 213 9 47 30 400 61 428 2 9 50 313 627 6
## [183] 78 338 212 690 596 65 77 44 964 835 637 927 76 44
## [197] 621 313 826 414 194 346 32 655 638 15 300 627 601 64
## [211] 563 317 1167 47 994 687 71 1047 1009 289 592 231 578 70
## [225] 581 149 404 125 230 19 129 328 532 877 662 301 668 15
## [239] 660 511 164 748 174 20 101 986 597 468 857 685 382 152
## [253] 272 135 1105 292 1141 141 9 85 67 710 568 487 17 49
## [267] 357 180 729 261 629 277 41 1575 670 980 426 153 275 136
## [281] 454 458 112 418 42 280 412 853 180 226 989 825 967 734
## [295] 121 288 503 10 258 1170 193 403 1276 820 26 894 361 860
## [309] 381 10 601 32 553 572 219 828 631 339 32 487 335 340
## [323] 291 46 201 10 878 778 708 351 711 503 23 493 405 818
## [337] 152 281 470 3 415 719 812 890 616 61 27 768 326 825
## [351] 384 195 711 30 557 386 165 5 542 404 14 348 44 482
##
## $upper
## Time Series:
## Start = 365
## End = 395
## Frequency = 1
## 80% 95%
## 365 896.2967 1135.021
## 366 896.2967 1135.021
## 367 896.2967 1135.021
## 368 896.2967 1135.021
## 369 896.2967 1135.021
## 370 896.2967 1135.021
## 371 896.2967 1135.021
## 372 896.2967 1135.021
## 373 896.2967 1135.021
## 374 896.2967 1135.021
## 375 896.2967 1135.021
## 376 896.2967 1135.021
## 377 896.2967 1135.021
## 378 896.2967 1135.021
## 379 896.2967 1135.021
## 380 896.2967 1135.021
## 381 896.2967 1135.021
## 382 896.2967 1135.021
## 383 896.2967 1135.021
## 384 896.2967 1135.021
## 385 896.2967 1135.021
## 386 896.2967 1135.021
## 387 896.2967 1135.021
## 388 896.2967 1135.021
## 389 896.2968 1135.021
## 390 896.2968 1135.021
## 391 896.2968 1135.021
## 392 896.2968 1135.021
## 393 896.2968 1135.021
## 394 896.2968 1135.021
## 395 896.2968 1135.021
##
## $lower
## Time Series:
## Start = 365
## End = 395
## Frequency = 1
## 80% 95%
## 365 -5.626271 -244.3509
## 366 -5.626273 -244.3509
## 367 -5.626275 -244.3509
## 368 -5.626277 -244.3509
## 369 -5.626280 -244.3509
## 370 -5.626282 -244.3509
## 371 -5.626284 -244.3509
## 372 -5.626286 -244.3509
## 373 -5.626289 -244.3509
## 374 -5.626291 -244.3509
## 375 -5.626293 -244.3509
## 376 -5.626295 -244.3509
## 377 -5.626298 -244.3509
## 378 -5.626300 -244.3509
## 379 -5.626302 -244.3509
## 380 -5.626304 -244.3509
## 381 -5.626307 -244.3509
## 382 -5.626309 -244.3509
## 383 -5.626311 -244.3509
## 384 -5.626313 -244.3509
## 385 -5.626316 -244.3509
## 386 -5.626318 -244.3509
## 387 -5.626320 -244.3509
## 388 -5.626322 -244.3509
## 389 -5.626325 -244.3509
## 390 -5.626327 -244.3509
## 391 -5.626329 -244.3509
## 392 -5.626332 -244.3510
## 393 -5.626334 -244.3510
## 394 -5.626336 -244.3510
## 395 -5.626338 -244.3510
write.csv(atm4_fc, "ATM4_forecast.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
pwr <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/power_data.csv")
dim(pwr)
## [1] 192 3
str(pwr)
## Classes 'data.table' and 'data.frame': 192 obs. of 3 variables:
## $ CaseSequence: int 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : int 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...
## - attr(*, ".internal.selfref")=<externalptr>
min(pwr$`YYYY-MMM`)
## [1] "1998-Apr"
summary(pwr)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
pwr%>%filter(!complete.cases(pwr))
## CaseSequence YYYY-MMM KWH
## 1 861 2008-Sep NA
pwr <- pwr %>%filter(complete.cases(pwr))
pwr_ts <- ts(pwr %>% select(-`YYYY-MMM`, -CaseSequence), start = c(1998, 1), frequency = 12)
boxplot(pwr_ts)
autoplot(pwr_ts)
ggseasonplot(pwr_ts, year.labels=TRUE, year.labels.left=TRUE) +
ylab("KwH") +
ggtitle("Residential Power")
We can detect seasonality from the plot. In addition the variance is increasing with increase in time. Transformation is good idea to stabilise the variance of the time series. Since data are seasonal, we will first take a seasonal difference
lambda1=BoxCox.lambda(pwr_ts)
lambda1
## [1] 0.06361748
pwr_bc <- BoxCox(pwr_ts, lambda=lambda1)
autoplot(pwr_bc)
pwr_bc %>% diff(lag=12) %>% ggtsdisplay()
#autoplot(pwr_bc_sd)
The data is stationary, evident from the time plot. No differencing required.
The significant spike at lag 1 in the ACF and PACF suggests a non-seasonal MA(1) or AR(1) component, while the significant spike at lag 1 in the ACF suggests a seasonal MA(1) component or lag 4 in ACF can suggest seasonal MA(4). Another case scenario, if we look at PACF and emphasize lag 12 and 24 (neglecting lag 4), AR(2) is another option. We can test this model in case our selected model is not a good approach.
ARIMA(0,0,1)(0,1,1) ARIMA(1,0,0)(1,1,0)
ar1 <- pwr_bc%>% Arima(order=c(0,0,1), seasonal=c(0,1,1))
ar2 <- pwr_bc%>% Arima(order=c(1,0,0), seasonal=c(1,1,0))
ar3 <- pwr_bc%>% Arima(order=c(1,0,0), seasonal=c(2,1,0))
ar1$aic
## [1] 327.7296
ar2$aic
## [1] 354.5315
ar3$aic
## [1] 331.7403
accuracy(ar1)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.07237788 0.5618735 0.28481 0.2287857 1.07925 0.8390499
## ACF1
## Training set -0.010759
accuracy(ar2)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.04776387 0.614326 0.2867503 0.1372028 1.087465 0.844766
## ACF1
## Training set -0.006151287
accuracy(ar3)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.06559122 0.5679673 0.2774767 0.2047543 1.051591 0.8174461
## ACF1
## Training set -0.01227044
checkresiduals(ar1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,1)[12]
## Q* = 8.7718, df = 22, p-value = 0.9944
##
## Model df: 2. Total lags used: 24
checkresiduals(ar2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(1,1,0)[12]
## Q* = 32.896, df = 22, p-value = 0.06335
##
## Model df: 2. Total lags used: 24
checkresiduals(ar3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,1,0)[12]
## Q* = 9.7066, df = 21, p-value = 0.9825
##
## Model df: 3. Total lags used: 24
Form Box-ljung test, we fail to reject null hypothesis. The residuals are uncorrelated and test validates our model.
Residuals have consant variance and are normally distributed.
pwr_fc <- forecast(ar1,12)
head(pwr_fc)
## $method
## [1] "ARIMA(0,0,1)(0,1,1)[12]"
##
## $model
## Series: .
## ARIMA(0,0,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## 0.1337 -0.7130
## s.e. 0.0735 0.0568
##
## sigma^2 estimated as 0.3407: log likelihood=-160.86
## AIC=327.73 AICc=327.87 BIC=337.29
##
## $level
## [1] 80 95
##
## $mean
## Jan Feb Mar Apr May Jun Jul
## 2013
## 2014 27.47164 26.94520 26.63214 26.48856 27.10573 26.92037 27.76407
## Aug Sep Oct Nov Dec
## 2013 27.84462
## 2014 27.47573 26.64076 26.51737 27.22215
##
## $lower
## 80% 95%
## Dec 2013 27.09660 26.70063
## Jan 2014 26.71698 26.31748
## Feb 2014 26.19054 25.79104
## Mar 2014 25.87748 25.47799
## Apr 2014 25.73390 25.33440
## May 2014 26.35107 25.95158
## Jun 2014 26.16570 25.76621
## Jul 2014 27.00940 26.60991
## Aug 2014 26.72107 26.32157
## Sep 2014 25.88610 25.48660
## Oct 2014 25.76271 25.36321
## Nov 2014 26.46748 26.06799
##
## $upper
## 80% 95%
## Dec 2013 28.59263 28.98860
## Jan 2014 28.22630 28.62579
## Feb 2014 27.69986 28.09936
## Mar 2014 27.38681 27.78630
## Apr 2014 27.24322 27.64272
## May 2014 27.86039 28.25989
## Jun 2014 27.67503 28.07452
## Jul 2014 28.51873 28.91822
## Aug 2014 28.23039 28.62988
## Sep 2014 27.39542 27.79492
## Oct 2014 27.27203 27.67153
## Nov 2014 27.97681 28.37630
write.csv(pwr_fc, "forecast_power.csv")
The data in waterflow 1 has data at different time stamps within the hour, however in waterflow 2 has hourly timestamps. We will bind two data frames together and update the column value of the water flow within an hour to mean of water flow within that hour.
w1 <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/waterflow_pipe1.csv")
w2 <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/waterflow_pipe2.csv")
wp1 <- w1 %>% setNames(gsub("[ ]", "_", tolower(colnames(w1)))) %>% mutate(date_time=mdy_hm(date_time))
wp2 <- w2 %>% setNames(gsub("[ ]", "_", tolower(colnames(w2)))) %>% mutate(date_time=mdy_hm(date_time))
wp_merge <- rbind(wp1,wp2) %>% mutate(hr = hour(date_time), dt= date(date_time), key=ymd_h(paste0(dt,hr))) %>% group_by(key) %>%
mutate(wf=mean(waterflow)) %>% select(dateime=key, wf) %>% distinct(., .keep_all=T)
There were no outliers present indicating that waterflow was consistent. We applied seasonal differencing and ordinary differencing and reviewed the ACF and PACF plots. The results were not promising, the waterflow was going to be negative. We will let auto.arima to decide the model.
ts_wp<-ts(wp_merge$wf, frequency = 24)
boxplot(ts_wp)
autoplot(ts_wp)
ggseasonplot(ts_wp, year.labels=TRUE, year.labels.left=TRUE) +
ylab("waterflow") +
ggtitle("Waterpipeline")
ggtsdisplay(ts_wp)
ts_wp %>% diff(lag=24) %>% ggtsdisplay()
a1 <-auto.arima(ts_wp)
accuracy(a1)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.4974017 14.4913 11.08593 -26.26872 47.7588 0.7333519
## ACF1
## Training set -0.0160318
checkresiduals(a1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 66.378, df = 46, p-value = 0.02616
##
## Model df: 2. Total lags used: 48
Although we were expecting weak forecasting models for the water flow problem, we were not able to build the forecasting model as our waterflow was going negative in an order to make the series stationary.
Refrences:
https://robjhyndman.com/hyndsight/dailydata/ https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/ https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials https://github.com/cran/aTSA/blob/master/R/forecast.R https://towardsdatascience.com/trend-seasonality-moving-average-auto-regressive-model-my-journey-to-time-series-data-with-edc4c0c8284b