Data 624 Project 1
library(tidyverse)
library(fpp2)
library(urca)
library(rio)
library(gridExtra)
library(forecast)
library(lubridate)1 Project#1
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
2 Libraries
3 Part A
Part A – ATM Forecast, ATM624Data.xlsx
3.1 Load Data
First, load the excel data, clean it by dropping the NA values, and rearrange it to a better format.
atm <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/ATM624Data.xlsx", col_types = c("date", "text", "numeric"))
atm #original dataatm_daily <- atm %>% drop_na() %>% spread(ATM, Cash)
atm_daily #daily total after dropping NA valuesFrom the table above, we can see that the cash withdrawal from ATM4 are generally larger than the other 3 ATMs, and that of ATM3 are mostly zero. Let’s look closely with their summary statistics and boxplots.
There are 3 NA values in ATM1 and 2 in ATM2.
There are only 3 datapoints in ATM3.
There are many outliers in ATM1 and one extremely high outlier in ATM4.
par(mfrow=c(4,1))
for (i in 2:5) {
print(summary(atm_daily[i]))
boxplot(atm_daily[i], horizontal = TRUE)
}## ATM1
## Min. : 1.00
## 1st Qu.: 73.00
## Median : 91.00
## Mean : 83.89
## 3rd Qu.:108.00
## Max. :180.00
## NA's :3
## ATM2
## Min. : 0.00
## 1st Qu.: 25.50
## Median : 67.00
## Mean : 62.58
## 3rd Qu.: 93.00
## Max. :147.00
## NA's :2
## ATM3
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.7206
## 3rd Qu.: 0.0000
## Max. :96.0000
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
3.2 Timeseries
Next, convert the data into a timeseries and plot it.
As the spike from ATM4 in the first plot makes it hard to see the details in ATM1, ATM2, and ATM3, individual plots will be shown below.
atm_ts <- ts(atm_daily %>% select(-DATE))
autoplot(atm_ts) +
ggtitle("Daily Cash Withdrawal from 4 ATM Machines") +
xlab("Day") +
ylab("Hundreds of Dollars ($100)")To handle the data better, the extremely high outlier in ATM4 will be replaced by median for better forecasting.
atm[1394,3] <- 403.839
atm %>% drop_na() %>% ggplot(aes(x=DATE, y=Cash, col=ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ATM, ncol=1, scales="free_y") +
labs(title="Daily Cash Withdrawal from 4 ATM Machines (May 2019 to April 2020)") +
xlab("Date") + ylab("Hundreds of Dollars ($100)")From the above plots:
Seasonality is seen in ATM1 and ATM2, will apply Box-Cox transformation.
Only 3 datapoints in ATM3 and all before them are 0, we may not have enough information for prediction. This may happen if the ATM3 launched only 3 days before end of April 2020.
Replacing the outlier with median makes the ATM4 better to forecast.
3.3 ARIMA Models
3.3.1 ATM1
From the timeseries plot, ATM1 is clearly with seasonality, which is weekly seasonality.
The ACF and PACF plots have significant lag7, lag14, and lag21.
Thus, this is a non-stationary timeseries.
atm1 <- atm_daily[2]
atm1 <- ts(atm1, frequency=7)
atm2 <- atm_daily[3]
atm2 <- ts(atm2, frequency=7)
atm3 <- atm_daily[4]
atm3 <- ts(atm3, start=363)
atm3[which(atm3==0)] <- NA
atm4 <- atm_daily[5]
atm4[285,] <- 403.839
atm4 <- ts(atm4, frequency=7)atm1_bc <- BoxCox(atm1, lambda = BoxCox.lambda(atm1))
ggtsdisplay(atm1_bc, main="ATM1 with BoxCox Transformation")After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.016
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary.
To recall, an ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations.
The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.
The seasonal spike at lag7 suggest seasonal AR(1) and/or MA(1) components. As ACF decays gradually, this suggests seasonal AR(0) and MA(1), i.e. P=0, Q=1.
Thus, the possible model here I suggest to be ARIMA(1,0,1)(0,1,1).
atm1_arima <- Arima(atm1, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))
summary(atm1_arima)## Series: atm1
## ARIMA(1,0,1)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2081476
##
## Coefficients:
## ar1 ma1 sma1
## -0.5040 0.6201 -0.6438
## s.e. 0.2417 0.2183 0.0427
##
## sigma^2 estimated as 1.252: log likelihood=-544.29
## AIC=1096.58 AICc=1096.7 BIC=1112.11
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.514774 25.09103 15.83582 -89.65246 108.6821 0.8958869
## ACF1
## Training set -0.007347018
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 9.4438, df = 11, p-value = 0.581
##
## Model df: 3. Total lags used: 14
## Series: atm1
## ARIMA(0,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2081476
##
## Coefficients:
## ma1 ma2 sma1
## 0.0989 -0.1107 -0.6482
## s.e. 0.0532 0.0527 0.0425
##
## sigma^2 estimated as 1.247: log likelihood=-543.68
## AIC=1095.37 AICc=1095.48 BIC=1110.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.493867 25.19797 16.12913 -88.12706 107.3975 0.9124807
## ACF1
## Training set 0.01895515
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 8.5811, df = 11, p-value = 0.6605
##
## Model df: 3. Total lags used: 14
The ARIMA model found by auto.arima is ARIMA(0,0,2)(0,1,1)[7].
The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].
According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values.
Both models worked great but the ARIMA(0,0,2)(0,1,1)[7] has smaller AICc which is better than the model I suggested.
3.3.2 ATM2
From the timeseries plot, ATM2 is clearly with seasonality, which is weekly seasonality.
The ACF and PACF plots have significant lag7, lag14, and lag21.
Thus, this is a non-stationary timeseries.
atm2_bc <- BoxCox(atm2, lambda = BoxCox.lambda(atm2))
ggtsdisplay(atm2_bc, main="ATM2 with BoxCox Transformation")After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0161
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary.
One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0.
The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1.
The non-differenced ACF and PACF plots have spikes at lag2 and lag5, suggest p=2 and q=2.
Thus, the possible model here I suggest to be ARIMA(2,0,2)(0,1,1)[7].
atm2_arima <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))
summary(atm2_arima)## Series: atm2
## ARIMA(2,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.7164431
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4188 -0.9019 0.4587 0.7886 -0.7187
## s.e. 0.0587 0.0471 0.0884 0.0619 0.0414
##
## sigma^2 estimated as 62.19: log likelihood=-1240.08
## AIC=2492.16 AICc=2492.4 BIC=2515.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf Inf 0.8250038 -0.01464052
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
##
## Model df: 5. Total lags used: 14
## Series: atm2
## ARIMA(2,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.7164431
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4188 -0.9019 0.4587 0.7886 -0.7187
## s.e. 0.0587 0.0471 0.0884 0.0619 0.0414
##
## sigma^2 estimated as 62.19: log likelihood=-1240.08
## AIC=2492.16 AICc=2492.4 BIC=2515.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf Inf 0.8250038 -0.01464052
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
##
## Model df: 5. Total lags used: 14
The ARIMA model found by auto.arima is ARIMA(2,0,2)(0,1,1)[7].
The ARIMA model I suggested is ARIMA(2,0,2)(0,1,1)[7].
The two models are the same.
According to the error measures and the residual plots, the model represents the data well with small p-value.
3.3.3 ATM3
Only 3 datapoints in ATM3 and all before them are 0. This may happen if the ATM3 launched only 3 days before end of April 2020.
Given only 3 datapoints, there is not enough information to forecast on the timeseries.
Therefore, I will use a simple mean forecast.
3.3.4 ATM4
Let’s repeat the steps from ATM1&2 on ATM4. ATM4 also has weekly seasonality.
The ACF and PACF plots have significant lag7, lag14, and lag21.
Thus, this is a non-stationary timeseries.
atm4_bc <- BoxCox(atm4, lambda = BoxCox.lambda(atm4))
ggtsdisplay(atm4_bc, main="ATM4 with BoxCox Transformation")After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0124
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary.
One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0.
The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1.
There one non-seasonal spike at lag3 in ACF and PACF plots suggest p=q=1.
Thus, I will try ARIMA(1,0,1)(0,1,1)[7].
atm4_arima <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))
summary(atm4_arima)## Series: atm4
## ARIMA(1,0,1)(0,1,1)[7]
## Box Cox transformation: lambda= 0.4525697
##
## Coefficients:
## ar1 ma1 sma1
## 0.6851 -0.6059 -0.8677
## s.e. 0.3040 0.3315 0.0315
##
## sigma^2 estimated as 171.4: log likelihood=-1432.08
## AIC=2872.16 AICc=2872.27 BIC=2887.68
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 61.63608 337.5467 254.8181 -360.8151 404.2179 0.7355835
## ACF1
## Training set 0.0185852
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 16.519, df = 11, p-value = 0.1229
##
## Model df: 3. Total lags used: 14
## Series: atm4
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= 0.4525697
##
## Coefficients:
## ar1 sar1 sar2 mean
## 0.0771 0.2090 0.1996 29.0082
## s.e. 0.0526 0.0516 0.0525 1.2627
##
## sigma^2 estimated as 182.2: log likelihood=-1466.34
## AIC=2942.69 AICc=2942.85 BIC=2962.19
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 84.38613 351.7695 274.2406 -370.6751 415.1343 0.7916504
## ACF1
## Training set 0.01960666
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.92, df = 10, p-value = 0.1019
##
## Model df: 4. Total lags used: 14
The ARIMA model found by auto.arima is ARIMA(1,0,0)(2,0,0)[7].
The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].
The model I suggested has smaller AICc and RMSE.
According to the error measures and the residual plots, I will stick to the model I suggested.
3.4 Forecast
atm1_f <- forecast(atm1_model, 31, level=95)
atm2_f <- forecast(atm2_model, 31, level=95)
atm3_f <- meanf(atm3, 31, level=95)
atm4_f <- forecast(atm4_model, 31, level=95)
gridExtra::grid.arrange(
autoplot(atm1_f) +
labs(title="ATM1: ARIMA(0,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
autoplot(atm2_f) +
labs(title="ATM2: ARIMA(2,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
autoplot(atm3_f) +
labs(title="ATM3: meanf", x="Day", y="Hundreds of Dollars($100)"),
autoplot(atm4_f) +
labs(title="ATM4: ARIMA(1,0,1)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
top = grid::textGrob("Forecast on Cash Withdrawal for May 2020")
)export <- rbind(atm1_f$mean, atm2_f$mean, atm3_f$mean, atm4_f$mean)
write.csv(export, "ATM624Forecast.csv")
data.frame(export) %>% cbind(ATM = c('ATM1', 'ATM2', 'ATM3', 'ATM4')) %>%
select(ATM, everything())4 Part B
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
4.1 Load Data
First, load the excel data into R and clean it by using function tsclean() which can handle outliers and NA value.
kwh <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/ResidentialCustomerForecastLoad-624.xlsx")
kwh #original datakwh_ts <- ts(kwh[,"KWH"], start=c(1998,1), frequency=12) %>%
tsclean() #handle the outliers and NA value
kwh_ts #cleaned data## Jan Feb Mar Apr May Jun Jul
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 7696306
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321
## Aug Sep Oct Nov Dec
## 1998 8607428 6989888 6345620 4640410 4693479
## 1999 7564391 7899368 5358314 4436269 4419229
## 2000 7517830 8912169 5844352 5041769 6220334
## 2001 8450717 7112069 5242535 4461979 5240995
## 2002 8058748 8245227 5865014 4908979 5779958
## 2003 8476499 7791791 5344613 4913707 5756193
## 2004 7309774 6690366 5444948 4824940 5791208
## 2005 7786659 7057213 6694523 4313019 6181548
## 2006 8241110 7296355 5104799 4458429 6226214
## 2007 7447146 7666970 5785964 4907057 6047292
## 2008 8037137 7381789 5101803 4555602 6442746
## 2009 8350517 7583146 5566075 5339890 7089880
## 2010 7922701 7819472 5875917 4800733 6152583
## 2011 10308076 8943599 5603920 6154138 8273142
## 2012 9612423 7559148 5576852 5731899 6609694
## 2013 9080226 7968220 5759367 5769083 7028762
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4313019 5443502 6351262 6529701 7608792 10655730
4.2 Timeseries
Let’s study the cleaned timeseries below.
Seasonality is found in this timeseries and appears to have a peak every 6 months. The seasonality may be annual due to the high usage during winter and summer.
autoplot(kwh_ts) +
ggtitle("Monthly Residential Power Usage for Jan 1998 to Dec 2013 (KWH)") +
xlab("Month") +
ylab("Kilowatt hours (KWH)")4.3 ARIMA Model
From the plots below, we see annual seasonality.
Tried Box-Cox transformation on the timeseries but no huge differences on the before and after. Will work on differencing instead.
kwh_bc <- BoxCox(kwh_ts, lambda = BoxCox.lambda(kwh_ts))
ggtsdisplay(kwh_bc, main="kwh_ts with BoxCox Transformation")After differencing once, the timeseries now appears to be stationary.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.1168
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
To recall, an ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations.
The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.
The seasonal spike at lag7 in ACF and lag7 & lag14 in PACF suggest seasonal AR(1) and/or MA(2) components. As ACF decays gradually, this suggests seasonal AR(0) and MA(2), i.e. P=0, Q=2.
Thus, the possible model here I suggest to be ARIMA(1,0,1)(0,1,2).
kwh_arima <- Arima(kwh_ts, order=c(1,0,1), seasonal=c(0,1,2), lambda = BoxCox.lambda(kwh_ts))
summary(kwh_arima)## Series: kwh_ts
## ARIMA(1,0,1)(0,1,2)[12]
## Box Cox transformation: lambda= -0.1442665
##
## Coefficients:
## ar1 ma1 sma1 sma2
## 0.9480 -0.8020 -0.8027 0.1754
## s.e. 0.0822 0.1772 0.0928 0.0910
##
## sigma^2 estimated as 9.097e-05: log likelihood=582.73
## AIC=-1155.47 AICc=-1155.12 BIC=-1139.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 85650.36 597401.9 463229.3 0.8034675 6.999746 0.7766959
## ACF1
## Training set 0.1519546
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,2)[12]
## Q* = 24.449, df = 20, p-value = 0.2233
##
## Model df: 4. Total lags used: 24
kwh_auto <- auto.arima(kwh_ts, approximation = FALSE, lambda=BoxCox.lambda(kwh_ts))
checkresiduals(kwh_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
## Q* = 25.496, df = 21, p-value = 0.2263
##
## Model df: 3. Total lags used: 24
## Series: kwh_ts
## ARIMA(1,0,0)(0,1,1)[12] with drift
## Box Cox transformation: lambda= -0.1442665
##
## Coefficients:
## ar1 sma1 drift
## 0.2903 -0.7349 1e-04
## s.e. 0.0724 0.0698 1e-04
##
## sigma^2 estimated as 8.731e-05: log likelihood=585.27
## AIC=-1162.55 AICc=-1162.32 BIC=-1149.78
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 20692.11 580573.3 445696.4 -0.2292905 6.807673 0.7472985
## ACF1
## Training set 0.05192474
The ARIMA model found by auto.arima is ARIMA(1,0,0)(0,1,1)[12].
The ARIMA model I suggested is ARIMA(1,0,1)(0,1,2)[12].
According to the error measures and the residual plots, the auto model represents the data better with smaller AICc value and smaller error measures.
Therefore ARIMA(1,0,0)(0,1,1)[12] is better than the model I suggested.
4.4 Forecast
kwh_model <- Arima(kwh_ts, order=c(1,0,0), seasonal=c(0,1,1), lambda = BoxCox.lambda(kwh_ts))
kwh_f <- forecast(kwh_model, h=12, level=95)
kwh_f## Point Forecast Lo 95 Hi 95
## Jan 2014 9417011 7779842 11460551
## Feb 2014 7893327 6480120 9670462
## Mar 2014 6436805 5307451 7849709
## Apr 2014 5756614 4760318 6998820
## May 2014 5570785 4610642 6766694
## Jun 2014 7479484 6140427 9163354
## Jul 2014 8522770 6971107 10482526
## Aug 2014 9080593 7413869 11190200
## Sep 2014 7909745 6483433 9706676
## Oct 2014 5661778 4683894 6880415
## Nov 2014 5568654 4608914 6764051
## Dec 2014 6919608 5693132 8457963
5 Part C
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
5.1 Load Data
wfp1 <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfp2 <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project1/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
colnames(wfp1) <- c("DateTime", "WaterFlow")
colnames(wfp2) <- c("DateTime", "WaterFlow") wfp1 <- wfp1 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1) %>% #match the hour with wfp2
group_by(Date, Time) %>%
summarise(Water=mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime=ymd_h(paste(Date,Time))) %>%
select(DateTime,Water)
wfp1wfp2 <- wfp2 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)) %>%
group_by(Date, Time) %>%
summarise(Water=mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime=ymd_h(paste(Date,Time))) %>%
select(DateTime, Water)
wfp25.2 Timeseries
Combining the two waterflows into one:
water <- full_join(wfp1, wfp2, by="DateTime", suffix=c("_1", "_2")) %>%
mutate(Water_1=ifelse(is.na(Water_1), 0, Water_1)) %>%
mutate(Water_2=ifelse(is.na(Water_2), 0, Water_2)) %>%
mutate(Water = Water_1 + Water_2) %>%
select(DateTime, Water)
waterwater_ts <- ts(water$Water, frequency=24)
ggseasonplot(water_ts) + theme(legend.title = element_blank())5.3 ARIMA Model
We cannot see significant seasonality involved in water_ts however there is a slightly decreasing trend.
It is a non-stationary timeseries.
water_bc <- BoxCox(water_ts, lambda = BoxCox.lambda(water_ts))
ggtsdisplay(water_bc, main="water_ts with BoxCox Transformation")Trend differencing is needed.
## [1] 1
## [1] 0
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0091
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary.
One seasonal differencing was applied so D=0, while the non-seasonal part suggests d=1.
There is one seasonal lags in ACF, suggest Q=1.
There one non-seasonal spike at lag1 in ACF suggest q=1.
Thus, I will try ARIMA(0,1,1)(0,0,1)[24].
water_arima <- Arima(water_ts, order=c(0,1,1), seasonal=c(0,0,1), lambda = BoxCox.lambda(water_ts))
summary(water_arima)## Series: water_ts
## ARIMA(0,1,1)(0,0,1)[24]
## Box Cox transformation: lambda= 0.8713037
##
## Coefficients:
## ma1 sma1
## -0.9578 0.0712
## s.e. 0.0101 0.0319
##
## sigma^2 estimated as 103.3: log likelihood=-3734.4
## AIC=7474.8 AICc=7474.83 BIC=7489.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
## ACF1
## Training set -0.04444759
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 57.881, df = 46, p-value = 0.1124
##
## Model df: 2. Total lags used: 48
water_auto <- auto.arima(water_ts, approximation = FALSE, lambda=BoxCox.lambda(water_ts))
summary(water_auto)## Series: water_ts
## ARIMA(0,1,1)(1,0,0)[24]
## Box Cox transformation: lambda= 0.8713037
##
## Coefficients:
## ma1 sar1
## -0.9578 0.0714
## s.e. 0.0101 0.0322
##
## sigma^2 estimated as 103.3: log likelihood=-3734.4
## AIC=7474.8 AICc=7474.82 BIC=7489.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
## ACF1
## Training set -0.04431537
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,0,0)[24]
## Q* = 57.858, df = 46, p-value = 0.1128
##
## Model df: 2. Total lags used: 48
The ARIMA model found by auto.arima is ARIMA(0,1,1)(1,0,0)[24].
The ARIMA model I suggested is ARIMA(0,1,1)(0,0,1)[24].
The two models are very close to each other on the statistics with only 0.01 difference on AICc.
I will choose the auto.arima model for this timeseries.
5.4 Forecast
Forecast a week on the water usage, which is \(7*24\) hours.
The timeseries is lack of seasonality which caused zero variation in the long term forecast.
water_f <- forecast(water_model, h=7*24, level=95)
autoplot(water_f) +
labs(title="Water Usage Forecast: ARIMA(0,1,1)(1,0,0)[24]", x="Day")export <- water_f$mean
write.csv(export, "Waterflow624Forecast.csv")
df <- data.frame(water_f) %>% select(Point.Forecast)
rownames(df) <- NULL
df