library(tidyverse)
library(scales)
library(lubridate)
library(knitr)
library(fpp2)
library(readxl)
library(forecast)
atm.df<-read_excel("ATM624Data.xlsx")
kable(head(atm.df))
DATE | ATM | Cash |
---|---|---|
39934 | ATM1 | 96 |
39934 | ATM2 | 107 |
39935 | ATM1 | 82 |
39935 | ATM2 | 89 |
39936 | ATM1 | 85 |
39936 | ATM2 | 90 |
kable(summary(atm.df))
DATE | ATM | Cash | |
---|---|---|---|
Min. :39934 | Length:1474 | Min. : 0.0 | |
1st Qu.:40026 | Class :character | 1st Qu.: 0.5 | |
Median :40118 | Mode :character | Median : 73.0 | |
Mean :40118 | NA | Mean : 155.6 | |
3rd Qu.:40210 | NA | 3rd Qu.: 114.0 | |
Max. :40312 | NA | Max. :10919.8 | |
NA | NA | NA’s :19 |
#check Date type
class(atm.df$DATE)
## [1] "numeric"
Clearly, NAs do exist, and Date is numeric. So let’s drop the NAs and do the clean up while we are at it, and not forgetting to convert Date column to Date
type that time series can understand.
tidy.atm.df<-atm.df %>%
drop_na() %>%
spread(ATM, Cash) %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
kable(head(tidy.atm.df, 10))
DATE | ATM1 | ATM2 | ATM3 | ATM4 |
---|---|---|---|---|
2009-05-01 | 96 | 107 | 0 | 776.99342 |
2009-05-02 | 82 | 89 | 0 | 524.41796 |
2009-05-03 | 85 | 90 | 0 | 792.81136 |
2009-05-04 | 90 | 55 | 0 | 908.23846 |
2009-05-05 | 99 | 79 | 0 | 52.83210 |
2009-05-06 | 88 | 19 | 0 | 52.20845 |
2009-05-07 | 8 | 2 | 0 | 55.47361 |
2009-05-08 | 104 | 103 | 0 | 558.50325 |
2009-05-09 | 87 | 107 | 0 | 904.34136 |
2009-05-10 | 93 | 118 | 0 | 879.49359 |
#Check Date type now:
class(tidy.atm.df$DATE)
## [1] "Date"
We see all data for all ATMs. Now, let’s create our Time Series from this data. While at it, let’s produce the summary as well.
atm.ts<-ts(tidy.atm.df %>%
select(-DATE))
head(atm.ts)
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## ATM1 ATM2 ATM3 ATM4
## 1 96 107 0 776.99342
## 2 82 89 0 524.41796
## 3 85 90 0 792.81136
## 4 90 55 0 908.23846
## 5 99 79 0 52.83210
## 6 88 19 0 52.20845
Plotting our Time Series for each ATM:
autoplot(atm.ts) +
labs(title = "Cash Withdrawn From the 4 ATMS", x = "Day") +
scale_y_continuous("Cash withdrawn in hundreds", labels = dollar) +
scale_color_discrete(NULL) +
theme(legend.position = c(0.1, 0.8))
Let’s bring back the Date and tidy up the data more:
tidy.atm.df %>% gather(atm, Cash, -DATE) %>%
ggplot(aes(x = DATE, y = Cash, col = atm)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ atm, ncol = 1, scales = "free_y") +
labs(title = "Cash Withdrawn From the 4 ATMs", x = "Date") +
scale_y_continuous("Cash withdrawn in hundreds", labels = dollar)
Clearly, While ATM1, ATM2 and ATM3 ha ve a far amount of transactions, ATM3 has the least number of withdrawals.
autoplot(ts(atm.ts[1:61, ])) +
labs(title = "Cash Withdrawn From 4 ATMs", x = "Day") +
scale_y_continuous("Cash Withdrawn in hundreds", labels = dollar)
From above, clearly weekly
seasonality is present in data.
Therefore we set our frequency to 7.
Set the frequency to 7 and plot the ACF/PACF
atm1<-atm.ts[, "ATM1"]
atm1.ts<-ts(atm1, frequency = 7)
ggtsdisplay(atm1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")
From the above plot, we notice:
weekly seasonality is clearly presented in The ACF and spectrum plots
Large spikes in the ACF lags 7, 14, and 21
Large spikes in the spectrum plot at frequencies 1, 2, and 3
Seasonal ARIMA model is suggested by the plot
let's display the differencing with lag 7 for autocorrelation
ggtsdisplay(diff(atm1.ts,7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")
Clearly, there is no non-seasonal differencing.
Let’s create model with Box-Cox, specify ARIMA parameters and extract AIC
atm1.lambda<-BoxCox.lambda(atm1.ts)
atm1.lambda
## [1] 0.2081476
\(\lambda\) = 0.2081476
function to create models & return AIC values for time series
atm.aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1.lambda))
}
Different combinations of p, q, P, Q:
expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm.aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 1 1 0 1 1096.583
The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).
The model used is ARIMA(1,0,1)(0,1,1)
atm1.fit<-Arima(atm1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1.lambda)
atm1.fit
## Series: atm1.ts
## ARIMA(1,0,1)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2081476
##
## Coefficients:
## ar1 ma1 sma1
## -0.5040 0.6201 -0.6438
## s.e. 0.2417 0.2183 0.0427
##
## sigma^2 estimated as 1.252: log likelihood=-544.29
## AIC=1096.58 AICc=1096.7 BIC=1112.11
Let’s check on residuals using the Ljung-Box test
Box.test(resid(atm1.fit), type = "L", fitdf = 3, lag = 7)
##
## Box-Ljung test
##
## data: resid(atm1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")
p-value (=0.09) is greater than 0.05, therefore the residuals may have white noise.
The residuals are normally distrubuted around a mean of 0 (zero)
Therefore we accept this model and can be used in forecasting.
Let’s do the same oparations for ATM2
atm2<-atm.ts[, "ATM2"]
atm2.ts<-ts(atm2, frequency = 7)
atm2.ts[which(is.na(atm2.ts))]<-median(atm2.ts, na.rm = TRUE)
ggtsdisplay(atm2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")
We also notice presence of seasonality in ATM2 which is differenced with lag = 7
ggtsdisplay(diff(atm2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")
The spikes in ACF & PACF in the non-differenced series at k=2 & k=5
Large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0
Will do the same thing as before; that is create model with Box-Cox, specify ARIMA parameters and extract AIC
atm2.lambda<-BoxCox.lambda(atm2.ts)
atm2.lambda
## [1] 0.7278107
atm2.aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2.aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 5 5 0 1 2544.791
The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1).
ARIMA(5,0,5)(0,1,1) is the model used
atm2.fit<-Arima(atm2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2.lambda)
Box.test(resid(atm2.fit), type = "L", fitdf = 11, lag = 14)
##
## Box-Ljung test
##
## data: resid(atm2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955
ggtsdisplay(resid(atm2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")
p-value is greater than 0.05, therefore the residuals may be white noise as suggested by Ljung-Box test.
The residuals are normally distrubuted around a mean of 0 (zero)
Therefore we accept this model and can be used in forecasting.
again, let’s do the same operations as before.
atm3<-atm.ts[(nrow(atm.ts) - 2):nrow(atm.ts), "ATM3"]
atm3.ts<-ts(atm3, start = 363)
atm3.ts<-atm.ts[, "ATM3"]
atm3.ts[which(atm3.ts == 0)]<-NA
atm3.ts
## Time Series:
## Start = 1
## End = 365
## Frequency = 1
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [276] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [301] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [326] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [351] NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85
#let's summarize to see the mean
summary(atm3.ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 82.00 83.50 85.00 87.67 90.50 96.00 362
Most of the values are NA. ATM 3 has only 3 transactions and we’ll use simple mean function to forecast for ATM 3
again, let’s do the same operations.
atm4<-atm.ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4.ts<-ts(atm4, frequency = 7)
summary(atm4.ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.563 124.334 403.839 445.233 704.192 1712.075
ggtsdisplay(atm4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")
Notice the weekly seasonality and and differenced with lag=7
ggtsdisplay(diff(atm4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")
we notice stationary time series with a large spike at k=7 suggesting D=1 and d=0
Will do the same thing as before; that is create model with Box-Cox, specify ARIMA parameters and extract AIC
atm4.lambda<-BoxCox.lambda(atm4.ts)
atm4.lambda
## [1] 0.4525697
atm4.aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4.lambda))
}
#creating possible combinations of p, q, P, Q except all zeros
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4.aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 5 2 0 1 2863.061
The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1).
ARIMA(0,0,2)(0,1,1) will be the model used here
atm4.fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4.lambda)
Box.test(resid(atm4.fit), type = "L", fitdf = 3, lag = 7)
##
## Box-Ljung test
##
## data: resid(atm4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05
ggtsdisplay(resid(atm4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")
p-value is greater than 0.05, therefore the residuals may be white noise as suggested by Ljung-Box test.
The residuals are normally distrubuted around a mean of 0 (zero)
Therefore we accept this model and can be used in forecasting.
atm1_forecast<-forecast(atm1.fit, 31, level = 95)
atm2_forecast<-forecast(atm2.fit, 31, level = 95)
atm3_forecast<-meanf(atm3.ts, 31, level = 95)
atm4_forecast<-forecast(atm4.fit, 31, level = 95)
gridExtra::grid.arrange(
autoplot(atm1_forecast) +
labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(atm2_forecast) +
labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(atm3_forecast) +
labs(title = "ATM3: mean", x = "Day", y = NULL) +
theme(legend.position = "none"),
autoplot(atm4_forecast) +
labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for May 2010\n")
)
The forecast shows ATM 3 has a single mean value forecast
The forecast shows ATMs 1, 2, and 4 have seasonality
The forcasted values are saved in a sv file like so:
data_frame(DATE = rep(max(tidy.atm.df$DATE) + 1:31, 4), ATM = rep(names(tidy.atm.df)[-1], each = 31), Cash = c(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)) %>%
write_csv("Forecast_Project1_Abdelmalek_ATM624Data_PartA_Output.csv")
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
power.df<-read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kable(head(power.df))
CaseSequence | YYYY-MMM | KWH |
---|---|---|
733 | 1998-Jan | 6862583 |
734 | 1998-Feb | 5838198 |
735 | 1998-Mar | 5420658 |
736 | 1998-Apr | 5010364 |
737 | 1998-May | 4665377 |
738 | 1998-Jun | 6467147 |
summary(power.df)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
according to the summary above, there is only 1 missing value. So I’ll just remove that row.
power.df <- power.df %>% filter(!is.na(KWH))
dim(power.df)
## [1] 191 3
At this point, the data is good and we can start creating the time series
power.ts<-ts(power.df[, "KWH"], start = c(1998, 1), frequency = 12)
head(power.ts)
## Jan Feb Mar Apr May Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
summary(power.ts)
## KWH
## Min. : 770523
## 1st Qu.: 5429912
## Median : 6283324
## Mean : 6502475
## 3rd Qu.: 7620524
## Max. :10655730
autoplot(power.ts, main = "Monthly KWH Readings") + xlab("Month") + ylab("Kilowatts used per hour (KWH)")
We notice:
Presence of seasonality in the time series data
July 2010 shows an unusual very low reading; probably due to power outage during summer time.I will assume this is a correct reading and I will not chnage the value for that month.
A slight upward trend after July 2010
Let’s use ggseasonal to explore any hidden details
ggseasonplot(power.ts) + ggtitle("Killowatt hour usage by year/month") + xlab("Month") + ylab("Kilowatts used per hour (KWH)")
It looks like we almost have a sin (sinusoidal) function with a period of 1 year, revealing the pattern of the following years.
ndiffs(power.ts)
## [1] 1
ggtsdisplay(power.ts, points = FALSE, plot.type = "histogram")
Clearly seasonality and histogram are not normal. Therefore we may need to use Box-Cox for transformation.
power.ts.lambda <- BoxCox.lambda(power.ts)
power.ts.lambda
## [1] 0.06361748
First, let’s see using Box-Cox to normalize the the histogram, what will happen
power.ts.box <- power.ts %>% BoxCox(lambda = 0.1)
ggtsdisplay(power.ts.box, points = FALSE, plot.type = "histogram")
Yes the histogram is slightly normalized, but it is not quite significant. Therefore, Box-Cox transformation may not be appropriate here.
power.ts.fit <- auto.arima(power.ts)
checkresiduals(power.ts.fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,2)[12] with drift
## Q* = 17.494, df = 20, p-value = 0.6207
##
## Model df: 4. Total lags used: 24
Well Well Well!! It looks like the Residual histogram of the non-Box-Coxed series is more normal. So we will not consider Box-Box. We just discard it.
Instead, we will use auti.arima for our prediction.
power.ts.forecast <- forecast(power.ts.fit, 12, level = 95)
autoplot(power.ts.forecast) + labs(title = "KWH: ARIMA(0, 0, 1)(0, 1, 2)", x = "Week", y = NULL) + theme(legend.position = "none")
Let’s write the monthly forecasts of 2014 to excel
mSeq <- max(power.df$CaseSequence)
power.dates <- format(seq(ymd("2014-01-01"), ymd("2014-12-31"), by = "1 month"), "%Y-%b")
power.ts.forecast.df <- data.frame("CaseSequence" = c(mSeq + 1, mSeq + 2, mSeq + 3, mSeq + 4, mSeq + 5, mSeq + 6, mSeq + 7, mSeq + 8, mSeq + 9, mSeq + 10, mSeq + 11, mSeq + 12),`YYYY-MMM` = power.dates, "KWH" = c(round(power.ts.forecast$mean)))
write_csv(power.ts.forecast.df, "Forecast_Project1_Abdelmalek_ResidentialCustomerData_PartB_Output.csv")
waterflow1 <- read_excel('Waterflow_Pipe1.xlsx', col_types =c("date", "numeric"))
waterflow2 <- read_excel('Waterflow_Pipe2.xlsx', col_types =c("date", "numeric"))
kable(head(waterflow1))
Date Time | WaterFlow |
---|---|
2015-10-23 00:24:06 | 23.369599 |
2015-10-23 00:40:02 | 28.002881 |
2015-10-23 00:53:51 | 23.065895 |
2015-10-23 00:55:40 | 29.972809 |
2015-10-23 01:19:17 | 5.997953 |
2015-10-23 01:23:58 | 15.935223 |
kable(head(waterflow2))
Date Time | WaterFlow |
---|---|
2015-10-23 01:00:00 | 18.81079 |
2015-10-23 02:00:00 | 43.08703 |
2015-10-23 03:00:00 | 37.98770 |
2015-10-23 04:00:00 | 36.12038 |
2015-10-23 05:00:00 | 31.85126 |
2015-10-23 06:00:00 | 28.23809 |
Let’s fix the column names for consistency
colnames(waterflow1)= c("DateTime","WaterFlow")
colnames(waterflow2)= c("DateTime","WaterFlow")
Looking at the 2 readings, we notice that the first pipe is reading more than once within the hour, while the second pipe is reading once every hour. We need to normalize both pipes by converting the first pipe into Hourly.
wpc1 <-waterflow1 %>%
mutate(Date = date(DateTime),
Hour = hour(DateTime) + 1) %>%
group_by(Date, Hour) %>%
summarize(WaterFlow = mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime = ymd_h(paste(Date, Hour))) %>%
select(DateTime, WaterFlow)
kable(head(wpc1))
DateTime | WaterFlow |
---|---|
2015-10-23 01:00:00 | 26.10280 |
2015-10-23 02:00:00 | 18.85202 |
2015-10-23 03:00:00 | 15.15857 |
2015-10-23 04:00:00 | 23.07886 |
2015-10-23 05:00:00 | 15.48219 |
2015-10-23 06:00:00 | 22.72539 |
We now use full join by adding waterflows through both of the pipes, so each data in each pipe is an hour of waterflow.
water.data<-full_join(wpc1, waterflow2, by = "DateTime", suffix = c("_1", "_2")) %>%
mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>%
mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>%
select(DateTime, WaterFlow)
kable(head(water.data, 10))
DateTime | WaterFlow |
---|---|
2015-10-23 01:00:00 | 44.91359 |
2015-10-23 02:00:00 | 61.93904 |
2015-10-23 03:00:00 | 53.14628 |
2015-10-23 04:00:00 | 59.19923 |
2015-10-23 05:00:00 | 47.33345 |
2015-10-23 06:00:00 | 50.96348 |
2015-10-23 07:00:00 | 30.45346 |
2015-10-23 08:00:00 | 45.04664 |
2015-10-23 09:00:00 | 76.56548 |
2015-10-23 10:00:00 | 75.37564 |
Let’s create our time series
waterflow.ts<-ts(water.data$WaterFlow, frequency = 24)
summary(waterflow.ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.885 31.253 43.558 44.250 56.950 106.499
dim(water.data)
## [1] 1000 2
autoplot(waterflow.ts) +
labs(title = "Two Pipelines Hourly Water Flow", x = "Day", y = "Total Waterflow")
We see variability over the period of time. So because of Variance, we will be using BoxCox.
waterflow.lambda<-BoxCox.lambda(waterflow.ts)
waterflow.trans<-BoxCox(waterflow.ts, waterflow.lambda)
ggtsdisplay(diff(waterflow.trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")
Our time series looks stationary, but we do see spikes in the ACF and PACF at k=1, which means non-seasonal AR(1) and MA(1) components.
waterflow.fit<-Arima(waterflow.ts, order = c(1, 1, 1), lambda = waterflow.lambda)
Box.test(resid(waterflow.fit), type = "L")
##
## Box-Ljung test
##
## data: resid(waterflow.fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964
ggtsdisplay(resid(waterflow.fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")
Forecasting using ARIMA(1,1,1):
waterflow.forecast<-forecast(waterflow.fit, 168, level = 95)
autoplot(waterflow.forecast) +
labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
theme(legend.position = "none")
Seems the modal forecast lacks seasonality.
Let’s write the forecast to a file
data_frame(DateTime = max(water.data$DateTime) +
hours(1:168), WaterFlow = waterflow.forecast$mean) %>%
write_csv("Forcast_Project1_Abdelmalek_WaterFlow_PartC_Output.csv")