In part A, forecast how much cash is taken out of 4 different ATM machines for May 2010.
The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast. Data is given via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
Date was transformed as it was a 5-digit numeric type that can be converted since 00000 would 1/1/1900. The data was then transformed into a tsibble object. The dates were filtered to select data from May 2009 to April 2010, as the last 14 rows for May 2010 are all NA values. There were 19 missing values in the Cash column prior to filtering the dates and now there are only 5 NA values.
# reading in excel file
ATM <- read_excel("ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
# converting into date format
mutate(DATE = as_date(DATE)) %>%
# renaming column name
rename(Date = DATE) %>%
# converting to tsibble
as_tsibble(index = Date, key = ATM) %>%
# selecting from May 2009 to April 2010
filter(Date < "2010-05-01")
head(ATM)
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
summary(ATM)
## Date ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
There are 3 missing values for ATM1 and 2 missing values for ATM2. As mentioned in Section 13.9, ARIMA models will work in handling missing values. To fix those 5 missing values, an ARIMA model was fitted on the data containing missing values and then used to interpolate the missing observations.
# summarize the missing data by ATM
ATM %>%
as.data.frame() %>%
group_by(ATM) %>%
summarise(`Missing Values` = sum(is.na(Cash)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## ATM `Missing Values`
## <chr> <int>
## 1 ATM1 3
## 2 ATM2 2
## 3 ATM3 0
## 4 ATM4 0
ATM <- ATM %>%
# Fit ARIMA model to the data containing missing values
model(ARIMA(Cash)) %>%
# Estimate Cash for the missing values.
interpolate(ATM)
# check to see if any missing values again
ATM %>%
as.data.frame() %>%
group_by(ATM) %>%
summarise(`Missing Values` = sum(is.na(Cash)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## ATM `Missing Values`
## <chr> <int>
## 1 ATM1 0
## 2 ATM2 0
## 3 ATM3 0
## 4 ATM4 0
At first glance, it appears that ATM1 and ATM2 have some seasonality to it while ATM3 seems to have only been active at the end of April 2010. There also seems to be a large outlier in ATM4. It would be best to remove this outlier and explore the data further.
ATM %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = " ATM Before Outlier Removal")
There is no trend for ATM1 and there is a weekly seasonality. The seasonality shows that there is a sudden drop in amount of cash withdrawn once a week. The remainder seems random up until March 2010, where there is a pattern in the increased variation.
#plot
ATM %>%
filter(ATM == "ATM1") %>%
autoplot(Cash) +
ggtitle("Non-tranformed ATM1")
# STL decomposition
ATM %>%
filter(ATM == "ATM1") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM1")
# lambda for box cox transformation
lambda <- ATM %>%
filter(ATM == "ATM1") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm1_fit <- ATM %>%
filter(ATM == "ATM1") %>%
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,lambda))
)
# stats for the models
left_join(glance(atm1_fit) %>% select(.model:BIC),
accuracy(atm1_fit) %>% select(.model, RMSE)) %>%
arrange(RMSE)
## # A tibble: 7 x 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 multiplicative_ts 0.0455 -1342. 2704. 2704. 2743. 23.0
## 2 additive 577. -2232. 4485. 4486. 4524. 23.7
## 3 ARIMA 3.41 -728. 1464. 1464. 1480. 24.5
## 4 additive_ts 3.51 -1301. 2623. 2624. 2662. 24.6
## 5 multiplicative 0.134 -2273. 4566. 4566. 4605. 26.2
## 6 snaive 765. NA NA NA NA 27.6
## 7 snaive_ts 4.76 NA NA NA NA 27.6
When it comes to choosing the best model, RMSE, AIC, AICc, or BIC can be used as the comparative statistic. The transformed multiplicative ETS model has the lowest RMSE but the ARIMA model has the lowest AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.
# report of the arima model
atm1_fit %>% select(.model = "ARIMA") %>% report()
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.1249 -0.1046 -0.6359
## s.e. 0.0523 0.0516 0.0441
##
## sigma^2 estimated as 3.414: log likelihood=-728.12
## AIC=1464.23 AICc=1464.34 BIC=1479.75
\(ARIMA(0,0,2)(0,1,1)_7\) was modeled on ATM1.
# forecasting
fc_atm1 <- atm1_fit %>%
forecast(h = 31) %>%
filter(.model=='ARIMA')
# forcasted plot
fc_atm1 %>%
autoplot(ATM) +
ggtitle(TeX(paste0("ATM 1 Forcasted with $ARIMA(0,0,2)(0,1,1)_7$ and $\\lambda$ = ",
round(lambda,2))))
# residuals
atm1_fit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,2)(0,1,1)_7$ with $\\lambda$ = ",
round(lambda,2))))
#Box-Pierce test, â„“=2m for seasonal data, m=7
atm1_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)
## # A tibble: 1 x 4
## .model bp_stat bp_pvalue .name_repair
## <chr> <dbl> <dbl> <chr>
## 1 .model 10.4 0.730 minimal
The residuals seem to be white noise with a few data points in the left tail.
There is a weekly seasonality. The seasonality shows that there is a sudden drop followed by a sharp increase in amount of cash withdrawn once a week. The remainder seems random up until February/March 2010, where there is an increased variation. The data seems to have a slightly decreasing trend with a sudden increase and decrease around February/March 2010.
#plot
ATM %>%
filter(ATM == "ATM2") %>%
autoplot(Cash) +
ggtitle("Non-tranformed ATM2")
# STL decomposition
ATM %>%
filter(ATM == "ATM2") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM2")
# lambda for box cox transformation
lambda <- ATM %>%
filter(ATM == "ATM2") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm2_fit <- ATM %>%
filter(ATM == "ATM2") %>%
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,lambda))
)
# stats for the models
left_join(glance(atm2_fit) %>% select(.model:BIC),
accuracy(atm2_fit) %>% select(.model, RMSE)) %>%
arrange(RMSE)
## # A tibble: 7 x 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 33.5 -1136. 2284. 2284. 2307. 24.3
## 2 additive 622. -2246. 4512. 4513. 4551. 24.6
## 3 additive_ts 35.7 -1725. 3469. 3470. 3508. 25.3
## 4 snaive_ts 48.4 NA NA NA NA 29.6
## 5 snaive 881. NA NA NA NA 29.6
## 6 multiplicative_ts 0.270 -1891. 3801. 3802. 3840. 33.6
## 7 multiplicative 0.738 -2438. 4897. 4898. 4936. 76.5
When it comes to choosing the best model, RMSE, AIC, AICc, or BIC can be used as the comparative statistic. The ARIMA model has the lowest RMSE, AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.
# report of the arima model
atm2_fit %>% select(.model = "ARIMA") %>% report()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4174 -0.8997 0.4575 0.7879 -0.7083
## s.e. 0.0599 0.0485 0.0893 0.0628 0.0423
##
## sigma^2 estimated as 33.5: log likelihood=-1136.02
## AIC=2284.04 AICc=2284.28 BIC=2307.32
\((2,0,2)(0,1,1)_7\) was modeled on ATM2.
# forecasting
fc_atm2 <- atm2_fit %>%
forecast(h = 31) %>%
filter(.model=='ARIMA')
# forcasted plot
fc_atm2 %>%
autoplot(ATM) +
ggtitle(TeX(paste0("ATM 2 Forcasted with $ARIMA(2,0,2)(0,1,1)_7$ and $\\lambda$ = ",
round(lambda,2))))
# residuals
atm2_fit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(2,0,2)(0,1,1)_7$ with $\\lambda$ = ",
round(lambda,2))))
#Box-Pierce test, â„“=2m for seasonal data, m=7
atm2_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)
## # A tibble: 1 x 4
## .model bp_stat bp_pvalue .name_repair
## <chr> <dbl> <dbl> <chr>
## 1 .model 9.95 0.766 minimal
The residuals seem to be white noise with a few residuals in both tails.
Unlike the other ATMs, ATM3 only has three data points which would not be enough to forecast an entire month, nor find an trend or seasonality. It would be best to use the mean of these three days to forecast the month of May.
#plot
ATM %>%
filter(ATM == "ATM3") %>%
autoplot(Cash) +
ggtitle("Non-tranformed ATM3")
# mean model
atm3_fit <- ATM %>%
filter(ATM == "ATM3",
Cash != 0) %>%
model(MEAN(Cash))
# forecasting
fc_atm3 <- atm3_fit %>%
forecast(h = 31)
# forcasted plot
fc_atm3 %>%
autoplot(ATM) +
ggtitle("ATM 3 Forecasted with the MEAN() Model")
There is a clear outlier in ATM4 and it would be best to remove it in order to forecast the data. We then can then interpolate it for imputation. If we define outliers as those that are greater than 3 interquartile ranges, there are 2 outliers that are identified: September 22 and February 9.
#plot
ATM %>%
filter(ATM == "ATM4") %>%
autoplot(Cash) +
ggtitle("Non-tranformed ATM4")
# STL decomposition
dcmp_4 <- ATM %>%
filter(ATM == "ATM4") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components()
dcmp_4 %>%
autoplot() +
labs(title = "STL Decomposition for ATM4")
# identifying the outliers
outliers <- dcmp_4 %>%
filter(remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder))
outliers
## # A dable: 2 x 8 [1D]
## # Key: ATM, .model [1]
## # : Cash = trend + season_week + remainder
## ATM .model Date Cash trend season_week remainder season_adjust
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 "STL(Cash ~~ 2009-09-22 1712. 373. -71.5 1411. 1784.
## 2 ATM4 "STL(Cash ~~ 2010-02-09 10920. 279. -71.5 10712. 10991.
ATM_miss <- ATM %>%
#replace outliers
mutate(Cash = replace(Cash, ATM == "ATM4" & Date == "2010-02-09", NA))
ATM <- ATM_miss %>%
# Fit ARIMA model to the data with missing values
model(ARIMA(Cash)) %>%
# Estimate Cash for the missing values / outliers
interpolate(ATM_miss)
There is a weekly seasonality. The seasonality shows that there is a sudden drop followed by a sharp increase in amount of cash withdrawn once a week. However, the gray bar is large in the seasonality, meaning the variation is the smallest in the seasonality compared to the variation in the data. The remainder seems random for most of the data and it picks up the outlier on September 22, which was not removed.
#plot
ATM %>%
filter(ATM == "ATM4") %>%
autoplot(Cash) +
ggtitle("ATM4 with No Outlier")
# STL decomposition
ATM %>%
filter(ATM == "ATM4") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM4 with No Outlier")
# lambda for box cox transformation
lambda <- ATM %>%
filter(ATM == "ATM4") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm4_fit <- ATM %>%
filter(ATM == "ATM4") %>%
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,lambda)),
# transformed additive ETS model, no seasonality
additive_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("N")),
# transformed multiplicative ETS model, no seasonality
multiplicative_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("N"))
)
# stats for the models
left_join(glance(atm4_fit) %>% select(.model:BIC),
accuracy(atm4_fit) %>% select(.model, RMSE)) %>%
arrange(RMSE)
## # A tibble: 9 x 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive 110883. -3192. 6404. 6405. 6443. 329.
## 2 additive_ts 95.3 -1904. 3828. 3828. 3867. 340.
## 3 multiplicative_ts 0.192 -1900. 3820. 3821. 3859. 346.
## 4 multiplicative 0.728 -3192. 6404. 6405. 6443. 354.
## 5 ARIMA 101. -1358. 2726. 2726. 2746. 354.
## 6 multiplicative_ts_no_s 0.211 -1938. 3883. 3883. 3894. 366.
## 7 additive_ts_no_s 113. -1938. 3883. 3883. 3894. 366.
## 8 snaive 205867. NA NA NA NA 453.
## 9 snaive_ts 164. NA NA NA NA 453.
The ETS() models with no seasonality ranked worse than those with seasonality. The Additive ETS() models has the highest RMSE but the other statistics are worse compared to models with transformed data. Yet again, the ARIMA model has the lowest AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.
# report of the arima model
atm4_fit %>% select(.model = "ARIMA") %>% report()
## Series: Cash
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.0784 0.2131 0.2086 13.3387
## s.e. 0.0528 0.0516 0.0525 0.5518
##
## sigma^2 estimated as 100.6: log likelihood=-1358.09
## AIC=2726.18 AICc=2726.34 BIC=2745.68
\((0,0,1)(2,0,0)_7\) was modeled on ATM4.
# forecasting
fc_atm4 <- atm4_fit %>%
forecast(h = 31) %>%
filter(.model=='ARIMA')
# forcasted plot
fc_atm4 %>%
autoplot(ATM) +
ggtitle(TeX(paste0("ATM 4 Forcasted with $ARIMA(0,0,1)(2,0,0)_7$ and $\\lambda$ = ",
round(lambda,2))))
# residuals
atm4_fit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,1)(2,0,0)_7$ with $\\lambda$ = ",
round(lambda,2))))
#Box-Pierce test, â„“=2m for seasonal data, m=7
atm2_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)
## # A tibble: 1 x 4
## .model bp_stat bp_pvalue .name_repair
## <chr> <dbl> <dbl> <chr>
## 1 .model 9.95 0.766 minimal
The residuals seem to be white noise with a few residuals in both tails. The forecasts may not seem like they model well as it can be seen that they taper off to the mean. However, the prediction levels may capture the true values for May 2010.
Here are the final forecasts on all four ATMs. ATM1 and ATM2 keep up with the seasonality and its variation. ATM3 is just the average of the non-zero data. Lastly, ATM4 seems to diminish to the mean the further into May we go.
# save as data frame
fc <- bind_rows(fc_atm1, fc_atm2, fc_atm3, fc_atm4) %>%
as.data.frame() %>%
select(Date, ATM, .mean) %>%
rename(Cash = .mean)
# export file
fc %>% write.csv("ATM_forecasts.csv")
# plot of may 2010
fc %>%
as_tsibble(index = Date, key = ATM) %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "Forecasted ATM Withdrawls in May 2010")
# altogether plot
fc %>%
as_tsibble(index = Date, key = ATM) %>%
autoplot(Cash) +
autolayer(ATM, Cash, colour = "black") +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawls")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Month was converted from text to a a monthly time object. The data was then transformed into a tsibble object.
# reading in excel file
Res_Load <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>%
# renaming column name
rename(Month = 'YYYY-MMM') %>%
# converting into date format
mutate(Month = yearmonth(Month)) %>%
# converting to tsibble
as_tsibble(index = Month)
head(Res_Load)
## # A tsibble: 6 x 3 [1M]
## CaseSequence Month KWH
## <dbl> <mth> <dbl>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
There is 1 missing values in the KWH column. As mentioned in Section 13.9, ARIMA models will work in handling missing values. To fix the missing value, an ARIMA model was fitted on the data containing the missing value and then used to interpolate the missing observation.
summary(Res_Load$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
Res_Load <- Res_Load %>%
# Fit ARIMA model to the data containing missing values
model(ARIMA(KWH)) %>%
# Estimate Cash for the missing values.
interpolate(Res_Load)
# check to see if any missing values again
Res_Load %>%
as.data.frame() %>%
summarise(`Missing Values` = sum(is.na(KWH)))
## Missing Values
## 1 0
At first glance, it seems that the data is seasonal and has a slight increasing trend. There is also an outlier in July 2010 with a value of 770523. This is most likely due to an error in the data and is missing an additional digit. To fix the outlier, it will be best to remove it and interpolate it.
Res_Load %>%
autoplot(KWH) +
labs(title = "Residential Power Usage")
Res_miss <- Res_Load %>%
#replace outliers with NA
mutate(KWH = replace(KWH, KWH == 770523, NA))
Res_Load <- Res_miss %>%
# Fit ARIMA model to the data with missing values
model(ARIMA(KWH)) %>%
# Estimate Cash for the missing values / outliers
interpolate(Res_miss)
With the outlier removed, the graph looks more readable and sensible. There appears to be an increasing trend which accounts for the smallest variation in the data compared to the others. There is an annual seasonality, with peaks occurring in the summer and winter. The remained looks random, with a spike in Dec 2013.
#plot
Res_Load %>%
autoplot(KWH) +
ggtitle("Residential Power Usage with no Outlier")
# STL decomposition
Res_Load %>%
model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition with no Outlier")
The seasonal plot further shows the increase in power every summer and winter. There is a clear increasing trend in the power usage during the peak months. The off-peak months probably vary every year due to the weather.
Res_Load %>%
gg_season(KWH, labels = "both") +
labs(title = "Seasonal plot: Residential Power Usage")
Res_Load %>%
gg_subseries(KWH) +
labs(title = "Residential Power Usage")
A box-cox transformation was applied to some of the models and compared to other models where there was no transformation.
# lambda for box cox transformation
lambda <- Res_Load %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
res_fit <- Res_Load %>%
model(
# additive ETS model
additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
# multiplicative ETS model
multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
# additive damped model
damped = ETS(KWH ~ error("A") + trend("Ad") + season("A")),
# SNAIVE model
snaive = SNAIVE(KWH),
# transformed additive ETS model
additive_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
# transformed multiplicative ETS model
multiplicative_bc = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
# transformed additive damped model
damped_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("Ad") + season("A")),
# transformed SNAIVE model
snaive_bc = SNAIVE(box_cox(KWH,lambda)),
# arima model
ARIMA = ARIMA(box_cox(KWH,lambda))
)
# stats for the models
left_join(glance(res_fit) %>% select(.model:BIC),
accuracy(res_fit) %>% select(.model, RMSE)) %>%
arrange(AICc)
## # A tibble: 9 x 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 1.51e- 5 752. -1493. -1493. -1478. 627673.
## 2 additive_bc 1.40e- 5 577. -1120. -1116. -1064. 610956.
## 3 damped_bc 1.41e- 5 576. -1116. -1112. -1058. 618831.
## 4 multiplicative_bc 6.51e- 7 575. -1115. -1112. -1060. 625349.
## 5 multiplicative 9.00e- 3 -3053. 6140. 6144. 6196. 613581.
## 6 additive 4.19e+11 -3065. 6165. 6168. 6220. 619703.
## 7 damped 4.25e+11 -3066. 6169. 6173. 6227. 622538.
## 8 snaive 6.51e+11 NA NA NA NA 809926.
## 9 snaive_bc 2.25e- 5 NA NA NA NA 809926.
Unlike the other data, this one produced negative values for AICc, AIC, and BIC. The additive exponential smoothing with a box cox transformation had the lowest RMSE, but the ARIMA had the best AIC, AICc, and BIC. Therefore, the ARIMA model is the best model to choose.
# report of the arima model
res_fit %>% select(.model = "ARIMA") %>% report()
## Series: KWH
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, lambda)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.2462 -0.7202 -0.3546 0.0013
## s.e. 0.0833 0.0751 0.0769 0.0004
##
## sigma^2 estimated as 1.505e-05: log likelihood=751.74
## AIC=-1493.47 AICc=-1493.13 BIC=-1477.51
\(ARIMA(0,0,1)(2,1,0)_{12}\) with drift was fitted on KWH.
# forecasting
fc_res <- res_fit %>%
forecast(h = 12) %>%
filter(.model=='ARIMA')
# forcasted plot
fc_res %>%
autoplot(Res_Load) +
ggtitle(TeX(paste0("ATM 1 Forcasted with $(0,0,1)(2,1,0)_{12}$ and $\\lambda$ = ",
round(lambda,2))))
# residuals
res_fit %>%
select(ARIMA) %>%
gg_tsresiduals(lag = 24) +
ggtitle(TeX(paste0("Residuals for $(0,0,1)(2,1,0)_{12}$ with $\\lambda$ = ",
round(lambda,2))))
#Box-Pierce test, â„“=2m for seasonal data, m=12
res_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 24, dof = 0)
## # A tibble: 1 x 4
## .model bp_stat bp_pvalue .name_repair
## <chr> <dbl> <dbl> <chr>
## 1 .model 29.9 0.187 minimal
The residuals seem to be white noise as shown by the graphs and the Box-Pierce test. It is interesting to note that the first few residuals seem to be constant.
Here are the final forecasted data using the ARIMA model:
fc_res <- fc_res %>%
as.data.frame() %>%
select(Month, .mean) %>%
rename(KWH = .mean) %>%
# adding back the case sequence
mutate(CaseSequence = 925:936) %>%
# rearranging order
relocate(CaseSequence)
fc_res
## CaseSequence Month KWH
## 1 925 2014 Jan 10297282
## 2 926 2014 Feb 8531331
## 3 927 2014 Mar 6650696
## 4 928 2014 Apr 5974089
## 5 929 2014 May 5942940
## 6 930 2014 Jun 8291370
## 7 931 2014 Jul 9539978
## 8 932 2014 Aug 10113059
## 9 933 2014 Sep 8474487
## 10 934 2014 Oct 5849784
## 11 935 2014 Nov 6112937
## 12 936 2014 Dec 8253088
# export file
fc_res %>% write.csv("Res_forecasts.csv")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Date Time was transformed as it was a 5-digit numeric type that can be converted since 00000 would 1/1/1900. Pipe1 has multiple recordings within an hour, so the average of each hour is taken. The data was then transformed into a tsibble object.
# reading in excel file
temp <- read_excel("Waterflow_Pipe1.xlsx", col_types = c('date', 'numeric')) %>%
# converting into date format
mutate(`Date Time` = as_datetime(`Date Time`)) %>%
# renaming column name
rename(DateTime = `Date Time`) %>%
#separate the date and hour
mutate(date = as.Date(DateTime),
# get the hour
hour = paste(format(DateTime, format = "%H"),":00:00"))
Pipe1 <- temp %>%
# replace the date time with the rounded hours
mutate(DateTime = ymd(date) + hms(hour)) %>%
# grouping
group_by(DateTime) %>%
# taking average for each hour
mutate(WaterFlow = mean(WaterFlow)) %>%
# deleting duplicate rows
distinct(DateTime, WaterFlow) %>%
# converting to tsibble
as_tsibble(index = DateTime)
head(Pipe1)
## # A tsibble: 6 x 2 [1h] <UTC>
## # Groups: @ DateTime [6]
## DateTime WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 00:00:00 26.1
## 2 2015-10-23 01:00:00 18.9
## 3 2015-10-23 02:00:00 15.2
## 4 2015-10-23 03:00:00 23.1
## 5 2015-10-23 04:00:00 15.5
## 6 2015-10-23 05:00:00 22.7
summary(Pipe1)
## DateTime WaterFlow
## Min. :2015-10-23 00:00:00 Min. : 8.923
## 1st Qu.:2015-10-25 10:45:00 1st Qu.:17.033
## Median :2015-10-27 22:30:00 Median :19.784
## Mean :2015-10-27 22:38:53 Mean :19.893
## 3rd Qu.:2015-10-30 10:15:00 3rd Qu.:22.789
## Max. :2015-11-01 23:00:00 Max. :31.730
There appears to be missing data for 4 hours in Pipe1. To fix this, we would have to interpolate the data using imputeTS. The interpolation method from Parts A and B did not work here, so a diferent method was used.
# find the missing hours and fill in NA
Pipe1 <- fill_gaps(Pipe1)
# identifies the hours
miss <- Pipe1 %>%
filter(is.na(WaterFlow))
miss
## # A tsibble: 4 x 2 [1h] <UTC>
## # Groups: @ DateTime [4]
## DateTime WaterFlow
## <dttm> <dbl>
## 1 2015-10-27 17:00:00 NA
## 2 2015-10-28 00:00:00 NA
## 3 2015-11-01 05:00:00 NA
## 4 2015-11-01 09:00:00 NA
# interpolates the missing variables
temp <- temp %>%
select(DateTime, WaterFlow) %>%
rbind(.,miss) %>%
as_tsibble(index = DateTime) %>%
na_interpolation(.)
# combining the data, and replacing the NA values
Pipe1 <-left_join(Pipe1, temp, by = "DateTime") %>%
mutate(WaterFlow = coalesce(WaterFlow.x, WaterFlow.y)) %>%
select(DateTime, WaterFlow)
The data is stationary as we fail to reject the null hypothesis.
Pipe1 %>%
gg_tsdisplay(WaterFlow, plot_type='partial') +
labs(title = "Water Flow of Pipe 1")
Pipe1 %>%
features(WaterFlow, unitroot_kpss)
## # A tibble: 1 x 3
## kpss_stat kpss_pvalue .name_repair
## <dbl> <dbl> <chr>
## 1 0.269 0.1 minimal
Looking at the STL decomposition, there seems to be a daily seasonality.
Pipe1 %>%
model(STL(WaterFlow ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition")
The data is transformed using box-cox. The ARIMA model with the box-cox transformation has lower statistics, meaning it is the better model.
# lambda for box cox transformation
lambda <- Pipe1 %>%
features(WaterFlow, features = guerrero) %>%
pull(lambda_guerrero)
p1_fit <- Pipe1 %>%
model(
# arima model
ARIMA_bc = ARIMA(box_cox(WaterFlow,lambda)),
ARIMA = ARIMA(WaterFlow)
)
glance(p1_fit) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 2 x 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 18.2 -688. 1380. 1380. 1387.
## 2 ARIMA_bc 57.1 -825. 1655. 1655. 1662.
p1_fit %>% select(.model = "ARIMA_bc") %>% report()
## Series: WaterFlow
## Model: ARIMA(0,0,0) w/ mean
## Transformation: box_cox(WaterFlow, lambda)
##
## Coefficients:
## constant
## 29.0075
## s.e. 0.4866
##
## sigma^2 estimated as 57.07: log likelihood=-825.36
## AIC=1654.71 AICc=1654.76 BIC=1661.67
The ARIMA model is just the mean of the data. The residuals appear to be white noise.
# forecasting the data
p1_fc <-p1_fit %>%
forecast(h = 168) %>%
filter(.model=='ARIMA_bc')
# forecasted plot
p1_fc %>%
autoplot(Pipe1) +
ggtitle(TeX(paste0("Pipe 1 Forecasted with ARIMA $(0,0,0)$ with mean and $\\lambda$ = ",
round(lambda,2))))
# residual plot
p1_fit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle("Residuals for Pipe 1 | ARIMA(0,0,0) with mean")
The dates were converted for Pipe2 and the data was converted to a tsibble. There are no missing data nor hours.
# reading in excel file
Pipe2 <- read_excel("Waterflow_Pipe2.xlsx", col_types = c('date', 'numeric')) %>%
# converting into date format
mutate(`Date Time` = as_datetime(`Date Time`)) %>%
# renaming column name
rename(DateTime = `Date Time`) %>%
# converting to tsibble
as_tsibble(index = DateTime)
head(Pipe2)
## # A tsibble: 6 x 2 [1h] <UTC>
## DateTime WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 18.8
## 2 2015-10-23 02:00:00 43.1
## 3 2015-10-23 03:00:00 38.0
## 4 2015-10-23 04:00:00 36.1
## 5 2015-10-23 05:00:00 31.9
## 6 2015-10-23 06:00:00 28.2
summary(Pipe2)
## DateTime WaterFlow
## Min. :2015-10-23 01:00:00 Min. : 1.885
## 1st Qu.:2015-11-02 10:45:00 1st Qu.:28.140
## Median :2015-11-12 20:30:00 Median :39.682
## Mean :2015-11-12 20:30:00 Mean :39.556
## 3rd Qu.:2015-11-23 06:15:00 3rd Qu.:50.782
## Max. :2015-12-03 16:00:00 Max. :78.303
The data is stationary as we fail to reject the null hypothesis.
Pipe2 %>%
gg_tsdisplay(WaterFlow, plot_type='partial') +
labs(title = "Water Flow of Pipe 2")
Pipe2 %>%
features(WaterFlow, unitroot_kpss)
## # A tibble: 1 x 3
## kpss_stat kpss_pvalue .name_repair
## <dbl> <dbl> <chr>
## 1 0.105 0.1 minimal
Looking at the STL decomposition, there seems to be some seasonality on a daily basis, as well as weekly.
Pipe2 %>%
model(STL(WaterFlow ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition")
The data is transformed using box-cox. The ARIMA model without the box-cox transformation has lower statistics, meaning it is the better model.
# lambda for box cox transformation
lambda <- Pipe2 %>%
features(WaterFlow, features = guerrero) %>%
pull(lambda_guerrero)
p2_fit <- Pipe2 %>%
model(
# arima model
ARIMA_bc = ARIMA(box_cox(WaterFlow,lambda)),
ARIMA = ARIMA(WaterFlow)
)
glance(p2_fit) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 2 x 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_bc 195. -4054. 8114. 8114. 8128.
## 2 ARIMA 256. -4191. 8387. 8387. 8402.
p2_fit %>%
select(.model = "ARIMA_bc") %>% report()
## Series: WaterFlow
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean
## Transformation: box_cox(WaterFlow, lambda)
##
## Coefficients:
## sma1 constant
## 0.0845 34.6140
## s.e. 0.0314 0.4772
##
## sigma^2 estimated as 194.7: log likelihood=-4053.87
## AIC=8113.75 AICc=8113.77 BIC=8128.47
With a large h in our forecast, the forecasts eventually become mean of the data. The residuals appear to be white noise.
# forecasting the data
p2_fc <-p2_fit %>%
forecast(h = 168) %>%
filter(.model=='ARIMA')
# forecasted plot
p2_fc %>%
autoplot(Pipe2) +
ggtitle(TeX(paste0("Pipe 2 Forecasted with ARIMA $(0,0,0)$ with mean and $\\lambda$ = ",
round(lambda,2))))
# residual plot
p2_fit %>%
select(ARIMA) %>%
gg_tsresiduals() +
ggtitle("Residuals for Pipe 1 | ARIMA(0,0,0) with mean")
Here are the final forecasted data using the ARIMA model. Since h was large, the forecasts resemble the mean for the most part. Although it can be forecasted, it might best not to forecast the data as they do not predict future values well.
p1_fc <- p1_fc %>%
as.data.frame() %>%
select(DateTime, .mean) %>%
rename(WaterFlow = .mean)
p2_fc <- p2_fc %>%
as.data.frame() %>%
select(DateTime, .mean) %>%
rename(WaterFlow = .mean)
# export file
write.xlsx(list('Pipe1' = p1_fc, 'Pipe2' = p2_fc), file = 'pipes.xlsx')