ATM and Residential Power Forecasting
library(tidyverse)
library(readxl)
library(fpp3)
library(forecast)
Objective
Part A – ATM Forecast, ATM624Data.xlsx
Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Model the data and create a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours.
Part A - ATM Forecast
Read in Data
The data is stored in an excel file and needs to be loaded into R to perform analysis.
Raw Data
There are missing values that need to be imputed prior to analysis to maintain continuity in the time series.
<- read_excel("ATM624Data.xlsx", col_types = c('date', 'text', 'numeric'))
a
head(a)
## # A tibble: 6 x 3
## DATE ATM Cash
## <dttm> <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1 96
## 2 2009-05-01 00:00:00 ATM2 107
## 3 2009-05-02 00:00:00 ATM1 82
## 4 2009-05-02 00:00:00 ATM2 89
## 5 2009-05-03 00:00:00 ATM1 85
## 6 2009-05-03 00:00:00 ATM2 90
summary(a)
## DATE ATM Cash
## Min. :2009-05-01 00:00:00 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 00:00:00 Class :character 1st Qu.: 0.5
## Median :2009-11-01 00:00:00 Mode :character Median : 73.0
## Mean :2009-10-31 19:11:48 Mean : 155.6
## 3rd Qu.:2010-02-01 00:00:00 3rd Qu.: 114.0
## Max. :2010-05-14 00:00:00 Max. :10919.8
## NA's :19
Cleaned and Imputed Data
Upon further inspection, most of the missing data values occur from May 1st, 2010 to May 14th, 2010. The objective of the project is to forecast the month of May 2010, therefore I will only use the data provided before May, 1st, 2010 to create my forecast. There are still 5 missing values that need to be imputed prior to advancing the analysis. ATM1 is missing 3 values and ATM2 is missing 2 values. I chose to impute the missing values with the Last Observation Carried Forward technique which simply fills the missing value with the previous non-NA value.
$ATM <- as.factor(a$ATM)
a$DATE <- as.Date(a$DATE)
ahead(a)
## # A tibble: 6 x 3
## DATE ATM Cash
## <date> <fct> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
summary(a)
## DATE ATM Cash
## Min. :2009-05-01 ATM1:365 Min. : 0.0
## 1st Qu.:2009-08-01 ATM2:365 1st Qu.: 0.5
## Median :2009-11-01 ATM3:365 Median : 73.0
## Mean :2009-10-31 ATM4:365 Mean : 155.6
## 3rd Qu.:2010-02-01 NA's: 14 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
which(is.na(a$ATM))
## [1] 731 732 733 734 735 736 737 738 739 740 741 742 743 744
731:744,] a[
## # A tibble: 14 x 3
## DATE ATM Cash
## <date> <fct> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
<- a[-c(731:744),] a
<- spread(data = a, key = ATM, value = Cash)
a.data paste0("Number of Cash NA values for ATM1: " , sum(is.na(a.data$ATM1)))
## [1] "Number of Cash NA values for ATM1: 3"
paste0("Number of Cash NA values for ATM2: " , sum(is.na(a.data$ATM2)))
## [1] "Number of Cash NA values for ATM2: 2"
paste0("Number of Cash NA values for ATM3: " , sum(is.na(a.data$ATM3)))
## [1] "Number of Cash NA values for ATM3: 0"
paste0("Number of Cash NA values for ATM4: " , sum(is.na(a.data$ATM4)))
## [1] "Number of Cash NA values for ATM4: 0"
<- a.data %>%
a.data fill(ATM1, .direction = "down") %>%
fill(ATM2, .direction = "down")
paste0("Number of NA values in dataset: " , sum(is.na(a.data)))
## [1] "Number of NA values in dataset: 0"
summary(a.data)
## DATE ATM1 ATM2 ATM3
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.00 1st Qu.: 0.0000
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
## Mean :2009-10-30 Mean : 84.19 Mean : 62.57 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
<- a.data %>%
a.data.ts gather(ATM, Cash, ATM1:ATM4)
<- as_tsibble(a.data.ts, index = DATE, key = ATM) a.data.ts
Data Exploration
There looks to be seasonality in ATM1 and ATM2. ATM3 appears to have been inactive for most of the time series with a spike of activity with the last few days. ATM4 has an outlier that is making it difficult to understand the series. To minimize the bias of the outlier, I will replace the outlier with the median.
Notes on ATMS:
- ATM1 does not have a clear trend but has seasonality.
- ATM2 does not have a clear trend but has seasonality.
- ATM3 has a Cash gain of 0 for most of the series. This likely means that the ATM was not in service until the end of the time series.
- ATM4, after the outlier was replaced, appears to have some seasonality and no clear trend.
%>%
a.data.ts autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
ggtitle("Prior to Outlier Adjustment")
paste0("Row number of outlier: ", which.max(a.data.ts$Cash))
## [1] "Row number of outlier: 1380"
1380,3] <- 403.839 a.data.ts[
%>%
a.data.ts autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
ggtitle("After Outlier Adjustment")
Data Modeling
ATM1
Check if Data is Stationary
In order to use ARIMA models, the data must be stationary. The data appears to have strong seasonal autocorrelation every week. After taking 1 seasonal difference with the lag equal to 7 days, the data appears to be stationary.
<- a.data.ts %>%
atm1 filter(ATM == 'ATM1')
gg_tsdisplay(atm1, plot_type='partial') +
ggtitle("ATM1 Before Differencing")
## Plot variable not specified, automatically selected `y = Cash`
%>%
atm1 features(Cash, unitroot_kpss, lag = 7)
## # A tibble: 1 x 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM1 0.517 0.0379
<- atm1 %>%
nsd features(Cash, unitroot_nsdiffs, lag = 7)
paste0("Number of Seasonal differences needed for stationarity: ", nsd$nsdiffs)
## [1] "Number of Seasonal differences needed for stationarity: 1"
%>%
atm1 features(difference(Cash, lag = 7), unitroot_kpss)
## # A tibble: 1 x 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM1 0.0220 0.1
gg_tsdisplay(atm1, difference(Cash, lag = 7), plot_type='partial') +
ggtitle("ATM1 After Differencing")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).
<- atm1 %>%
atm1 mutate(diff_Cash = difference(Cash, lag = 7))
Model Creation
To find the optimal model for ATM1, 5 separate models comprised of different forecasting methods were created. The models will be compared via accuracy metrics to find the optimal model. The following models were created:
- MEAN model, using the mean as the forecast
- SNAIVE model, using the seasonal naive value as the prediction
- ETS(A,N,A), using the additive exponential smoothing prediction as forecast
- ETS(M, N, M) using the multiplicative exponential smoothing predication as forecast
- ARIMA(0,0,1)(0,1,2)[7]
<- atm1 %>%
lambda features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
<- atm1 %>%
fit1 model(
MEAN = MEAN(Cash),
SNAIVE = SNAIVE(Cash),
ETS_Additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
#ETS_Additive_Boxcox = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
#ETS_Multiplicative_Boxcox = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M"))
ARIMA = ARIMA(Cash)
)
<- fit1 %>%
fc1 forecast(h = 31)
#fit1.arima <- atm1 %>%
# model(
# ARIMA_Differenced = ARIMA(diff_Cash)
# )
#
#fc1.arima <- fit1.arima %>%
# forecast(h = 31)
Model Evaluation
The best model according to the accuracy metrics is the ARIMA(0,0,1)(0,1,2)[7]. The ARIMA model had the best RMSE and MAE indicating that it had the best fit for the time series.
#acc.met <- fit1 %>%
# glance() %>%
# select(.model:BIC)
#acc.met2 <- fit1.arima %>%
# glance() %>%
# select(.model:BIC)
#acc <- rbind(acc.met, acc.met2)
<- fit1 %>%
acc.met2 accuracy() %>%
select(.model, RMSE, MAE)
#RMSE2 <- fit1.arima %>%
# accuracy() %>%
# select(RMSE)
#rmse <- rbind(RMSE, RMSE2)
#kableExtra::kable(cbind(acc.met, acc.met2))
::kable(acc.met2) kableExtra
.model | RMSE | MAE |
---|---|---|
MEAN | 36.63506 | 27.29360 |
SNAIVE | 27.83195 | 17.78492 |
ETS_Additive | 23.84001 | 15.11082 |
ETS_Multiplicative | 26.29736 | 16.22126 |
ARIMA | 23.30822 | 14.59878 |
Model Diagnostics
The ARIMA(0,0,1)(0,1,2)[7] residuals meet the key expectations for model forecasting. The innovation residuals are likely from white noise as the lag values are within the blue bands in the ACF plot. The mean of the innovation residuals is very close to zero (mean = -0.07), meaning the model is unbiased. The properties of constant variance and normality are not met but are not necessary for forecasting.
<- atm1 %>%
fit1.forecast model(
ARIMA = ARIMA(Cash)
)
%>%
fit1.forecast gg_tsresiduals()
<- residuals(fit1.forecast, type = "innovation")
resid mean(resid$.resid)
## [1] -0.07437729
Forecast
The forecast for ATM1 is provided below. The forecasts have been stored in a .csv file that will be up on my Github and attached to the Project submission.
<- atm1 %>%
fit1.forecast model(
ARIMA = ARIMA(Cash)
)
<- fit1.forecast %>%
fc1.final forecast(h = 31)
%>%
fc1.final autoplot(atm1) +
ggtitle("ATM1: ARIMA(0,0,1)(0,1,2)[7] Forecast")
<- as.data.frame(fc1.final) %>%
DATA624_Project1_PartA_ATM1_Forecast select(DATE, .mean)
write.csv(DATA624_Project1_PartA_ATM1_Forecast, "DATA624_Project1_PartA_ATM1_Forecast.csv")
ATM2
Check if Data is Stationary
Similar to ATM1, ATM2 data appears to have strong seasonal autocorrelation every week. After taking 1 seasonal difference with the lag equal to 7 days, the data appears to be stationary.
<- a.data.ts %>%
atm2 filter(ATM == 'ATM2')
gg_tsdisplay(atm1, plot_type='partial') +
ggtitle("ATM2 Before Differencing")
## Plot variable not specified, automatically selected `y = Cash`
%>%
atm2 features(Cash, unitroot_kpss, lag = 7)
## # A tibble: 1 x 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM2 2.15 0.01
<- atm2 %>%
nsd features(Cash, unitroot_nsdiffs, lag = 7)
paste0("Number of Seasonal differences needed for stationarity: ", nsd$nsdiffs)
## [1] "Number of Seasonal differences needed for stationarity: 1"
%>%
atm2 features(difference(Cash, lag = 7), unitroot_kpss)
## # A tibble: 1 x 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM2 0.0167 0.1
gg_tsdisplay(atm2, difference(Cash, lag = 7), plot_type='partial') +
ggtitle("ATM2 After Differencing")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).
<- atm2 %>%
atm2 mutate(diff_Cash = difference(Cash, lag = 7))
Model Creation
A handful of models were created in the process of identifying the ideal model for ATM2. The models will be compared via accuracy metrics to find the optimal model. The following models were created:
- MEAN model, using the mean as the forecast
- SNAIVE model, using the seasonal naive value as the prediction
- ETS(A,N,A), using the additive exponential smoothing prediction as forecast
- ETS(M, N, M) using the multiplicative exponential smoothing predication as forecast
- ARIMA(2,0,2)(0,1,1)[7]
<- atm2 %>%
lambda features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
<- atm2 %>%
fit2 model(
MEAN = MEAN(Cash),
SNAIVE = SNAIVE(Cash),
ETS_Additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
#ETS_Additive_Boxcox = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
#ETS_Multiplicative_Boxcox = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M"))
ARIMA = ARIMA(Cash)
)
<- fit2 %>%
fc2 forecast(h = 31)
#fit1.arima <- atm1 %>%
# model(
# ARIMA_Differenced = ARIMA(diff_Cash)
# )
#
#fc1.arima <- fit1.arima %>%
# forecast(h = 31)
Model Evaluation
The best model according to the accuracy metrics is the ARIMA(2,0,2)(0,1,1)[7]. The ARIMA model had the best RMSE and MAE indicating that it had the best fit for the time series. The ETS_Additive model is a close second choice model in terms of model fit.
#acc.met <- fit1 %>%
# glance() %>%
# select(.model:BIC)
#acc.met2 <- fit1.arima %>%
# glance() %>%
# select(.model:BIC)
#acc <- rbind(acc.met, acc.met2)
<- fit2 %>%
acc.met4 accuracy() %>%
select(.model, RMSE, MAE)
#RMSE2 <- fit1.arima %>%
# accuracy() %>%
# select(RMSE)
#rmse <- rbind(RMSE, RMSE2)
#kableExtra::kable(cbind(acc.met, acc.met2))
::kable(acc.met4) kableExtra
.model | RMSE | MAE |
---|---|---|
MEAN | 38.83768 | 33.28018 |
SNAIVE | 30.12358 | 20.74860 |
ETS_Additive | 25.04569 | 17.66281 |
ETS_Multiplicative | 74.08215 | 29.24670 |
ARIMA | 24.19476 | 17.01608 |
Model Diagnostics
The ARIMA(2,0,2)(0,1,1)[7] met the key expectations for model forecasting. The ACF plot shows that innovation residuals are likely from white noise as the lag values are below / close to the bands. The mean of the innovation residuals is extremely close to zero (mean = -0.87). The homoscedascity property is not met, however, as the variance changes throughout the series. Finally the data is not exactly from a normal distribution. The homoscedascity and normality properties are helpful but not necessary for forecasting.
<- atm2 %>%
fit2.forecast model(
ARIMA = ARIMA(Cash)
)
%>%
fit2.forecast gg_tsresiduals()
<- residuals(fit2.forecast, type = "innovation")
resid mean(resid$.resid)
## [1] -0.8725862
Forecast
The forecast for ATM2 is provided below. The forecasts have been stored in a .csv file that will be up on my Github and attached to the Project submission.
<- atm2 %>%
fit2.forecast model(
ARIMA = ARIMA(Cash)
)
<- fit2.forecast %>%
fc2.final forecast(h = 31)
%>%
fc2.final autoplot(atm2) +
ggtitle("ATM2: ARIMA(2,0,2)(0,1,1)[7] Forecast")
<- as.data.frame(fc2.final) %>%
DATA624_Project1_PartA_ATM2_Forecast select(DATE, .mean)
write.csv(DATA624_Project1_PartA_ATM2_Forecast, "DATA624_Project1_PartA_ATM2_Forecast.csv")
ATM 3
Check if Data is Stationary
ATM3 does not need to be differenced prior to ARIMA model creation. There is no seasonality present like the previous two ATMs and the unitroot kpss test showed that the series is stationary as is.
<- a.data.ts %>%
atm3 filter(ATM == 'ATM3')
gg_tsdisplay(atm3, plot_type='partial') +
ggtitle("ATM3 Before Differencing")
## Plot variable not specified, automatically selected `y = Cash`
%>%
atm3 features(Cash, unitroot_kpss, lag = 7)
## # A tibble: 1 x 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM3 0.389 0.0818
<- atm3 %>%
nsd features(Cash, unitroot_nsdiffs, lag = 7)
paste0("Number of Seasonal differences needed for stationarity: ", nsd$nsdiffs)
## [1] "Number of Seasonal differences needed for stationarity: 0"
Model Creation
Many models were fit to try to identify the ideal model for ATM3. The models will be compared via accuracy metrics to find the optimal model. The following models were created:
- MEAN model, using the mean as the forecast
- NAIVE model, using the naive value as the prediction
- ETS(A,N,A), using the additive exponential smoothing prediction as forecast
- ETS(M, N, M) using the multiplicative exponential smoothing predication as forecast
- ARIMA(0,0,2)
#lambda <- atm3 %>%
# features(Cash, features = guerrero) %>%
# pull(lambda_guerrero)
<- atm3 %>%
fit3 model(
MEAN = MEAN(Cash),
NAIVE = NAIVE(Cash),
ETS_Additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
#ETS_Additive_Boxcox = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
#ETS_Multiplicative_Boxcox = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M"))
ARIMA = ARIMA(Cash)
)
## Warning: 1 error encountered for ETS_Multiplicative
## [1] missing value where TRUE/FALSE needed
<- fit3 %>%
fc3 forecast(h = 31)
#fit1.arima <- atm1 %>%
# model(
# ARIMA_Differenced = ARIMA(diff_Cash)
# )
#
#fc1.arima <- fit1.arima %>%
# forecast(h = 31)
Model Evaluation
The best model due to the combination of simplicity and accuracy is the NAIVE model. Though the ARIMA model has slight superior accuracy metrics, the NAIVE model is much simpler and is almost as effective (0.06 difference in RMSE and 0.04 in MAE).
#acc.met <- fit1 %>%
# glance() %>%
# select(.model:BIC)
#acc.met2 <- fit1.arima %>%
# glance() %>%
# select(.model:BIC)
#acc <- rbind(acc.met, acc.met2)
<- fit3 %>%
acc.met6 accuracy() %>%
select(.model, RMSE, MAE)
#RMSE2 <- fit1.arima %>%
# accuracy() %>%
# select(RMSE)
#rmse <- rbind(RMSE, RMSE2)
#kableExtra::kable(cbind(acc.met, acc.met2))
::kable(acc.met6) kableExtra
.model | RMSE | MAE |
---|---|---|
MEAN | 7.933887 | 1.4292513 |
NAIVE | 5.087422 | 0.3104396 |
ETS_Additive | 4.994687 | 0.7623657 |
ETS_Multiplicative | NaN | NaN |
ARIMA | 5.026171 | 0.2714591 |
Forecast
The forecast for ATM3 is provided below. The forecasts have been stored in a .csv file that will be up on my Github and attached to the Project submission.
<- atm3 %>%
fit3.forecast model(
NAIVE = NAIVE(Cash)
)
<- fit3.forecast %>%
fc3.final forecast(h = 31)
%>%
fc3.final autoplot(atm3) +
ggtitle("ATM3: NAIVE Forecast")
<- as.data.frame(fc3.final) %>%
DATA624_Project1_PartA_ATM3_Forecast select(DATE, .mean)
write.csv(DATA624_Project1_PartA_ATM3_Forecast, "DATA624_Project1_PartA_ATM3_Forecast.csv")
ATM4
Check if Data is Stationary
ATM4 appears to be stationary without having to difference. The ACF and PACF show that differencing is not necessary.
<- a.data.ts %>%
atm4 filter(ATM == 'ATM4')
gg_tsdisplay(atm4, plot_type='partial') +
ggtitle("ATM4 Before Differencing")
## Plot variable not specified, automatically selected `y = Cash`
%>%
atm4 features(Cash, unitroot_kpss, lag = 7)
## # A tibble: 1 x 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM4 0.118 0.1
<- atm4 %>%
nsd features(Cash, unitroot_ndiffs)
paste0("Number of differences needed for stationarity: ", nsd$ndiffs)
## [1] "Number of differences needed for stationarity: 0"
Model Creation
Many models were fit to try to identify the ideal model for ATM4. The following models were created:
- MEAN model, using the mean as the forecast
- SNAIVE model, using the seasonal naive value as the prediction
- ETS(A,N,A), using the additive exponential smoothing prediction as forecast
- ETS(M, N, M) using the multiplicative exponential smoothing predication as forecast
- ARIMA(3,0,2)(1,0,0)[7]
#lambda <- atm3 %>%
# features(Cash, features = guerrero) %>%
# pull(lambda_guerrero)
<- atm4 %>%
fit4 model(
MEAN = MEAN(Cash),
SNAIVE = SNAIVE(Cash),
ETS_Additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
#ETS_Additive_Boxcox = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
#ETS_Multiplicative_Boxcox = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M"))
ARIMA = ARIMA(Cash)
)
<- fit4 %>%
fc4 forecast(h = 31)
#fit1.arima <- atm1 %>%
# model(
# ARIMA_Differenced = ARIMA(diff_Cash)
# )
#
#fc1.arima <- fit1.arima %>%
# forecast(h = 31)
Model Evaluation
The best fitting model for ATM4 is the ETS_Additive model. This model has the lowest RMSE and MAE suggesting that it is the best fit for future predictions.
#acc.met <- fit1 %>%
# glance() %>%
# select(.model:BIC)
#acc.met2 <- fit1.arima %>%
# glance() %>%
# select(.model:BIC)
#acc <- rbind(acc.met, acc.met2)
<- fit4 %>%
acc.met8 accuracy() %>%
select(.model, RMSE, MAE)
#RMSE2 <- fit1.arima %>%
# accuracy() %>%
# select(RMSE)
#rmse <- rbind(RMSE, RMSE2)
#kableExtra::kable(cbind(acc.met, acc.met2))
::kable(acc.met8) kableExtra
.model | RMSE | MAE |
---|---|---|
MEAN | 350.4289 | 291.8324 |
SNAIVE | 452.4937 | 346.4162 |
ETS_Additive | 328.7012 | 264.9232 |
ETS_Multiplicative | 354.0933 | 276.2913 |
ARIMA | 339.4947 | 279.9456 |
Model Diagnostics
The Additive Exponential Smoothing model meets some of the criteria to yield a good forecast. The ACF plot conveys that the innovation residuals are from white noise. The mean of the innovation residuals are also relatively close to 0. The model does not meet the properties of constant variance and normal distribution. This model should be fairly effective in forecasting ATM4 cash flow. The homoscedascity and normality properties are helpful but not necessary for forecasting.
<- atm4 %>%
fit4.forecast model(
ETS_Additive = ETS(Cash ~ error("A") + trend("N") + season("A"))
)
%>%
fit4.forecast gg_tsresiduals()
<- residuals(fit4.forecast, type = "innovation")
resid mean(resid$.resid)
## [1] -8.288658
Forecast
The forecast for ATM4 is provided below. The forecasts have been stored in a .csv file that will be up on my Github and attached to the Project submission.
<- atm4 %>%
fit4.forecast model(
ETS_Additive = ETS(Cash ~ error("A") + trend("N") + season("A"))
)
<- fit4.forecast %>%
fc4.final forecast(h = 31)
%>%
fc4.final autoplot(atm4) +
ggtitle("ATM4: Additive Exponential Smoothing Forecast")
<- as.data.frame(fc4.final) %>%
DATA624_Project1_PartA_ATM4_Forecast select(DATE, .mean)
write.csv(DATA624_Project1_PartA_ATM4_Forecast, "DATA624_Project1_PartA_ATM4_Forecast.csv")
Part B - Forecasting Power
Read in Data
The date column needs to be adjusted to a yearmonth date rather than
a character. There is also a missing value in the KWH
column. I will replace the missing value with the prior year value of
the same month. For example, the missing value for September, 2008 will
be filled with the September, 2007 KWH value.
<- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
b
head(b)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
<- b %>%
b rename(YearMonth = "YYYY-MMM") %>%
mutate(YearMonth = yearmonth(YearMonth))
<- as_tsibble(b, index = YearMonth)
b.data
head(b.data)
## # A tsibble: 6 x 3 [1M]
## CaseSequence YearMonth KWH
## <dbl> <mth> <dbl>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
summary(b.data$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
paste0("Num of missing values: ", sum(is.na(b.data)))
## [1] "Num of missing values: 1"
paste0("Row index of NA value: " ,which(is.na(b.data$KWH)))
## [1] "Row index of NA value: 129"
129,] b.data[
## # A tsibble: 1 x 3 [1M]
## CaseSequence YearMonth KWH
## <dbl> <mth> <dbl>
## 1 861 2008 Sep NA
129,3] <- b.data[117,3] b.data[
paste0("Num of missing values: ", sum(is.na(b.data)))
## [1] "Num of missing values: 0"
Data Exploration
The data is seasonal and contains and outlier for July 2010 with a much lower power usage than any other month. There also appears to be a slight upward trend in energy usage which could be due to a rise / fall in local temperature month over month especially during summer and winter months. The outlier will be replaced with the previous year’s value for that month (Ex July 2010 will be replaced with July 2009).
%>%
b.data autoplot(KWH) +
ggtitle("Residential Customer Power Consumption w/ Outlier")
paste0("Row index of outlier: " ,which.min(b.data$KWH))
## [1] "Row index of outlier: 151"
151,] b.data[
## # A tsibble: 1 x 3 [1M]
## CaseSequence YearMonth KWH
## <dbl> <mth> <dbl>
## 1 883 2010 Jul 770523
151,3] <- b.data[139,3] b.data[
%>%
b.data autoplot(KWH) +
ggtitle("Residential Customer Power Consumption w/ Adjusted Outlier")
Data Modeling
Check if Data is Stationary
Like in Part A, checking the stationarity of the time series is key to using ARIMA models. The data appears to have strong seasonal autocorrelation every 12 months. After taking 1 seasonal difference the data appears to be stationary.
gg_tsdisplay(b.data, y = KWH, plot_type='partial') +
ggtitle("Before Differencing")
%>%
b.data features(KWH, unitroot_kpss, lag = 12)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 1.67 0.01
<- b.data %>%
nsd features(KWH, unitroot_nsdiffs, lag = 12)
paste0("Number of Seasonal differences needed for stationarity: ", nsd$nsdiffs)
## [1] "Number of Seasonal differences needed for stationarity: 1"
%>%
b.data features(difference(KWH, lag = 12), unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.162 0.1
gg_tsdisplay(b.data, difference(KWH, lag = 12), plot_type='partial') +
ggtitle("ATM1 After Differencing")
## Warning: Removed 12 row(s) containing missing values (geom_path).
## Warning: Removed 12 rows containing missing values (geom_point).
<- b.data %>%
b.data mutate(diff_KWH = difference(KWH, lag = 12))
Model Creation
A few different models with varying degrees of complexity were created to attempt to find a model that could be effective in forecasting future consumer power usage. Basic models like MEAN and SNAIVE were used as well as more complicated ETS and ARIMA models. Each model will be compared to find the best fitting model to use for forecasting.
<- b.data %>%
fit5 model(
MEAN = MEAN(KWH),
SNAIVE = SNAIVE(KWH),
ETS_Additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
ETS_Multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
#ETS_Additive_Boxcox = ETS(box_cox(KWH, lambda) ~ error("A") + trend("N") + season("A")),
#ETS_Multiplicative_Boxcox = ETS(box_cox(KWH, lambda) ~ error("M") + trend("N") + season("M"))
ARIMA = ARIMA(KWH)
)
Model Evaluation
The ARIMA(0,0,4)(2,1,0)[12] w/ drift model outperformed all of the other models in terms of RMSE and MAE. This model will be used for forecasting.
#acc.met <- fit1 %>%
# glance() %>%
# select(.model:BIC)
#acc.met2 <- fit1.arima %>%
# glance() %>%
# select(.model:BIC)
#acc <- rbind(acc.met, acc.met2)
<- fit5 %>%
acc.met10 accuracy() %>%
select(.model, RMSE, MAE)
#RMSE2 <- fit1.arima %>%
# accuracy() %>%
# select(RMSE)
#rmse <- rbind(RMSE, RMSE2)
#kableExtra::kable(cbind(acc.met, acc.met2))
::kable(acc.met10) kableExtra
.model | RMSE | MAE |
---|---|---|
MEAN | 1383989.8 | 1170541.1 |
SNAIVE | 810180.3 | 608304.1 |
ETS_Additive | 619779.7 | 457833.0 |
ETS_Multiplicative | 613720.8 | 455271.5 |
ARIMA | 584710.2 | 422768.5 |
Model Diagnostics
The ARIMA(0,0,4)(2,1,0)[12] w/ drift meet most of the expectations for model forecasting. The AFC plot shows that innovation residuals are likely from white noise. The mean of the innovation residuals are close to zero given the scale of the innovation residuals is in the millions (mean = -8653). The homoscedascity property is not met, however, as the variance changes throughout the series. Finally the data is not exactly from a normal distribution. The homoscedascity and normality properties are helpful but not necessary for forecasting.
<- b.data %>%
fit5.forecast model(
ARIMA = ARIMA(KWH)
)
%>%
fit5.forecast gg_tsresiduals()
<- residuals(fit5.forecast, type = "innovation")
resid mean(resid$.resid)
## [1] -8653.906
Forecast
The forecast for the Electricity Consumption is provided below. The forecasts have been stored in a .csv file that will be up on my Github and attached to the Project submission.
<- fit5.forecast %>%
fc5.final forecast(h = "1 year")
%>%
fc5.final autoplot(b.data) +
ggtitle("Residential Power Consumption: ARIMA(0,0,4)(2,1,0)[12] w/ drift Forecast")
<- as.data.frame(fc5.final) %>%
DATA624_Project1_PartB_Forecast select(YearMonth, .mean)
write.csv(DATA624_Project1_PartB_Forecast, "DATA624_Project1_PartB_Forecast.csv")