DATA 624 Project One - Part B - Forecasting Power Consumption

Read in Data and Split into Individual ATMs
- Read in the Residential Energy Consumption Data
- Impute Nulls with the Mean
Create the Time Series Object with Frequency = 12
Plot the Time Series
Create some Seasonal Plots to Look at Potential Seasonality
Residential Customer Forecast - Auto Arima
Look at the auto Correlation Plots
- Run a Cross Validation Loop to Check a Number of Different Models to see which model performs the best out of sample
- Build out the STL Model
Produce the Output Dataset

Read in Data and Split into Individual ATMs

Read in the Residential Energy Consumption Data

ResidentialCustomerForecastLoad_624 <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")

Impute Nulls with the Mean

This will impute the Mean for any Missing Values in the Time Series

ResidentialCustomerForecastLoad_624[is.na(ResidentialCustomerForecastLoad_624)] <- mean(ResidentialCustomerForecastLoad_624$KWH)

Create the Time Series Object with Frequency = 12

This is a montly time series from 1998 to 2013

## create a time series version of the different datasets
ResidentialCustomerForecastLoad_624_ts <- ts(ResidentialCustomerForecastLoad_624 %>% select(KWH),frequency =12, start=c(1998,1))

Plot the Time Series

Just looking at the time series it appears there isa definite seasonality to the data and a slight upward trend at the moment

## use the autoplot function
autoplot(ResidentialCustomerForecastLoad_624_ts)

Create some Seasonal Plots to Look at Potential Seasonality

It looks like a lot more energy is used in the middle of the summer and the middle of the winter which makes a lot of sense because this is when households have to use more electricity to heat and cool their homes. There is a few outliers in the dataset it looks like one of them happened in 2008 during the financial crisis and the other one in 2010 I think that both of these were periods of economic turmoil and I think this is the primary reason that both of these observations are such outliers

ggseasonplot(ResidentialCustomerForecastLoad_624_ts, polar = TRUE)

ggseasonplot(ResidentialCustomerForecastLoad_624_ts)

ggsubseriesplot(ResidentialCustomerForecastLoad_624_ts)

## Warning: Removed 16 rows containing missing values (geom_path).

Residential Customer Forecast - Auto Arima

auto_arima <- auto.arima(ResidentialCustomerForecastLoad_624_ts)
summary(auto_arima)

## Series: ResidentialCustomerForecastLoad_624_ts 
## ARIMA(0,0,2)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     ma2     sar1     sar2     drift
##       0.1739  0.0505  -0.7591  -0.4124  8750.907
## s.e.  0.0766  0.0844   0.0697   0.0682  3214.839
## 
## sigma^2 estimated as 7.841e+11:  log likelihood=-2707.12
## AIC=5426.25   AICc=5426.73   BIC=5445.4
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -9730.731 845160.6 506279.9 -5.055329 11.57781 0.730904
##                     ACF1
## Training set 0.008533477

auto_arima_forecast = forecast(auto_arima,h=12)
plot(auto_arima_forecast)

auto_arima_forecast_fun <- function(x, h) {
  forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 1)), h = 31)
}

Look at the auto Correlation Plots

As expected the lag 12 has the highest correlation because we would expected energy consumption of say January of this year to be similar to that of January of last year

ggAcf(ResidentialCustomerForecastLoad_624_ts)

ggPacf(ResidentialCustomerForecastLoad_624_ts)

Run a Cross Validation Loop to Check a Number of Different Models to see which model performs the best out of sample

Write a loop to look at cross validation for each time series forecast method The STLF model was the best model, this was with the default method which I believe is “ETS” which is a type of expontential smoothing. The Auto Arima model that I built in the step above seemed to be a close second. The seasonal naive model appears to be the third best performing model, this model just takes the value from the previous season and uses that as the prediction for the current season

method <- c("auto_arima_forecast_fun","hw","snaive","rwf","croston","holt","ses","forecast","stlf")

for (i in method){

# Compute cross-validated errors for up to 8 steps ahead
e <- tsCV(ResidentialCustomerForecastLoad_624_ts, forecastfunction = get(paste0(i)), h = 31)

# Compute the MSE values and remove missing values
mse <- colMeans(e^2, na.rm = TRUE)
print(paste("The residual mean squared error of the",i,"model is:",sqrt(mean(e^2, na.rm = TRUE))))


}

## [1] "The residual mean squared error of the auto_arima_forecast_fun model is: 1014511.75822615"
## [1] "The residual mean squared error of the hw model is: 1816109.28874367"
## [1] "The residual mean squared error of the snaive model is: 1202817.31119995"
## [1] "The residual mean squared error of the rwf model is: 1953698.56308183"
## [1] "The residual mean squared error of the croston model is: 1265455.77408895"
## [1] "The residual mean squared error of the holt model is: 4880726.22842738"
## [1] "The residual mean squared error of the ses model is: 1719938.66327158"
## [1] "The residual mean squared error of the forecast model is: 1717452.3069133"
## [1] "The residual mean squared error of the stlf model is: 963062.096475599"

Build out the STL Model

stlf <- stlf(ResidentialCustomerForecastLoad_624_ts)
summary(stlf)

## 
## Forecast method: STL +  ETS(M,N,N)
## 
## Model Information:
## ETS(M,N,N) 
## 
## Call:
##  ets(y = x, model = etsmodel, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.1344 
## 
##   Initial states:
##     l = 6290586.7981 
## 
##   sigma:  0.1138
## 
##      AIC     AICc      BIC 
## 6199.441 6199.568 6209.213 
## 
## Error measures:
##                    ME     RMSE      MAE      MPE    MAPE      MASE      ACF1
## Training set 52502.77 756322.5 470050.6 -3.93305 10.9949 0.6834311 0.1415413
## 
## Forecasts:
##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9600572 8485694 10715449 7895514 11305629
## Feb 2014        8559481 7434444  9684518 6838885 10280077
## Mar 2014        7200023 6064915  8335132 5464025  8936022
## Apr 2014        6396041 5250947  7541134 4644771  8147310
## May 2014        6060720 4905725  7215715 4294308  7827132
## Jun 2014        7724342 6559528  8889156 5942912  9505771
## Jul 2014        8098512 6923958  9273065 6302187  9894837
## Aug 2014        9434641 8250426 10618856 7623541 11245742
## Sep 2014        8531914 7338113  9725714 6706153 10357674
## Oct 2014        6375171 5171859  7578483 4534864  8215478
## Nov 2014        5968190 4755439  7180941 4113447  7822933
## Dec 2014        7794043 6571923  9016163 5924972  9663114
## Jan 2015        9600572 8369153 10831990 7717279 11483864
## Feb 2015        8559481 7318830  9800132 6662070 10456892
## Mar 2015        7200023 5950207  8449840 5288594  9111452
## Apr 2015        6396041 5137123  7654958 4470693  8321389
## May 2015        6060720 4792765  7328675 4121550  7999890
## Jun 2015        7724342 6447411  9001273 5771444  9677239
## Jul 2015        8098512 6812666  9384358 6131980 10065044
## Aug 2015        9434641 8139939 10729344 7454565 11414718
## Sep 2015        8531914 7228413  9835414 6538382 10525446
## Oct 2015        6375171 5062930  7687413 4368271  8382071
## Nov 2015        5968190 4647264  7289117 3948007  7988373
## Dec 2015        7794043 6464486  9123600 5760661  9827425

stlf = forecast(stlf,h=12)
plot(stlf)

Produce the Output Dataset

Predictions <- as.data.frame(stlf$mean)
colnames(Predictions) <- c("KWH")
Predictions$Dates <- seq(from = as.Date("2014-01-01"), to = as.Date("2014-12-01"), by = 'month')
Predictions$Dates <- format(as.Date(Predictions$Dates), "%Y-%m")

write.csv(Predictions,file = "Residential_Power_Consumption_2014.csv")