DATA 624 Project One - Part B - Forecasting Power Consumption
DATA 624 Project One - Part B - Forecasting Power Consumption
Read in Data and Split into Individual ATMs
Read in the Residential Energy Consumption Data
Impute Nulls with the Mean
This will impute the Mean for any Missing Values in the Time Series
Create the Time Series Object with Frequency = 12
This is a montly time series from 1998 to 2013
Plot the Time Series
Just looking at the time series it appears there isa definite seasonality to the data and a slight upward trend at the moment
Create some Seasonal Plots to Look at Potential Seasonality
It looks like a lot more energy is used in the middle of the summer and the middle of the winter which makes a lot of sense because this is when households have to use more electricity to heat and cool their homes. There is a few outliers in the dataset it looks like one of them happened in 2008 during the financial crisis and the other one in 2010 I think that both of these were periods of economic turmoil and I think this is the primary reason that both of these observations are such outliers
## Warning: Removed 16 rows containing missing values (geom_path).
Residential Customer Forecast - Auto Arima
## Series: ResidentialCustomerForecastLoad_624_ts
## ARIMA(0,0,2)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 ma2 sar1 sar2 drift
## 0.1739 0.0505 -0.7591 -0.4124 8750.907
## s.e. 0.0766 0.0844 0.0697 0.0682 3214.839
##
## sigma^2 estimated as 7.841e+11: log likelihood=-2707.12
## AIC=5426.25 AICc=5426.73 BIC=5445.4
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -9730.731 845160.6 506279.9 -5.055329 11.57781 0.730904
## ACF1
## Training set 0.008533477
Look at the auto Correlation Plots
As expected the lag 12 has the highest correlation because we would expected energy consumption of say January of this year to be similar to that of January of last year
Run a Cross Validation Loop to Check a Number of Different Models to see which model performs the best out of sample
Write a loop to look at cross validation for each time series forecast method The STLF model was the best model, this was with the default method which I believe is “ETS” which is a type of expontential smoothing. The Auto Arima model that I built in the step above seemed to be a close second. The seasonal naive model appears to be the third best performing model, this model just takes the value from the previous season and uses that as the prediction for the current season
method <- c("auto_arima_forecast_fun","hw","snaive","rwf","croston","holt","ses","forecast","stlf")
for (i in method){
# Compute cross-validated errors for up to 8 steps ahead
e <- tsCV(ResidentialCustomerForecastLoad_624_ts, forecastfunction = get(paste0(i)), h = 31)
# Compute the MSE values and remove missing values
mse <- colMeans(e^2, na.rm = TRUE)
print(paste("The residual mean squared error of the",i,"model is:",sqrt(mean(e^2, na.rm = TRUE))))
}
## [1] "The residual mean squared error of the auto_arima_forecast_fun model is: 1014511.75822615"
## [1] "The residual mean squared error of the hw model is: 1816109.28874367"
## [1] "The residual mean squared error of the snaive model is: 1202817.31119995"
## [1] "The residual mean squared error of the rwf model is: 1953698.56308183"
## [1] "The residual mean squared error of the croston model is: 1265455.77408895"
## [1] "The residual mean squared error of the holt model is: 4880726.22842738"
## [1] "The residual mean squared error of the ses model is: 1719938.66327158"
## [1] "The residual mean squared error of the forecast model is: 1717452.3069133"
## [1] "The residual mean squared error of the stlf model is: 963062.096475599"
Build out the STL Model
##
## Forecast method: STL + ETS(M,N,N)
##
## Model Information:
## ETS(M,N,N)
##
## Call:
## ets(y = x, model = etsmodel, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.1344
##
## Initial states:
## l = 6290586.7981
##
## sigma: 0.1138
##
## AIC AICc BIC
## 6199.441 6199.568 6209.213
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 52502.77 756322.5 470050.6 -3.93305 10.9949 0.6834311 0.1415413
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 9600572 8485694 10715449 7895514 11305629
## Feb 2014 8559481 7434444 9684518 6838885 10280077
## Mar 2014 7200023 6064915 8335132 5464025 8936022
## Apr 2014 6396041 5250947 7541134 4644771 8147310
## May 2014 6060720 4905725 7215715 4294308 7827132
## Jun 2014 7724342 6559528 8889156 5942912 9505771
## Jul 2014 8098512 6923958 9273065 6302187 9894837
## Aug 2014 9434641 8250426 10618856 7623541 11245742
## Sep 2014 8531914 7338113 9725714 6706153 10357674
## Oct 2014 6375171 5171859 7578483 4534864 8215478
## Nov 2014 5968190 4755439 7180941 4113447 7822933
## Dec 2014 7794043 6571923 9016163 5924972 9663114
## Jan 2015 9600572 8369153 10831990 7717279 11483864
## Feb 2015 8559481 7318830 9800132 6662070 10456892
## Mar 2015 7200023 5950207 8449840 5288594 9111452
## Apr 2015 6396041 5137123 7654958 4470693 8321389
## May 2015 6060720 4792765 7328675 4121550 7999890
## Jun 2015 7724342 6447411 9001273 5771444 9677239
## Jul 2015 8098512 6812666 9384358 6131980 10065044
## Aug 2015 9434641 8139939 10729344 7454565 11414718
## Sep 2015 8531914 7228413 9835414 6538382 10525446
## Oct 2015 6375171 5062930 7687413 4368271 8382071
## Nov 2015 5968190 4647264 7289117 3948007 7988373
## Dec 2015 7794043 6464486 9123600 5760661 9827425
Produce the Output Dataset
Predictions <- as.data.frame(stlf$mean)
colnames(Predictions) <- c("KWH")
Predictions$Dates <- seq(from = as.Date("2014-01-01"), to = as.Date("2014-12-01"), by = 'month')
Predictions$Dates <- format(as.Date(Predictions$Dates), "%Y-%m")
write.csv(Predictions,file = "Residential_Power_Consumption_2014.csv")