Introduction

I am to forecast how much power will be consumed by residential customers in kilowatt hours for the next 12 months. Data was provided for this project spaning January 1998 until December 2013.

Data Exploration

I will begin by reading in the data and ploting the series.

KWH
Min. : 770523
1st Qu.: 5429912
Median : 6283324
Mean : 6502475
3rd Qu.: 7620524
Max. :10655730
NA’s :1

We have outliers and missing data. We will clean them up with tsclean from the forcasts package, and replot the data with the ACF and PACF:

This looks much better. There is a seasonal component that looks to be additive. I will need to use a model that captures seasonality.

Residual Analysis

All models do a good job picking up the seasonality. Let’s look at the residuals to see if any of the models should not be used.

Holt-Winters Model


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 61.741, df = 8, p-value = 0.0000000002121

Model df: 16.   Total lags used: 24

There are some spikes in the ACF that indicate this is not the best model.

Adjusted Holt-Winters Model


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 48.973, df = 8, p-value = 0.00000006433

Model df: 16.   Total lags used: 24

This is a bit better than the previous model but there are still ACF spikes outside of the bands.

Damped & Adjusted Holt-Winters Model


    Ljung-Box test

data:  Residuals from Damped Holt-Winters' additive method
Q* = 47.391, df = 7, p-value = 0.00000004682

Model df: 17.   Total lags used: 24

The preformance of this model is about the same as above.

ARIMA Model


    Ljung-Box test

data:  Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
Q* = 11.549, df = 17, p-value = 0.8266

Model df: 7.   Total lags used: 24

This model seems to be the best. There is only one ACF spike outside of the threshold and it only by a small amount. My working theory is this is the model I should use, but I will check this using cross validation.

Cross Validation

In order to understand how well a model is likely to preform at predicting out of sample data, I will use the tsCV function and evaluate the models. My goal is to minimize the RMSE. First I will get the errors from the cross validation process, then I will compute the RMSE.

Model RMSE
ARIMA 659529.8
Holt-Winters 1010721.9
Adjusted Holt-Winters 1373617.3
Damped & Adjusted Holt-Winters 1547751.3

The ARIMA model had the best cross validation results.

Model Selection

Given that the ARIMA model minimized the RMSE in the cross validation, I will use it to forecast residential power consumption.

Summary

I set out to forecast residential power consumption. Four modeling techniques were tested for accuracy using cross-validation. In the end an ARIMA model was selected bercause it minimized the RMSE. Here’s my forcast for the next 12 months of power usage:

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2014 9433336 8688058 10178615 8293531 10573141
Feb 2014 8570219 7785677 9354761 7370366 9770072
Mar 2014 6615312 5830157 7400467 5414521 7816102
Apr 2014 6005538 5196071 6815004 4767565 7243510
May 2014 5917569 5108101 6727036 4679595 7155543
Jun 2014 8187387 7377114 8997660 6948182 9426592
Jul 2014 9528779 8717809 10339750 8288507 10769052
Aug 2014 9991953 9180963 10802943 8751650 11232255
Sep 2014 8546843 7735715 9357971 7306330 9787356
Oct 2014 5856327 5045196 6667458 4615809 7096845
Nov 2014 6153245 5342114 6964376 4912727 7393763
Dec 2014 7732110 6920971 8543250 6491580 8972641