DATA 624 Project 1 Part B

Introduction

I am to forecast how much power will be consumed by residential customers in kilowatt hours for the next 12 months. Data was provided for this project spaning January 1998 until December 2013.

Data Exploration

I will begin by reading in the data and ploting the series.

if (!file.exists("ResidentialCustomerForecastLoad-624.xlsx")) {
  download.file("https://github.com/mikeasilva/CUNY-SPS/raw/master/DATA624/ResidentialCustomerForecastLoad-624.xlsx", "ResidentialCustomerForecastLoad-624.xlsx")
}
raw_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kwh <- raw_data %>%
  select(KWH) %>%
  ts(start = decimal_date(date("1998-01-01")), frequency = 12)

autoplot(kwh, main = "Residential Power Usage (KWH)") +
  scale_y_continuous(labels = comma) +
  theme(axis.title = element_blank())

summary(kwh) %>% kable() %>% kable_styling()

	KWH
	Min. : 770523
	1st Qu.: 5429912
	Median : 6283324
	Mean : 6502475
	3rd Qu.: 7620524
	Max. :10655730
	NA’s :1

We have outliers and missing data. We will clean them up with tsclean from the forcasts package, and replot the data with the ACF and PACF:

kwh <- tsclean(kwh)

ggtsplot <- function(ts, title) {
  # A ggplot2 version of tsdisplay(df)
  # Args:
  #    ts (Time-Series): The time series we want to plot
  #    title (str): The title of the graph
  grid.arrange(
    autoplot(ts) +
      scale_y_continuous(labels = comma) +
      ggtitle(title) +
      theme(axis.title = element_blank()),
    grid.arrange(
      ggAcf(ts) + ggtitle(element_blank()),
      ggPacf(ts) + ggtitle(element_blank()), ncol = 2)
    , nrow = 2)
}


ggtsplot(kwh, "Cleaned Residential Power Usage (KWH)")

This looks much better. There is a seasonal component that looks to be additive. I will need to use a model that captures seasonality.

Candidate Models

Since the data is highly seasonal, I we will need to use an ARIMA or Holt-Winters model. Let’s fit an models using auto.arima() and hw() and visualize their projections. I will use a couple of different parameters with the Holt-Winters model.

h <- 12
lambda <- BoxCox.lambda(kwh)
autoplot(kwh) +
  autolayer(hw(kwh, h = h), series = "Holt-Winters") +
  autolayer(hw(kwh, h = h, lambda = lambda), series = "Holt-Winters (Box-Cox Adjusted)") +
  autolayer(hw(kwh, h = h, lambda = lambda, damped = TRUE), series = "Holt-Winters (Damped & Box-Cox Adjusted)") +
  autolayer(forecast(auto.arima(kwh), h = h), series = "ARIMA") +
  facet_wrap(. ~ series, ncol = 2) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(labels = comma) +
  theme(legend.position = "none", axis.title = element_blank())

Residual Analysis

All models do a good job picking up the seasonality. Let’s look at the residuals to see if any of the models should not be used.

Holt-Winters Model

hw_fit <- hw(kwh, h = h)
checkresiduals(hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 61.741, df = 8, p-value = 0.0000000002121

Model df: 16.   Total lags used: 24

There are some spikes in the ACF that indicate this is not the best model.

Adjusted Holt-Winters Model

adj_hw_fit <- hw(kwh, h = h, lambda = lambda)
checkresiduals(adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 48.973, df = 8, p-value = 0.00000006433

Model df: 16.   Total lags used: 24

This is a bit better than the previous model but there are still ACF spikes outside of the bands.

Damped & Adjusted Holt-Winters Model

damp_adj_hw_fit <- hw(kwh, h = h, lambda = lambda, damped = TRUE)
checkresiduals(damp_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Damped Holt-Winters' additive method
Q* = 47.391, df = 7, p-value = 0.00000004682

Model df: 17.   Total lags used: 24

The preformance of this model is about the same as above.

ARIMA Model

arima_fit <- auto.arima(kwh)
checkresiduals(arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
Q* = 11.549, df = 17, p-value = 0.8266

Model df: 7.   Total lags used: 24

This model seems to be the best. There is only one ACF spike outside of the threshold and it only by a small amount. My working theory is this is the model I should use, but I will check this using cross validation.

Cross Validation

In order to understand how well a model is likely to preform at predicting out of sample data, I will use the tsCV function and evaluate the models. My goal is to minimize the RMSE. First I will get the errors from the cross validation process, then I will compute the RMSE.

get_rmse <- function(e) {
  sqrt(mean(e^2, na.rm = TRUE))
}

arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(3, 0, 1), seasonal = c(2, 1, 0), include.drift = TRUE), h = h)
}

hw_error <- tsCV(kwh, hw, h = h)
adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda)
damped_adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda, damped = TRUE)
arima_errors <- tsCV(kwh, arima_forecast, h = h)

data.frame(Model = c("ARIMA", "Holt-Winters", "Adjusted Holt-Winters", "Damped & Adjusted Holt-Winters"),
           RMSE = c(get_rmse(arima_errors[, h]), get_rmse(hw_error[, h]), get_rmse(adj_hw_error[, h]), get_rmse(damped_adj_hw_error[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()

Model	RMSE
ARIMA	659529.8
Holt-Winters	1010721.9
Adjusted Holt-Winters	1373617.3
Damped & Adjusted Holt-Winters	1547751.3

The ARIMA model had the best cross validation results.

Model Selection

Given that the ARIMA model minimized the RMSE in the cross validation, I will use it to forecast residential power consumption.

Summary

I set out to forecast residential power consumption. Four modeling techniques were tested for accuracy using cross-validation. In the end an ARIMA model was selected bercause it minimized the RMSE. Here’s my forcast for the next 12 months of power usage:

my_forecast <- forecast(arima_fit, h = h)
autoplot(my_forecast) +
  ggtitle("Forecasted Residential Power Consumption (KWH)") +
  theme(axis.title = element_blank())

my_forecast %>%
  kable() %>%
  kable_styling()

	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
Jan 2014	9433336	8688058	10178615	8293531	10573141
Feb 2014	8570219	7785677	9354761	7370366	9770072
Mar 2014	6615312	5830157	7400467	5414521	7816102
Apr 2014	6005538	5196071	6815004	4767565	7243510
May 2014	5917569	5108101	6727036	4679595	7155543
Jun 2014	8187387	7377114	8997660	6948182	9426592
Jul 2014	9528779	8717809	10339750	8288507	10769052
Aug 2014	9991953	9180963	10802943	8751650	11232255
Sep 2014	8546843	7735715	9357971	7306330	9787356
Oct 2014	5856327	5045196	6667458	4615809	7096845
Nov 2014	6153245	5342114	6964376	4912727	7393763
Dec 2014	7732110	6920971	8543250	6491580	8972641

DATA 624 Project 1 Part B - Power