Proj1 PowerUsage

PART B – US Electricity Usage

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add these to your existing files above - clearly labeled.

getwd()

## [1] "T:/00-624 HH Predictive Analytics"

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(fpp2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## -- Attaching packages ---------------------------------------------- fpp2 2.4 --

## v forecast  8.13     v expsmooth 2.3 
## v fma       2.4

##

library(magrittr)
library(dplyr)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.4     v purrr   0.3.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x tidyr::extract()      masks magrittr::extract()
## x dplyr::filter()       masks stats::filter()
## x data.table::first()   masks dplyr::first()
## x dplyr::lag()          masks stats::lag()
## x data.table::last()    masks dplyr::last()
## x purrr::set_names()    masks magrittr::set_names()
## x purrr::transpose()    masks data.table::transpose()

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

theme_set(theme_light())

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(forecast)

READ DATA / Exploraty Data Analysis EDA

Peak look of raw data
Check/Fill in null values
Convert to time series
Graph the monthly data
Graph the data using alternative window
Graph the seasonal Data
Seasonal Box Plot
Decomposition Components Graph

getwd()

## [1] "T:/00-624 HH Predictive Analytics"

df.power  = fread("ResidentialCustomerForecastLoad-624.txt", header=TRUE, sep = '\t', stringsAsFactors = FALSE)
head(df.power)

##    CaseSequence YYYY-MMM     KWH
## 1:          733 1998-Jan 6862583
## 2:          734 1998-Feb 5838198
## 3:          735 1998-Mar 5420658
## 4:          736 1998-Apr 5010364
## 5:          737 1998-May 4665377
## 6:          738 1998-Jun 6467147

summary(df.power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

print (paste ("How Many Missing:", sum(is.na(df.power)),"Row of Missings"))

## [1] "How Many Missing: 1 Row of Missings"

First, I am loading the data for residential power usage. There are three columns in this data, case sequence , date, power usage in KWh measurement.

It can be seen that there are 924 observations in this database, the power usage measured by kWh has a medium of 6.28 million kWH, ranging from minimum of 770 thousand two maximum of 10.6 million KWh. The data is very complete, with only one missing in kWh.

df.power$KWH[is.na(df.power$KWH)] = median(df.power$KWH, na.rm = TRUE)
summary(df.power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

power.ts = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 12)

autoplot(power.ts) +
    xlab('Every Year') +    ylab('KWH') +
    labs(title = "Residential power usage, in KWH",
       subtitle = "January 1998 - December 2013")

For this missing roll, I am imputing it with the medium level of power usage (@ 6.28 million KWh). Now the data is complete, which we can use to do data exploration and data manipulation.

power.ts.halfyear = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 6)

autoplot(power.ts.halfyear) +
    xlab('Every 6-Month') +    ylab('KWH') +
   
labs(title = "Residential power usage, in KWH, every 6 months",
       subtitle = "January 1998 - December 2013")

From this figure there seems a seasonal effect existing every six months, which I suspect that are there is some difference in power usage for winter and summer. So I further plot out the power usage using the frequency of every six months instead. This knew figure confirms my suspicion.

power.ts.quarter = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 3)

autoplot(power.ts.quarter) +
    xlab('Every Quarter') +    ylab('KWH') +
   
labs(title = "Residential power usage, in KWH, every Quarter",
       subtitle = "January 1998 - December 2013")

To visualize one step further, I plotted the power usage by each quarter. And it does seem that there is a seasonal effect in power usage by quarter. But the overall usage is steady year to year.

ggtsdisplay(power.ts, points=FALSE, plot.type="histogram")

The ggtsdisplay function with histogram option shows that in AC F plot, there are high auto correlation among the neighboring measurements. This autocorrelation is pronounced for all lag time, whether it is short lag or long lag ( up to 36 lags ). And the autocorrelation goes to both directions, positive autocorrelation and negative autocorrelation. Most of these autocorrelation are outside of the 95% confidence interval. The residual count histogram of power usage time series indicate that it is not normally distributed, but skewed to the right. And there is an outlier in the residual to the far left, which I suspect is the 2011 July outlier. The Ljung-Box test indicates that the residual effect is obvious, meaning the residuals are correlated. The test has a small value of 30, with the highly significant P value, << 0.05. So we need 2 do some transformation in order to make the time series data useful. We will perform box Cox transformation later on.

ggseasonplot(power.ts)

Here is the seasonal plot by each month on the X axis separated by each year indicated by the color of the line. This figure tells us the downward spike that we saw early on was in July of 2011. Not sure what has happened during that Month. The rest of the year, it shows that each year the power have a uptake during the summer months, from June to September. This finding is intuitive and makes sense.

boxplot(power.ts~cycle(power.ts), xlab='Monh', ylab='KWH')

We apply the box Cox transformation to improve the model, because the Ljung-Box test has suggested us to do. The box plot also confirms the findings from above figures. The boxplot also shows that the 95% confidence interval for each month, all years combined are small, indicating that the power usage are very steady year by year. If there is any disperse at all, it happens during December and January, explained by the holiday season. Overall there is no surprise finding during these years.

Seasonal Naive Model and Random Walk Model

res_power = stl(power.ts, s.window = 12)
plot(res_power)

print(paste('Order of seasonal difference required = ', nsdiffs(power.ts)))

## [1] "Order of seasonal difference required =  1"

print(paste('Order of difference required = ', ndiffs(power.ts)))

## [1] "Order of difference required =  1"

calcRMSE_power = function(modelName, modelFunc)
    {
    res_power = tsCV(power.ts,modelFunc)
    print(paste('RMSE for model', modelName, '=', sqrt(mean(res_power^2, na.rm=TRUE))))
      }
calcRMSE_power('Seasonal Naive', snaive)

## [1] "RMSE for model Seasonal Naive = 1210609.88804598"

calcRMSE_power('Random Walk', rwf)

## [1] "RMSE for model Random Walk = 1539633.85951456"

We are using the STL decomposition method to confirm our findings. The dataframe format is transformed into time series format by TS function, starting at January 1998 to December 2013. The frequency of year ( 12 months frequency ) is used instead of the raw months for better visualization of data. This raw plot of power usage bye year indicate that except uh downward spike in around 2011 or so, the power usage in USA household our very steady, fluctuate within a set of boundaries over the period of 15 years.

The model recommend a seasonal difference at 1, using nsdiffs function.

Using the seasonal naive model, we calculate the RMSE, which is square root of residual power mean ^2. The RMSE for the seasonal not if model is 1210609.

Next we calculated the random walk function RMSE for this time series, US power usage, which is at 1539633. The season no naive model have a smaller RMSE than random walk model, indicating random walk is not a good model to use for this time series.

ARIMA Model

func_powerarima <- function(x, h)
  {  forecast(auto.arima(x), h=h, seasonal=TRUE) }

calcRMSE_power('ARIMA', func_powerarima)

## [1] "RMSE for model ARIMA = 1115479.10938553"

The RMSE for the arima model is not very satisfying, therefore, we will not use the Auto arima model for this dataset’s forecast.

ETS (Error Trend and Seasonablity, or Exponential Smoothing) Model

func_power <- function(x, h)
{forecast(ets(x, model="ZZZ"), h=h)}

calcRMSE_power('ETS', func_power)

## [1] "RMSE for model ETS = 1037684.72933356"

fit_power = ets(power.ts, model = 'ZZZ')
summary(fit_power)

## ETS(M,N,M) 
## 
## Call:
##  ets(y = power.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.1167 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6133758.2586 
##     s = 0.9568 0.7524 0.8702 1.1849 1.2617 1.1857
##            1.0001 0.7743 0.8154 0.9116 1.0651 1.222
## 
##   sigma:  0.1201
## 
##      AIC     AICc      BIC 
## 6224.787 6227.514 6273.649 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE      ACF1
## Training set 61969.44 833321.3 505622.3 -4.356712 12.01361 0.722433 0.1757877

checkresiduals(fit_power)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,M)
## Q* = 30.705, df = 10, p-value = 0.0006562
## 
## Model df: 14.   Total lags used: 24

The ETS model shows an RMSE at 1037684, smaller than either seasonal naive model or random walk model, suggest he this model is a better fit than those two, for the power usage database.

Choosing ETS model over the other two, we ask the system to recommend fit for us. It suggested the model of ETS (M, N, M). It suggest data there is multiplicative error terms, no trend, and the multiplicative seasonality.

Next we analyze the residual of the fit model. The residual plots are not satisfactory. Ljung-Box Test is 30 7, with P value <<0.05. The ACF plot shows that it is somewhat better than the naive method before, because most of the autocorrelations are within 95% confidence interval. But this model is still not satisfactory. Next we apply box Cox transformation to improve the fit.

fit_power = ets(power.ts, model = 'ZZZ', lambda = BoxCox.lambda(power.ts))
summary(fit_power)

## ETS(A,N,A) 
## 
## Call:
##  ets(y = power.ts, model = "ZZZ", lambda = BoxCox.lambda(power.ts)) 
## 
##   Box-Cox transformation: lambda= 0.1042 
## 
##   Smoothing parameters:
##     alpha = 0.0538 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 39.2123 
##     s = -0.0972 -1.2756 -0.575 0.9225 1.3681 0.4026
##            0.1313 -1.2026 -0.9124 -0.3817 0.389 1.231
## 
##   sigma:  0.9553
## 
##      AIC     AICc      BIC 
## 1007.327 1010.054 1056.189 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 181549.3 876724.3 565471.4 -2.128111 12.02496 0.8079454 0.2409174

checkresiduals(fit_power)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 13.696, df = 10, p-value = 0.1873
## 
## Model df: 14.   Total lags used: 24

fcst_power = forecast(fit_power, h=12, lambda = BoxCox.lambda(power.ts))

## Warning in forecast.ets(fit_power, h = 12, lambda = BoxCox.lambda(power.ts)):
## biasadj information not found, defaulting to FALSE.

print(fcst_power)

##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9072736 7183518 11395443 6333493 12829171
## Feb 2014        7731631 6095283  9751051 5361280 11000973
## Mar 2014        6662622 5230951  8435807 4590628  9536172
## Apr 2014        6005531 4700833  7625719 4118547  8633056
## May 2014        5671328 4431064  7213978 3878258  8174218
## Jun 2014        7358020 5785457  9303198 5081382 10509173
## Jul 2014        7751644 6100889  9791808 5361297 11055897
## Aug 2014        9309851 7358433 11712684 6481503 13197502
## Sep 2014        8558822 6748361 10792837 5936177 12175460
## Oct 2014        6416036 5020103  8150139 4397280  9228578
## Nov 2014        5589925 4356484  7127459 3807699  8086030
## Dec 2014        7040548 5518105  8929009 4838018 10102189

autoplot(fcst_power, pi=TRUE) +  ggtitle("ETS Model Forecast - Power consumption in KWH")

It looks like box Cox transformation is successful in reducing the residual level of autocorrelation. Now the Lambda is 0.1, RMSE is much smaller than the non transformed RMSE. What is interesting is that a box Cox transformation has changed the recommended ETS model to ETS(A, N, A ). That means the error is additive, no trend, additive seasonality. Next we will do the residual test For this model fit.

The Ljung-Box Has a non significant P value, indicating that this model satisfyingly explains the seasonal trend oh of this data. So we will choose this as our final model to forecast the power usage for 2014.

The corresponding forecast for each months of 2014 is printed, with 80% confidence interval, as well as 95% confidence interval. Next we are plotting the forecast along with the historical data of power usage over the years.