Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add these to your existing files above - clearly labeled.
getwd()
## [1] "T:/00-624 HH Predictive Analytics"
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## v forecast 8.13 v expsmooth 2.3
## v fma 2.4
##
library(magrittr)
library(dplyr)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.4 v purrr 0.3.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x data.table::first() masks dplyr::first()
## x dplyr::lag() masks stats::lag()
## x data.table::last() masks dplyr::last()
## x purrr::set_names() masks magrittr::set_names()
## x purrr::transpose() masks data.table::transpose()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
theme_set(theme_light())
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(forecast)
getwd()
## [1] "T:/00-624 HH Predictive Analytics"
df.power = fread("ResidentialCustomerForecastLoad-624.txt", header=TRUE, sep = '\t', stringsAsFactors = FALSE)
head(df.power)
## CaseSequence YYYY-MMM KWH
## 1: 733 1998-Jan 6862583
## 2: 734 1998-Feb 5838198
## 3: 735 1998-Mar 5420658
## 4: 736 1998-Apr 5010364
## 5: 737 1998-May 4665377
## 6: 738 1998-Jun 6467147
summary(df.power)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
print (paste ("How Many Missing:", sum(is.na(df.power)),"Row of Missings"))
## [1] "How Many Missing: 1 Row of Missings"
First, I am loading the data for residential power usage. There are three columns in this data, case sequence , date, power usage in KWh measurement.
It can be seen that there are 924 observations in this database, the power usage measured by kWh has a medium of 6.28 million kWH, ranging from minimum of 770 thousand two maximum of 10.6 million KWh. The data is very complete, with only one missing in kWh.
df.power$KWH[is.na(df.power$KWH)] = median(df.power$KWH, na.rm = TRUE)
summary(df.power)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5434539
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6501333
## 3rd Qu.:876.2 3rd Qu.: 7608792
## Max. :924.0 Max. :10655730
power.ts = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 12)
autoplot(power.ts) +
xlab('Every Year') + ylab('KWH') +
labs(title = "Residential power usage, in KWH",
subtitle = "January 1998 - December 2013")
For this missing roll, I am imputing it with the medium level of power usage (@ 6.28 million KWh). Now the data is complete, which we can use to do data exploration and data manipulation.
power.ts.halfyear = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 6)
autoplot(power.ts.halfyear) +
xlab('Every 6-Month') + ylab('KWH') +
labs(title = "Residential power usage, in KWH, every 6 months",
subtitle = "January 1998 - December 2013")
From this figure there seems a seasonal effect existing every six months, which I suspect that are there is some difference in power usage for winter and summer. So I further plot out the power usage using the frequency of every six months instead. This knew figure confirms my suspicion.
power.ts.quarter = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 3)
autoplot(power.ts.quarter) +
xlab('Every Quarter') + ylab('KWH') +
labs(title = "Residential power usage, in KWH, every Quarter",
subtitle = "January 1998 - December 2013")
To visualize one step further, I plotted the power usage by each quarter. And it does seem that there is a seasonal effect in power usage by quarter. But the overall usage is steady year to year.
ggtsdisplay(power.ts, points=FALSE, plot.type="histogram")
The ggtsdisplay function with histogram option shows that in AC F plot, there are high auto correlation among the neighboring measurements. This autocorrelation is pronounced for all lag time, whether it is short lag or long lag ( up to 36 lags ). And the autocorrelation goes to both directions, positive autocorrelation and negative autocorrelation. Most of these autocorrelation are outside of the 95% confidence interval. The residual count histogram of power usage time series indicate that it is not normally distributed, but skewed to the right. And there is an outlier in the residual to the far left, which I suspect is the 2011 July outlier. The Ljung-Box test indicates that the residual effect is obvious, meaning the residuals are correlated. The test has a small value of 30, with the highly significant P value, << 0.05. So we need 2 do some transformation in order to make the time series data useful. We will perform box Cox transformation later on.
ggseasonplot(power.ts)
Here is the seasonal plot by each month on the X axis separated by each year indicated by the color of the line. This figure tells us the downward spike that we saw early on was in July of 2011. Not sure what has happened during that Month. The rest of the year, it shows that each year the power have a uptake during the summer months, from June to September. This finding is intuitive and makes sense.
boxplot(power.ts~cycle(power.ts), xlab='Monh', ylab='KWH')
We apply the box Cox transformation to improve the model, because the Ljung-Box test has suggested us to do. The box plot also confirms the findings from above figures. The boxplot also shows that the 95% confidence interval for each month, all years combined are small, indicating that the power usage are very steady year by year. If there is any disperse at all, it happens during December and January, explained by the holiday season. Overall there is no surprise finding during these years.
res_power = stl(power.ts, s.window = 12)
plot(res_power)
print(paste('Order of seasonal difference required = ', nsdiffs(power.ts)))
## [1] "Order of seasonal difference required = 1"
print(paste('Order of difference required = ', ndiffs(power.ts)))
## [1] "Order of difference required = 1"
calcRMSE_power = function(modelName, modelFunc)
{
res_power = tsCV(power.ts,modelFunc)
print(paste('RMSE for model', modelName, '=', sqrt(mean(res_power^2, na.rm=TRUE))))
}
calcRMSE_power('Seasonal Naive', snaive)
## [1] "RMSE for model Seasonal Naive = 1210609.88804598"
calcRMSE_power('Random Walk', rwf)
## [1] "RMSE for model Random Walk = 1539633.85951456"
We are using the STL decomposition method to confirm our findings. The dataframe format is transformed into time series format by TS function, starting at January 1998 to December 2013. The frequency of year ( 12 months frequency ) is used instead of the raw months for better visualization of data. This raw plot of power usage bye year indicate that except uh downward spike in around 2011 or so, the power usage in USA household our very steady, fluctuate within a set of boundaries over the period of 15 years.
The model recommend a seasonal difference at 1, using nsdiffs function.
Using the seasonal naive model, we calculate the RMSE, which is square root of residual power mean ^2. The RMSE for the seasonal not if model is 1210609.
Next we calculated the random walk function RMSE for this time series, US power usage, which is at 1539633. The season no naive model have a smaller RMSE than random walk model, indicating random walk is not a good model to use for this time series.
func_powerarima <- function(x, h)
{ forecast(auto.arima(x), h=h, seasonal=TRUE) }
calcRMSE_power('ARIMA', func_powerarima)
## [1] "RMSE for model ARIMA = 1115479.10938553"
The RMSE for the arima model is not very satisfying, therefore, we will not use the Auto arima model for this dataset’s forecast.
func_power <- function(x, h)
{forecast(ets(x, model="ZZZ"), h=h)}
calcRMSE_power('ETS', func_power)
## [1] "RMSE for model ETS = 1037684.72933356"
fit_power = ets(power.ts, model = 'ZZZ')
summary(fit_power)
## ETS(M,N,M)
##
## Call:
## ets(y = power.ts, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.1167
## gamma = 1e-04
##
## Initial states:
## l = 6133758.2586
## s = 0.9568 0.7524 0.8702 1.1849 1.2617 1.1857
## 1.0001 0.7743 0.8154 0.9116 1.0651 1.222
##
## sigma: 0.1201
##
## AIC AICc BIC
## 6224.787 6227.514 6273.649
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 61969.44 833321.3 505622.3 -4.356712 12.01361 0.722433 0.1757877
checkresiduals(fit_power)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,M)
## Q* = 30.705, df = 10, p-value = 0.0006562
##
## Model df: 14. Total lags used: 24
The ETS model shows an RMSE at 1037684, smaller than either seasonal naive model or random walk model, suggest he this model is a better fit than those two, for the power usage database.
Choosing ETS model over the other two, we ask the system to recommend fit for us. It suggested the model of ETS (M, N, M). It suggest data there is multiplicative error terms, no trend, and the multiplicative seasonality.
Next we analyze the residual of the fit model. The residual plots are not satisfactory. Ljung-Box Test is 30 7, with P value <<0.05. The ACF plot shows that it is somewhat better than the naive method before, because most of the autocorrelations are within 95% confidence interval. But this model is still not satisfactory. Next we apply box Cox transformation to improve the fit.
fit_power = ets(power.ts, model = 'ZZZ', lambda = BoxCox.lambda(power.ts))
summary(fit_power)
## ETS(A,N,A)
##
## Call:
## ets(y = power.ts, model = "ZZZ", lambda = BoxCox.lambda(power.ts))
##
## Box-Cox transformation: lambda= 0.1042
##
## Smoothing parameters:
## alpha = 0.0538
## gamma = 1e-04
##
## Initial states:
## l = 39.2123
## s = -0.0972 -1.2756 -0.575 0.9225 1.3681 0.4026
## 0.1313 -1.2026 -0.9124 -0.3817 0.389 1.231
##
## sigma: 0.9553
##
## AIC AICc BIC
## 1007.327 1010.054 1056.189
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 181549.3 876724.3 565471.4 -2.128111 12.02496 0.8079454 0.2409174
checkresiduals(fit_power)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 13.696, df = 10, p-value = 0.1873
##
## Model df: 14. Total lags used: 24
fcst_power = forecast(fit_power, h=12, lambda = BoxCox.lambda(power.ts))
## Warning in forecast.ets(fit_power, h = 12, lambda = BoxCox.lambda(power.ts)):
## biasadj information not found, defaulting to FALSE.
print(fcst_power)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 9072736 7183518 11395443 6333493 12829171
## Feb 2014 7731631 6095283 9751051 5361280 11000973
## Mar 2014 6662622 5230951 8435807 4590628 9536172
## Apr 2014 6005531 4700833 7625719 4118547 8633056
## May 2014 5671328 4431064 7213978 3878258 8174218
## Jun 2014 7358020 5785457 9303198 5081382 10509173
## Jul 2014 7751644 6100889 9791808 5361297 11055897
## Aug 2014 9309851 7358433 11712684 6481503 13197502
## Sep 2014 8558822 6748361 10792837 5936177 12175460
## Oct 2014 6416036 5020103 8150139 4397280 9228578
## Nov 2014 5589925 4356484 7127459 3807699 8086030
## Dec 2014 7040548 5518105 8929009 4838018 10102189
autoplot(fcst_power, pi=TRUE) + ggtitle("ETS Model Forecast - Power consumption in KWH")
It looks like box Cox transformation is successful in reducing the residual level of autocorrelation. Now the Lambda is 0.1, RMSE is much smaller than the non transformed RMSE. What is interesting is that a box Cox transformation has changed the recommended ETS model to ETS(A, N, A ). That means the error is additive, no trend, additive seasonality. Next we will do the residual test For this model fit.
The Ljung-Box Has a non significant P value, indicating that this model satisfyingly explains the seasonal trend oh of this data. So we will choose this as our final model to forecast the power usage for 2014.
The corresponding forecast for each months of 2014 is printed, with 80% confidence interval, as well as 95% confidence interval. Next we are plotting the forecast along with the historical data of power usage over the years.
In summary, this plot for US power usage over the years from 1998 to 2013, with forecast of 2014 ( 80% confidence interval, 95% confidence interval ). The model that we have used as the final model is ETS (A, N, A) with box Cox transformation. We are satisfied with our model selection.