—————————————————————————

Student Name : Sachid Deshmukh

—————————————————————————

Part-A

Part-B

Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Load data

df.power  = fread("ResidentialCustomerForecastLoad-624.txt", header=TRUE, sep = '\t', stringsAsFactors = FALSE)

View dataframe preview

head(df.power)
##    CaseSequence YYYY-MMM     KWH
## 1:          733 1998-Jan 6862583
## 2:          734 1998-Feb 5838198
## 3:          735 1998-Mar 5420658
## 4:          736 1998-Apr 5010364
## 5:          737 1998-May 4665377
## 6:          738 1998-Jun 6467147
summary(df.power)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

Note that KWH column has one NA value. Lets fill that value with median value of KWH

df.power$KWH[is.na(df.power$KWH)] = median(df.power$KWH, na.rm = TRUE)
summary(df.power)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

Let’s remove Year and Month string column and convert the time series into monthly frequency

power.ts = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 12)

Let’s plot the time series

autoplot(power.ts) +
  xlab('Month') +
  ylab('KWH') +
  ggtitle('Monthly power consumption in KWH')

From the above time series we can see an outlier in year 2010. The Time series exhibits seasonal pattern indicating high power consumption in winter and summar time and low power consumption in rest of the period. There is no upward ot downward trend exhibited by the time series indicating the demand for power is relatively stable from 2000 to 2013. We can see the surge in peaks has increased after 2010 indicating high usage of power during winter and summer season post 2010

Let’s use STL decomposition to confirm our findings

res = stl(power.ts, s.window = 12)
plot(res)

We can see that STL plot above shows strong seasonal pattern. There is no underlying trend however there is a surge in power consumption post 2010. The unusually low power consumption in the year 2010 is appearent in the error plot above

Above is also a non-stationary time series. Let’s see what order differencing is required to make it stationary

print(paste('Order of seasonal difference required = ', nsdiffs(power.ts)))
## [1] "Order of seasonal difference required =  1"
print(paste('Order of difference required = ', ndiffs(power.ts)))
## [1] "Order of difference required =  1"

We can see that we need first order seasonal difference followed by first order regular difference to make this time series stationary

Let’s try following forecating models on this time series and assess which one is better in terms of RMSE. We will use time series cross validation function to calculate RMSE for eash model

  • Seasonal Naive Model
  • Random walk model
  • ETS model
  • ARIMA model
calcRMSE = function(modelName, modelFunc)
{
  res = tsCV(power.ts,modelFunc)
  print(paste('RMSE for model', modelName, '=', sqrt(mean(res^2, na.rm=TRUE))))
}

Fit seasonal naive model and calc RMSE

calcRMSE('Seasonal Naive', snaive)
## [1] "RMSE for model Seasonal Naive = 1210609.88804598"

Fit Random Walk model and calc RMSE

calcRMSE('Random Walk', rwf)
## [1] "RMSE for model Random Walk = 1539633.85951456"

Fit Seasonal ARIMA model and calc RMSE

func <- function(x, h){forecast(auto.arima(x), h=h, seasonal=TRUE)}
calcRMSE('ARIMA', func)
## [1] "RMSE for model ARIMA = 1115003.87958045"

Fit ETS model and calc RMSE

func <- function(x, h){forecast(ets(x, model="ZZZ"), h=h)}
calcRMSE('ETS', func)
## [1] "RMSE for model ETS = 1037684.72933356"

We can see that ETS model gives us best RMSE. Let’s analyze model details

fit = ets(power.ts, model = 'ZZZ')
summary(fit)
## ETS(M,N,M) 
## 
## Call:
##  ets(y = power.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.1167 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6133758.2586 
##     s = 0.9568 0.7524 0.8702 1.1849 1.2617 1.1857
##            1.0001 0.7743 0.8154 0.9116 1.0651 1.222
## 
##   sigma:  0.1201
## 
##      AIC     AICc      BIC 
## 6224.787 6227.514 6273.649 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 61969.44 833321.3 505622.3 -4.356712 12.01361 0.722433
##                   ACF1
## Training set 0.1757877

We can see that auto selection method of ETS has chosen ETS (M,N,M) model indicating multiplicative error terms and No trend and multiplicative seasonality. This make sense based on time series and STL dcomposition results

Let’s analyze residuals

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,M)
## Q* = 30.705, df = 10, p-value = 0.0006562
## 
## Model df: 14.   Total lags used: 24

Residual plots are not satisfactory. small value of Ljung-Box test shows that residuals are correlated. ACF plot also confirms the same. Histogram suggests that residuals are not normally distributed. It shows huge left skew in the residual distribution. The Residual plot indicates a huge outlier presense in year 2010, which is in line with our finding in the time series

Let’s apply BoxCox transformation on the time series and see the results

fit = ets(power.ts, model = 'ZZZ', lambda = BoxCox.lambda(power.ts))
summary(fit)
## ETS(A,N,A) 
## 
## Call:
##  ets(y = power.ts, model = "ZZZ", lambda = BoxCox.lambda(power.ts)) 
## 
##   Box-Cox transformation: lambda= 0.1042 
## 
##   Smoothing parameters:
##     alpha = 0.0538 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 39.2123 
##     s = -0.0972 -1.2756 -0.575 0.9225 1.3681 0.4026
##            0.1313 -1.2026 -0.9124 -0.3817 0.389 1.231
## 
##   sigma:  0.9553
## 
##      AIC     AICc      BIC 
## 1007.327 1010.054 1056.189 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 181549.3 876724.3 565471.4 -2.128111 12.02496 0.8079454
##                   ACF1
## Training set 0.2409174

We can see that RMSE is greatly improved using BoxCox transformation. We can also see that BoxCox transformation changes ETS model to ETS(A,N,A) where error and seasonality are additive and there is no trend

Let’s visualize residuals

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 13.696, df = 10, p-value = 0.1873
## 
## Model df: 14.   Total lags used: 24

Residual plots are satisfactory. Large value of Ljung-Box test shows that residuals are not correlated. ACF plot also confirms the same. Histogram suggests that residuals are not normally distributed however it appears to be acceptable

Let’s use ETS(A,N,A) model to generate forecast for 2014

fcst = forecast(fit, h=12, lambda = BoxCox.lambda(power.ts))
## Warning in forecast.ets(fit, h = 12, lambda = BoxCox.lambda(power.ts)):
## biasadj information not found, defaulting to FALSE.
print(fcst)
##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9072736 7183518 11395443 6333493 12829171
## Feb 2014        7731631 6095283  9751051 5361280 11000973
## Mar 2014        6662622 5230951  8435807 4590628  9536172
## Apr 2014        6005531 4700833  7625719 4118547  8633056
## May 2014        5671328 4431064  7213978 3878258  8174218
## Jun 2014        7358020 5785457  9303198 5081382 10509173
## Jul 2014        7751644 6100889  9791808 5361297 11055897
## Aug 2014        9309851 7358433 11712684 6481503 13197502
## Sep 2014        8558822 6748361 10792837 5936177 12175460
## Oct 2014        6416036 5020103  8150139 4397280  9228578
## Nov 2014        5589925 4356484  7127459 3807699  8086030
## Dec 2014        7040548 5518105  8929009 4838018 10102189
fwrite(data.frame(fcst), "PowerForecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

Let’s plot the forecast

autoplot(fcst, pi=TRUE) +
  ggtitle("Forecast - Power consumption in KWH")

Autoplot above shows point forecast along with 95% confidence interval for 2014 energy consumption in KWH using ETS(A,N,A) model with BoxCox transformation