PartB

The dataset consists of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

pu <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(pu)
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

Observe The Data

#View Summary
summary(pu)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

There appears to be 1 NA value, which means that we may be missing a whole months data.

#locate NA value
which(is.na(pu), arr.ind=TRUE)
##      row col
## [1,] 129   3
#Observe row area
slice(pu,c(127:132))
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          859 2008-Jul   7643987
## 2          860 2008-Aug   8037137
## 3          861 2008-Sep        NA
## 4          862 2008-Oct   5101803
## 5          863 2008-Nov   4555602
## 6          864 2008-Dec   6442746

It appears we are missing the dataset for September 2008. Lets visualize the data to further evaluate the dataset.

pu <- pu %>% rename(Date = 'YYYY-MMM')
pu_data <- pu %>% select(-CaseSequence) 
pu_data <- pu_data %>%  mutate(Date = as.Date(paste0('01-', Date), '%d-%Y-%b'))
 
ggplot(pu_data, aes(Date, KWH)) +geom_line() + ggtitle('Residential Power Usage')

There appears to be an extremely low power usage in sometime after 2010 that we’ll need to explore.

min(pu_data$KWH,na.rm = TRUE)
## [1] 770523

Lets convert our data set in to a time series data to check the seasonality of our dataset.

#converter data to monthly time series
pu2 <-ts(pu[, "KWH"], start = c(1998, 1), frequency = 12)
ggseasonplot(pu2)+ggtitle('Residential Power Usage by Year')
## Warning in is.na(ylab): is.na() applied to non-(list or vector) of type
## 'NULL'

When we break the plot up by year, there is an increase in power usage in the summer that peak in August and then another spike in the winter that peak in January. The lowest power usage is in May and November.

It appears, our minimum value is in July and may be a typo since July happens to be month with a high power usage in a given year.

Other than the missing value and the typo in July 2010, the remaining data appears correct.

Clean The Data

since the data is very seasonal, I will the the mean value of the months of July & September to replace to 2 months with missing or incorrect data.

#Delete missing or incorrect values so it doesn't impact average
pu_data <- pu_data[-c(129,151),]

#Get average by month
pu_data$Month <- months(pu_data$Date)
aggregate(KWH ~ Month, pu_data, mean)
##        Month     KWH
## 1      April 5299734
## 2     August 8298211
## 3   December 6283175
## 4   February 6946556
## 5    January 8007422
## 6       July 7852521
## 7       June 6536092
## 8      March 5971450
## 9        May 5039034
## 10  November 4953619
## 11   October 5657164
## 12 September 7702333

The average for July is 7852521 and the average for September is 7702333.

#replace values for July 2010 and September 2008
pu$KWH[129] <- 7702333
pu$KWH[151] <- 7852521
pu[129,]
## # A tibble: 1 x 3
##   CaseSequence Date         KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep 7702333
pu[151,]
## # A tibble: 1 x 3
##   CaseSequence Date         KWH
##          <dbl> <chr>      <dbl>
## 1          883 2010-Jul 7852521
summary(pu)
##   CaseSequence       Date                KWH          
##  Min.   :733.0   Length:192         Min.   : 4313019  
##  1st Qu.:780.8   Class :character   1st Qu.: 5443502  
##  Median :828.5   Mode  :character   Median : 6351262  
##  Mean   :828.5                      Mean   : 6545609  
##  3rd Qu.:876.2                      3rd Qu.: 7670677  
##  Max.   :924.0                      Max.   :10655730

Analyze the Data

Let’s visualize the updated time series

#converter data to monthly time series
pu2 <-ts(pu[, "KWH"], start = c(1998, 1), frequency = 12)
ggtsdisplay(pu2, points = FALSE, plot.type = "histogram") 
## Warning in is.na(main): is.na() applied to non-(list or vector) of type
## 'NULL'

Lets further explore the seasonality in our data.

ggseasonplot(pu2,polar = TRUE)+ggtitle('Residential Power Usage by Year')
## Warning in is.na(ylab): is.na() applied to non-(list or vector) of type
## 'NULL'

The Polar seasonal chart shows highest values in July through September and December through February. The plot shows the lowest power usage in October and April.

When we break the data up in lag plots by month, there appears to be lags at 6,7, and 12 but 12 has the strongest correlation. This further confirms seasonality in the time series.

gglagplot(pu2)

Extract the trend, seasonality and error

pud <- decompose(pu2, type="additive") 
plot(pud)

When we extract the time series, there appears to be an slight increasing trend over the years. Seasonality is amplified as well showing strong pattern over the years.

Forecasts

STL Decomposition

pu.stl <-stl(pu2[,1], s.window='periodic', robust=TRUE)
autoplot(forecast(pu.stl),12) + ylab('Power Usage')

checkresiduals(forecast(pu.stl))
## Warning in checkresiduals(forecast(pu.stl)): The fitted degrees of freedom
## is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 67.414, df = 22, p-value = 1.677e-06
## 
## Model df: 2.   Total lags used: 24

ARIMA Model

The ndiffs function recommends that the time series requires 1 non-seasonal difference and 1 seasonal difference in order to achieve stationairty.

ndiffs(pu2)
## [1] 1
nsdiffs(pu2)
## [1] 1

Since our time series has volatile spikes, it requires a box-cox transformation as well.

#taking the difference
pu.lambda <- BoxCox.lambda(pu2)
pu.arima <- diff(diff(BoxCox(pu2,pu.lambda),12))
ggtsdisplay(pu.arima)
## Warning in is.na(main): is.na() applied to non-(list or vector) of type
## 'NULL'

After, taking the difference, there are significant spikes in both ACF/PACF charts indicating that a ARMA model may be best fit.

Box.test(pu.arima, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  pu.arima
## X-squared = 21.816, df = 1, p-value = 3e-06
kpss.test(pu.arima)
## Warning in kpss.test(pu.arima): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  pu.arima
## KPSS Level = 0.03924, Truncation lag parameter = 3, p-value = 0.1

The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.

Since it looks like the data requires a seasonal and a non-seasonal fitting, I decided to use auto arima to find the best fit.

pu.arima_fit <- auto.arima(pu.arima, stepwise =FALSE)

Auto arima function suggests a non-seasonal MA(2) and a seasonal AR(2) model. The residuals appear to be normally distributed with a mean around zero. The ACF does not show an auto-correlation and there aren’t significant spikes. The model appears acceptable for forecasting.

Below we see a 12 month forecast based on the ARIMA(0,0,2)(2,0,0)

autoplot(forecast(pu.arima_fit),12) + ylab('Power Usage')

##Conclusion

checkresiduals(pu.arima_fit )

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(2,0,0)[12] with zero mean
## Q* = 20.037, df = 20, p-value = 0.4556
## 
## Model df: 4.   Total lags used: 24
checkresiduals(forecast(pu.stl))
## Warning in checkresiduals(forecast(pu.stl)): The fitted degrees of freedom
## is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 67.414, df = 22, p-value = 1.677e-06
## 
## Model df: 2.   Total lags used: 24
autoplot(forecast(pu.stl),12) + ylab('Power Usage')

autoplot(forecast(pu.arima_fit),12) + ylab('Power Usage')

Based on the residual analysis, both the ARIMA(0,0,2)(2,0,0) Model and STL +ETS(M,N,N) Model are similar. While the ARIMA model has a slightly better normally distributed residuals and has less lags that do spike greater that the critical values, it may be a slight better fit. Overall, both models are very similar and since we cannot directly compare them, both models are a good fit to forecast the power usage for 2014.