We load the data and then utilize the summary and skim functions to assess the quality of the data. Here are some observations:
We will use the tsclean function to clean the data and address the NA value. we show the same summary and skim function output for the cleaned series
power <- readxl::read_excel('ResidentialCustomerForecastLoad-624.xlsx')
power <- power$KWH %>% ts(start=c(1998, 1), frequency = 12)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
| Name | power |
| Number of rows | 192 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| ts | 1 |
| ________________________ | |
| Group variables | None |
Variable type: ts
| skim_variable | n_missing | complete_rate | start | end | frequency | deltat | mean | sd | min | max | median |
|---|---|---|---|---|---|---|---|---|---|---|---|
| x | 1 | 0.99 | 1998 | 2013 | 12 | 0.08 | 6502475 | 1447571 | 770523 | 10655730 | 6283324 |
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4313019 5443502 6351262 6529701 7608792 10655730
| Name | power |
| Number of rows | 192 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| ts | 1 |
| ________________________ | |
| Group variables | None |
Variable type: ts
| skim_variable | n_missing | complete_rate | start | end | frequency | deltat | mean | sd | min | max | median |
|---|---|---|---|---|---|---|---|---|---|---|---|
| x | 0 | 1 | 1998 | 2013 | 12 | 0.08 | 6529701 | 1369032 | 4313019 | 10655730 | 6351262 |
Next we perform some basic EDA to better understand the series and to inform our data modeling. Here are some observations:
options(scipen=999)
ggtsdisplay(power) + labs(title = "Power Usage", subtitle = 'In KWH') + theme_fivethirtyeight()options(scipen=999)
ggsubseriesplot(power) + theme_fivethirtyeight() +
labs(title = "Power Useage", subtitle = "Seasonal Plot", y = "Useage (kWh)", x = "Mth")lambda = BoxCox.lambda(power)
p1 <- autoplot(power) +
theme_fivethirtyeight() + labs(title = "Power Time Series")
p2 <- autoplot(BoxCox(power, lambda)) +
theme_fivethirtyeight() + labs(title = "Box Cox- Power Time Series")
p1Similar to the ATM analysis, we will utilize several different techniques (STL, ETS and Arima) to model the the power series.
Modeling output is set forth below:
power %>%
stl(s.window = 'periodic', robust = TRUE) %>%
autoplot() +
labs(title = "STL Decomposition", x = "Yr")## ETS(A,Ad,A)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= -0.1443
##
## Smoothing parameters:
## alpha = 0.118
## beta = 0.0001
## gamma = 0.0001
## phi = 0.979
##
## Initial states:
## l = 6.1998
## b = 0.0001
## s = -0.006 -0.0285 -0.0132 0.019 0.0263 0.0212
## 0.0014 -0.0255 -0.0192 -0.0077 0.0081 0.024
##
## sigma: 0.0094
##
## AIC AICc BIC
## -765.9795 -762.0258 -707.3446
## Series: .
## ARIMA(0,0,1)(2,1,0)[12] with drift
## Box Cox transformation: lambda= -0.1442665
##
## Coefficients:
## ma1 sar1 sar2 drift
## 0.2563 -0.7036 -0.3817 0.0001
## s.e. 0.0809 0.0734 0.0748 0.0001
##
## sigma^2 estimated as 0.00008869: log likelihood=585.32
## AIC=-1160.65 AICc=-1160.3 BIC=-1144.68
We have plotted the forecast from our three modeling techniques below. Additionally we have calculated RMSEs for each modeling approach to determine the best model:
Once again the Arima model has produced the best fit. You can see the table output below. The Arima point forecast was also written to a csv file.
| x | |
|---|---|
| STL | 914242.9 |
| ETS | 967678.0 |
| ARIMA | 837116.8 |