1. Project setup

1.1. Install and load required libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(zoo)
## Warning: package 'zoo' was built under R version 4.2.2
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tsibble)
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✔ tsibbledata 0.4.1     ✔ fable       0.3.2
## ✔ feasts      0.3.0     
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::index()     masks zoo::index()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

1.2. Import dataset

setwd("C:/Users/Peter Nguyen/Downloads/Forecasting Project")
data <- read.csv("Energy_consumption2.csv")
data <- data %>% filter(Year >= 2010)
data$Date <- as.Date(data$Date,"%Y-%m-%d")
str(data)
## 'data.frame':    152 obs. of  9 variables:
##  $ MSN               : chr  "TNRCBUS" "TNRCBUS" "TNRCBUS" "TNRCBUS" ...
##  $ YYYYMM            : int  201001 201002 201003 201004 201005 201006 201007 201008 201009 201010 ...
##  $ Year              : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Month             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date              : Date, format: "2010-01-01" "2010-02-01" ...
##  $ Energy_Consumption: num  1677 1433 1133 743 644 ...
##  $ Column_Order      : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ Description       : chr  "End-Use Energy Consumed by the Residential Sector" "End-Use Energy Consumed by the Residential Sector" "End-Use Energy Consumed by the Residential Sector" "End-Use Energy Consumed by the Residential Sector" ...
##  $ Unit              : chr  "Trillion Btu" "Trillion Btu" "Trillion Btu" "Trillion Btu" ...

2. Introduction

2.1. Describe the dataset, where it’s from, who created it, the index, keys and variables (what are the forecast variable and predictor variables (if applicable)).

The data is created by U.S. Energy Information Administration (EIA) and can be found at https://www.eia.gov/totalenergy/data/monthly/index.php#consumption, file 2.2 - Residential sector energy consumption.

The forecast variable is Energy_Consumption, and the time series we will use is Date (which will be aggregate to Year-Month later).

2.2. Why did you choose this dataset?

We use energy everyday, and therefore curious about how is the consumption habit look like nationwide. Besides, there is a common trend in energy consumption (more in winter and less in summer). We want to see if this is true in the U.S., and if yes, can we leverage the data to forecast future energy consumption.

2.3. How can forecast on this data be leveraged to make smarter decisions?

By knowing future energy consumption, manufacturers can better prepare its resources to produce energy. This may help avoiding the situation where a number of households got their energy shut off occasionally because of lack of energy supply.

3. Data wrangling

3.1. Convert to tsibble

data_tsibble <- data %>% as_tsibble(index=Date)
data_tsibble
## # A tsibble: 152 x 9 [1D]
##    MSN     YYYYMM  Year Month Date       Energy_Consumpt…¹ Colum…² Descr…³ Unit 
##    <chr>    <int> <int> <int> <date>                 <dbl>   <int> <chr>   <chr>
##  1 TNRCBUS 201001  2010     1 2010-01-01             1677.      11 End-Us… Tril…
##  2 TNRCBUS 201002  2010     2 2010-02-01             1433.      11 End-Us… Tril…
##  3 TNRCBUS 201003  2010     3 2010-03-01             1133.      11 End-Us… Tril…
##  4 TNRCBUS 201004  2010     4 2010-04-01              743.      11 End-Us… Tril…
##  5 TNRCBUS 201005  2010     5 2010-05-01              644.      11 End-Us… Tril…
##  6 TNRCBUS 201006  2010     6 2010-06-01              686.      11 End-Us… Tril…
##  7 TNRCBUS 201007  2010     7 2010-07-01              749.      11 End-Us… Tril…
##  8 TNRCBUS 201008  2010     8 2010-08-01              738.      11 End-Us… Tril…
##  9 TNRCBUS 201009  2010     9 2010-09-01              645.      11 End-Us… Tril…
## 10 TNRCBUS 201010  2010    10 2010-10-01              666.      11 End-Us… Tril…
## # … with 142 more rows, and abbreviated variable names ¹​Energy_Consumption,
## #   ²​Column_Order, ³​Description

3.2. Deal with missing data

There are no missing data.

3.3. Create new variables to aid in forecasting

There is no need to create new variable to aid in forecasting.

3.4. Aggregate time series to desired format for forecasting

data_tsibble_aggregate <- data_tsibble %>% index_by(yearmonth=yearmonth(Date)) %>% summarise(Energy_Consumption = sum(Energy_Consumption))
data_tsibble_aggregate
## # A tsibble: 152 x 2 [1M]
##    yearmonth Energy_Consumption
##        <mth>              <dbl>
##  1  2010 Jan              1677.
##  2  2010 Feb              1433.
##  3  2010 Mar              1133.
##  4  2010 Apr               743.
##  5  2010 May               644.
##  6  2010 Jun               686.
##  7  2010 Jul               749.
##  8  2010 Aug               738.
##  9  2010 Sep               645.
## 10  2010 Oct               666.
## # … with 142 more rows

4. Exploratory analysis and visualization for the dataset

4.1. Visualize the dataset and comment on characteristics of time series

data_tsibble_aggregate %>% autoplot(Energy_Consumption)

4.2. Comment on the any anomalies in the data

There is no anomaly in the data, they seems to have distinct trend and cycle.

4.3. Describe trend/seasonality/cycles with supporting charts

There seems to be seasonality in this data set. There is no clear trend or cycle.

5. Model fitting

5.1. Split dataset into training and test sets

train <- data_tsibble_aggregate %>% filter(year(yearmonth) <  2020)
test  <- data_tsibble_aggregate %>% filter(year(yearmonth) >= 2020)

We will use 2010-2019 data to train the models and 2020-2022 data to test them

5.2. Fit TSLM, ETS and ARIMA model(s)

fit <- train %>% 
  model(
    # TSLM
    autoLM = TSLM(Energy_Consumption),
    LMwithTrend = TSLM(Energy_Consumption ~ trend()),
    # ETS
    autoETS = ETS(Energy_Consumption),
    # ARIMA
    autoArima = ARIMA(Energy_Consumption)
  )

5.2.1. Use AICc to select within model families

glance(fit)
## # A tibble: 4 × 20
##   .model    r_squa…¹ adj_r_…²  sigma2 statis…³ p_value    df log_lik   AIC  AICc
##   <chr>        <dbl>    <dbl>   <dbl>    <dbl>   <dbl> <int>   <dbl> <dbl> <dbl>
## 1 autoLM     0        0       1.15e+5 NA        NA         1   -869. 1402. 1402.
## 2 LMwithTr…  3.17e-6 -0.00847 1.16e+5  3.74e-4   0.985     2   -869. 1404. 1404.
## 3 autoETS   NA       NA       4.50e-3 NA        NA        NA   -771. 1572. 1576.
## 4 autoArima NA       NA       6.90e+3 NA        NA        NA   -632. 1272. 1273.
## # … with 10 more variables: BIC <dbl>, CV <dbl>, deviance <dbl>,
## #   df.residual <int>, rank <int>, MSE <dbl>, AMSE <dbl>, MAE <dbl>,
## #   ar_roots <list>, ma_roots <list>, and abbreviated variable names
## #   ¹​r_squared, ²​adj_r_squared, ³​statistic

5.2.2. Evaluate the residuals

#Fit ARIMA

arima_fit <- train %>% 
  model(auto = ARIMA(Energy_Consumption))
glance(arima_fit)
## # A tibble: 1 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 auto    6903.   -632. 1272. 1273. 1283. <cpl [25]> <cpl [0]>
#report
arima_fit %>% select(auto) %>% report()
## Series: Energy_Consumption 
## Model: ARIMA(1,0,0)(2,1,0)[12] 
## 
## Coefficients:
##          ar1     sar1     sar2
##       0.4618  -0.3659  -0.4403
## s.e.  0.0941   0.0945   0.0938
## 
## sigma^2 estimated as 6903:  log likelihood=-632.17
## AIC=1272.34   AICc=1272.73   BIC=1283.07
#ARIMA(1,0,0)(2,1,0)[12]

arima_fit %>% gg_tsresiduals(lag = 50)

The ACF function shows non-autocorrelation as all lines is under the lines of significance, and thus the variable is stationary. According to the histogram, the residuals appear to be mostly white noise, and it pretty normally distributed.

5.3. If there are predictor variables – fit a TSLM with predictor variables, Regression with ARIMA errors

There are no predictor variables.

5.4. Fit benchmark methods

#Fit ETS
ets_fit <- train %>% 
  model(auto = ETS(Energy_Consumption))
glance(ets_fit)
## # A tibble: 1 × 9
##   .model  sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 auto   0.00450   -771. 1572. 1576. 1613. 4954. 5530. 0.0486
#Fit TSLM
tslm_fit <- train %>% 
  model(auto = TSLM(Energy_Consumption))
glance(tslm_fit)
## # A tibble: 1 × 15
##   .model r_squa…¹ adj_r…² sigma2 stati…³ p_value    df log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl>  <dbl> <lgl>   <lgl>   <int>   <dbl> <dbl> <dbl> <dbl>
## 1 auto          0       0 1.15e5 NA      NA          1   -869. 1402. 1402. 1407.
## # … with 4 more variables: CV <dbl>, deviance <dbl>, df.residual <int>,
## #   rank <int>, and abbreviated variable names ¹​r_squared, ²​adj_r_squared,
## #   ³​statistic

5.5. Extra credit – a NNETAR model or Prophet model (FPP3 Ch 12)

6. Accuracy

6.1. Compare model performance using accuracy measures of your choice on the test dataset

comparision <- fit %>% forecast(h=12)
comparision %>% accuracy(data_tsibble_aggregate)
## # A tibble: 4 × 10
##   .model      .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>       <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 autoArima   Test    5.57  80.9  68.7  1.62  7.38 0.947 0.742 -0.197
## 2 autoETS     Test  -44.5  101.   66.9 -3.24  6.10 0.922 0.925  0.332
## 3 autoLM      Test   15.0  283.  238.  -6.03 24.3  3.28  2.59   0.495
## 4 LMwithTrend Test   13.9  283.  238.  -6.16 24.4  3.29  2.59   0.495

6.2. Why did you select those metrics for assessing accuracy?

We use RMSE because RMSE give the better picture of the quality of the regression model

6.3. Select a final model – why did you select this model?

ARIMA because ARIMA have the lowest RMSE with 80.89

7. Forecast

7.1. Produce a forecast for the future

fc <- fit %>% forecast(h="60 months") 
fc %>% autoplot(data_tsibble_aggregate,level = NULL)

7.2. How many steps into the future are you forecasting? Why?

60 months in total. After using the models for the Jan 2020-Aug 2022 period to compare their performance with actual data, we forecast the energy consumption for the next 28 months to see how the predicted energy consumption look like until the end of 2024.

7.3. Considerations or watchouts when putting forecast into production?

Although there exist a general trend throughout the year, the total consumption varies years by years (noticeable by the difference in peak value). One must keep this fact in mind if he/she/they decide to implement the models into use.