library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
## Warning: package 'zoo' was built under R version 4.2.2
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✔ tsibbledata 0.4.1 ✔ fable 0.3.2
## ✔ feasts 0.3.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::index() masks zoo::index()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
setwd("C:/Users/Peter Nguyen/Downloads/Forecasting Project")
data <- read.csv("Energy_consumption2.csv")
data <- data %>% filter(Year >= 2010)
data$Date <- as.Date(data$Date,"%Y-%m-%d")
str(data)
## 'data.frame': 152 obs. of 9 variables:
## $ MSN : chr "TNRCBUS" "TNRCBUS" "TNRCBUS" "TNRCBUS" ...
## $ YYYYMM : int 201001 201002 201003 201004 201005 201006 201007 201008 201009 201010 ...
## $ Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : Date, format: "2010-01-01" "2010-02-01" ...
## $ Energy_Consumption: num 1677 1433 1133 743 644 ...
## $ Column_Order : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Description : chr "End-Use Energy Consumed by the Residential Sector" "End-Use Energy Consumed by the Residential Sector" "End-Use Energy Consumed by the Residential Sector" "End-Use Energy Consumed by the Residential Sector" ...
## $ Unit : chr "Trillion Btu" "Trillion Btu" "Trillion Btu" "Trillion Btu" ...
The data is created by U.S. Energy Information Administration (EIA) and can be found at https://www.eia.gov/totalenergy/data/monthly/index.php#consumption, file 2.2 - Residential sector energy consumption.
The forecast variable is Energy_Consumption, and the time series we will use is Date (which will be aggregate to Year-Month later).
We use energy everyday, and therefore curious about how is the consumption habit look like nationwide. Besides, there is a common trend in energy consumption (more in winter and less in summer). We want to see if this is true in the U.S., and if yes, can we leverage the data to forecast future energy consumption.
By knowing future energy consumption, manufacturers can better prepare its resources to produce energy. This may help avoiding the situation where a number of households got their energy shut off occasionally because of lack of energy supply.
data_tsibble <- data %>% as_tsibble(index=Date)
data_tsibble
## # A tsibble: 152 x 9 [1D]
## MSN YYYYMM Year Month Date Energy_Consumpt…¹ Colum…² Descr…³ Unit
## <chr> <int> <int> <int> <date> <dbl> <int> <chr> <chr>
## 1 TNRCBUS 201001 2010 1 2010-01-01 1677. 11 End-Us… Tril…
## 2 TNRCBUS 201002 2010 2 2010-02-01 1433. 11 End-Us… Tril…
## 3 TNRCBUS 201003 2010 3 2010-03-01 1133. 11 End-Us… Tril…
## 4 TNRCBUS 201004 2010 4 2010-04-01 743. 11 End-Us… Tril…
## 5 TNRCBUS 201005 2010 5 2010-05-01 644. 11 End-Us… Tril…
## 6 TNRCBUS 201006 2010 6 2010-06-01 686. 11 End-Us… Tril…
## 7 TNRCBUS 201007 2010 7 2010-07-01 749. 11 End-Us… Tril…
## 8 TNRCBUS 201008 2010 8 2010-08-01 738. 11 End-Us… Tril…
## 9 TNRCBUS 201009 2010 9 2010-09-01 645. 11 End-Us… Tril…
## 10 TNRCBUS 201010 2010 10 2010-10-01 666. 11 End-Us… Tril…
## # … with 142 more rows, and abbreviated variable names ¹Energy_Consumption,
## # ²Column_Order, ³Description
There are no missing data.
There is no need to create new variable to aid in forecasting.
data_tsibble_aggregate <- data_tsibble %>% index_by(yearmonth=yearmonth(Date)) %>% summarise(Energy_Consumption = sum(Energy_Consumption))
data_tsibble_aggregate
## # A tsibble: 152 x 2 [1M]
## yearmonth Energy_Consumption
## <mth> <dbl>
## 1 2010 Jan 1677.
## 2 2010 Feb 1433.
## 3 2010 Mar 1133.
## 4 2010 Apr 743.
## 5 2010 May 644.
## 6 2010 Jun 686.
## 7 2010 Jul 749.
## 8 2010 Aug 738.
## 9 2010 Sep 645.
## 10 2010 Oct 666.
## # … with 142 more rows
data_tsibble_aggregate %>% autoplot(Energy_Consumption)
There seems to be seasonality in this data set. There is no clear trend or cycle.
train <- data_tsibble_aggregate %>% filter(year(yearmonth) < 2020)
test <- data_tsibble_aggregate %>% filter(year(yearmonth) >= 2020)
We will use 2010-2019 data to train the models and 2020-2022 data to test them
fit <- train %>%
model(
# TSLM
autoLM = TSLM(Energy_Consumption),
LMwithTrend = TSLM(Energy_Consumption ~ trend()),
# ETS
autoETS = ETS(Energy_Consumption),
# ARIMA
autoArima = ARIMA(Energy_Consumption)
)
glance(fit)
## # A tibble: 4 × 20
## .model r_squa…¹ adj_r_…² sigma2 statis…³ p_value df log_lik AIC AICc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 autoLM 0 0 1.15e+5 NA NA 1 -869. 1402. 1402.
## 2 LMwithTr… 3.17e-6 -0.00847 1.16e+5 3.74e-4 0.985 2 -869. 1404. 1404.
## 3 autoETS NA NA 4.50e-3 NA NA NA -771. 1572. 1576.
## 4 autoArima NA NA 6.90e+3 NA NA NA -632. 1272. 1273.
## # … with 10 more variables: BIC <dbl>, CV <dbl>, deviance <dbl>,
## # df.residual <int>, rank <int>, MSE <dbl>, AMSE <dbl>, MAE <dbl>,
## # ar_roots <list>, ma_roots <list>, and abbreviated variable names
## # ¹r_squared, ²adj_r_squared, ³statistic
#Fit ARIMA
arima_fit <- train %>%
model(auto = ARIMA(Energy_Consumption))
glance(arima_fit)
## # A tibble: 1 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 auto 6903. -632. 1272. 1273. 1283. <cpl [25]> <cpl [0]>
#report
arima_fit %>% select(auto) %>% report()
## Series: Energy_Consumption
## Model: ARIMA(1,0,0)(2,1,0)[12]
##
## Coefficients:
## ar1 sar1 sar2
## 0.4618 -0.3659 -0.4403
## s.e. 0.0941 0.0945 0.0938
##
## sigma^2 estimated as 6903: log likelihood=-632.17
## AIC=1272.34 AICc=1272.73 BIC=1283.07
#ARIMA(1,0,0)(2,1,0)[12]
arima_fit %>% gg_tsresiduals(lag = 50)
The ACF function shows non-autocorrelation as all lines is under the lines of significance, and thus the variable is stationary. According to the histogram, the residuals appear to be mostly white noise, and it pretty normally distributed.
There are no predictor variables.
#Fit ETS
ets_fit <- train %>%
model(auto = ETS(Energy_Consumption))
glance(ets_fit)
## # A tibble: 1 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 auto 0.00450 -771. 1572. 1576. 1613. 4954. 5530. 0.0486
#Fit TSLM
tslm_fit <- train %>%
model(auto = TSLM(Energy_Consumption))
glance(tslm_fit)
## # A tibble: 1 × 15
## .model r_squa…¹ adj_r…² sigma2 stati…³ p_value df log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <lgl> <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 auto 0 0 1.15e5 NA NA 1 -869. 1402. 1402. 1407.
## # … with 4 more variables: CV <dbl>, deviance <dbl>, df.residual <int>,
## # rank <int>, and abbreviated variable names ¹r_squared, ²adj_r_squared,
## # ³statistic
comparision <- fit %>% forecast(h=12)
comparision %>% accuracy(data_tsibble_aggregate)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 autoArima Test 5.57 80.9 68.7 1.62 7.38 0.947 0.742 -0.197
## 2 autoETS Test -44.5 101. 66.9 -3.24 6.10 0.922 0.925 0.332
## 3 autoLM Test 15.0 283. 238. -6.03 24.3 3.28 2.59 0.495
## 4 LMwithTrend Test 13.9 283. 238. -6.16 24.4 3.29 2.59 0.495
We use RMSE because RMSE give the better picture of the quality of the regression model
ARIMA because ARIMA have the lowest RMSE with 80.89
fc <- fit %>% forecast(h="60 months")
fc %>% autoplot(data_tsibble_aggregate,level = NULL)
60 months in total. After using the models for the Jan 2020-Aug 2022 period to compare their performance with actual data, we forecast the energy consumption for the next 28 months to see how the predicted energy consumption look like until the end of 2024.
Although there exist a general trend throughout the year, the total consumption varies years by years (noticeable by the difference in peak value). One must keep this fact in mind if he/she/they decide to implement the models into use.
4.2. Comment on the any anomalies in the data
There is no anomaly in the data, they seems to have distinct trend and cycle.