Forecasting natural gas carbon emission in U.S. electric power sector

This is an attempt to analyse the contribution of different fuel sources in carbon emission of U.S. electric power sector. Data used is obtained from https://www.eia.gov/electricity/data.php#elecenv

This dataset is provided by the Energy Information Administration (EIA) of the United States and contains data of the carbon emission of different fuel sources used by the American electric power sector from 1973 to 2018.

A SARIMA and a Holt-Winters model were fitted to forecast carbon emission of natural gas specifically.

library(fpp2)
library(dplyr)
library(ggplot2)
library(zoo)
library(urca)
library(caret)
options(dplyr.summarise.inform = FALSE)

Importing data

co2_data <- read.csv('emission.csv')
head(co2_data)

##       MSN YYYYMM  Value Column_Order                              Description
## 1 CLEIEUS 197301 72.076            1 Coal Electric Power Sector CO2 Emissions
## 2 CLEIEUS 197302 64.442            1 Coal Electric Power Sector CO2 Emissions
## 3 CLEIEUS 197303 64.084            1 Coal Electric Power Sector CO2 Emissions
## 4 CLEIEUS 197304 60.842            1 Coal Electric Power Sector CO2 Emissions
## 5 CLEIEUS 197305 61.798            1 Coal Electric Power Sector CO2 Emissions
## 6 CLEIEUS 197306 66.538            1 Coal Electric Power Sector CO2 Emissions
##                                    Unit
## 1 Million Metric Tons of Carbon Dioxide
## 2 Million Metric Tons of Carbon Dioxide
## 3 Million Metric Tons of Carbon Dioxide
## 4 Million Metric Tons of Carbon Dioxide
## 5 Million Metric Tons of Carbon Dioxide
## 6 Million Metric Tons of Carbon Dioxide

Some data manipulation and aggregation

co2_data$date <- as.yearmon(as.character(co2_data$YYYYMM), "%Y%m")

co2_data <- read.csv('emission.csv')
co2_data$date <- as.yearmon(as.character(co2_data$YYYYMM), "%Y%m")

co2 <- na.omit(co2_data)
co2 <- co2_data %>% group_by(date, Description) %>% summarize(total = sum(as.numeric(Value), na.rm=TRUE))
any(is.na(co2))

## [1] TRUE

Data cleaning

#String manipulation on the column Description
sub_regex <- '.*(Coal|Distillate|Geothermal|Natural Gas|Non-Biomass|Petroleum Coke|Petroleum Electric|Residual Fuel Oil|Total).*'
co2$source <- sub(sub_regex, '\\1', co2$Description)

#Remove subtotal rows
co2 <- co2 %>% 
  select(-Description)
co2$total <- as.numeric(co2$total)
head(co2)

## # A tibble: 6 x 3
## # Groups:   date [1]
##   date       total source        
##   <yearmon>  <dbl> <chr>         
## 1 Jan 1973  72.1   Coal          
## 2 Jan 1973   2.38  Distillate    
## 3 Jan 1973   0     Geothermal    
## 4 Jan 1973  12.2   Natural Gas   
## 5 Jan 1973   0     Non-Biomass   
## 6 Jan 1973   0.128 Petroleum Coke

EDA

Overall carbon emission by the U.S. electric power sector

overall <- co2 %>% group_by(date) %>% summarize(grand_total = sum(total))
#convert to ts
total_emission <- ts(overall$grand_total, start=c(1973,1), end=c(2018,9), frequency=12)
autoplot(total_emission) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Carbon emission by the US electric power sector started to decline after around 2009, in 2018-2019 it was around the same level as late 1980s.

Coal carbon emission

coal <- ts(co2[co2$source == "Coal",]$total, start=c(1973,1), end=c(2018,9), frequency=12)
autoplot(coal) +  ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Coal Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

It looks like coal carbon emission has decreased significantly. Coal is the biggest contributor to carbon emission in the US electric power sector.

If we take carbon emission of a fuel as an indicator of its usage. The overall decline of carbon emission seen across all fuels were probably majorly contributed by the reduced usage of coal in power generation in the United States.

Distillate fuels (diesel fuels, kerosene-type jet fuels) carbon emission

distillate <- ts(co2[co2$source=="Distillate",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(distillate) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Distillate Fuel Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Distillate was used much less than coal, which explains it significantly lower emission compared to coal. Compared to pre-early 1980s, distillate usage has dropped.

####Geothermal energy carbon emission

geothermal <- ts(co2[co2$source=="Geothermal",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(geothermal) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Geothermal Fuel Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Geothermal energy was not used prior to late-1980s. It seems that geothermal energy usage tended to stay relatively constant throughout the seasons of a year. This is possibly because geothermal heat pumps can be used to regulate indoor temperature during both summer and winter months.

Natural gas carbon emission

natgas <- ts(co2[co2$source=="Natural Gas",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(natgas) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Natural Gas Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Usage of natural gas has increased rapidly since around the end of the last century. The shift from coal to natural gas usage in electricity generation in the United States around 2005 also contributed majorly to this upward trend.

Non-Biomass Waste

nonbio <- ts(co2[co2$source=="Non-Biomass",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(nonbio) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Non-Biomass Waste Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Like geothermal energy sources, non-biomass waste has increased rapidly in usage during the 1990s but has plateaued.

Petroleum coke carbon emission

petcok <- ts(co2[co2$source=="Petroleum Coke",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(petcok) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Petroleum Coke Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Usage of petroleum coke has been limited compared to other fuels. It also started to decrease in mid-noughties.

Petroleum carbon emission

petrol <- ts(co2[co2$source=="Petroleum Electric",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(petrol) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Petroleum Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Petroleum usage has decreased significantly over the years, compared to its peak in the 1970s, petroleum usage is minimal in recent years.

Residual fuel oil carbon emission

residfuel <- ts(co2[co2$source=="Residual Fuel Oil",]$total, start = c(1973,1), end=c(2018,9), frequency=12)
autoplot(residfuel) + ylab("Carbon emission (million metric tons)") +
  xlab('Year') +
  ggtitle("US Electric Power Sector Residual Fuel Oil Carbon Emission (million metric tons)") +
  theme(plot.title = element_text(hjust=0.5))

Residual fuel oil usage also has dropped significantly over the years, its contribution to carbon emission is minimal now.

Overall, we observe that in the American electric power sector, renewable sources like non-biomass waste, geothermal (not strictly renewable) has seen increases in their usage compared to a few decades ago; usage of non-renewable sources has declined, except natural gas.

Comparing carbon emission contribution of different fuels from January 2015 to September 2018:

co2%>% filter(date >= 2015) %>%
ggplot(aes(x=reorder(source,total), y=total)) + geom_col() +
  xlab('Year') +
  ylab("Carbon emission (million metric tons)") + xlab('Fuel') +
  ggtitle("Carbon Emission of Various Energy Sources Used in US Electric Power Sector") +
  theme(axis.text.x = element_text(angle=30, vjust = 0.7))

Natural gas is a fuel source that has been increasingly contributing to carbon emission in recent years. Its usage is becoming more widespread. Below is an attempt to forecast natural gas carbon emission in U.S. electric power sector.

More EDA

Seasonality of natural gas carbon emission

autoplot(window(natgas, start=2010))

Zooming in, we can see that carboon emission of natural gas is higher in summer than in winter. This aligns with the fact that American households use more energy in summer. Unlike in Europe where many families do not have air-conditioning; in America, air-conditioning is common.

Seasonal plot:

ggseasonplot(natgas, year.labels = FALSE, year.labels.left=FALSE, continuous=TRUE) +
  ylab("Carbon emission (million metric tons)") + 
  ggtitle("Natural Gas Carbon Emission From US Electric Power Sector") +
  theme(plot.title = element_text(hjust=0.5))

ggseasonplot(natgas, polar = TRUE, year.labels.left=FALSE, continuous=TRUE) +
  ylab("Carbon emission (million metric tons)") +
  ggtitle("Natural Gas Carbon Emission From US Electric Power Sector")

#Colour scale made continuous otherwise too many levels
#Just another way to show that during summer months, carbon emission is higher

The colour scale was made to be continuous because they are too many levels (years). These seasonal plots provide another way to see that carbon emission from natural gas electricity generation in the U.S. is higher during summer months.

Train-test split

train <- window(natgas, end = c(2013,12))
test <- window(natgas, start = c(2014,1))

Time Series Decomposition

#Time series STL decomposition
train %>%
  mstl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot()

Again, there’s an upward trend and a seasonality, by eyeballing, natural gas has emitted almost 5 times more carbon in 2018 than in any year from 1970s-90s.

Fitting a SARIMA model using Box-Jenkins method

Since the data is seasonal, it would be good choice to fit an SARIMA. Before fitting, we need to make the data stationary first. On top of that, a BoxCox transformation is needed since its variance changes over time.

natgas_s <- diff(BoxCox(train,BoxCox.lambda(train)), 12)
#lambda value is 0.05070343
autoplot(natgas_s) + ylab(' ') + xlab('Year') + 
  ggtitle("Natural gas carbon emission made stationary") +
  theme(axis.title.y = element_text(angle = 0,vjust = 0.5),
        plot.title = element_text(hjust=0.5))

It looks like it can be made even more stationary. So another order of difference is taken.

#take a lag-1 difference
natgas_s <- diff(natgas_s, 1)
autoplot(natgas_s) + ylab(' ') + xlab('Year') + 
  ggtitle("Natural gas carbon emission made stationary") +
  theme(axis.title.y = element_text(angle = 0,vjust = 0.5),
        plot.title = element_text(hjust=0.5))

It seems like data has been made stationary.

ur.kpss(natgas_s) %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0075 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The null hypothesis of this statistical test is that data is stationary, since the test-statistic is small, we cannot confidently reject the null hypothesis, i.e. data is stationary.

acf(natgas_s)

There is a significant peak at lag=12, so it’s likely that seasonal MA = 1, we can also try MA = 0. There are also adjacent peaks, so non-seasonal MA = 2, we can also try non-seasnoal MA = 1.

pacf(natgas_s)

There are significant peaks at lag=12 and lag=24, so seasonal AR = 2, we will also try 1. Peaks tail off so non-seasonal AR = 0.

Looking at ACF and PACF of the stationary data, a possible model is ARIMA(0,1,2)(2,1,1)[12].

A few other models are also fitted.

lambda <- BoxCox.lambda(train)
arima_model1 <- arima(BoxCox(train,lambda), order=c(0,1,2), seasonal = list(order = c(2,1,1), period=12))
arima_model2 <- arima(BoxCox(train,lambda), order=c(0,1,1), seasonal = list(order = c(2,1,1), period=12))
arima_model3 <- arima(BoxCox(train,lambda), order=c(0,1,2), seasonal = list(order = c(1,1,1), period=12))
arima_model4 <- arima(BoxCox(train,lambda), order=c(0,1,1), seasonal = list(order = c(1,1,1), period=12))
arima_model5 <- arima(BoxCox(train,lambda), order=c(0,1,2), seasonal = list(order = c(2,1,0), period=12))
arima_model6 <- arima(BoxCox(train,lambda), order=c(0,1,1), seasonal = list(order = c(1,1,1), period=12))
arima_model7 <- arima(BoxCox(train,lambda), order=c(0,1,2), seasonal = list(order = c(2,1,0), period=12))
arima_model8 <- arima(BoxCox(train,lambda), order=c(0,1,1), seasonal = list(order = c(2,1,0), period=12))


ss1 <- sum(arima_model1$residuals)
ss2 <- sum(arima_model2$residuals)
ss3 <- sum(arima_model3$residuals)
ss4 <- sum(arima_model4$residuals)
ss5 <- sum(arima_model5$residuals)
ss6 <- sum(arima_model6$residuals)
ss7 <- sum(arima_model7$residuals)
ss8 <- sum(arima_model8$residuals)

df = data.frame(row.names = c('AIC', 'SSE'),
                c(arima_model1$aic, ss1),
                c(arima_model2$aic, ss2),
                c(arima_model3$aic, ss3),
                c(arima_model4$aic, ss4),
                c(arima_model5$aic, ss5),
                c(arima_model6$aic, ss6),
                c(arima_model7$aic, ss7),
                c(arima_model8$aic, ss8))
colnames(df) = c('ARIMA(0,1,2)(2,1,1)[12]', 'ARIMA(0,1,1)(2,1,1)[12]', 'ARIMA(0,1,2)(1,1,1)[12]',
                 'ARIMA(0,1,1)(1,1,1)[12]', 'ARIMA(0,1,2)(2,1,0)[12]', 'ARIMA(0,1,1)(1,1,1)[12]',
                 'ARIMA(0,1,2)(2,1,0)[12]', 'ARIMA(0,1,1)(2,1,0)[12]')
df

##     ARIMA(0,1,2)(2,1,1)[12] ARIMA(0,1,1)(2,1,1)[12] ARIMA(0,1,2)(1,1,1)[12]
## AIC            -1408.422586           -1380.2117061           -1407.0562311
## SSE                0.696029               0.3210073               0.5166495
##     ARIMA(0,1,1)(1,1,1)[12] ARIMA(0,1,2)(2,1,0)[12] ARIMA(0,1,1)(1,1,1)[12]
## AIC           -1379.1267740           -1370.8862028           -1379.1267740
## SSE               0.2737182               0.2190542               0.2737182
##     ARIMA(0,1,2)(2,1,0)[12] ARIMA(0,1,1)(2,1,0)[12]
## AIC           -1370.8862028           -1338.7046395
## SSE               0.2190542               0.1375532

ARIMA(0,1,2)(1,1,1)[12] has the lowest AIC and ARIMA(0,1,1)(2,1,0)[12] has the smallest SSE. Model selection based on the smallest AIC and the smallest SSE gives different candidates, both models are also equally parsimonious, so the one with the smallest SSE is chosen, i.e. ARIMA(0,1,1)(2,1,0)[12] or arima_model8. This model has 6 parameters.

checkresiduals(arima_model8)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(2,1,0)[12]
## Q* = 84.702, df = 21, p-value = 1.303e-09
## 
## Model df: 3.   Total lags used: 24

Residuals have zero mean and the variance of residuals is not constant throughout, there is some autocorrelation as well.

Null hypothesis of Ljung-Box test is that residuals have no autocorrelation. However, since p-value is very small, we have to reject the null hypothesis.

We try using the auto.arima() function in R that would run an algorithm to explore many possible models and return the best one.

#lambda = "auto" -> perform BoxCox transformation
auto_arima <- auto.arima(train, seasonal = TRUE, lambda = "auto")
auto_arima

## Series: train 
## ARIMA(4,1,2)(0,1,1)[12] 
## Box Cox transformation: lambda= -0.1381425 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     ma2     sma1
##       1.3231  -0.6197  0.1076  -0.1840  -1.5701  0.7120  -0.8238
## s.e.  0.0922   0.0965  0.0760   0.0518   0.0815  0.0808   0.0330
## 
## sigma^2 estimated as 0.002756:  log likelihood=729.01
## AIC=-1442.01   AICc=-1441.71   BIC=-1408.64

The result is different: ARIMA(1,1,1)(0,1,1)[12] is the best model.

checkresiduals(auto_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,2)(0,1,1)[12]
## Q* = 31.496, df = 17, p-value = 0.01737
## 
## Model df: 7.   Total lags used: 24

This model also fails the Ljung-Box test, there is still significant autocorrelation in the residuals. However, even though the model fails the Ljung-Box test, the model can still be used for forecasting, though the prediction intervals may not be accurate.

The model given by auto.arima() is chosen because auto.arima() has tried more possible models. This model can be written as an equation:

\[ (1-0.6435B)(1-B^{12})(1-B)(X_t) = (1+0.9584B)(1+0.7161B)e_t \] The coefficients are obtained from the output of auto.arima().

Forecast for the next 5 years

#Forecast for next 5 years
future <- auto_arima %>% forecast(h = 60)
autoplot(future) + autolayer(future, series = "Forecast") + autolayer(test, series = "Actual") +
  ylab("Carbon emission (million metric tons)") + xlab("Year") +
  ggtitle("US Electric Power Sector Natural Gas Carbon \nEmission As Forecast by ARIMA(1,1,1)(0,1,1)[12]") +
  theme(plot.title = element_text(hjust=0.5)) +
  guides(colour = guide_legend(title = ""))

Natural gas carbon emission is forecast to continue increase slowly in the future, which aligns with the actual trend.

accuracy(future, test)

##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set 0.02915978 1.702108 1.208890 -0.08287894 5.920923 0.6061702
## Test set     0.29029930 3.647264 2.898512  0.83615686 6.697886 1.4533923
##                     ACF1 Theil's U
## Training set -0.06051512        NA
## Test set      0.70394881 0.6499045

Test set RMSE is 3.647264, MAE is 2.898512.

Holt-Winters

We can also try to fit a Holt-Winters model (also known as triple exponential smoothing) with a multiplicative seasonality since the variance in carbon emission across seasons has increased over the years.

hw_model <- hw(train, seasonal="multiplicative", h=60)
hw_forecast<-forecast(hw_model)
autoplot(hw_forecast) + autolayer(hw_forecast, series = "Holt-Winters Forecast") +
  autolayer(test, series = "Actual") + ylab("Carbon emission (million metric tons)") +
  xlab("Year") +
  ggtitle("US Electric Power Sector Natural Gas Carbon \nEmission As Forecast by Holt-Winters") +
  theme(plot.title = element_text(hjust=0.5)) +
  guides(colour = guide_legend(title = ""))

print(paste('Alpha coefficient is', as.character(hw_model$model$par['alpha'])))

## [1] "Alpha coefficient is 0.337685697066209"

print(paste('Beta coefficient is', as.character(hw_model$model$par['beta'])))

## [1] "Beta coefficient is 0.000100466504357162"

print(paste('Gamma coefficient is', as.character(hw_model$model$par['gamma'])))

## [1] "Gamma coefficient is 0.360203432186517"

print(paste('Period of a season is', as.character(hw_model$model$par['m'])))

## [1] "Period of a season is NA"

Above are the coefficients of the Holt-Winters model. Forecast value for X, h time steps from t can be written as:

\[ \hat{X}_{t+h|t} = (level_t + h*trend_t)seasonal_{t+h-12(k')} \\ level_t = 0.3377(\frac{X_t}{seasonal_{t-12}}) + (1-0.3377)(level_{t-1} + trend_{t-1})) \\ trend_t = 0.0001(level_t - level_{t-1}) + (1-0.0001)level_{t-1} \\ seasonal_t = 0.3602(\frac{X_t}{level_{t-1}+trend_{t-1}}) + (1-0.3602)seasonal_{t-12} \\ \\ \\ k' = INTEGER((h-1)/12)+1 \]

The initial states estimated are:

hw_model$model$initstate

##            l            b           s1           s2           s3           s4 
## 16.937219792 -0.005084263  0.754253336  0.899912591  1.097186236  1.199096762 
##           s5           s6           s7           s8           s9          s10 
##  1.293569392  1.355371928  1.223934211  1.026785331  0.871456604  0.850724140 
##          s11          s12 
##  0.644933508  0.782775961

accuracy(hw_forecast, test)

##                      ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.06221125 1.969161 1.395566 -0.1338955  6.882893 0.6997743
## Test set     4.48588459 6.293576 5.310967  9.5731391 11.684716 2.6630620
##                   ACF1 Theil's U
## Training set 0.3460062        NA
## Test set     0.8228134   1.06495

Test set RMSE = 6.293576, MAE = 5.310967, Holt-Winters is less accurate than SARIMA when predicting carbon emission by natural gas from 2014 to 2018.

We can use data beyond 2018 to test the actual predictive power of this SARIMA model.