Forecasting the United States Carbon Emissions

library(fpp3) 
library(readxl)
co2emission <- read_excel("co2emissions.xlsx")

Abstract

Forecasting emissions in the U.S. can be a very useful tool in tracking and managing reduction strategies. In this paper I utilize different forecasting models such as the benchmark, ETS, and ARIMA models on carbon emissions in the United States from the year 1970 to 2021. The data showed a downward trend with little to no seasonality and the Ljung Box test demonstrated that the series was not white noise. I selected the Drift method for my benchmark method in the initial stage of my analysis and then applied ETS, ARIMA, and Holt’s linear trend method to see if I could improve the forecasts further. After implementing the forecasting models, I found that Holt’s Linear Trend method performed the best and provided the lowest MASE and RMSE values. The residual diagnostics of this method of forecasting showed that the residuals were normally distributed around zero as well as proved them to be white noise which brought me to the conclusion that no important information was left out. From my forecasts, emissions will reach 13 tons per person in the U.S by 2030 which is a decrease from the last known data point of 15 tons per person in 2021.

Introduction

Carbon emissions have become one of the most popular topics of  discussion for not only politics but corporations as well. This is due to the fact that Co2 emissions have been the biggest contributor to climate change since the Industrial Revolution in 1750. Human activities such as burning fossil fuels and clearing land for agriculture and industry have tremendously increased greenhouse gas concentrations in our atmosphere by over 50%. Climate change poses a major threat to life on Earth as it is attributed to natural disasters such as hurricanes, droughts, and wildfires. As a result, in 2015, the first-ever legally binding global climate change agreement adopted the Paris Agreement which set out a global framework to limit climate change to 1.5 degrees Celsius. Although the United States withdrew this agreement in 2020 under the Trump Administration citing economic concerns, we rejoined in 2021 under the Biden Administration with an ambitious goal of reducing greenhouse gas emissions by 50-52% by 2030. 

Forecasting carbon emissions is crucial to understanding and measuring reduction efforts and their results for not only the United States but corporations as well. With the SEC’s new Climate Disclosure Rule, public companies will have to disclose their climate risks and carbon emissions within their operations just as they do on financial statements. Before this rule, climate disclosures were on a voluntary basis not audited by a third-party which has led to many public companies making ambitious pledges for Co2 reductions that they have not lived up to.With that being said, predicting future carbon emissions may not be absolutely accurate due to many factors such as administrative changes, policies, pandemics, economic developments, etc. However, in this paper I aim to forecast carbon emissions in the United States using various forecasting methods to help stakeholders understand the trajectory of Co2 emissions moving forward in order to make informed decisions about climate change policy.

Data Description

The data utilized in this report is from Our World in Data which can be found at https://ourworldindata.org/explorers/co2. It is a yearly time series dataset that includes many countries’ production based carbon emissions, however not including land change use, spanning from 1800 to 2021. The main metric, Co2 emissions is sourced from the Carbon Disclosure Project. For this paper, I decided to utilize the United States’s production based carbon emissions from 1970 to 2021 because as you can see from the time-series graph below this is where the data begins to have a downward trend. I felt as though including the years before 1970 may skew the data to not provide as accurate a forecast with the growing global shift to reducing GHG emissions starting after these years.

# Convert dataset into time series data
USco2 <- co2emission %>%
  as_tsibble(index = year)

# Plot the dataset
USco2 %>%
  autoplot(tons) +
labs(title = "United States Co2 Emissions", # add title of the graph
    y = "Co2 Emissions" , # add y-axis label
    x = "Years") # add x-axis label

# ACF Plot
USco2 %>%
  ACF(tons,lag_max = 10) %>%
  autoplot() + 
  labs(title="United States Co2 Emissions")

The data set contains Co2 emissions for the United States per year which is measured by the average annual emissions per person in the country. This is calculated by dividing the total annual emissions of the United States by its total population. From the time-plot and ACF plot, we can see a downward trend, that as stated above, starts in 1970 which can possibly be attributed to policy, economic factors, and the emergence of renewable energy. The significant decrease in emissions shown in the time plot from 2019 to 2020 can be assumed to be from COVID-19 and we can see an increasing trend beginning in 2021. There is no apparent seasonality seen in the time-plot which is backed by the ACF showing no significant seasonal spikes in the lagged periods.

Statistical Methods

To start my statistical analysis, I split my data set into a training and test set. As there were only 52 observations in our data set I sliced it to include 1:41. Then I began using the benchmark methods including naive, average, and drift methods. I excluded the seasonal naive method as there is no seasonality in the data.

# Create training set
USco2_train <- USco2 %>% 
  slice(1:41)

# Took out seasonal naive method because there is no seasonality
fit <- USco2_train %>%
  model(
    mean = MEAN(tons),
    naive = NAIVE(tons),
    drift = RW(tons ~ drift()),
    ses = ETS(tons ~ error("A") + trend("N") + season("N")), # Fit simple exponential smoothing model
    `Holt's method` = ETS(tons ~ error("A") + trend("A") + season("N")), # Fit Holt's linear trend model
    `Damped Holt's method` = ETS(tons ~ error("A") + trend("Ad", phi = 0.9) + season("N")), # Fit Holt's linear damped trend model
    ETS = ETS(tons),
    ARIMA(tons)
)

co2_fc <- fit %>%
  forecast(h = "11 years")

After running these benchmark methods I chose to use the drift method as this produced the lowest MAPE value, 9.96, and RMSE value, 1.74. In order to try to improve my forecasting further, I applied ETS models such as Simple Exponential Smoothing, Holt’s Linear Trend method, Damped Trend methods, and ARIMA models.In order to run the ARIMA models, I first tested the model for stationarity using the Ljung Box test and the results showed a very small p-value indicating the series is not white noise and thus non-stationary. Therefore, I then re-applied the Ljung Box test to the differenced value of tons and the results were a high p-value, .681, indicating the series was now white noise and thus stationary. Then, to further test stationarity I applied the KPSS test and this resulted in a p-value of .03 which indicated, again, that the series was non-stationary and thus needed to be differenced. I then re-applied the KPSS test on the differenced series and got a p-value of .1 indicating that the differenced data appear stationary. I then determined the series required one difference to make it stationary and with this began running my ARIMA models.

############### Test for Stationarity ############### 
USco2_train |>
  features(tons, ljung_box, lag = 10)

USco2_train |>
  mutate(diff_tons = difference(tons)) |>
  features(diff_tons, ljung_box, lag = 10)

# KPSS on tons
USco2_train |>
  features(tons, unitroot_kpss)

# difference the data, and apply the test again.
USco2_train |>
  mutate(diff_tons = difference(tons)) |>
  features(diff_tons, unitroot_kpss)

# determine the appropriate number of first differences required to make a series stationary
USco2_train |>
  features(tons, unitroot_ndiffs)

dif <- USco2_train|>
  mutate(diff_tons = difference(tons))

After running the initial ARIMA model test, the system automatically chose the model ARIMA(0,1,2). However, I wanted to test other ARIMA models around this neighborhood to ensure that this was the most accurate model. Therefore, I ran an ACF and PACF and found that the ACF plot was exponentially decaying and there was a significant lag at lag 1 in the PACF but not past indicating an ARIMA(1,1,0) model. I then ran this model and neighboring models to the original which found that ARIMA (2,1,0) was the best and produced the lowest AICc. However, when I compared the ARIMA (2,1,0) AICc of 74.05 to the original’s, ARIMA (0,1,2), AICc of 73.19 I saw that the original was actually more accurate.

############### ARIMA Models############### 
ARfit<- dif|>
  model(ARIMA(tons))

report(ARfit)

## Series: tons 
## Model: ARIMA(0,1,2) 
## 
## Coefficients:
##          ma1      ma2
##       0.2971  -0.2740
## s.e.  0.1594   0.1517
## 
## sigma^2 estimated as 0.3224:  log likelihood=-33.26
## AIC=72.52   AICc=73.19   BIC=77.59

# ACF and PACF plots
dif |>
  gg_tsdisplay(tons, plot_type = "partial")

I then ran this model and neighboring models to the original which found that ARIMA (2,1,0) was the best and produced the lowest AICc. However, when I compared the ARIMA (2,1,0) AICc of 74.05 to the original’s, ARIMA (0,1,2), AICc of 73.19 I saw that the original was actually more accurate.

# Test neighboring models
ARfit2 <- dif |>
  model(ARIMA(tons ~ pdq(1,1,0))) # fit the ARIMA(1,1,0) model

report(ARfit2)

## Series: tons 
## Model: ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.1855
## s.e.  0.1546
## 
## sigma^2 estimated as 0.3601:  log likelihood=-35.84
## AIC=75.68   AICc=76   BIC=79.06

# find the best ARIMA model with p = {1, 2, 3}, q = {1, 2}, d = 1.
ARfit3 <- dif |>
  model(ARIMA(tons ~ pdq(p = 1:3, d = 1, q = 0:2)))

report(ARfit3)

## Series: tons 
## Model: ARIMA(2,1,0) 
## 
## Coefficients:
##          ar1      ar2
##       0.2775  -0.3599
## s.e.  0.1530   0.1673
## 
## sigma^2 estimated as 0.3296:  log likelihood=-33.69
## AIC=73.38   AICc=74.05   BIC=78.45

After running all the models, the results displayed that Holt’s Linear Trend method produced the best results based on RMSE and MASE values calculated. The forecast generated from the training set using the Holt’s Linear Trend model is shown in this figure. From this, we can conclude that the predicted model follows the downward trend we saw previously in the series.

#Plot the Holt's Linear Trend Method
Holtfit <- USco2_train %>%
  model(`Holt's method` = ETS(tons ~ error("A") + trend("A") + season("N")))


Holtfc <- Holtfit %>%
  forecast(h = "11 years")

Holtfc |> 
  autoplot(USco2_train) +
  labs(title="US Carbon Emissions", y="Tons" )

Results

As stated previously, I decided to use RMSE and MAPE values in my analysis to test the accuracy of each model. In the below table you can see the results of all models tested.

#Test the accuracy of the models
accuracy(co2_fc, USco2)

The Drift benchmark method provided me with a good idea of where RMSE and MAPE values should be while the rest of the benchmark methods, ETS, Damped Holt’s Method, Simple exponential smoothing, and ARIMA demonstrated far poorer forecasting accuracy. With Holt’s Linear Trend method producing our lowest RMSE and MAPE values, 1.69 and 9.66 respectively, this was the model I decided to use for future forecasting.Following this decision, I conducted residual diagnostics on the model.

#Get Residuals
USco2 |> 
  model(ETS(tons ~ error("A") + trend("A") + season("N")))|> 
  gg_tsresiduals()

HL <- USco2|>
  model(ETS(tons ~ error("A") + trend("A") + season("N"))) |> 
  augment()

HL |> features(.innov, ljung_box, lag = 10)

The residual diagnostics show that our residuals are normally distributed around 0. The ACF shows only one significant spike which brings us to the conclusion that these residuals are white noise. From the timeplot, you can see that there are outliers in 2010 and 2020. The Ljung Box test on the residuals provided a p-value result of .49 which is greater than the significance level of .05 meaning that we fail to reject the null hypothesis and conclude that they are in fact white noise. As a result of this conclusion, we are able to confidently say that we did not leave out any pertinent information in the selected model.

Discussion

Using the Holt’s Linear Trend method, I forecasted with a forecast horizon of h = 10 that you can see in the plot below.

HLfit <- USco2 |>
  model(AAN = ETS(tons ~ error("A") + trend("A") + season("N"))
     )
# Forecast population for the next 10 years
fc <- HLfit |> 
  forecast(h = 10)

# Graph the data and forecasts 
fc |>
  autoplot(USco2) +
  geom_line(aes(y = .fitted), col="#D55E00",
    data = augment(fit)) +
    labs(y="Tons", title="US Co2 Emissions") +
    guides(colour = "none")

From the graph we can see that our forecast for the United States carbon emissions is expected to continue with a downward trend, following the spike from 2020 to 2021. By 2030, it appears that emissions will reach 13 tons for the average annual emissions per person in the country. While the emission figures show a decreasing trend, it is important to note the significant drop in emissions from 2019 to 2020, due to what can be assumed to be COVID-19. In 2020, the world was put on pause and therefore we saw the restoration of ecosystems that went uninterrupted for the first time in a while, industries shutting down, and people staying home which all contributed to the decrease in global emissions. However, from the data we can see the start of a rise in emissions in 2021 which may indicate that because of the unprecedented event of COVID we will also see an abnormal increase in emissions over the next few years rather than the historical decrease seen in the past. Moving forward, I think it will be important for the U.S. to continue to implement climate change and emissions reduction strategies in the wake of the global rise in temperatures. Although we will, undoubtedly, continue to see a rise in emission levels from 2020, the hope is that the country will continue with the gradual decrease in emissions that has been seen.