Carbon Emissions - Basic Time Series Analysis

Abstract

An anlysis of the Carbon Emissions data available on Kaggle is conducted. The idea is to apply time series to gain some understanding of the data and to make predictions. Additionally, it is shown how simple it can be to run basic time series analysis using R.

Data Wrangling

Part of the data wrangling process follows kernels on Kaggle.

Load the required libraries.

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(fpp)

## Loading required package: forecast

## Loading required package: fma

## Loading required package: expsmooth

## Loading required package: lmtest

## Loading required package: tseries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(astsa)

## 
## Attaching package: 'astsa'

## The following object is masked from 'package:fpp':
## 
##     oil

## The following objects are masked from 'package:fma':
## 
##     chicken, sales

## The following object is masked from 'package:forecast':
## 
##     gas

Load the data, some entries are coded as “Not Available”

myData <- read.csv("MER_T12_06.csv", stringsAsFactors = F, na.strings = c("", "NA", "Not Available")
)

Basic glimpse into the data and missing data removal.

str(myData)

## 'data.frame':    5094 obs. of  6 variables:
##  $ MSN         : chr  "CLEIEUS" "CLEIEUS" "CLEIEUS" "CLEIEUS" ...
##  $ YYYYMM      : int  197301 197302 197303 197304 197305 197306 197307 197308 197309 197310 ...
##  $ Value       : num  72.1 64.4 64.1 60.8 61.8 ...
##  $ Column_Order: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Description : chr  "Coal Electric Power Sector CO2 Emissions" "Coal Electric Power Sector CO2 Emissions" "Coal Electric Power Sector CO2 Emissions" "Coal Electric Power Sector CO2 Emissions" ...
##  $ Unit        : chr  "Million Metric Tons of Carbon Dioxide" "Million Metric Tons of Carbon Dioxide" "Million Metric Tons of Carbon Dioxide" "Million Metric Tons of Carbon Dioxide" ...

sum(is.na(myData))

## [1] 416

myData <- myData %>% na.omit()

The data shows the total emissions for the year as the 13th month, that’ll be removed now.First we split year and month.

myData <- myData %>% mutate(
  dummy = as.character(YYYYMM),
  year = substr(dummy, 0, 4),
  year = as.factor(year),
  month = substr(dummy, 5, 6),
  month = as.factor(month),
  value = as.numeric(Value)
  ) %>% select(year, month, value, Description, -dummy, -YYYYMM)
#Now we remove the 13th month
myData <- myData %>% filter( month != 13)

The description column is far too long, we’ll shorten it for easier reading.

myData <- 
  myData %>%
  mutate(
    description2 = recode(
      .$Description,
      "Coal Electric Power Sector CO2 Emissions" = "coal",
      "Natural Gas Electric Power Sector CO2 Emissions" = "natural gas",
      "Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions" = "distillate fuel", 
      "Petroleum Coke Electric Power Sector CO2 Emissions" = "petroleum coke",
      "Residual Fuel Oil Electric Power Sector CO2 Emissions" = "residual fuel", 
      "Petroleum Electric Power Sector CO2 Emissions" = "petroleum electric power",
      "Geothermal Energy Electric Power Sector CO2 Emissions"="geothermal energy",
      "Non-Biomass Waste Electric Power Sector CO2 Emissions"="non biomass waste",
      "Total Energy Electric Power Sector CO2 Emissions" = "total energy"
    )
  )

myData  <- 
  myData %>%
  select(-Description) %>%
  rename(description = description2) %>%
  mutate(description = as.factor(description))

Let’s look at the new descriptions and also take a look at the distribution of observations in each category.

table(myData$description)

## 
##                     coal          distillate fuel        geothermal energy 
##                      523                      523                      331 
##              natural gas        non biomass waste           petroleum coke 
##                      523                      331                      523 
## petroleum electric power            residual fuel             total energy 
##                      523                      523                      523

Time series

We’re interested in the value column. We’ll create a time series for each emission source.

ts_coal <- ts(myData$value[which(myData$description == "coal")], start = c(1973, 01), frequency = 12)
ts_destillate_fuel <- ts(myData$value[which(myData$description == "distillate fuel")], start = c(1973, 01), frequency = 12)
ts_natural_gas <- ts(myData$value[which(myData$description == "natural gas")], start = c(1973, 01), frequency = 12)
ts_petroleum_coke <- ts(myData$value[which(myData$description == "petroleum coke")], start = c(1973, 01), frequency = 12)
ts_petroleum_elec <- ts(myData$value[which(myData$description == "petroleum electric power")], start = c(1973, 01), frequency = 12)
ts_residual <- ts(myData$value[which(myData$description == "residual fuel")], start = c(1973, 01), frequency = 12)
ts_total <- ts(myData$value[which(myData$description == "total energy")], start = c(1973, 01), frequency = 12)
ts_geothermal <- ts(myData$value[which(myData$description == "geothermal energy")], start = c(1989, 01), frequency = 12)
ts_waste <- ts(myData$value[which(myData$description == "non biomass waste")], start = c(1989, 01), frequency = 12)

At this point we are ready to start with a classic time series analysis. We begin with some plots to gain some insight into the data.

autoplot(ts_coal) + ggtitle("Coal Emissions")

autoplot(ts_geothermal) + ggtitle("Geothermal Emissions")

autoplot(ts_total) + ggtitle("Total Emissions")

We can see there’s an overall decline in total emissions and in coal emissions after 2005. The pattern for geothermal emissions is completely different.

The following plot allows us to get an understanding of the seasonal dynamics.

ggsubseriesplot(ts_total) + ggtitle("Total Emissions")

As we can see the months with higher emissions are July and August, the peak of summer. Followed by December and January, the peak of winter. Seasonality seems to be largely driven by shifts in temperature. We can also see that in recent years emissions are decresing for every month. Even reaching the mean historical value for emissions. A deeper look alows us to see that emissions reverted to historical means for the cold months.

We can generate the same plot for the coal and natural gas emissions.

ggsubseriesplot(ts_coal) + ggtitle("Coal")

ggsubseriesplot(ts_natural_gas) + ggtitle("Natural Gas")

We can see that coal emissions are falling in every month in the past years. On the other hand natural gas emissions are up, especially in the summer months.

Total emissions seem to be driven, to a large extent, by the coal emissions. Let’s plot the share of coal in total emissions over time.

autoplot(ts_coal/ts_total) + ggtitle("Coal/Total Emissions")

We can see that coal reached it’s peak in total emissions between 1990 and 2000. This reduction could come from better coal technology or the subsitution of coal. Let’s see how the shares of polution from alternative sources of energy behave over time. First for geothermal then for waste.

autoplot(ts_geothermal/ts_total) + ggtitle("Geothermal/Total Emissions")

autoplot(ts_waste/ts_total) + ggtitle("Waste/Total Emissions")

Their shares are increasing over time, indicating an increase in their usage (assuming technology is not getting worse).

Next let’s plot the detrended series of total emissions using the STL function (Seasonal and trend decomposition using LOESS)

ts_total %>% stl(t.window = 13, s.window = "periodic") %>% autoplot

It might be useful to inspect the seasonally adjusted series.

autoplot(seasadj(stl(ts_total, s.window = "periodic"))) +ggtitle("Total Emissions - Seasonally Adjusted Series")

Model Fiting

In this part we focus only on total emissions.

We can fit an arima model to this data using the auto.arima function

#Stepwise fitting
fit1 <- auto.arima(ts_total)
summary(fit1)

## Series: ts_total 
## ARIMA(2,1,1)(0,1,1)[12]                    
## 
## Coefficients:
##          ar1      ar2      ma1     sma1
##       0.5175  -0.0061  -0.8978  -0.7802
## s.e.  0.0560   0.0515   0.0352   0.0280
## 
## sigma^2 estimated as 37.66:  log likelihood=-1653.14
## AIC=3316.27   AICc=3316.39   BIC=3337.45
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.2251275 6.035837 4.567068 -0.2425231 2.869473 0.6091823
##                      ACF1
## Training set -0.002243674

#And trying all models
fit2 <- auto.arima(ts_total, stepwise = F)
summary(fit2)

## Series: ts_total 
## ARIMA(1,1,1)(2,1,1)[12]                    
## 
## Coefficients:
##          ar1      ma1     sar1     sar2     sma1
##       0.4963  -0.8937  -0.0419  -0.1561  -0.7230
## s.e.  0.0562   0.0301   0.0599   0.0534   0.0448
## 
## sigma^2 estimated as 37.07:  log likelihood=-1648.99
## AIC=3309.97   AICc=3310.14   BIC=3335.38
## 
## Training set error measures:
##                      ME    RMSE      MAE        MPE     MAPE      MASE
## Training set -0.2277564 5.98252 4.542354 -0.2426657 2.857439 0.6058858
##                     ACF1
## Training set 0.003368634

The models are different. The first fit is a ARIMA(1,1,1)(0,1,1)[12], while the second is an ARIMA(1,1,1)(2,1,1)[12]. The difference is a second order seasonal AR component. The second model provides a slightly better AICCc score.

Let’s do a quick residual analysis.

checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  residuals
## Q* = 36.177, df = 20, p-value = 0.01466
## 
## Model df: 4.   Total lags used: 24

checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  residuals
## Q* = 29.872, df = 19, p-value = 0.05345
## 
## Model df: 5.   Total lags used: 24

In both cases the residuals look like white noise and the Ljung-Box test support an independent distribution. Residuals also look normally distriuted.

We can also compare the accuracy of both models.

accuracy(fit1)

##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.2251275 6.035837 4.567068 -0.2425231 2.869473 0.6091823
##                      ACF1
## Training set -0.002243674

accuracy(fit2)

##                      ME    RMSE      MAE        MPE     MAPE      MASE
## Training set -0.2277564 5.98252 4.542354 -0.2426657 2.857439 0.6058858
##                     ACF1
## Training set 0.003368634

Again fit2 is the better choice, although the difference is small. This is not the best way to measure accuracy since it’s computed on the training set. We need some form of cross validation, or training/test sets.

The following code implements a cross-validation procedure for our models.

k <- 60 # minimum data length for fitting a model
n <- length(ts_total)
mae1 <- mae2  <- matrix(NA,n-k,12)
st <- tsp(ts_total)[1]+(k-2)/12

for(i in 1:(n-k))
{
  xshort <- window(ts_total, end=st + i/12)
  xnext <- window(ts_total, start=st + (i+1)/12, end=st + (i+12)/12)
  fit1 <- Arima(xshort, order = c(1,1,2), seasonal = list(order =c(0,1,1), period = 12))
  fcast1 <- forecast(fit1, h=12)
  fit2 <- Arima(xshort, order=c(1,1,2), seasonal=list(order=c(2,1,1), period=12))
  fcast2 <- forecast(fit2, h=12)
  
  mae1[i,1:length(xnext)] <- abs(fcast1[['mean']]-xnext)
  mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext)
  
}

And we can plot the Mean Absolute Error

plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE",
     ylim=c(4, 8))
lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3)
legend("topleft",legend=c("fit 1","fit 2"),col=2:4,lty=1)

As we can see model 2 performs slightly better.

Let´s use the second model for a 12 month forecast.

sarima.for(ts_total, 1,1,1,2,1,1,12, n.ahead = 12)

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2016                                                               
## 2017 160.0177 137.3683 130.6933 116.5934 132.7238 161.4440 187.9824
##           Aug      Sep      Oct      Nov      Dec
## 2016 193.5437 159.9105 140.3797 134.5421 149.9530
## 2017                                             
## 
## $se
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2016                                                               
## 2017 8.001156 8.114284 8.220028 8.321630 8.420647 8.517849 8.613630
##           Aug      Sep      Oct      Nov      Dec
## 2016 6.058261 7.073240 7.487495 7.715791 7.873653
## 2017

If we go back to the plot of the total emissions time series we can see that the variance is not constant. We can do the same exercise using the Box-Cox transformation.

lam <- BoxCox.lambda(ts_total)
bc_ts_total <- BoxCox(ts_total, lam)

The next step is to run the same models on the transformed series and get the outputs in the original scale.

Cross Validation

k <- 60 # minimum data length for fitting a model
n <- length(bc_ts_total)
mae1 <- mae2  <- matrix(NA,n-k,12)
st <- tsp(bc_ts_total)[1]+(k-2)/12

for(i in 1:(n-k))
{
  xshort <- window(bc_ts_total, end=st + i/12)
  xnext <- window(bc_ts_total, start=st + (i+1)/12, end=st + (i+12)/12)
  fit1_bc <- Arima(xshort, order = c(1,1,2), seasonal = list(order =c(0, 1, 1), period = 12))
  fcast1 <- forecast(fit1_bc, h=12)
  fit2_bc <- Arima(xshort, order=c(1,1,1), seasonal=list(order=c(2,1,1), period=12))
  fcast2 <- forecast(fit2_bc, h=12)
  
  mae1[i,1:length(xnext)] <- abs(  (lam*fcast1[['mean']]+1)^(1/lam) -(lam*xnext+1)^(1/lam)   )
  mae2[i,1:length(xnext)] <- abs(  (lam*fcast2[['mean']]+1)^(1/lam) -(lam*xnext +1)^(1/lam) )
  
}

plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE",  ylim =c(4,8)  )
lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3)
legend("topleft",legend=c("fit 1 BC","fit 2 BC"),col=2:4,lty=1)

Again model 2 provides the best overall fit, although for a few horizons model 1 has a slightly lower MAE. If we compare the transformed and the original fits, we can see that the increase in MAE for the original series is slightly steeper, but results are very similar. In this case there might be little to gain with the transformation.

Since the series is quite long, we could’ve also set fixed a part of it aside as a test set (using the window function) and compare the predictions to the actual observed values.

Conclusion