An anlysis of the Carbon Emissions data available on Kaggle is conducted. The idea is to apply time series to gain some understanding of the data and to make predictions. Additionally, it is shown how simple it can be to run basic time series analysis using R.
Part of the data wrangling process follows kernels on Kaggle.
Load the required libraries.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(fpp)
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: tseries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(astsa)
##
## Attaching package: 'astsa'
## The following object is masked from 'package:fpp':
##
## oil
## The following objects are masked from 'package:fma':
##
## chicken, sales
## The following object is masked from 'package:forecast':
##
## gas
Load the data, some entries are coded as “Not Available”
myData <- read.csv("MER_T12_06.csv", stringsAsFactors = F, na.strings = c("", "NA", "Not Available")
)
Basic glimpse into the data and missing data removal.
str(myData)
## 'data.frame': 5094 obs. of 6 variables:
## $ MSN : chr "CLEIEUS" "CLEIEUS" "CLEIEUS" "CLEIEUS" ...
## $ YYYYMM : int 197301 197302 197303 197304 197305 197306 197307 197308 197309 197310 ...
## $ Value : num 72.1 64.4 64.1 60.8 61.8 ...
## $ Column_Order: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Description : chr "Coal Electric Power Sector CO2 Emissions" "Coal Electric Power Sector CO2 Emissions" "Coal Electric Power Sector CO2 Emissions" "Coal Electric Power Sector CO2 Emissions" ...
## $ Unit : chr "Million Metric Tons of Carbon Dioxide" "Million Metric Tons of Carbon Dioxide" "Million Metric Tons of Carbon Dioxide" "Million Metric Tons of Carbon Dioxide" ...
sum(is.na(myData))
## [1] 416
myData <- myData %>% na.omit()
The data shows the total emissions for the year as the 13th month, that’ll be removed now.First we split year and month.
myData <- myData %>% mutate(
dummy = as.character(YYYYMM),
year = substr(dummy, 0, 4),
year = as.factor(year),
month = substr(dummy, 5, 6),
month = as.factor(month),
value = as.numeric(Value)
) %>% select(year, month, value, Description, -dummy, -YYYYMM)
#Now we remove the 13th month
myData <- myData %>% filter( month != 13)
The description column is far too long, we’ll shorten it for easier reading.
myData <-
myData %>%
mutate(
description2 = recode(
.$Description,
"Coal Electric Power Sector CO2 Emissions" = "coal",
"Natural Gas Electric Power Sector CO2 Emissions" = "natural gas",
"Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions" = "distillate fuel",
"Petroleum Coke Electric Power Sector CO2 Emissions" = "petroleum coke",
"Residual Fuel Oil Electric Power Sector CO2 Emissions" = "residual fuel",
"Petroleum Electric Power Sector CO2 Emissions" = "petroleum electric power",
"Geothermal Energy Electric Power Sector CO2 Emissions"="geothermal energy",
"Non-Biomass Waste Electric Power Sector CO2 Emissions"="non biomass waste",
"Total Energy Electric Power Sector CO2 Emissions" = "total energy"
)
)
myData <-
myData %>%
select(-Description) %>%
rename(description = description2) %>%
mutate(description = as.factor(description))
Let’s look at the new descriptions and also take a look at the distribution of observations in each category.
table(myData$description)
##
## coal distillate fuel geothermal energy
## 523 523 331
## natural gas non biomass waste petroleum coke
## 523 331 523
## petroleum electric power residual fuel total energy
## 523 523 523
We’re interested in the value column. We’ll create a time series for each emission source.
ts_coal <- ts(myData$value[which(myData$description == "coal")], start = c(1973, 01), frequency = 12)
ts_destillate_fuel <- ts(myData$value[which(myData$description == "distillate fuel")], start = c(1973, 01), frequency = 12)
ts_natural_gas <- ts(myData$value[which(myData$description == "natural gas")], start = c(1973, 01), frequency = 12)
ts_petroleum_coke <- ts(myData$value[which(myData$description == "petroleum coke")], start = c(1973, 01), frequency = 12)
ts_petroleum_elec <- ts(myData$value[which(myData$description == "petroleum electric power")], start = c(1973, 01), frequency = 12)
ts_residual <- ts(myData$value[which(myData$description == "residual fuel")], start = c(1973, 01), frequency = 12)
ts_total <- ts(myData$value[which(myData$description == "total energy")], start = c(1973, 01), frequency = 12)
ts_geothermal <- ts(myData$value[which(myData$description == "geothermal energy")], start = c(1989, 01), frequency = 12)
ts_waste <- ts(myData$value[which(myData$description == "non biomass waste")], start = c(1989, 01), frequency = 12)
At this point we are ready to start with a classic time series analysis. We begin with some plots to gain some insight into the data.
autoplot(ts_coal) + ggtitle("Coal Emissions")
autoplot(ts_geothermal) + ggtitle("Geothermal Emissions")
autoplot(ts_total) + ggtitle("Total Emissions")
We can see there’s an overall decline in total emissions and in coal emissions after 2005. The pattern for geothermal emissions is completely different.
The following plot allows us to get an understanding of the seasonal dynamics.
ggsubseriesplot(ts_total) + ggtitle("Total Emissions")
As we can see the months with higher emissions are July and August, the peak of summer. Followed by December and January, the peak of winter. Seasonality seems to be largely driven by shifts in temperature. We can also see that in recent years emissions are decresing for every month. Even reaching the mean historical value for emissions. A deeper look alows us to see that emissions reverted to historical means for the cold months.
We can generate the same plot for the coal and natural gas emissions.
ggsubseriesplot(ts_coal) + ggtitle("Coal")
ggsubseriesplot(ts_natural_gas) + ggtitle("Natural Gas")
We can see that coal emissions are falling in every month in the past years. On the other hand natural gas emissions are up, especially in the summer months.
Total emissions seem to be driven, to a large extent, by the coal emissions. Let’s plot the share of coal in total emissions over time.
autoplot(ts_coal/ts_total) + ggtitle("Coal/Total Emissions")
We can see that coal reached it’s peak in total emissions between 1990 and 2000. This reduction could come from better coal technology or the subsitution of coal. Let’s see how the shares of polution from alternative sources of energy behave over time. First for geothermal then for waste.
autoplot(ts_geothermal/ts_total) + ggtitle("Geothermal/Total Emissions")
autoplot(ts_waste/ts_total) + ggtitle("Waste/Total Emissions")
Their shares are increasing over time, indicating an increase in their usage (assuming technology is not getting worse).
Next let’s plot the detrended series of total emissions using the STL function (Seasonal and trend decomposition using LOESS)
ts_total %>% stl(t.window = 13, s.window = "periodic") %>% autoplot
It might be useful to inspect the seasonally adjusted series.
autoplot(seasadj(stl(ts_total, s.window = "periodic"))) +ggtitle("Total Emissions - Seasonally Adjusted Series")
In this part we focus only on total emissions.
We can fit an arima model to this data using the auto.arima function
#Stepwise fitting
fit1 <- auto.arima(ts_total)
summary(fit1)
## Series: ts_total
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.5175 -0.0061 -0.8978 -0.7802
## s.e. 0.0560 0.0515 0.0352 0.0280
##
## sigma^2 estimated as 37.66: log likelihood=-1653.14
## AIC=3316.27 AICc=3316.39 BIC=3337.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2251275 6.035837 4.567068 -0.2425231 2.869473 0.6091823
## ACF1
## Training set -0.002243674
#And trying all models
fit2 <- auto.arima(ts_total, stepwise = F)
summary(fit2)
## Series: ts_total
## ARIMA(1,1,1)(2,1,1)[12]
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1
## 0.4963 -0.8937 -0.0419 -0.1561 -0.7230
## s.e. 0.0562 0.0301 0.0599 0.0534 0.0448
##
## sigma^2 estimated as 37.07: log likelihood=-1648.99
## AIC=3309.97 AICc=3310.14 BIC=3335.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2277564 5.98252 4.542354 -0.2426657 2.857439 0.6058858
## ACF1
## Training set 0.003368634
The models are different. The first fit is a ARIMA(1,1,1)(0,1,1)[12], while the second is an ARIMA(1,1,1)(2,1,1)[12]. The difference is a second order seasonal AR component. The second model provides a slightly better AICCc score.
Let’s do a quick residual analysis.
checkresiduals(fit1)
##
## Ljung-Box test
##
## data: residuals
## Q* = 36.177, df = 20, p-value = 0.01466
##
## Model df: 4. Total lags used: 24
checkresiduals(fit2)
##
## Ljung-Box test
##
## data: residuals
## Q* = 29.872, df = 19, p-value = 0.05345
##
## Model df: 5. Total lags used: 24
In both cases the residuals look like white noise and the Ljung-Box test support an independent distribution. Residuals also look normally distriuted.
We can also compare the accuracy of both models.
accuracy(fit1)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2251275 6.035837 4.567068 -0.2425231 2.869473 0.6091823
## ACF1
## Training set -0.002243674
accuracy(fit2)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2277564 5.98252 4.542354 -0.2426657 2.857439 0.6058858
## ACF1
## Training set 0.003368634
Again fit2 is the better choice, although the difference is small. This is not the best way to measure accuracy since it’s computed on the training set. We need some form of cross validation, or training/test sets.
The following code implements a cross-validation procedure for our models.
k <- 60 # minimum data length for fitting a model
n <- length(ts_total)
mae1 <- mae2 <- matrix(NA,n-k,12)
st <- tsp(ts_total)[1]+(k-2)/12
for(i in 1:(n-k))
{
xshort <- window(ts_total, end=st + i/12)
xnext <- window(ts_total, start=st + (i+1)/12, end=st + (i+12)/12)
fit1 <- Arima(xshort, order = c(1,1,2), seasonal = list(order =c(0,1,1), period = 12))
fcast1 <- forecast(fit1, h=12)
fit2 <- Arima(xshort, order=c(1,1,2), seasonal=list(order=c(2,1,1), period=12))
fcast2 <- forecast(fit2, h=12)
mae1[i,1:length(xnext)] <- abs(fcast1[['mean']]-xnext)
mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext)
}
And we can plot the Mean Absolute Error
plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE",
ylim=c(4, 8))
lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3)
legend("topleft",legend=c("fit 1","fit 2"),col=2:4,lty=1)
As we can see model 2 performs slightly better.
Let´s use the second model for a 12 month forecast.
sarima.for(ts_total, 1,1,1,2,1,1,12, n.ahead = 12)
## $pred
## Jan Feb Mar Apr May Jun Jul
## 2016
## 2017 160.0177 137.3683 130.6933 116.5934 132.7238 161.4440 187.9824
## Aug Sep Oct Nov Dec
## 2016 193.5437 159.9105 140.3797 134.5421 149.9530
## 2017
##
## $se
## Jan Feb Mar Apr May Jun Jul
## 2016
## 2017 8.001156 8.114284 8.220028 8.321630 8.420647 8.517849 8.613630
## Aug Sep Oct Nov Dec
## 2016 6.058261 7.073240 7.487495 7.715791 7.873653
## 2017
If we go back to the plot of the total emissions time series we can see that the variance is not constant. We can do the same exercise using the Box-Cox transformation.
lam <- BoxCox.lambda(ts_total)
bc_ts_total <- BoxCox(ts_total, lam)
The next step is to run the same models on the transformed series and get the outputs in the original scale.
Cross Validation
k <- 60 # minimum data length for fitting a model
n <- length(bc_ts_total)
mae1 <- mae2 <- matrix(NA,n-k,12)
st <- tsp(bc_ts_total)[1]+(k-2)/12
for(i in 1:(n-k))
{
xshort <- window(bc_ts_total, end=st + i/12)
xnext <- window(bc_ts_total, start=st + (i+1)/12, end=st + (i+12)/12)
fit1_bc <- Arima(xshort, order = c(1,1,2), seasonal = list(order =c(0, 1, 1), period = 12))
fcast1 <- forecast(fit1_bc, h=12)
fit2_bc <- Arima(xshort, order=c(1,1,1), seasonal=list(order=c(2,1,1), period=12))
fcast2 <- forecast(fit2_bc, h=12)
mae1[i,1:length(xnext)] <- abs( (lam*fcast1[['mean']]+1)^(1/lam) -(lam*xnext+1)^(1/lam) )
mae2[i,1:length(xnext)] <- abs( (lam*fcast2[['mean']]+1)^(1/lam) -(lam*xnext +1)^(1/lam) )
}
plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE", ylim =c(4,8) )
lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3)
legend("topleft",legend=c("fit 1 BC","fit 2 BC"),col=2:4,lty=1)
Again model 2 provides the best overall fit, although for a few horizons model 1 has a slightly lower MAE. If we compare the transformed and the original fits, we can see that the increase in MAE for the original series is slightly steeper, but results are very similar. In this case there might be little to gain with the transformation.
Since the series is quite long, we could’ve also set fixed a part of it aside as a test set (using the window function) and compare the predictions to the actual observed values.
Simple time series analysis allowed us to get some insights into the data. Additionally, we were able to make a simple 12 month forecast. The takeaway is the ease to manipulate time series and predictions in R using the fpp package. A lot more can be done in the future with this data, there’s room for trying different models and approaches, not only using time series.