“Steam and exhaust rise from a chemical company on a cold winter day on January 6, 2017 in Oberhausen, Germany (CNN, 2019).”

Load Required Libraries

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggpubr)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

Load in Data

emissions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-05-21/emissions.csv')

## Rows: 12551 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): parent_entity, parent_type, commodity, production_unit
## dbl (3): year, production_value, total_emissions_MtCO2e
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

Human activities, especially the burning of fossil fuels, have significantly increased carbon dioxide (CO₂) emissions, contributing to climate change and global warming. Different fuel sources release different amounts of CO₂ when burned, and understanding these differences is essential for creating effective environmental policies. In this project, we investigate whether carbon emissions significantly differ by fuel type using statistical methods. We focus on analyzing emissions data by commodity type, such as coal, oil, natural gas, and cement, to determine if certain fuels contribute more to overall emissions than others. This analysis can provide valuable insights for prioritizing emissions reduction strategies based on fuel type.

Biostats Project Question:

The question we want to answer using this data set is: does carbon emissions statistically differ between the different fuel types listed under “Commodity”?

Definitions of Variables:

Total Emissions (MtCO₂e) - the dependent variable that is measured in million metric tons of CO₂ equivalent.

Fuel Type (Commodity) - The independent variable that includes oil and natural gas liquids, various types of coals, and cement. For the purposes of our project, commodity is the only variable that will be looked at to examine variation in CO₂ emissions.

Data Collection

The data used for this project comes from the Global Carbon Project, a reputable source that tracks carbon emissions worldwide. The dataset was accessed through the TidyTuesday project on GitHub, which curates public datasets for educational and research purposes. The dataset includes total carbon emissions (in million metric tons of CO₂ equivalent) categorized by fuel type or commodity, such as oil, natural gas, and various types of coal. Each observation represents a country’s emissions from a specific fuel source in a given year. The dataset was already cleaned and organized, making it ready for analysis without additional preprocessing. By focusing on emissions by fuel type, we aim to understand which commodities contribute most to global CO₂ levels.

Exploring the Data set

Below is a summary of the data that includes the mean, median, standard deviation, and standard error of the total emissions based on fuel type.

data_table_summary = emissions %>%
  group_by(commodity) %>%
  summarise(
    Mean = mean(total_emissions_MtCO2e),
    Median = median(total_emissions_MtCO2e),
    SD = sd(total_emissions_MtCO2e),
    SE = sd(total_emissions_MtCO2e)/sqrt(length((total_emissions_MtCO2e)))
  )
data_table_summary

## # A tibble: 9 × 5
##   commodity            Mean Median    SD    SE
##   <chr>               <dbl>  <dbl> <dbl> <dbl>
## 1 Anthracite Coal     129.    40.1 273.  14.2 
## 2 Bituminous Coal     241.    38.8 805.  21.7 
## 3 Cement              113.    24.5 260.  16.1 
## 4 Lignite Coal         42.4   17.7  55.5  1.75
## 5 Metallurgical Coal   81.6   21.2 229.   6.99
## 6 Natural Gas          75.1   25.9 174.   2.96
## 7 Oil & NGL           141.    51.7 237.   3.87
## 8 Sub-Bituminous Coal  74.0   17.4 162.   6.23
## 9 Thermal Coal         74.9   42.6  89.6  3.62

Histogram of Total Emissions

ggplot(emissions, aes(x = total_emissions_MtCO2e)) + geom_histogram( color = "indianred4", fill = "indianred2") + theme_light() + ylab("Frequecy") + xlab("Total Carbon Emissions (MtCO2e)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the graph above, the total carbon emissions data is not normally distributed because the graph is skewed to the right. Repeated attempts have been made to fix the skew by changing the bin size, but the default size of 30 displayed a legible graph.

The decision was made to not make any transformations to the data because of the heavy skew, changing it might lead to more complications down the line. Instead, we hypothesized that the graph represents an abnormal data set. A qq-plot will proved more information about data distribution.

QQ-Plot of Total Emissions

qqnorm(emissions$total_emissions_MtCO2e)
qqline(emissions$total_emissions_MtCO2e)

The qqplot also reveals that the data set is not normally distributed, while the majority of the data follows the balck line, the end takes on the appearance of an exponential curve around 2 on the x axis. This led to us to conclude that the total emissions data is not normal, and so we will need to use a Kruskal-Wallis test for our statistical test.

Running the Statistical Test

Even though we stated above that the data is not normal based on the graphs presented, we still need to check the distribution of total carbon emissions based on the fuel type because our question focuses on emissions based on type rather than just the total carbon emissions.

Histogram of Total Emissions based on Fuel Type

ggplot(emissions, aes(x = total_emissions_MtCO2e, fill = commodity)) + geom_histogram() + theme_light() + facet_wrap(~ commodity) + labs(x = "Total Carbon Emissions (MtCO2e)", y = "Frequency") + scale_fill_manual(values = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram has a similar appearance to the histogram of just the total carbon emissions. They both share the same right skew which means that the distribution based on fuel type is also not normal. A qq-plot was also done to proved more information.

QQ-Plot of Total Emissions based on Fuel Type

qqplot_com <- ggqqplot(emissions, x = "total_emissions_MtCO2e", color = "commodity", palette = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"))
facet(qqplot_com, facet.by = "commodity")

Fuel types like Lignite coal and thermal coal appear to have a normal distribution because their data points follow the line, but for the other fuel types, they do not. This was further proof that the data did not have a normal distribution and that we would need to perform a Kruskal-Wallis test in order to answer our question.

Before performing the Kruskal-Wallis test, we need to run a Shapiro-Wilk test in order to determine if our data is normal. You might be asking “Why, after concluding the data was not normal from the graphs?” Well, because a graph visually tells you if your data is normal or abnormal, but not statistically. A Shapiro test will tell us if our data is statically normal.

Shapiro-Wilk Test For Normality

Anthracite_Coal = emissions[emissions$commodity == "Anthracite Coal", ]$total_emissions_MtCO2e
Bituminous_Coal = emissions[emissions$commodity == "Bituminous Coal", ]$total_emissions_MtCO2e
Cement = emissions[emissions$commodity == "Cement", ]$total_emissions_MtCO2e
Lignite_Coal = emissions[emissions$commodity == "Lignite Coal", ]$total_emissions_MtCO2e
Metallurgical_Coal = emissions[emissions$commodity == "Metallurgical Coal", ]$total_emissions_MtCO2e
Natural_Gas = emissions[emissions$commodity == "Natural Gas", ]$total_emissions_MtCO2e
Oil_NGL = emissions[emissions$commodity == "Oil & NGL", ]$total_emissions_MtCO2e
Sub_Bituminous_Coal = emissions[emissions$commodity == "Sub-Bituminous Coal", ]$total_emissions_MtCO2e
Thermal_Coal = emissions[emissions$commodity == "Thermal Coal", ]$total_emissions_MtCO2e


shapiro.test(Anthracite_Coal)

## 
##  Shapiro-Wilk normality test
## 
## data:  Anthracite_Coal
## W = 0.47986, p-value < 2.2e-16

shapiro.test(Bituminous_Coal)

## 
##  Shapiro-Wilk normality test
## 
## data:  Bituminous_Coal
## W = 0.28262, p-value < 2.2e-16

shapiro.test(Cement)

## 
##  Shapiro-Wilk normality test
## 
## data:  Cement
## W = 0.43898, p-value < 2.2e-16

shapiro.test(Lignite_Coal)

## 
##  Shapiro-Wilk normality test
## 
## data:  Lignite_Coal
## W = 0.76317, p-value < 2.2e-16

shapiro.test(Metallurgical_Coal)

## 
##  Shapiro-Wilk normality test
## 
## data:  Metallurgical_Coal
## W = 0.33551, p-value < 2.2e-16

shapiro.test(Natural_Gas)

## 
##  Shapiro-Wilk normality test
## 
## data:  Natural_Gas
## W = 0.38476, p-value < 2.2e-16

shapiro.test(Oil_NGL)

## 
##  Shapiro-Wilk normality test
## 
## data:  Oil_NGL
## W = 0.58422, p-value < 2.2e-16

shapiro.test(Sub_Bituminous_Coal)

## 
##  Shapiro-Wilk normality test
## 
## data:  Sub_Bituminous_Coal
## W = 0.45668, p-value < 2.2e-16

shapiro.test(Thermal_Coal)

## 
##  Shapiro-Wilk normality test
## 
## data:  Thermal_Coal
## W = 0.70899, p-value < 2.2e-16

The p-value for each of the different fuel types is below 0.05 which means our data is not normally distributed visually and statistically. Next, we need to run a Levene’s test to compare the variance of the groups because one of the assumptions of a Kruskal-Wallis test is non-equal variances.

Levene’s Test

leveneTest(data = emissions, total_emissions_MtCO2e ~ commodity, center = mean)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = mean)
##          Df F value    Pr(>F)    
## group     8  101.57 < 2.2e-16 ***
##       12542                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variance of the fuel types are stistically different from each other, satisfiying the test’s assumptions ( t(8) = 101.57, p-value = < 2.2 x 10^-16).

Now that we’ve established that the data is not normally distributed and there’s unequal variances, we can move on to running the Kruskal-Wallis test. For this test, our null hypothesis = there is no significant difference in total carbon emissions between the different fuel types, while our alternative hypothesis = there is a significant difference in total carbon emissions between the different fuel types.

Kruskal-Wallis Test

kruskal.test(total_emissions_MtCO2e ~ commodity, data = emissions)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  total_emissions_MtCO2e by commodity
## Kruskal-Wallis chi-squared = 599.77, df = 8, p-value < 2.2e-16

There is a significant difference in total carbon emissions between the different fuel types. ( X²(8) = 599.77, p-value = < 2.2 x 10^-16).

Lastly, we need do a pair-wise test to identify which pair of fuel types had the most significant differences with our data.

Dunn’s Test

emissions %>%
  dunn_test(total_emissions_MtCO2e ~ commodity, p.adjust.method = "bonferroni")

## # A tibble: 36 × 9
##    .y.        group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
##  * <chr>      <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
##  1 total_emi… Anthr… Bitum…   368  1370      1.68 9.36e- 2 1   e+ 0 ns          
##  2 total_emi… Anthr… Cement   368   263     -1.58 1.14e- 1 1   e+ 0 ns          
##  3 total_emi… Anthr… Ligni…   368  1008     -6.63 3.36e-11 1.21e- 9 ****        
##  4 total_emi… Anthr… Metal…   368  1073     -3.95 7.73e- 5 2.78e- 3 **          
##  5 total_emi… Anthr… Natur…   368  3452     -2.79 5.32e- 3 1.92e- 1 ns          
##  6 total_emi… Anthr… Oil &…   368  3733      4.49 7.29e- 6 2.62e- 4 ***         
##  7 total_emi… Anthr… Sub-B…   368   673     -3.67 2.38e- 4 8.59e- 3 **          
##  8 total_emi… Anthr… Therm…   368   611      2.60 9.46e- 3 3.40e- 1 ns          
##  9 total_emi… Bitum… Cement  1370   263     -3.36 7.79e- 4 2.80e- 2 *           
## 10 total_emi… Bitum… Ligni…  1370  1008    -12.1  1.02e-33 3.68e-32 ****        
## # ℹ 26 more rows

The pairs that have ns under “p.adj.signif” are not statistically significant which means that there’s no difference between their total carbon emissions. The pairs that have a * are statistically significant which means that there is a difference between their total carbon emissions. The pairs that have more * are more significant than those with one star.

Visual Representation

The first graph is our data in a box-plot showing abnormality. The second graph shows the same box-plot, but with all the data points. The third graph is the same graph, but re-textured to make it easier to identity the fuel type when we add their p-values.

emissions.boxplot <- ggplot(emissions, aes(x = commodity, y = total_emissions_MtCO2e, fill = commodity)) + geom_boxplot( fill = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A")) + theme_light() + labs( x = "Fuel Type", y = "Total Carbon Emissions (MtCO2e)")
emissions.boxplot

emissions.boxplot + geom_jitter(shape=16, position=position_jitter(0.2)) + theme_light() +
  scale_fill_manual(values = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A")) + scale_x_discrete(labels = c('Anthra.', 'Bit,', 'Cement', 'Lig.', 'Met.', 'NG', 'O & NGL', 'Sub-bit.', 'Thermal'))

ggboxplot(emissions, x = "commodity", y = "total_emissions_MtCO2e", color = "commodity", palette = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"), add = "jitter") + labs( x = "Fuel Type", y = "Total Carbon Emissions (MtCO2e)") + scale_x_discrete(labels = c('O & NGL', 'NG', 'Sub-bit.', 'Met.', 'Bit.', 'Thermal', 'Anthra.', 'Cement', 'Lig.'))

This is our final graph that visually shows the majority of what we talked about through out this article:

the specific subjects our project focused on
data abnormality in the form of a box-plot
the Kruskal-Wallis test p-value
the Dunn’s test p-values

base.graph = ggboxplot(emissions, x = "commodity", y = "total_emissions_MtCO2e",
              color = "commodity", palette = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"),
              add = "jitter") + labs( x = "Fuel Type", y = "Total Carbon Emissions (MtCO2e)")

base.graph + geom_pwc(
  aes(group = commodity), tip.length = 0, 
  method = "dunn_test", label = "p.adj.format", 
  hide.ns = FALSE, p.adjust.method = "bonferroni" 
) + stat_compare_means(method = "kruskal.test", label.y = 40000, label.x = 1) + scale_x_discrete(labels = c('O & NGL', 'NG', 'Sub-bit.', 'Met.', 'Bit.', 'Thermal', 'Anthra.', 'Cement', 'Lig.'))

Conclusion

Our analysis reveals that carbon emissions do significantly differ by fuel type. The total carbon emissions data were found to be abnormally distributed, both visually and statistically, and have unequal variances across groups. Therefore, a Kruskal-Wallis test was conducted to assess differences between fuel type. The test results showed a statistically significant difference in emissions between most of the fuel types ( X²(8) = 599.77, p-value = < 2.2 x 10^-16). These findings emphasize the need to consider fuel-specific impacts when designing carbon reduction policies. By identifying which fuels contribute most to emissions, policymakers and industries can better target efforts to reduce global CO₂ output.

References

CNN, H. R. (2019, December 4). Global Emissions to Hit Another Record High This Year. CNN. https://edition.cnn.com/2019/12/04/health/emissions-global-carbon-budget-2019-intl-hnk/index.html. - Picture of the chemical plant.

Global Carbon Project. (2024). Emissions Dataset. Retrieved from: https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-21/emissions.csv

Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621. R Documentation. (n.d.). shapiro.test function. Retrieved from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/shapiro.test.html

Biostats Final Project - Carbon Emissions

Vega Gil Turumbay and Brianna Demota-Diaz

April 22, 2025