library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggpubr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
emissions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-05-21/emissions.csv')
## Rows: 12551 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): parent_entity, parent_type, commodity, production_unit
## dbl (3): year, production_value, total_emissions_MtCO2e
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Human activities, especially the burning of fossil fuels, have significantly increased carbon dioxide (CO₂) emissions, contributing to climate change and global warming. Different fuel sources release different amounts of CO₂ when burned, and understanding these differences is essential for creating effective environmental policies. In this project, we investigate whether carbon emissions significantly differ by fuel type using statistical methods. We focus on analyzing emissions data by commodity type, such as coal, oil, natural gas, and cement, to determine if certain fuels contribute more to overall emissions than others. This analysis can provide valuable insights for prioritizing emissions reduction strategies based on fuel type.
The question we want to answer using this data set is: does carbon emissions statistically differ between the different fuel types listed under “Commodity”?
Total Emissions (MtCO2e) - the dependent variable that is measured in million metric tons of CO2 equivalent.
Fuel Type (Commodity) - The independent variable that includes oil and natural gas liquids, various types of coals, and cement. For the purposes of our project, commodity is the only variable that will be looked at to examine variation in CO2 emissions.
The data used for this project comes from the Global Carbon Project, a reputable source that tracks carbon emissions worldwide. The dataset was accessed through the TidyTuesday project on GitHub, which curates public datasets for educational and research purposes. The dataset includes total carbon emissions (in million metric tons of CO₂ equivalent) categorized by fuel type or commodity, such as oil, natural gas, and various types of coal. Each observation represents a country’s emissions from a specific fuel source in a given year. The dataset was already cleaned and organized, making it ready for analysis without additional preprocessing. By focusing on emissions by fuel type, we aim to understand which commodities contribute most to global CO₂ levels.
Below is a summary of the data that includes the mean, median, standard deviation, and standard error of the total emissions based on fuel type.
data_table_summary = emissions %>%
group_by(commodity) %>%
summarise(
Mean = mean(total_emissions_MtCO2e),
Median = median(total_emissions_MtCO2e),
SD = sd(total_emissions_MtCO2e),
SE = sd(total_emissions_MtCO2e)/sqrt(length((total_emissions_MtCO2e)))
)
data_table_summary
## # A tibble: 9 × 5
## commodity Mean Median SD SE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Anthracite Coal 129. 40.1 273. 14.2
## 2 Bituminous Coal 241. 38.8 805. 21.7
## 3 Cement 113. 24.5 260. 16.1
## 4 Lignite Coal 42.4 17.7 55.5 1.75
## 5 Metallurgical Coal 81.6 21.2 229. 6.99
## 6 Natural Gas 75.1 25.9 174. 2.96
## 7 Oil & NGL 141. 51.7 237. 3.87
## 8 Sub-Bituminous Coal 74.0 17.4 162. 6.23
## 9 Thermal Coal 74.9 42.6 89.6 3.62
ggplot(emissions, aes(x = total_emissions_MtCO2e)) + geom_histogram( color = "indianred4", fill = "indianred2") + theme_light() + ylab("Frequecy") + xlab("Total Carbon Emissions (MtCO2e)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the graph above, the total carbon emissions data is not normally distributed because the graph is skewed to the right. Repeated attempts have been made to fix the skew by changing the bin size, but the default size of 30 displayed a legible graph.
The decision was made to not make any transformations to the data because of the heavy skew, changing it might lead to more complications down the line. Instead, we hypothesized that the graph represents an abnormal data set. A qq-plot will proved more information about data distribution.
qqnorm(emissions$total_emissions_MtCO2e)
qqline(emissions$total_emissions_MtCO2e)
The qqplot also reveals that the data set is not normally distributed, while the majority of the data follows the balck line, the end takes on the appearance of an exponential curve around 2 on the x axis. This led to us to conclude that the total emissions data is not normal, and so we will need to use a Kruskal-Wallis test for our statistical test.
Even though we stated above that the data is not normal based on the graphs presented, we still need to check the distribution of total carbon emissions based on the fuel type because our question focuses on emissions based on type rather than just the total carbon emissions.
ggplot(emissions, aes(x = total_emissions_MtCO2e, fill = commodity)) + geom_histogram() + theme_light() + facet_wrap(~ commodity) + labs(x = "Total Carbon Emissions (MtCO2e)", y = "Frequency") + scale_fill_manual(values = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This histogram has a similar appearance to the histogram of just the total carbon emissions. They both share the same right skew which means that the distribution based on fuel type is also not normal. A qq-plot was also done to proved more information.
qqplot_com <- ggqqplot(emissions, x = "total_emissions_MtCO2e", color = "commodity", palette = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"))
facet(qqplot_com, facet.by = "commodity")
Fuel types like Lignite coal and thermal coal appear to have a normal distribution because their data points follow the line, but for the other fuel types, they do not. This was further proof that the data did not have a normal distribution and that we would need to perform a Kruskal-Wallis test in order to answer our question.
Before performing the Kruskal-Wallis test, we need to run a Shapiro-Wilk test in order to determine if our data is normal. You might be asking “Why, after concluding the data was not normal from the graphs?” Well, because a graph visually tells you if your data is normal or abnormal, but not statistically. A Shapiro test will tell us if our data is statically normal.
Anthracite_Coal = emissions[emissions$commodity == "Anthracite Coal", ]$total_emissions_MtCO2e
Bituminous_Coal = emissions[emissions$commodity == "Bituminous Coal", ]$total_emissions_MtCO2e
Cement = emissions[emissions$commodity == "Cement", ]$total_emissions_MtCO2e
Lignite_Coal = emissions[emissions$commodity == "Lignite Coal", ]$total_emissions_MtCO2e
Metallurgical_Coal = emissions[emissions$commodity == "Metallurgical Coal", ]$total_emissions_MtCO2e
Natural_Gas = emissions[emissions$commodity == "Natural Gas", ]$total_emissions_MtCO2e
Oil_NGL = emissions[emissions$commodity == "Oil & NGL", ]$total_emissions_MtCO2e
Sub_Bituminous_Coal = emissions[emissions$commodity == "Sub-Bituminous Coal", ]$total_emissions_MtCO2e
Thermal_Coal = emissions[emissions$commodity == "Thermal Coal", ]$total_emissions_MtCO2e
shapiro.test(Anthracite_Coal)
##
## Shapiro-Wilk normality test
##
## data: Anthracite_Coal
## W = 0.47986, p-value < 2.2e-16
shapiro.test(Bituminous_Coal)
##
## Shapiro-Wilk normality test
##
## data: Bituminous_Coal
## W = 0.28262, p-value < 2.2e-16
shapiro.test(Cement)
##
## Shapiro-Wilk normality test
##
## data: Cement
## W = 0.43898, p-value < 2.2e-16
shapiro.test(Lignite_Coal)
##
## Shapiro-Wilk normality test
##
## data: Lignite_Coal
## W = 0.76317, p-value < 2.2e-16
shapiro.test(Metallurgical_Coal)
##
## Shapiro-Wilk normality test
##
## data: Metallurgical_Coal
## W = 0.33551, p-value < 2.2e-16
shapiro.test(Natural_Gas)
##
## Shapiro-Wilk normality test
##
## data: Natural_Gas
## W = 0.38476, p-value < 2.2e-16
shapiro.test(Oil_NGL)
##
## Shapiro-Wilk normality test
##
## data: Oil_NGL
## W = 0.58422, p-value < 2.2e-16
shapiro.test(Sub_Bituminous_Coal)
##
## Shapiro-Wilk normality test
##
## data: Sub_Bituminous_Coal
## W = 0.45668, p-value < 2.2e-16
shapiro.test(Thermal_Coal)
##
## Shapiro-Wilk normality test
##
## data: Thermal_Coal
## W = 0.70899, p-value < 2.2e-16
The p-value for each of the different fuel types is below 0.05 which means our data is not normally distributed visually and statistically. Next, we need to run a Levene’s test to compare the variance of the groups because one of the assumptions of a Kruskal-Wallis test is non-equal variances.
leveneTest(data = emissions, total_emissions_MtCO2e ~ commodity, center = mean)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 8 101.57 < 2.2e-16 ***
## 12542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variance of the fuel types are stistically different from each other, satisfiying the test’s assumptions ( t(8) = 101.57, p-value = < 2.2 x 10-16).
Now that we’ve established that the data is not normally distributed and there’s unequal variances, we can move on to running the Kruskal-Wallis test. For this test, our null hypothesis = there is no significant difference in total carbon emissions between the different fuel types, while our alternative hypothesis = there is a significant difference in total carbon emissions between the different fuel types.
kruskal.test(total_emissions_MtCO2e ~ commodity, data = emissions)
##
## Kruskal-Wallis rank sum test
##
## data: total_emissions_MtCO2e by commodity
## Kruskal-Wallis chi-squared = 599.77, df = 8, p-value < 2.2e-16
There is a significant difference in total carbon emissions between the different fuel types. ( X2(8) = 599.77, p-value = < 2.2 x 10-16).
Lastly, we need do a pair-wise test to identify which pair of fuel types had the most significant differences with our data.
emissions %>%
dunn_test(total_emissions_MtCO2e ~ commodity, p.adjust.method = "bonferroni")
## # A tibble: 36 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 total_emi… Anthr… Bitum… 368 1370 1.68 9.36e- 2 1 e+ 0 ns
## 2 total_emi… Anthr… Cement 368 263 -1.58 1.14e- 1 1 e+ 0 ns
## 3 total_emi… Anthr… Ligni… 368 1008 -6.63 3.36e-11 1.21e- 9 ****
## 4 total_emi… Anthr… Metal… 368 1073 -3.95 7.73e- 5 2.78e- 3 **
## 5 total_emi… Anthr… Natur… 368 3452 -2.79 5.32e- 3 1.92e- 1 ns
## 6 total_emi… Anthr… Oil &… 368 3733 4.49 7.29e- 6 2.62e- 4 ***
## 7 total_emi… Anthr… Sub-B… 368 673 -3.67 2.38e- 4 8.59e- 3 **
## 8 total_emi… Anthr… Therm… 368 611 2.60 9.46e- 3 3.40e- 1 ns
## 9 total_emi… Bitum… Cement 1370 263 -3.36 7.79e- 4 2.80e- 2 *
## 10 total_emi… Bitum… Ligni… 1370 1008 -12.1 1.02e-33 3.68e-32 ****
## # ℹ 26 more rows
The pairs that have ns under “p.adj.signif” are not statistically significant which means that there’s no difference between their total carbon emissions. The pairs that have a * are statistically significant which means that there is a difference between their total carbon emissions. The pairs that have more * are more significant than those with one star.
The first graph is our data in a box-plot showing abnormality. The second graph shows the same box-plot, but with all the data points. The third graph is the same graph, but re-textured to make it easier to identity the fuel type when we add their p-values.
emissions.boxplot <- ggplot(emissions, aes(x = commodity, y = total_emissions_MtCO2e, fill = commodity)) + geom_boxplot( fill = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A")) + theme_light() + labs( x = "Fuel Type", y = "Total Carbon Emissions (MtCO2e)")
emissions.boxplot
emissions.boxplot + geom_jitter(shape=16, position=position_jitter(0.2)) + theme_light() +
scale_fill_manual(values = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A")) + scale_x_discrete(labels = c('Anthra.', 'Bit,', 'Cement', 'Lig.', 'Met.', 'NG', 'O & NGL', 'Sub-bit.', 'Thermal'))
ggboxplot(emissions, x = "commodity", y = "total_emissions_MtCO2e", color = "commodity", palette = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"), add = "jitter") + labs( x = "Fuel Type", y = "Total Carbon Emissions (MtCO2e)") + scale_x_discrete(labels = c('O & NGL', 'NG', 'Sub-bit.', 'Met.', 'Bit.', 'Thermal', 'Anthra.', 'Cement', 'Lig.'))
This is our final graph that visually shows the majority of what we talked about through out this article:
the specific subjects our project focused on
data abnormality in the form of a box-plot
the Kruskal-Wallis test p-value
the Dunn’s test p-values
base.graph = ggboxplot(emissions, x = "commodity", y = "total_emissions_MtCO2e",
color = "commodity", palette = c("#882255", "#B5149A", "#CC6677", "#DDCC77", "#88CCEE", "#44AA99", "#117733", "#332288", "#AF074A"),
add = "jitter") + labs( x = "Fuel Type", y = "Total Carbon Emissions (MtCO2e)")
base.graph + geom_pwc(
aes(group = commodity), tip.length = 0,
method = "dunn_test", label = "p.adj.format",
hide.ns = FALSE, p.adjust.method = "bonferroni"
) + stat_compare_means(method = "kruskal.test", label.y = 40000, label.x = 1) + scale_x_discrete(labels = c('O & NGL', 'NG', 'Sub-bit.', 'Met.', 'Bit.', 'Thermal', 'Anthra.', 'Cement', 'Lig.'))
Our analysis reveals that carbon emissions do significantly differ by fuel type. The total carbon emissions data were found to be abnormally distributed, both visually and statistically, and have unequal variances across groups. Therefore, a Kruskal-Wallis test was conducted to assess differences between fuel type. The test results showed a statistically significant difference in emissions between most of the fuel types ( X2(8) = 599.77, p-value = < 2.2 x 10-16). These findings emphasize the need to consider fuel-specific impacts when designing carbon reduction policies. By identifying which fuels contribute most to emissions, policymakers and industries can better target efforts to reduce global CO₂ output.
CNN, H. R. (2019, December 4). Global Emissions to Hit Another Record High This Year. CNN. https://edition.cnn.com/2019/12/04/health/emissions-global-carbon-budget-2019-intl-hnk/index.html. - Picture of the chemical plant.
Global Carbon Project. (2024). Emissions Dataset. Retrieved from: https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-21/emissions.csv
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621. R Documentation. (n.d.). shapiro.test function. Retrieved from: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/shapiro.test.html