About the Data

From the World Bank (worldbank.org) The World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional, and global estimates.

The data provided here is not very clean and spans a wide variety of topics with a scope spanning far wider than just one report. In order to narrow the scope of inquiry, we will only be focusing on the Indicator pertaining to killotons of CO2 emitted by country.

# Loading libraries and initial data load to focus in on CO2 emissions
setwd("/Users/steven/School/Fall_24/WDI_CSV_2024_06_28")
library(ggplot2)
library(magrittr)
library(dplyr)
library(tidyr)
library(scales)
library(reshape2)
#I don't use all of these but did at one time or another while cleaning

countries <- read.csv("WDICountry.csv")

countriesSeries <- read.csv("WDIcountry-series.csv")
official_recognized_countries <- read.csv("Official-Countries.csv")

countries_combined <- merge(x = countries,y = countriesSeries,by.x = "Country.Code", by.y = "CountryCode")

wdi <- read.csv("WDICSV.csv")
all_countries_data <- merge(x=countries_combined, wdi, by.x = c("Country.Code","SeriesCode"), by.y = c("Country.Code","Indicator.Code"), all.y=TRUE)
all_countries_data <- all_countries_data[,colSums(is.na(all_countries_data))<nrow(all_countries_data)]

gdp <- all_countries_data[all_countries_data$Indicator.Name %in% c("GDP (current US$)"),]
all_countries_data <- all_countries_data[all_countries_data$Indicator.Name %in% c("CO2 emissions (kt)"),]
gdp <- filter(gdp, Country.Code %in% official_recognized_countries$ABW)

Initial Questions to Answer

Thanks to the World Bank, we have a wealth of data on our disposal detailing the past 30 years of carbon emissions for every country. This data requires a degree of cleaning in order to remove null values and make sure that the countries are officially recognizes, but with this data we can answer a few questions.

Our first question is simple. How much are the countries of our world equally contributing to climate change via CO2 emissions, and how much if at all have the responsible countries changed within the scope of inquiry?

# Data Cleaning
metadata_headers <- c("Country.Code","Country.Name","Indicator.Name")
pattern <- "X\\d{4}"
year_headers <- colnames(all_countries_data)[grep(pattern, colnames(all_countries_data))]
all_headers <- c(metadata_headers,year_headers)



all_countries_data_clean <- subset(all_countries_data, select=c(metadata_headers,year_headers))
all_countries_data_clean <- all_countries_data_clean %>%
  filter(Country.Code %in% official_recognized_countries$ABW)
new_colnames <- sub("^X", "", year_headers)
colnames(all_countries_data_clean)[grep(pattern, colnames(all_countries_data_clean))] <- new_colnames

all_countries_data_clean <- all_countries_data_clean %>%
  pivot_longer(cols = all_of(new_colnames), names_to = "Year", values_to = "Value")
all_countries_data_clean <- all_countries_data_clean[!is.na(all_countries_data_clean$Value),]

emissions_data_sorted_2024 <- all_countries_data_clean[all_countries_data_clean$Year == 2020,]
emissions_data_sorted_2024 <- emissions_data_sorted_2024[order(-emissions_data_sorted_2024$Value), ]


metadata_headers <- c("Country.Code","Country.Name","Indicator.Name")
pattern <- "X\\d{4}"
year_headers <- colnames(all_countries_data)[grep(pattern, colnames(all_countries_data))]
all_headers <- c(metadata_headers,year_headers)

gdp <- subset(gdp, select=c(metadata_headers,year_headers))

new_colnames <- sub("^X", "", year_headers)
colnames(all_countries_data_clean)[grep(pattern, colnames(all_countries_data_clean))] <- new_colnames
colnames(gdp)[grep(pattern, colnames(gdp))] <- new_colnames
gdp <- gdp %>%
  pivot_longer(cols = all_of(new_colnames), names_to = "Year", values_to = "Value")
gdp <- gdp[!is.na(gdp$Value),]

Emissions Heatmap

There are many possible ways to answer this question, but one of the easiest is a Heat Map. A Heat Map will show us not only who is contributing the most right now, but we will also see how top contributors might have been different in past years.

all_countries_data_clean <- filter(all_countries_data_clean, Country.Code %in% official_recognized_countries$ABW)

heatmap_data <- dcast(all_countries_data_clean, Country.Name ~ Year, value.var = "Value")

melted_data <- melt(heatmap_data, id.vars = "Country.Name")
ggplot(melted_data, aes(x = variable, y = Country.Name, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "white",high = "red", labels = function(x) paste(comma(x / 1e6), "Million")) +
  theme(axis.text.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))) +
  labs(title = "CO2 Emissions Heatmap (kilotons)", x = "Year", y = "Country") +
  theme_minimal()

What’s interesting here is that there are only a few countries that seem to be factors at all when it comes to emissions with China and the US standing out in particular in recent years. We will ultimately investigate this relationship, but first we want to know why this is. One naive guess is that this might be correlated to GDP.

GDP Scatterplot

Here, we test the hypothesis that higher GDP countries have higher emissions by mapping the GDP of all countries in 2020 using a Heatmap.

gdp_2020 <- gdp[gdp$Year == 2020,]
gdp_2020_top50 <- gdp_2020 %>%
  arrange(desc(Value)) %>%
  head(50)
ggplot(gdp_2020_top50, aes(x = Value, y = Country.Name)) +
  scale_x_continuous(labels = scales::label_number(scale = 1e-9, prefix = "$", suffix = " Billion")) +
  geom_point() +
  labs(title = "GDP per Capita in 2020",
       x = "Country",
       y = "GDP Per Capita") +
  theme_minimal()

Interestingly, our hypothesis seems at the very least plausiblem based on this visualization. China and the US have the highest GDP with Japan coming in 3rd.

World’s Largest CO2 Producers at present (data from 2020)

Let’s take a look even closer, zooming in on the top 10 countries for carbon emissions in 2020 to see how much each country is contributing.

# I paste some code in here for my first tab
top_10 <- head(emissions_data_sorted_2024, 10)
top_10 <- top_10 %>% arrange(desc(Value))
top_10$Country.Name <- reorder(top_10$Country.Name, -top_10$Value)
top_10$label <- factor(paste(top_10$Country.Name, "(", comma(top_10$Value), " kt CO2)", sep = " "))
ggplot(top_10, aes(x = "", y = Value, fill = factor(Value))) +
  geom_bar(stat = "identity") +
  geom_col(width = 1, color = "black") +
  coord_polar(theta = "y") +
  theme_void() +
  guides(colour = guide_legend(reverse=T)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(x=1.8,label = Country.Name), color = "black", position = position_stack(vjust = 0.5)) +  # Add labels
  labs(title = "Top 10 CO2 Emissions Countries in 2024 (Needs further Cleaning)") +
  labs(fill = "Emissions by Country") +
  scale_fill_manual(values = c('pink','lightgreen','lightblue','lightyellow','brown','hotpink','orange','purple','forestgreen','aquamarine'), labels = rev(top_10$label))

Based on this visualization, it seems like GDP may be weakly correlated with CO2 emissions but by no means is there a causal relationship. Still, we see China and the US staying as #1 and #2. This raises 2 questions: First, were they always this way? And second, How have their CO2 contributions changed over time? Conventional awareness of climate change would suggest that we should see both countries contributing more CO2 emissions every year.

Comparing #1 and #2 CO2 producers (USA and China) over time

Let’s test this idea with 2 line graphs. Here, the purple dot will highlight the point in time that the 2 countries switch position (2005).

usa_china_data <- all_countries_data_clean[all_countries_data_clean$Country.Code %in% c("CHN","USA"),]
usa_data <- usa_china_data[usa_china_data$Country.Code %in% c("USA"),]
china_data <- usa_china_data[usa_china_data$Country.Code %in% c("CHN"),]
intersect_index <- which.min(abs(usa_data$Value - china_data$Value))
intersect_year <- usa_data$Year[intersect_index]

ggplot(usa_china_data, aes(x = Year, y = Value, color = Country.Name, group = Country.Name)) +
  #geom_point(data = usa_china_data[which.min(usa_china_data$Value), ], color="red",  size=3) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(data = usa_china_data[usa_china_data$Year == intersect_year & usa_china_data$Country.Code == "USA",], color = "purple", size=5) +
 # + geom_point(data = usa_china_data[usa_china_data$Country.Code == "CHN" && usa_china_data$Value ], color="purple",  size=3) +
  geom_line() +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = " Million")) +
  labs(title = "Killotons of CO2 emissions per year by Country (USA vs China)",
       x = "Year",
       y = "CO2 (kT)") +
  theme_minimal()

Here, we see China maintain that trajectory that we predicted where they are (with a few exceptions) emitting more CO2 every year. The US, however, has a more interesting trajectory, seemingly falling down from its peak in 2000. Again, conventional wisdom suggests that this number would go up every year. After all, global climate change is getting worse.

Sum of CO2 emissions globally by Year

Let’s investigate this finding across all countries. Essentially, we want to know whether the average CO2 emissions is going up every year or not. We will try to answer this using a histogram.

aggregated_df <- all_countries_data_clean %>%
  group_by(Year) %>%
  summarise(Value = sum(Value, na.rm = TRUE))

ggplot(aggregated_df, aes(x = Value, fill = after_stat(count))) +
  geom_histogram(bins = 30, color = "black") +
  scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = " Million")) +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Histogram of Total Global CO2 Emissions Aggregated by Year", x = "Sum of CO2 emissions by year (killoTons)", y = "Frequency") + theme_minimal()

Very interestingly, the spread seems to be pretty even on how much CO2 we are emitting per year. This means that the trend isn’t as predictable as our hypothesis led us to believe.

Conclusion

We answered a number of questions with our 5 visualizations:

  1. GDP seems weakly correlated to CO2 emisisons of a country.

  2. Very few countries seem to be “big players” when it comes to global CO2 emissions.

  3. China currently emits nearly half of total global CO2, with the US in second place.

  4. China has been steadily increasing its CO2 emissions since 1990, officially passing the United States in 2005.

  5. The USA, as well as many other countries, has not been increasing its CO2 production continuously. This has major ramifications when it comes to explaining what the causes are for worstening climate change.