Project 1-SLK

Author

Senay LK

Global Emissions and GDP

Introduction

For my project I Chose the “Global Emissions” data set. This data set consists of 92 countries as rows and their respective greenhouse gas emissions as columns. It has 2484 observations(because it includes data from 1992 to 2018 for each of the 92 countries) and 20 variables. More specifically, it includes variables such as a country’s GDP, population, and emissions of Methane, Nitrogen Dioxide, and Carbon dioxide in Kilotons. The source for this data set is ‘Our World in Data’, and I found it through the CORGIS repository. I plan to visualize the countries with the most amount of greenhouse gas emissions and investigate the correlation GDP has with emissions.

Loading in the Libraries and the data set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly) # for interactivity

Warning: package 'plotly' was built under R version 4.4.3


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("C:/Users/senay/OneDrive/Desktop/Scoo/Spring 2025/DATA 110/Datasets")
globalemmissions <- read_csv("global_emissions.csv")

Rows: 2484 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country.Name, Country.Code
dbl (18): Year, Country.GDP, Country.Population, Emissions.Production.CH4, E...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Analysis

Examining the data set

# looking at the structure
str(globalemmissions)

spc_tbl_ [2,484 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Year                              : num [1:2484] 1992 1993 1994 1995 1996 ...
 $ Country.Name                      : chr [1:2484] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Country.Code                      : chr [1:2484] "AFG" "AFG" "AFG" "AFG" ...
 $ Country.GDP                       : num [1:2484] 1.27e+10 9.83e+09 7.92e+09 1.23e+10 1.21e+10 ...
 $ Country.Population                : num [1:2484] 14485543 15816601 17075728 18110662 18853444 ...
 $ Emissions.Production.CH4          : num [1:2484] 7.13 7.21 7.47 7.83 8.67 ...
 $ Emissions.Production.N2O          : num [1:2484] 2.89 2.93 2.76 2.88 3.12 3.43 3.72 4 3.48 3.07 ...
 $ Emissions.Production.CO2.Cement   : num [1:2484] 0.046 0.047 0.047 0.047 0.047 0.047 0.047 0.047 0.01 0.007 ...
 $ Emissions.Production.CO2.Coal     : num [1:2484] 0.022 0.018 0.015 0.015 0.007 0.004 0.004 0.004 0.004 0.07 ...
 $ Emissions.Production.CO2.Gas      : num [1:2484] 0.363 0.352 0.338 0.322 0.308 0.283 0.265 0.242 0.224 0.209 ...
 $ Emissions.Production.CO2.Oil      : num [1:2484] 0.927 0.894 0.86 0.824 0.78 0.728 0.691 0.495 0.498 0.491 ...
 $ Emissions.Production.CO2.Flaring  : num [1:2484] 0.022 0.022 0.022 0.022 0.022 0.022 0.022 0.022 0.022 0.022 ...
 $ Emissions.Production.CO2.Other    : num [1:2484] 0.00 0.00 2.22e-16 2.22e-16 1.00e-03 ...
 $ Emissions.Production.CO2.Total    : num [1:2484] 1.38 1.33 1.28 1.23 1.16 ...
 $ Emissions.Global Share.CO2.Cement : num [1:2484] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0 0 ...
 $ Emissions.Global Share.CO2.Coal   : num [1:2484] 0 0 0 0 0 0 0 0 0 0 ...
 $ Emissions.Global Share.CO2.Gas    : num [1:2484] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0 0 ...
 $ Emissions.Global Share.CO2.Oil    : num [1:2484] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0 0 0 ...
 $ Emissions.Global Share.CO2.Flaring: num [1:2484] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
 $ Emissions.Global Share.CO2.Total  : num [1:2484] 0.01 0.01 0.01 0.01 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   Year = col_double(),
  ..   Country.Name = col_character(),
  ..   Country.Code = col_character(),
  ..   Country.GDP = col_double(),
  ..   Country.Population = col_double(),
  ..   Emissions.Production.CH4 = col_double(),
  ..   Emissions.Production.N2O = col_double(),
  ..   Emissions.Production.CO2.Cement = col_double(),
  ..   Emissions.Production.CO2.Coal = col_double(),
  ..   Emissions.Production.CO2.Gas = col_double(),
  ..   Emissions.Production.CO2.Oil = col_double(),
  ..   Emissions.Production.CO2.Flaring = col_double(),
  ..   Emissions.Production.CO2.Other = col_double(),
  ..   Emissions.Production.CO2.Total = col_double(),
  ..   `Emissions.Global Share.CO2.Cement` = col_double(),
  ..   `Emissions.Global Share.CO2.Coal` = col_double(),
  ..   `Emissions.Global Share.CO2.Gas` = col_double(),
  ..   `Emissions.Global Share.CO2.Oil` = col_double(),
  ..   `Emissions.Global Share.CO2.Flaring` = col_double(),
  ..   `Emissions.Global Share.CO2.Total` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

# checking the dimensions 
dim(globalemmissions)

[1] 2484   20

# looking at the first few rows
head(globalemmissions)

# A tibble: 6 × 20
   Year Country.Name Country.Code Country.GDP Country.Population
  <dbl> <chr>        <chr>              <dbl>              <dbl>
1  1992 Afghanistan  AFG          12677538816           14485543
2  1993 Afghanistan  AFG           9834580992           15816601
3  1994 Afghanistan  AFG           7919857152           17075728
4  1995 Afghanistan  AFG          12307525632           18110662
5  1996 Afghanistan  AFG          12070125568           18853444
6  1997 Afghanistan  AFG          11850752000           19357126
# ℹ 15 more variables: Emissions.Production.CH4 <dbl>,
#   Emissions.Production.N2O <dbl>, Emissions.Production.CO2.Cement <dbl>,
#   Emissions.Production.CO2.Coal <dbl>, Emissions.Production.CO2.Gas <dbl>,
#   Emissions.Production.CO2.Oil <dbl>, Emissions.Production.CO2.Flaring <dbl>,
#   Emissions.Production.CO2.Other <dbl>, Emissions.Production.CO2.Total <dbl>,
#   `Emissions.Global Share.CO2.Cement` <dbl>,
#   `Emissions.Global Share.CO2.Coal` <dbl>, …

Cleaning the data set

# making everything lowercase
names(globalemmissions) <- tolower(names(globalemmissions)) 
# substituting spaces with underscores
names(globalemmissions) <-gsub(" ", "_", names(globalemmissions)) 
# substituting periods with underscores
names(globalemmissions) <- gsub("[.]", "_", names(globalemmissions)) 

head(globalemmissions)

# A tibble: 6 × 20
   year country_name country_code country_gdp country_population
  <dbl> <chr>        <chr>              <dbl>              <dbl>
1  1992 Afghanistan  AFG          12677538816           14485543
2  1993 Afghanistan  AFG           9834580992           15816601
3  1994 Afghanistan  AFG           7919857152           17075728
4  1995 Afghanistan  AFG          12307525632           18110662
5  1996 Afghanistan  AFG          12070125568           18853444
6  1997 Afghanistan  AFG          11850752000           19357126
# ℹ 15 more variables: emissions_production_ch4 <dbl>,
#   emissions_production_n2o <dbl>, emissions_production_co2_cement <dbl>,
#   emissions_production_co2_coal <dbl>, emissions_production_co2_gas <dbl>,
#   emissions_production_co2_oil <dbl>, emissions_production_co2_flaring <dbl>,
#   emissions_production_co2_other <dbl>, emissions_production_co2_total <dbl>,
#   emissions_global_share_co2_cement <dbl>,
#   emissions_global_share_co2_coal <dbl>, …

Filtering for the 5 countries with the most GDP

# arranging the dataset by decreasing GDP to find out which countries has the highest
globalemmissions |> arrange(desc(country_gdp))

# A tibble: 2,484 × 20
    year country_name  country_code country_gdp country_population
   <dbl> <chr>         <chr>              <dbl>              <dbl>
 1  2018 China         CHN              1.82e13         1427647744
 2  2018 United States USA              1.81e13          327096256
 3  2017 United States USA              1.76e13          325084736
 4  2017 China         CHN              1.76e13         1421021696
 5  2016 China         CHN              1.73e13         1414049408
 6  2016 United States USA              1.72e13          323016000
 7  2015 United States USA              1.69e13          320878304
 8  2015 China         CHN              1.67e13         1406847872
 9  2014 United States USA              1.65e13          318673440
10  2014 China         CHN              1.62e13         1399453952
# ℹ 2,474 more rows
# ℹ 15 more variables: emissions_production_ch4 <dbl>,
#   emissions_production_n2o <dbl>, emissions_production_co2_cement <dbl>,
#   emissions_production_co2_coal <dbl>, emissions_production_co2_gas <dbl>,
#   emissions_production_co2_oil <dbl>, emissions_production_co2_flaring <dbl>,
#   emissions_production_co2_other <dbl>, emissions_production_co2_total <dbl>,
#   emissions_global_share_co2_cement <dbl>, …

The countries with the consistently high GDP are China, United States, India, Japan, and Germany.

# creating a subset data set of the countries with the highest GDP
top5gdp <- globalemmissions |> filter(country_name %in% c("China", "United States", "India", "Japan", "Germany"))

Filtering for the 5 countries with the least GDP

# arranging by increasing GDP value to identify which countries has the least 
globalemmissions |> arrange(country_gdp)

# A tibble: 2,484 × 20
    year country_name         country_code country_gdp country_population
   <dbl> <chr>                <chr>              <dbl>              <dbl>
 1  1992 United Arab Emirates ARE                    0            2052892
 2  1992 Qatar                QAT           6980115968             495403
 3  1992 Iceland              ISL           7008859648             260155
 4  1993 Iceland              ISL           7073786368             262650
 5  1995 Iceland              ISL           7281669120             267627
 6  1994 Iceland              ISL           7301300736             265140
 7  1996 Iceland              ISL           7601171456             270144
 8  1993 Qatar                QAT           7619622912             501479
 9  1994 Afghanistan          AFG           7919857152           17075728
10  1992 Gabon                GAB           7946244608            1002573
# ℹ 2,474 more rows
# ℹ 15 more variables: emissions_production_ch4 <dbl>,
#   emissions_production_n2o <dbl>, emissions_production_co2_cement <dbl>,
#   emissions_production_co2_coal <dbl>, emissions_production_co2_gas <dbl>,
#   emissions_production_co2_oil <dbl>, emissions_production_co2_flaring <dbl>,
#   emissions_production_co2_other <dbl>, emissions_production_co2_total <dbl>,
#   emissions_global_share_co2_cement <dbl>, …

According to the data set the countries with consistently low GDP are Qatar, Iceland, Tajikistan, Gabon, and Albania. I excluded UAE and Afghanistan because they don’t have consistent low GDP.

# creating a subset data set of the countries with the least GDP
bottom5gdp <- globalemmissions |> filter(country_name %in% c("Qatar", "Iceland", "Tajikistan", "Gabon", "Albania"))

Visualization

# using ggplot to make a graph of CO2 emissions of countries with high GDP 
p1 <- top5gdp |> 
  ggplot(aes(x = year, y = emissions_production_co2_total, color = country_name)) +
  geom_line(linewidth=1.5) +  # making the graph a line chart 
  scale_color_brewer(palette = "Set1") + # using Set1 Color Palette
  theme_minimal() + # a minimal background
  labs(x = "Years",  # labels 
       y = "CO2 Emissions in Kilotons", 
       color = "Country",
       title = "CO2 Emissions of Countries With High GDP from 1992 to 2018", 
       caption = "Source: Our World In Data")
p1

#using ggplot to make a graph of CO2 emissions of countries with low GDP 
p2 <- bottom5gdp |> 
  ggplot(aes(x = year, y = emissions_production_co2_total, color = country_name)) +
  geom_line() +  # making the graph a line chart 
  scale_color_brewer(palette = "Dark2") + # using Dark2 Color Palette
  theme_minimal() + # a minimal background
  labs(x = "Years",  # labels 
       y = "CO2 Emissions in Kilotons", 
       color = "Country",
       title = "CO2 Emissions of Countries With Low GDP from 1992 to 2018", 
       caption = "Source: Our World In Data") 

p2  <- ggplotly(p2) # for interactivity 
p2

We can see that countries that have high GDP have much higher CO2 emissions than those countries that have low GDP. Now I want to put the two in comparison and include other greenhouse gases for the year 2018 because it is the most recent year in the data set. Additionally, I am going to add 5 more countries to each of the two categories of high and low GDP because I want to generally compare greenhouse gas emissions of the two.

# adding 5 more countries to countries with high GDP and filtering for only 2018
top10gdp <- globalemmissions |> 
  filter(country_name %in% c("China", "United States", "India", "Japan", "Germany", "Russia", "Brazil", "Indonesia", "France", "United Kingdom"), year == 2018) |>
  mutate(gdp = "High GDP Countries ") # This is to separate the high and low GDP countries when I merge the two tables and to make a facet grid later on

# adding 5 more countries to countries with low GDP and filtering for only 2018
bottom10gdp <- globalemmissions |> filter(country_name %in% c("Qatar", "Iceland", "Tajikistan", "Gabon", "Albania", "North Macedonia", "Cyprus", "Bosnia and Herzegovina", "Kyrgyzstan", "Moldova"), year == 2018) |>
  mutate(gdp = "Low GDP Countries") # This is to separate the low and high GDP countries when I merge the two tables and to make a facet grid later on

Now I am going to merge these two subsets together and pivot longer. I was inspired by code from previous class materials.

#merging the two subset data sets 
newdf <- rbind(top10gdp, bottom10gdp)
# making a longer version of the merged data set so it is suitable to plot
long_newdf <- newdf |> 
  pivot_longer(cols = c(6,7,14), # the columns with green house gases
               names_to = "greenhouse_gas", 
               values_to = "emissions_in_kiltons") 
head(long_newdf)

# A tibble: 6 × 20
   year country_name country_code country_gdp country_population
  <dbl> <chr>        <chr>              <dbl>              <dbl>
1  2018 Brazil       BRA              2.97e12          209469312
2  2018 Brazil       BRA              2.97e12          209469312
3  2018 Brazil       BRA              2.97e12          209469312
4  2018 China        CHN              1.82e13         1427647744
5  2018 China        CHN              1.82e13         1427647744
6  2018 China        CHN              1.82e13         1427647744
# ℹ 15 more variables: emissions_production_co2_cement <dbl>,
#   emissions_production_co2_coal <dbl>, emissions_production_co2_gas <dbl>,
#   emissions_production_co2_oil <dbl>, emissions_production_co2_flaring <dbl>,
#   emissions_production_co2_other <dbl>,
#   emissions_global_share_co2_cement <dbl>,
#   emissions_global_share_co2_coal <dbl>,
#   emissions_global_share_co2_gas <dbl>, …

Making a heat map of the comparison of greenhouse gas emissions between High and Low GDP countries

p3 <-  long_newdf |> 
  ggplot(aes(x = country_name, y = greenhouse_gas, fill = emissions_in_kiltons)) + 
  geom_tile() + # heat map
  scale_fill_distiller(palette="PuOr") + # Purple and orange fill color 
  facet_grid(~gdp) + # makes a side by side heat map for the two categories 
  theme_dark() + # dark background 
  labs(x = "Countries", 
       y = "Greenhouse Gases", 
       fill = "Emissions in Kiltons", 
       title = "Comparison of Greenhouse Gas Emissions for 2018", 
       caption = "Source: Our World In Data") +
  theme(axis.text.x = element_blank()) # removes the name of the countries because it is overcrowded and I want to show general comparison of greenhouse gas emissions between low and high GDP countries 
  
p3 <- ggplotly(p3) # interactivity 
p3

Correlation and Linear Regression Analysis

cor(globalemmissions$country_gdp, globalemmissions$emissions_production_co2_total)

[1] 0.941533

There is a very strong positive correlation between GDP and CO2 emissions

performing a linear regression with CO2 emissions as the response variable and GDP as the predictor

lm_model <- lm(emissions_production_co2_total ~ country_gdp, data = newdf)
summary(lm_model)


Call:
lm(formula = emissions_production_co2_total ~ country_gdp, data = newdf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2003.4  -496.6   259.4   265.6  2906.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.696e+02  2.483e+02  -1.086    0.292    
country_gdp  4.216e-10  3.877e-11  10.874 2.42e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 933.5 on 18 degrees of freedom
Multiple R-squared:  0.8679,    Adjusted R-squared:  0.8605 
F-statistic: 118.2 on 1 and 18 DF,  p-value: 2.424e-09

The y-intercept is - 269.6, the slope is 0.0000000004216.
So the equation for the linear model would be CO2 emissions = 0.0000000004216(GDP) - 269.6
The slope can be interpreted as such: for each dollar increase in GDP, CO2 emissions are predicted to increase by 0.0000000004216 Kilotons.
The R^2 has a value of 0.8679 which means 86.79% of the variation can be explained by GDP or I can use the GDP value to explain the CO2 emissions 86.79% of the time.
The p-value is very small and less than 0.05 so that means this model is statistically significant. So there is very strong evidence that the variable contributes to the model.

# making a scatter plot of x value as GDP and CO2 emissions as y values
globalemmissions |> 
  ggplot(aes(x = country_gdp , y = emissions_production_co2_total)) +
  geom_point() + 
  geom_smooth(method = 'lm', formula = y~x) + 
  labs(x = "GDP", 
       y = "CO2 emission in Kilotons", 
       title = "GDP versus CO2 emissions", 
       caption = "Source: Our World In Data"
  )

Summary

First, I performed Exploratory Data Analysis on the data set by examining the structure and dimensions of the data set. Then, I cleaned the data by making the column fully lowercase using the tolower() function and eliminated spaces and periods by using the gsub() function. Then I created a subset data set by filtering for the 5 countries with the most and least GDP and created two respective line charts that showed the CO2 emissions for both groups over the years. After that I made a third graph which was a heat map that showed comparison of greenhouse gas emissions between the top 10 countries with the highest GDP and the bottom 10 countries with the lowest GDP. What surprised me from this visualization was the fact that China had been emitted much more CO2 than the US. Finally, I used linear regression analysis to make a scatter plot of GDP versus CO2 emissions.