For my project I Chose the “Global Emissions” data set. This data set consists of 92 countries as rows and their respective greenhouse gas emissions as columns. It has 2484 observations(because it includes data from 1992 to 2018 for each of the 92 countries) and 20 variables. More specifically, it includes variables such as a country’s GDP, population, and emissions of Methane, Nitrogen Dioxide, and Carbon dioxide in Kilotons. The source for this data set is ‘Our World in Data’, and I found it through the CORGIS repository. I plan to visualize the countries with the most amount of greenhouse gas emissions and investigate the correlation GDP has with emissions.
Loading in the Libraries and the data set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly) # for interactivity
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 2484 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country.Name, Country.Code
dbl (18): Year, Country.GDP, Country.Population, Emissions.Production.CH4, E...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# arranging the dataset by decreasing GDP to find out which countries has the highestglobalemmissions |>arrange(desc(country_gdp))
# A tibble: 2,484 × 20
year country_name country_code country_gdp country_population
<dbl> <chr> <chr> <dbl> <dbl>
1 2018 China CHN 1.82e13 1427647744
2 2018 United States USA 1.81e13 327096256
3 2017 United States USA 1.76e13 325084736
4 2017 China CHN 1.76e13 1421021696
5 2016 China CHN 1.73e13 1414049408
6 2016 United States USA 1.72e13 323016000
7 2015 United States USA 1.69e13 320878304
8 2015 China CHN 1.67e13 1406847872
9 2014 United States USA 1.65e13 318673440
10 2014 China CHN 1.62e13 1399453952
# ℹ 2,474 more rows
# ℹ 15 more variables: emissions_production_ch4 <dbl>,
# emissions_production_n2o <dbl>, emissions_production_co2_cement <dbl>,
# emissions_production_co2_coal <dbl>, emissions_production_co2_gas <dbl>,
# emissions_production_co2_oil <dbl>, emissions_production_co2_flaring <dbl>,
# emissions_production_co2_other <dbl>, emissions_production_co2_total <dbl>,
# emissions_global_share_co2_cement <dbl>, …
The countries with the consistently high GDP are China, United States, India, Japan, and Germany.
# creating a subset data set of the countries with the highest GDPtop5gdp <- globalemmissions |>filter(country_name %in%c("China", "United States", "India", "Japan", "Germany"))
Filtering for the 5 countries with the least GDP
# arranging by increasing GDP value to identify which countries has the least globalemmissions |>arrange(country_gdp)
According to the data set the countries with consistently low GDP are Qatar, Iceland, Tajikistan, Gabon, and Albania. I excluded UAE and Afghanistan because they don’t have consistent low GDP.
# creating a subset data set of the countries with the least GDPbottom5gdp <- globalemmissions |>filter(country_name %in%c("Qatar", "Iceland", "Tajikistan", "Gabon", "Albania"))
Visualization
# using ggplot to make a graph of CO2 emissions of countries with high GDP p1 <- top5gdp |>ggplot(aes(x = year, y = emissions_production_co2_total, color = country_name)) +geom_line(linewidth=1.5) +# making the graph a line chart scale_color_brewer(palette ="Set1") +# using Set1 Color Palettetheme_minimal() +# a minimal backgroundlabs(x ="Years", # labels y ="CO2 Emissions in Kilotons", color ="Country",title ="CO2 Emissions of Countries With High GDP from 1992 to 2018", caption ="Source: Our World In Data")p1
#using ggplot to make a graph of CO2 emissions of countries with low GDP p2 <- bottom5gdp |>ggplot(aes(x = year, y = emissions_production_co2_total, color = country_name)) +geom_line() +# making the graph a line chart scale_color_brewer(palette ="Dark2") +# using Dark2 Color Palettetheme_minimal() +# a minimal backgroundlabs(x ="Years", # labels y ="CO2 Emissions in Kilotons", color ="Country",title ="CO2 Emissions of Countries With Low GDP from 1992 to 2018", caption ="Source: Our World In Data") p2 <-ggplotly(p2) # for interactivity p2
We can see that countries that have high GDP have much higher CO2 emissions than those countries that have low GDP. Now I want to put the two in comparison and include other greenhouse gases for the year 2018 because it is the most recent year in the data set. Additionally, I am going to add 5 more countries to each of the two categories of high and low GDP because I want to generally compare greenhouse gas emissions of the two.
# adding 5 more countries to countries with high GDP and filtering for only 2018top10gdp <- globalemmissions |>filter(country_name %in%c("China", "United States", "India", "Japan", "Germany", "Russia", "Brazil", "Indonesia", "France", "United Kingdom"), year ==2018) |>mutate(gdp ="High GDP Countries ") # This is to separate the high and low GDP countries when I merge the two tables and to make a facet grid later on
# adding 5 more countries to countries with low GDP and filtering for only 2018bottom10gdp <- globalemmissions |>filter(country_name %in%c("Qatar", "Iceland", "Tajikistan", "Gabon", "Albania", "North Macedonia", "Cyprus", "Bosnia and Herzegovina", "Kyrgyzstan", "Moldova"), year ==2018) |>mutate(gdp ="Low GDP Countries") # This is to separate the low and high GDP countries when I merge the two tables and to make a facet grid later on
Now I am going to merge these two subsets together and pivot longer. I was inspired by code from previous class materials.
#merging the two subset data sets newdf <-rbind(top10gdp, bottom10gdp)# making a longer version of the merged data set so it is suitable to plotlong_newdf <- newdf |>pivot_longer(cols =c(6,7,14), # the columns with green house gasesnames_to ="greenhouse_gas", values_to ="emissions_in_kiltons") head(long_newdf)
# A tibble: 6 × 20
year country_name country_code country_gdp country_population
<dbl> <chr> <chr> <dbl> <dbl>
1 2018 Brazil BRA 2.97e12 209469312
2 2018 Brazil BRA 2.97e12 209469312
3 2018 Brazil BRA 2.97e12 209469312
4 2018 China CHN 1.82e13 1427647744
5 2018 China CHN 1.82e13 1427647744
6 2018 China CHN 1.82e13 1427647744
# ℹ 15 more variables: emissions_production_co2_cement <dbl>,
# emissions_production_co2_coal <dbl>, emissions_production_co2_gas <dbl>,
# emissions_production_co2_oil <dbl>, emissions_production_co2_flaring <dbl>,
# emissions_production_co2_other <dbl>,
# emissions_global_share_co2_cement <dbl>,
# emissions_global_share_co2_coal <dbl>,
# emissions_global_share_co2_gas <dbl>, …
Making a heat map of the comparison of greenhouse gas emissions between High and Low GDP countries
p3 <- long_newdf |>ggplot(aes(x = country_name, y = greenhouse_gas, fill = emissions_in_kiltons)) +geom_tile() +# heat mapscale_fill_distiller(palette="PuOr") +# Purple and orange fill color facet_grid(~gdp) +# makes a side by side heat map for the two categories theme_dark() +# dark background labs(x ="Countries", y ="Greenhouse Gases", fill ="Emissions in Kiltons", title ="Comparison of Greenhouse Gas Emissions for 2018", caption ="Source: Our World In Data") +theme(axis.text.x =element_blank()) # removes the name of the countries because it is overcrowded and I want to show general comparison of greenhouse gas emissions between low and high GDP countries p3 <-ggplotly(p3) # interactivity p3
There is a very strong positive correlation between GDP and CO2 emissions
performing a linear regression with CO2 emissions as the response variable and GDP as the predictor
lm_model <-lm(emissions_production_co2_total ~ country_gdp, data = newdf)summary(lm_model)
Call:
lm(formula = emissions_production_co2_total ~ country_gdp, data = newdf)
Residuals:
Min 1Q Median 3Q Max
-2003.4 -496.6 259.4 265.6 2906.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.696e+02 2.483e+02 -1.086 0.292
country_gdp 4.216e-10 3.877e-11 10.874 2.42e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 933.5 on 18 degrees of freedom
Multiple R-squared: 0.8679, Adjusted R-squared: 0.8605
F-statistic: 118.2 on 1 and 18 DF, p-value: 2.424e-09
The y-intercept is - 269.6, the slope is 0.0000000004216.
So the equation for the linear model would be CO2 emissions = 0.0000000004216(GDP) - 269.6
The slope can be interpreted as such: for each dollar increase in GDP, CO2 emissions are predicted to increase by 0.0000000004216 Kilotons.
The R^2 has a value of 0.8679 which means 86.79% of the variation can be explained by GDP or I can use the GDP value to explain the CO2 emissions 86.79% of the time.
The p-value is very small and less than 0.05 so that means this model is statistically significant. So there is very strong evidence that the variable contributes to the model.
# making a scatter plot of x value as GDP and CO2 emissions as y valuesglobalemmissions |>ggplot(aes(x = country_gdp , y = emissions_production_co2_total)) +geom_point() +geom_smooth(method ='lm', formula = y~x) +labs(x ="GDP", y ="CO2 emission in Kilotons", title ="GDP versus CO2 emissions", caption ="Source: Our World In Data" )
Summary
First, I performed Exploratory Data Analysis on the data set by examining the structure and dimensions of the data set. Then, I cleaned the data by making the column fully lowercase using the tolower() function and eliminated spaces and periods by using the gsub() function. Then I created a subset data set by filtering for the 5 countries with the most and least GDP and created two respective line charts that showed the CO2 emissions for both groups over the years. After that I made a third graph which was a heat map that showed comparison of greenhouse gas emissions between the top 10 countries with the highest GDP and the bottom 10 countries with the lowest GDP. What surprised me from this visualization was the fact that China had been emitted much more CO2 than the US. Finally, I used linear regression analysis to make a scatter plot of GDP versus CO2 emissions.