607_Project2C_DylanGold

Codebase 3

Importing data

For my 3rd file I picked a dataset containing emission data based on sector and country. This was obtained from a classmates post from dicussion 5A. Muhammad Khan posted https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions#Fossil_carbon_dioxide_emissions_by_country
From the link I got this csv file containing the green house gas emissions of countries by sector.
I uploaded the file to github then imported this file.

library(tidyverse)
library(dplyr)
library(ggplot2)
library(gt)

url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Project-2/refs/heads/main/FileC/EDGAR_2024_GHG_booklet_2024.csv"

df <- read_csv(
  file = url,
  col_names = TRUE
)

head(df, 8)

# A tibble: 8 × 58
  Substance Sector  `EDGAR Country Code` Country  `1970`  `1971`  `1972`  `1973`
  <chr>     <chr>   <chr>                <chr>     <dbl>   <dbl>   <dbl>   <dbl>
1 CO2       Agricu… AFG                  Afghan… 0.0292  0.0292  0.0292  0.0292 
2 CO2       Agricu… ALB                  Albania 0.113   0.113   0.113   0.113  
3 CO2       Agricu… ARG                  Argent… 0.104   0.104   0.104   0.104  
4 CO2       Agricu… ARM                  Armenia 0.0553  0.0553  0.0553  0.0553 
5 CO2       Agricu… AUS                  Austra… 0.311   0.311   0.311   0.311  
6 CO2       Agricu… AUT                  Austria 0.00644 0.00644 0.00644 0.00644
7 CO2       Agricu… AZE                  Azerba… 0.0415  0.0415  0.0415  0.0415 
8 CO2       Agricu… BEL                  Belgium 0.0417  0.0417  0.0417  0.0417 
# ℹ 50 more variables: `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
#   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
#   `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, …

I will use glimpse to see the columns of this data set

glimpse(df)

Rows: 4,853
Columns: 58
$ Substance            <chr> "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", …
$ Sector               <chr> "Agriculture", "Agriculture", "Agriculture", "Agr…
$ `EDGAR Country Code` <chr> "AFG", "ALB", "ARG", "ARM", "AUS", "AUT", "AZE", …
$ Country              <chr> "Afghanistan", "Albania", "Argentina", "Armenia",…
$ `1970`               <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1971`               <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1972`               <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1973`               <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1974`               <dbl> 0.039966661, 0.113614286, 0.087214278, 0.05996643…
$ `1975`               <dbl> 0.045309517, 0.112514286, 0.077314278, 0.05996643…
$ `1976`               <dbl> 0.050599993, 0.127757140, 0.070085708, 0.06343258…
$ `1977`               <dbl> 0.061442848, 0.135142853, 0.111099990, 0.07115166…
$ `1978`               <dbl> 0.058928563, 0.169557133, 0.118014276, 0.07289536…
$ `1979`               <dbl> 0.055471421, 0.167985704, 0.137814274, 0.07497930…
$ `1980`               <dbl> 0.061338085, 0.162957134, 0.137342845, 0.07695692…
$ `1981`               <dbl> 0.047299993, 0.170499990, 0.094338086, 0.07719083…
$ `1982`               <dbl> 0.062804752, 0.167828561, 0.098161897, 0.07710577…
$ `1983`               <dbl> 0.060395229, 0.172857133, 0.115604751, 0.08206045…
$ `1984`               <dbl> 0.087004749, 0.177571418, 0.178461884, 0.08212424…
$ `1985`               <dbl> 0.087738082, 0.177571418, 0.179195215, 0.08401680…
$ `1986`               <dbl> 0.081347606, 0.177571418, 0.159028554, 0.08718524…
$ `1987`               <dbl> 0.088680940, 0.160599991, 0.160704744, 0.08931171…
$ `1988`               <dbl> 0.079566653, 0.169085705, 0.174480932, 0.08293230…
$ `1989`               <dbl> 0.079776177, 0.192971418, 0.170080931, 0.06760046…
$ `1990`               <dbl> 0.06940475, 0.16531427, 0.17505712, 0.06344168, 0…
$ `1991`               <dbl> 0.06846189, 0.03504285, 0.16678093, 0.05928290, 0…
$ `1992`               <dbl> 0.062542847, 0.023257139, 0.257190449, 0.05512411…
$ `1993`               <dbl> 0.06285713, 0.02734285, 0.28191425, 0.05096534, 0…
$ `1994`               <dbl> NA, 0.03153333, 0.46865232, 0.04680655, 0.7025856…
$ `1995`               <dbl> NA, 0.02828571, 0.54339994, 0.04264777, 0.8184523…
$ `1996`               <dbl> 0.007857141, 0.015400000, 0.823218949, 0.03848899…
$ `1997`               <dbl> 0.007857141, 0.010633333, 0.786971343, 0.02514285…
$ `1998`               <dbl> 0.008642856, 0.038761902, 0.677442783, 0.03017142…
$ `1999`               <dbl> 0.007857141, 0.011838094, 0.733542780, 0.01917142…
$ `2000`               <dbl> 0.007857141, 0.016552381, 0.762142773, 0.02200000…
$ `2001`               <dbl> 0.03802857, 0.02241905, 0.86899989, 0.01571429, 1…
$ `2002`               <dbl> 0.032685708, 0.051699997, 0.845323705, 0.01382857…
$ `2003`               <dbl> 0.036142851, 0.038604761, 1.162542746, 0.00471428…
$ `2004`               <dbl> 0.04085714, 0.06002857, 1.36017605, 0.04965714, 1…
$ `2005`               <dbl> 0.050285708, 0.010319047, 1.034942762, 0.04462857…
$ `2006`               <dbl> NA, 0.042166665, 1.244414158, NA, 1.500923675, 0.…
$ `2007`               <dbl> NA, 0.041799998, 1.363790346, 0.022628567, 1.4407…
$ `2008`               <dbl> NA, 0.030171428, 0.917242772, 0.011314284, 1.4446…
$ `2009`               <dbl> NA, 0.026714283, 0.807085630, 0.013985712, 1.4902…
$ `2010`               <dbl> NA, 0.02262857, 1.03452369, 0.01100000, 1.6955712…
$ `2011`               <dbl> NA, 0.030171426, 1.115399856, 0.010999998, 1.8322…
$ `2012`               <dbl> 0.062857131, 0.054161900, 0.924680837, 0.01885714…
$ `2013`               <dbl> 0.055157133, 0.033995236, 1.000004634, 0.03284285…
$ `2014`               <dbl> 0.084490461, 0.053480948, 1.146618895, 0.03520000…
$ `2015`               <dbl> 0.116966646, 0.056519042, 0.892257036, 0.03394285…
$ `2016`               <dbl> 0.163009495, 0.047404756, 1.385790295, 0.03944285…
$ `2017`               <dbl> 0.311299945, 0.054161900, 1.278409352, 0.05248571…
$ `2018`               <dbl> 0.161333305, 0.046461900, 1.592118833, 0.05107142…
$ `2019`               <dbl> 0.151276165, 0.066419041, 1.703061665, 0.06102380…
$ `2020`               <dbl> 0.064795227, 0.060552376, 1.925471156, 0.05583809…
$ `2021`               <dbl> 0.046692374, 0.048347615, 2.124204458, 0.04462857…
$ `2022`               <dbl> 0.023617115, 0.048536186, 2.270615839, 0.04491142…
$ `2023`               <dbl> 0.015142367, 0.048954648, 2.405894067, 0.04523671…

We can see that some aspects of the data are already long, such as the sector and the substance. But we can see each year takes a different column we can use pivot longer to convert from a wide format to a long formatted data set.
GHG_emissions is in units of MtCO2eq/yr, metric tons of co2 per year

emissions <- df %>%
  pivot_longer(
    cols = !c("Substance", "Sector", "EDGAR Country Code", "Country"),
    names_to = "year",
    values_to = "GHG_emissions"
  )
head(emissions)

# A tibble: 6 × 6
  Substance Sector      `EDGAR Country Code` Country     year  GHG_emissions
  <chr>     <chr>       <chr>                <chr>       <chr>         <dbl>
1 CO2       Agriculture AFG                  Afghanistan 1970         0.0292
2 CO2       Agriculture AFG                  Afghanistan 1971         0.0292
3 CO2       Agriculture AFG                  Afghanistan 1972         0.0292
4 CO2       Agriculture AFG                  Afghanistan 1973         0.0292
5 CO2       Agriculture AFG                  Afghanistan 1974         0.0400
6 CO2       Agriculture AFG                  Afghanistan 1975         0.0453

We also need to convert the years from characters to numeric types.

emissions$year <- as.numeric(emissions$year)
head(emissions, 8)

# A tibble: 8 × 6
  Substance Sector      `EDGAR Country Code` Country      year GHG_emissions
  <chr>     <chr>       <chr>                <chr>       <dbl>         <dbl>
1 CO2       Agriculture AFG                  Afghanistan  1970        0.0292
2 CO2       Agriculture AFG                  Afghanistan  1971        0.0292
3 CO2       Agriculture AFG                  Afghanistan  1972        0.0292
4 CO2       Agriculture AFG                  Afghanistan  1973        0.0292
5 CO2       Agriculture AFG                  Afghanistan  1974        0.0400
6 CO2       Agriculture AFG                  Afghanistan  1975        0.0453
7 CO2       Agriculture AFG                  Afghanistan  1976        0.0506
8 CO2       Agriculture AFG                  Afghanistan  1977        0.0614

Our data is now in a longer format with good typed columns We can check NA values, from glimpse we saw we have several na values.

colSums(is.na(emissions))

         Substance             Sector EDGAR Country Code            Country 
               108                108                108                108 
              year      GHG_emissions 
                 0               5781

We can filter to see which of these values are NA, which we should remove or just leave. We can first look at the 108 NA

na_emissions <- emissions %>%
  filter(is.na(Substance))
head(na_emissions)

# A tibble: 6 × 6
  Substance Sector `EDGAR Country Code` Country  year GHG_emissions
  <chr>     <chr>  <chr>                <chr>   <dbl>         <dbl>
1 <NA>      <NA>   <NA>                 <NA>     1970            NA
2 <NA>      <NA>   <NA>                 <NA>     1971            NA
3 <NA>      <NA>   <NA>                 <NA>     1972            NA
4 <NA>      <NA>   <NA>                 <NA>     1973            NA
5 <NA>      <NA>   <NA>                 <NA>     1974            NA
6 <NA>      <NA>   <NA>                 <NA>     1975            NA

The 108 values can be dropped. These are just empty rows.

emissions <- emissions %>% drop_na(Substance)
colSums(is.na(emissions))

         Substance             Sector EDGAR Country Code            Country 
                 0                  0                  0                  0 
              year      GHG_emissions 
                 0               5673

We can now look at our GHG values.

na_emissions <- emissions %>%
  filter(is.na(GHG_emissions))
head(na_emissions)

# A tibble: 6 × 6
  Substance Sector      `EDGAR Country Code` Country      year GHG_emissions
  <chr>     <chr>       <chr>                <chr>       <dbl>         <dbl>
1 CO2       Agriculture AFG                  Afghanistan  1994            NA
2 CO2       Agriculture AFG                  Afghanistan  1995            NA
3 CO2       Agriculture AFG                  Afghanistan  2006            NA
4 CO2       Agriculture AFG                  Afghanistan  2007            NA
5 CO2       Agriculture AFG                  Afghanistan  2008            NA
6 CO2       Agriculture AFG                  Afghanistan  2009            NA

It looks like the GHG_emissions value can be NA pretty often. I don’t have a good way to replace these values so I will also drop them. It could range from the data not being reported for particular years or other reasons as to why this occurs. We wont be able to graph these points anyway, we can keep this removal in mind for the analysis.

emissions <- emissions %>% drop_na()
colSums(is.na(emissions))

         Substance             Sector EDGAR Country Code            Country 
                 0                  0                  0                  0 
              year      GHG_emissions 
                 0                  0

We now have a table

Analysis

The main analysis asked for by our classmate is to show the increase and decrease of emissions of countries over time. To start with I will create a table showing the total emissions of each substance. We will be able to see which substances have the most (quantity) emissions.

emissions_by_substance <- emissions %>% 
  group_by(Substance) %>%
    summarise(emissions = sum(GHG_emissions)) %>%
      arrange(desc(emissions))

emissions_by_substance %>%
  gt() %>%
    cols_label(
      `Substance` = "Substance",
      emissions = "Emissions Metric Tons per Year"
    ) %>%
    tab_header(title = md("Emissions Global Total by Substance"))

Substance	Emissions Metric Tons per Year
Emissions Global Total by Substance
CO2	3040805.90
GWP_100_AR5_CH4	904760.02
GWP_100_AR5_N2O	229499.36
GWP_100_AR5_F-gases	67922.29

We can see that CO2 Is by far the highest in terms of metric tons per year. This makes sense because CO2 is by far the most widely known green house gas.

I will also show a table with the global emissions based on the sector. We can see that different sectors have different effects on the global emissions, in particular the Power Industry has the greatest proportion, almost doubling the next two sectors at around 100,000 metrics tons per year.

emissions_by_sector <- emissions %>% 
  group_by(Sector) %>%
    summarise(emissions = sum(GHG_emissions)) %>%
      arrange(desc(emissions))

emissions_by_sector %>%
  gt() %>%
    cols_label(
      `Sector` = "Sector",
      emissions = "Emissions Metric Tons per Year"
    ) %>%
    tab_header(title = md("Emissions Global Total by Substance"))

Sector	Emissions Metric Tons per Year
Emissions Global Total by Substance
Power Industry	1058497.6
Transport	634946.8
Agriculture	617019.0
Industrial Combustion	538102.7
Fuel Exploitation	489624.2
Buildings	427911.5
Processes	317300.9
Waste	159584.9

I will create a data frame with the combined emissions from all sectors. I will keep the substances separate, I dont know much about hte sustances, if different ones have different levels of danger in different quantities.

emissions_yearly <- emissions %>%
  group_by(Country,Substance, year) %>%
    summarize(emissions = sum(GHG_emissions), .groups = "keep")

head(emissions_yearly)

# A tibble: 6 × 4
# Groups:   Country, Substance, year [6]
  Country     Substance  year emissions
  <chr>       <chr>     <dbl>     <dbl>
1 Afghanistan CO2        1970      1.73
2 Afghanistan CO2        1971      1.73
3 Afghanistan CO2        1972      1.69
4 Afghanistan CO2        1973      1.73
5 Afghanistan CO2        1974      2.19
6 Afghanistan CO2        1975      2.03

Our classmate mentioned looking at emissions starting at the year 2000.
We now have the data grouped by country, substance and year. Lets create a plot showing the world wide total emissions by substance from 2000 onwards.

emissions_global_2000_onwards <- emissions_yearly %>%
  filter(year >= 2000) %>%
  group_by(Substance, year)%>%
    summarize(emissions = sum(emissions), .groups = "keep")

ggplot(data = emissions_global_2000_onwards, aes(x = year, y = emissions, color = `Substance`)
       ) +
      geom_line() +
      ggtitle("Global Emissions Since 2000 by substance") +
      xlab("Year") +
      ylab("Emissions (Metric Tons per Year)") + 
      labs(color = "Substance")

Here we have a bit more context to our previous table. We can see that while all of the substances have increased in volume CO2 has increased significantly. There have been some years where CO2 usage has decreased but as of 2023 it was back at an all time high.

I am interested in seeing specific countries like the United States and China’s CO2 Emissions since 2000.

countries = c("United States", "China")
china_vs_us_CO2 <- emissions %>%
  filter(year >= 2000) %>%
    filter(Country %in% countries) %>%
      filter(Substance == "CO2") %>% 
        group_by(Country, year) %>%
          summarise(CO2_Emissions = sum(GHG_emissions), .groups = "keep")

ggplot(data = china_vs_us_CO2, aes(x = year, y = CO2_Emissions, color = Country)) +
    geom_line() +
    ggtitle("CO2 Emissions of China and US") +
    xlab("Year") +
    ylab("CO2 Emissions (Metric Tons per Year)") + 
    labs(color = "Country")

We can add more countries to the list to compare more, I selected China and United States because I believe they are percieved to be some of the largest culprits of green house gas emissions. We can see here that the United States has a fair amount of CO2 Emissions that have more or less decreased within the last 20 years. China on the other hand has had a large increase in emissions to more than double the US’s emissions. Keep in mind the graph here did not start y at 0 so it looks worse that it is. These were some things we could analyze using the tidied data set.

Summarize

To summarize, we were able to clean and tidy the data set suggested by our classmate about different emissions in different countries over time. After tiding this data set we were able to look at different types of emissions globally as well as how different sectors affected the global emissions total. We were also able to look at the increase in emissions through a timeplot for several substances and finally compared specific countries, in particular the US and China.