Project 2.2

Emissions Data

Loading the data

This dataset tracks emissions data by country from the years 2000 to 2020. The information was compiled from the United Nations Food and Agriculture Organization and uploaded to Kaggle for use. Emissions from CH4, N20, CO2, and CO2 equivalents. source: https://www.kaggle.com/datasets/justin2028/total-emissions-per-country-2000-2020?resource=download

For additional analysis, I have included population by country which I downloaded from World Bank Group. source: https://data.worldbank.org/indicator/SP.POP.TOTL?end=2020&start=2000

To begin we will download the csv file, and take a look at its current structure.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

emiss_raw <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/Total%20Emissions%20Per%20Country%20(2000-2020).csv", check.names = FALSE)

head(emiss_raw)

##          Area             Item                          Element       Unit
## 1 Afghanistan    Crop Residues           Direct emissions (N2O) kilotonnes
## 2 Afghanistan    Crop Residues         Indirect emissions (N2O) kilotonnes
## 3 Afghanistan    Crop Residues                  Emissions (N2O) kilotonnes
## 4 Afghanistan    Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan    Crop Residues          Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation                  Emissions (CH4) kilotonnes
##      2000     2001     2002     2003     2004     2005     2006     2007
## 1   0.520   0.5267   0.8200   0.9988   0.8225   1.1821   1.0277   1.2426
## 2   0.117   0.1185   0.1845   0.2247   0.1851   0.2660   0.2312   0.2796
## 3   0.637   0.6452   1.0045   1.2235   1.0075   1.4481   1.2589   1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6  18.200  16.9400  18.9000  20.3000  27.3000  22.4000  22.4000  23.8000
##       2008     2009     2010     2011     2012     2013     2014     2015
## 1   0.8869   1.3920   1.2742   1.0321   1.3726   1.4018   1.4584   1.2424
## 2   0.1996   0.3132   0.2867   0.2322   0.3088   0.3154   0.3281   0.2795
## 3   1.0865   1.7051   1.5609   1.2643   1.6815   1.7173   1.7865   1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6  26.6000  28.0000  29.1200  29.4000  28.7000  28.7000  30.8000  22.9600
##       2016     2017     2018     2019     2020
## 1   1.1940   1.0617   0.8988   1.2176   1.3170
## 2   0.2687   0.2389   0.2022   0.2740   0.2963
## 3   1.4627   1.3005   1.1011   1.4916   1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6  16.6600  15.3233  16.4555  17.8542  20.6577

We can see that the different years recorded have been put into different columns. The dataset is wide because of it. The dataset is untidy in structure because when new data is recorded you will need to add a column for each new year.

Re-arranging the data

We want to re-structure the data, as such we will:

Pivot Longer to create a longer dataframe from teh side format moving the Year into a rows by Year and their values into a new column for emissions
I want to split the Element into emissions type and the gas emitted
Convert the Year to numeric

emiss_tidy <- emiss_raw |>
  pivot_longer(cols = 5:last_col(),
               names_to = "Year",
               values_to = "Emissions",
               values_drop_na = TRUE) |>
  # Here I am splitting element into the emission and gas types
  separate(col = Element,
           into = c("Emission_Type", "Gas_Type"),
           sep = "\\(",
           extra = "merge",
           fill = "right") |>
  # Re-ordering the columns and renaming some of them
  select(Country = Area, 
         Year, 
         Emission_Source = Item, 
         Emission_Type,
         Gas_Type, 
         Unit, 
         Emissions) |>
  # Cast the year as a number and remove ()
  mutate(Year = as.numeric(Year),
         Gas_Type = str_replace_all(Gas_Type, "[()]", ""))

head(emiss_tidy)

## # A tibble: 6 × 7
##   Country      Year Emission_Source Emission_Type       Gas_Type Unit  Emissions
##   <chr>       <dbl> <chr>           <chr>               <chr>    <chr>     <dbl>
## 1 Afghanistan  2000 Crop Residues   "Direct emissions " N2O      kilo…     0.52 
## 2 Afghanistan  2001 Crop Residues   "Direct emissions " N2O      kilo…     0.527
## 3 Afghanistan  2002 Crop Residues   "Direct emissions " N2O      kilo…     0.82 
## 4 Afghanistan  2003 Crop Residues   "Direct emissions " N2O      kilo…     0.999
## 5 Afghanistan  2004 Crop Residues   "Direct emissions " N2O      kilo…     0.822
## 6 Afghanistan  2005 Crop Residues   "Direct emissions " N2O      kilo…     1.18

Let us now import the data on population totals by country.

pop_raw <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/API_SP.POP.TOTL_DS2_en_csv_v2_76253.csv")
head(pop_raw)

##                  Country.Name Country.Code    Indicator.Name Indicator.Code
## 1                       Aruba          ABW Population, total    SP.POP.TOTL
## 2 Africa Eastern and Southern          AFE Population, total    SP.POP.TOTL
## 3                 Afghanistan          AFG Population, total    SP.POP.TOTL
## 4  Africa Western and Central          AFW Population, total    SP.POP.TOTL
## 5                      Angola          AGO Population, total    SP.POP.TOTL
## 6                     Albania          ALB Population, total    SP.POP.TOTL
##       X1960     X1961     X1962     X1963     X1964     X1965     X1966
## 1     54922     55578     56320     57002     57619     58190     58694
## 2 130072080 133534923 137171659 140945536 144904094 149033472 153281203
## 3   9035043   9214083   9404406   9604487   9814318  10036008  10266395
## 4  97630925  99706674 101854756 104089175 106388440 108772632 111246953
## 5   5231654   5301583   5354310   5408320   5464187   5521981   5581386
## 6   1608800   1659800   1711319   1762621   1814135   1864791   1914573
##       X1967     X1968     X1969     X1970     X1971     X1972     X1973
## 1     58990     59069     59052     58950     58781     58047     58299
## 2 157704381 162329396 167088245 171984985 177022314 182126556 187524135
## 3  10505959  10756922  11017409  11290128  11567667  11853696  12157999
## 4 113795019 116444636 119203521 122086536 125072948 128176494 131449942
## 5   5641807   5702699   5763685   5852788   5991102   6174262   6388528
## 6   1965598   2022272   2081695   2135479   2187853   2243126   2296752
##       X1974     X1975     X1976     X1977     X1978     X1979     X1980
## 1     58349     58295     58368     58580     58776     59191     59909
## 2 193186642 198914573 204802976 210680842 217074286 223974122 230792729
## 3  12469127  12773954  13059851  13340756  13611441  13655567  13169311
## 4 134911581 138569918 142337272 146258576 150402616 154721711 159166518
## 5   6613367   6842947   7074664   7317829   7576734   7847207   8133872
## 6   2350124   2404831   2458526   2513546   2566266   2617832   2671997
##       X1981     X1982     X1983     X1984     X1985     X1986     X1987
## 1     60563     61276     62228     62901     61728     59931     59159
## 2 238043099 245822010 253644643 261458202 269450407 277621771 286067346
## 3  11937581  10991378  10917982  11190221  11426852  11420074  11387818
## 4 163762473 168585118 173255157 177880746 182811038 187889141 193104347
## 5   8435607   8751648   9082983   9425917   9779120  10139450  10497858
## 6   2726056   2784278   2843960   2904429   2964762   3022635   3083605
##       X1988     X1989     X1990     X1991     X1992     X1993     X1994
## 1     59331     60443     62753     65896     69005     73685     77595
## 2 294498625 302939121 311748681 320442961 329082707 338324002 347441809
## 3  11523298  11874088  12045660  12238879  13278974  14943172  16250794
## 4 198485027 204062274 209566031 215178709 221191375 227246778 233360104
## 5  10861291  11238562  11626360  12023529  12423712  12827135  13249764
## 6   3142336   3227943   3286542   3266790   3247039   3227287   3207536
##       X1995     X1996     X1997     X1998     X1999     X2000     X2001
## 1     79805     83021     86301     88451     89659     90588     91439
## 2 356580375 366138524 375646235 385505757 395750933 406156661 416807868
## 3  17065836  17763266  18452091  19159996  19887785  20130327  20284307
## 4 239801875 246415446 253207584 260297834 267506298 274968446 282780717
## 5  13699778  14170973  14660413  15159370  15667235  16194869  16747208
## 6   3187784   3168033   3148281   3128530   3108778   3089027   3060173
##       X2002     X2003     X2004     X2005     X2006     X2007     X2008
## 1     92074     93128     95138     97635     99405    100150    100917
## 2 427820358 439173286 450928044 463076637 475606210 488580707 502070763
## 3  21378117  22733049  23560654  24404567  25424094  25909852  26482622
## 4 290841795 299142845 307725100 316588476 325663158 334984176 344586109
## 5  17327699  17943712  18600423  19291161  20015279  20778561  21578655
## 6   3051010   3039616   3026939   3011487   2992547   2970017   2947314
##       X2009     X2010     X2011     X2012     X2013     X2014     X2015
## 1    101604    101838    102591    104110    105675    106807    107906
## 2 516003448 530308387 544737983 559609961 575202699 590968990 607123269
## 3  27466101  28284089  29347708  30560034  31622704  32792523  33831764
## 4 354343844 364358270 374790143 385360349 396030207 406992047 418127845
## 5  22414773  23294825  24218352  25177394  26165620  27160769  28157798
## 6   2927519   2913021   2905195   2900401   2895092   2889104   2880703
##       X2016     X2017     X2018     X2019     X2020     X2021     X2022
## 1    108727    108735    108908    109203    108587    107700    107310
## 2 623369401 640058741 657801085 675950189 694446100 713090928 731821393
## 3  34700612  35688935  36743039  37856121  39068979  40000412  40578842
## 4 429454743 440882906 452195915 463365429 474569351 485920997 497387180
## 5  29183070  30234839  31297155  32375632  33451132  34532429  35635029
## 6   2876101   2873457   2866376   2854191   2837849   2811666   2777689
##       X2023
## 1    107359
## 2 750503764
## 3  41454761
## 4 509398589
## 5  36749906
## 6   2745972

The dataset is also untidy and wide in structure with columns for each year. Lets clean that up

pop_tidy <- pop_raw |>
  pivot_longer(cols = starts_with("X"),
               names_to = "Year",
               values_to = "Population") |>
  mutate(Year = as.numeric(str_remove(Year,"^X"))) |>
  rename(Country = Country.Name) |>
  filter(Year >= 2000, Year <= 2020) |>
  select(Country, Year, Population)

head(pop_tidy)

## # A tibble: 6 × 3
##   Country  Year Population
##   <chr>   <dbl>      <dbl>
## 1 Aruba    2000      90588
## 2 Aruba    2001      91439
## 3 Aruba    2002      92074
## 4 Aruba    2003      93128
## 5 Aruba    2004      95138
## 6 Aruba    2005      97635

Now lets merge the data from the emissions and population dataframes. For the purposes of this analysis we will only keep the rows where a match can be made, so we will be using an inner join.

emiss_pop_data <- emiss_tidy |>
  inner_join(pop_tidy, by = c("Country", "Year"))

head(emiss_pop_data)

## # A tibble: 6 × 8
##   Country      Year Emission_Source Emission_Type       Gas_Type Unit  Emissions
##   <chr>       <dbl> <chr>           <chr>               <chr>    <chr>     <dbl>
## 1 Afghanistan  2000 Crop Residues   "Direct emissions " N2O      kilo…     0.52 
## 2 Afghanistan  2001 Crop Residues   "Direct emissions " N2O      kilo…     0.527
## 3 Afghanistan  2002 Crop Residues   "Direct emissions " N2O      kilo…     0.82 
## 4 Afghanistan  2003 Crop Residues   "Direct emissions " N2O      kilo…     0.999
## 5 Afghanistan  2004 Crop Residues   "Direct emissions " N2O      kilo…     0.822
## 6 Afghanistan  2005 Crop Residues   "Direct emissions " N2O      kilo…     1.18 
## # ℹ 1 more variable: Population <dbl>

While reviewing the results, there are some rows that if groubed by everything except the last emissions column seem to be duplicates. This may have been from how the data was aggregated from multiple sources at the time the CSV was created. Below we can see some examples of the duplication.

# Examples of duplicates
emiss_dup <- emiss_pop_data |>
  group_by(Country, Year, Emission_Source, Emission_Type, Gas_Type, Unit) |>
  filter(n() > 1)

head(emiss_dup)

## # A tibble: 6 × 8
## # Groups:   Country, Year, Emission_Source, Emission_Type, Gas_Type, Unit [3]
##   Country      Year Emission_Source       Emission_Type Gas_Type Unit  Emissions
##   <chr>       <dbl> <chr>                 <chr>         <chr>    <chr>     <dbl>
## 1 Afghanistan  2005 Burning - Crop resid… "Emissions "  CH4      kilo…    3.48  
## 2 Afghanistan  2013 Burning - Crop resid… "Emissions "  CH4      kilo…    3.45  
## 3 Afghanistan  2005 Burning - Crop resid… "Emissions "  CH4      kilo…   10.1   
## 4 Afghanistan  2013 Burning - Crop resid… "Emissions "  CH4      kilo…    5     
## 5 Afghanistan  2005 Burning - Crop resid… "Emissions "  N2O      kilo…    0.0901
## 6 Afghanistan  2005 Burning - Crop resid… "Emissions "  N2O      kilo…    0.021 
## # ℹ 1 more variable: Population <dbl>

We will combine the emissions amounts for those duplicate rows for the purpose of this assignment.

emiss_pop_data <- emiss_pop_data |>
  group_by(Country, Year, Emission_Source, Emission_Type, Gas_Type, Unit, Population) |>
  summarize(Emissions = sum(Emissions, na.rm = TRUE), .groups = "drop")

head(emiss_pop_data)

## # A tibble: 6 × 8
##   Country      Year Emission_Source Emission_Type Gas_Type      Unit  Population
##   <chr>       <dbl> <chr>           <chr>         <chr>         <chr>      <dbl>
## 1 Afghanistan  2000 AFOLU           "Emissions "  CH4           kilo…   20130327
## 2 Afghanistan  2000 AFOLU           "Emissions "  CO2           kilo…   20130327
## 3 Afghanistan  2000 AFOLU           "Emissions "  CO2eq AR5     kilo…   20130327
## 4 Afghanistan  2000 AFOLU           "Emissions "  CO2eq from C… kilo…   20130327
## 5 Afghanistan  2000 AFOLU           "Emissions "  CO2eq from N… kilo…   20130327
## 6 Afghanistan  2000 AFOLU           "Emissions "  N2O           kilo…   20130327
## # ℹ 1 more variable: Emissions <dbl>

Now we will create an emissions per capita column

emiss_pop_data <- emiss_pop_data |>
  mutate(Emissions_Per_Capita = round((Emissions) / Population, 2))

head(emiss_pop_data)

## # A tibble: 6 × 9
##   Country      Year Emission_Source Emission_Type Gas_Type      Unit  Population
##   <chr>       <dbl> <chr>           <chr>         <chr>         <chr>      <dbl>
## 1 Afghanistan  2000 AFOLU           "Emissions "  CH4           kilo…   20130327
## 2 Afghanistan  2000 AFOLU           "Emissions "  CO2           kilo…   20130327
## 3 Afghanistan  2000 AFOLU           "Emissions "  CO2eq AR5     kilo…   20130327
## 4 Afghanistan  2000 AFOLU           "Emissions "  CO2eq from C… kilo…   20130327
## 5 Afghanistan  2000 AFOLU           "Emissions "  CO2eq from N… kilo…   20130327
## 6 Afghanistan  2000 AFOLU           "Emissions "  N2O           kilo…   20130327
## # ℹ 2 more variables: Emissions <dbl>, Emissions_Per_Capita <dbl>

Data Analysis

With such robust data we can look to answer many questions regarding Emissions. Here are some we seek to answer:

Lets start with which single source of emissions contributed to the total emissions and which contributes the least?

# Finding sources of the highest emissions
emiss_pop_data |> 
  group_by(Emission_Source) |>
  summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
            .groups = "drop") |>
  arrange(desc(Total_emiss_cap))

## # A tibble: 42 × 2
##    Emission_Source                Total_emiss_cap
##    <chr>                                    <dbl>
##  1 All sectors with LULUCF                  53.2 
##  2 All sectors without LULUCF               47.4 
##  3 Energy                                   36.8 
##  4 Agri-food systems                        12.5 
##  5 Emissions on agricultural land            9.33
##  6 AFOLU                                     7.14
##  7 IPCC Agriculture                          5.75
##  8 Net Forest conversion                     4.54
##  9 Farm-gate emissions                       3.78
## 10 Land Use change                           3.72
## # ℹ 32 more rows

# Finding sources of the lowest emissions
emiss_pop_data |> 
  group_by(Emission_Source) |>
  summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
            .groups = "drop") |>
  arrange(Total_emiss_cap)

## # A tibble: 42 × 2
##    Emission_Source             Total_emiss_cap
##    <chr>                                 <dbl>
##  1 Forestland                            -6.42
##  2 LULUCF                                -5.33
##  3 Burning - Crop residues                0   
##  4 Crop Residues                          0   
##  5 Drained organic soils (N2O)            0   
##  6 Food Household Consumption             0   
##  7 Food Retail                            0   
##  8 International bunkers                  0   
##  9 Manure Management                      0   
## 10 Manure applied to Soils                0   
## # ℹ 32 more rows

Which years had the highest overall emissions? and which had the lowest?

# Finding year of the highest emissions
emiss_pop_data |> 
  group_by(Year) |>
  summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
            .groups = "drop") |>
  arrange(desc(Total_emiss_cap))

## # A tibble: 21 × 2
##     Year Total_emiss_cap
##    <dbl>           <dbl>
##  1  2001            9.29
##  2  2000            9.12
##  3  2017            9.03
##  4  2016            8.96
##  5  2019            8.96
##  6  2002            8.92
##  7  2018            8.9 
##  8  2011            8.84
##  9  2012            8.8 
## 10  2003            8.43
## # ℹ 11 more rows

# Finding years of the lowest emissions
emiss_pop_data |> 
  group_by(Year) |>
  summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
            .groups = "drop") |>
  arrange((Total_emiss_cap))

## # A tibble: 21 × 2
##     Year Total_emiss_cap
##    <dbl>           <dbl>
##  1  2010            7.77
##  2  2009            7.83
##  3  2006            8.03
##  4  2007            8.05
##  5  2008            8.05
##  6  2015            8.14
##  7  2020            8.22
##  8  2013            8.35
##  9  2014            8.35
## 10  2004            8.41
## # ℹ 11 more rows

Lets see which countries contribute the most overall emissions per capita and what their average and highest emissions are:

emiss_pop_data |> 
  group_by(Country) |>
  summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
            Avg_emissions_per_capita = round(mean(Emissions_Per_Capita),2),
            Highest_emission_per_capita = round(max(Emissions_Per_Capita),2),
            Lowest_emission_per_capita = round(min(Emissions_Per_Capita),2),
            .groups = "drop") |>
  arrange(desc(Total_emiss_cap))

## # A tibble: 186 × 5
##    Country         Total_emiss_cap Avg_emissions_per_ca…¹ Highest_emission_per…²
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 Qatar                     12.6                       0                   0.12
##  2 Trinidad and T…            7.71                      0                   0.06
##  3 Botswana                   6.79                      0                   0.03
##  4 Mongolia                   6.02                      0                   0.03
##  5 Bahrain                    5.79                      0                   0.06
##  6 Australia                  5.56                      0                   0.04
##  7 Kuwait                     5                         0                   0.05
##  8 Paraguay                   4.87                      0                   0.03
##  9 Brunei Darussa…            4.65                      0                   0.05
## 10 Belize                     4.47                      0                   0.06
## # ℹ 176 more rows
## # ℹ abbreviated names: ¹Avg_emissions_per_capita, ²Highest_emission_per_capita
## # ℹ 1 more variable: Lowest_emission_per_capita <dbl>

which gas type is emitted most?

emiss_pop_data |> 
  group_by(Gas_Type) |>
  summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
            Avg_emissions_per_capita = round(mean(Emissions_Per_Capita),2),
            Highest_emission_per_capita = round(max(Emissions_Per_Capita),2),
            Lowest_emission_per_capita = round(min(Emissions_Per_Capita),2),
            .groups = "drop") |>
  arrange(desc(Total_emiss_cap))

## # A tibble: 7 × 5
##   Gas_Type         Total_emiss_cap Avg_emissions_per_ca…¹ Highest_emission_per…²
##   <chr>                      <dbl>                  <dbl>                  <dbl>
## 1 CO2eq AR5                 102.                        0                   0.21
## 2 CO2                        58.3                       0                   0.07
## 3 CO2eq from CH4 …           17.9                       0                   0.21
## 4 CO2eq from N2O …            0.2                       0                   0.04
## 5 CH4                         0.05                      0                   0.01
## 6 CO2eq from F-ga…            0                         0                   0   
## 7 N2O                         0                         0                   0   
## # ℹ abbreviated names: ¹Avg_emissions_per_capita, ²Highest_emission_per_capita
## # ℹ 1 more variable: Lowest_emission_per_capita <dbl>

Lets view the global emissions over the 20 years.

emiss_pop_data |>
  group_by(Year,Gas_Type) |>
  summarise(Total_emissions = sum(Emissions, na.rm = TRUE), 
            .groups = "drop") |>
  ggplot(aes(x = Year, 
             y = Total_emissions,
             color = Gas_Type,
             group = Gas_Type)) +
  geom_line() +
  geom_point() +
  labs(title = "Global Emissions 2000 - 2020",
       x = "Year",
       y = "Total Emissions",
       color = "Gas_Type")

Further considerations: For future enhancements it would be helpful to know how the different emissions datasets were aggregated to create the kaggle dataset. By identifying the true cause of duplication we can better handle those rows.

The data is very nuanced, are there any ways that we can further group the data to draw some big picture takeaways?

Are there datasets we can use as a supplement in determining what may be the cause in increased or decreased emissions for the different gas types?

Project 2.2

Cindy Lin & William Forero

Emissions Data

Loading the data

Re-arranging the data

Data Analysis