This dataset tracks emissions data by country from the years 2000 to 2020. The information was compiled from the United Nations Food and Agriculture Organization and uploaded to Kaggle for use. Emissions from CH4, N20, CO2, and CO2 equivalents. source: https://www.kaggle.com/datasets/justin2028/total-emissions-per-country-2000-2020?resource=download
For additional analysis, I have included population by country which I downloaded from World Bank Group. source: https://data.worldbank.org/indicator/SP.POP.TOTL?end=2020&start=2000
To begin we will download the csv file, and take a look at its current structure.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
emiss_raw <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/Total%20Emissions%20Per%20Country%20(2000-2020).csv", check.names = FALSE)
head(emiss_raw)
## Area Item Element Unit
## 1 Afghanistan Crop Residues Direct emissions (N2O) kilotonnes
## 2 Afghanistan Crop Residues Indirect emissions (N2O) kilotonnes
## 3 Afghanistan Crop Residues Emissions (N2O) kilotonnes
## 4 Afghanistan Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan Crop Residues Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation Emissions (CH4) kilotonnes
## 2000 2001 2002 2003 2004 2005 2006 2007
## 1 0.520 0.5267 0.8200 0.9988 0.8225 1.1821 1.0277 1.2426
## 2 0.117 0.1185 0.1845 0.2247 0.1851 0.2660 0.2312 0.2796
## 3 0.637 0.6452 1.0045 1.2235 1.0075 1.4481 1.2589 1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6 18.200 16.9400 18.9000 20.3000 27.3000 22.4000 22.4000 23.8000
## 2008 2009 2010 2011 2012 2013 2014 2015
## 1 0.8869 1.3920 1.2742 1.0321 1.3726 1.4018 1.4584 1.2424
## 2 0.1996 0.3132 0.2867 0.2322 0.3088 0.3154 0.3281 0.2795
## 3 1.0865 1.7051 1.5609 1.2643 1.6815 1.7173 1.7865 1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6 26.6000 28.0000 29.1200 29.4000 28.7000 28.7000 30.8000 22.9600
## 2016 2017 2018 2019 2020
## 1 1.1940 1.0617 0.8988 1.2176 1.3170
## 2 0.2687 0.2389 0.2022 0.2740 0.2963
## 3 1.4627 1.3005 1.1011 1.4916 1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6 16.6600 15.3233 16.4555 17.8542 20.6577
We can see that the different years recorded have been put into different columns. The dataset is wide because of it. The dataset is untidy in structure because when new data is recorded you will need to add a column for each new year.
We want to re-structure the data, as such we will:
emiss_tidy <- emiss_raw |>
pivot_longer(cols = 5:last_col(),
names_to = "Year",
values_to = "Emissions",
values_drop_na = TRUE) |>
# Here I am splitting element into the emission and gas types
separate(col = Element,
into = c("Emission_Type", "Gas_Type"),
sep = "\\(",
extra = "merge",
fill = "right") |>
# Re-ordering the columns and renaming some of them
select(Country = Area,
Year,
Emission_Source = Item,
Emission_Type,
Gas_Type,
Unit,
Emissions) |>
# Cast the year as a number and remove ()
mutate(Year = as.numeric(Year),
Gas_Type = str_replace_all(Gas_Type, "[()]", ""))
head(emiss_tidy)
## # A tibble: 6 × 7
## Country Year Emission_Source Emission_Type Gas_Type Unit Emissions
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2000 Crop Residues "Direct emissions " N2O kilo… 0.52
## 2 Afghanistan 2001 Crop Residues "Direct emissions " N2O kilo… 0.527
## 3 Afghanistan 2002 Crop Residues "Direct emissions " N2O kilo… 0.82
## 4 Afghanistan 2003 Crop Residues "Direct emissions " N2O kilo… 0.999
## 5 Afghanistan 2004 Crop Residues "Direct emissions " N2O kilo… 0.822
## 6 Afghanistan 2005 Crop Residues "Direct emissions " N2O kilo… 1.18
Let us now import the data on population totals by country.
pop_raw <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/API_SP.POP.TOTL_DS2_en_csv_v2_76253.csv")
head(pop_raw)
## Country.Name Country.Code Indicator.Name Indicator.Code
## 1 Aruba ABW Population, total SP.POP.TOTL
## 2 Africa Eastern and Southern AFE Population, total SP.POP.TOTL
## 3 Afghanistan AFG Population, total SP.POP.TOTL
## 4 Africa Western and Central AFW Population, total SP.POP.TOTL
## 5 Angola AGO Population, total SP.POP.TOTL
## 6 Albania ALB Population, total SP.POP.TOTL
## X1960 X1961 X1962 X1963 X1964 X1965 X1966
## 1 54922 55578 56320 57002 57619 58190 58694
## 2 130072080 133534923 137171659 140945536 144904094 149033472 153281203
## 3 9035043 9214083 9404406 9604487 9814318 10036008 10266395
## 4 97630925 99706674 101854756 104089175 106388440 108772632 111246953
## 5 5231654 5301583 5354310 5408320 5464187 5521981 5581386
## 6 1608800 1659800 1711319 1762621 1814135 1864791 1914573
## X1967 X1968 X1969 X1970 X1971 X1972 X1973
## 1 58990 59069 59052 58950 58781 58047 58299
## 2 157704381 162329396 167088245 171984985 177022314 182126556 187524135
## 3 10505959 10756922 11017409 11290128 11567667 11853696 12157999
## 4 113795019 116444636 119203521 122086536 125072948 128176494 131449942
## 5 5641807 5702699 5763685 5852788 5991102 6174262 6388528
## 6 1965598 2022272 2081695 2135479 2187853 2243126 2296752
## X1974 X1975 X1976 X1977 X1978 X1979 X1980
## 1 58349 58295 58368 58580 58776 59191 59909
## 2 193186642 198914573 204802976 210680842 217074286 223974122 230792729
## 3 12469127 12773954 13059851 13340756 13611441 13655567 13169311
## 4 134911581 138569918 142337272 146258576 150402616 154721711 159166518
## 5 6613367 6842947 7074664 7317829 7576734 7847207 8133872
## 6 2350124 2404831 2458526 2513546 2566266 2617832 2671997
## X1981 X1982 X1983 X1984 X1985 X1986 X1987
## 1 60563 61276 62228 62901 61728 59931 59159
## 2 238043099 245822010 253644643 261458202 269450407 277621771 286067346
## 3 11937581 10991378 10917982 11190221 11426852 11420074 11387818
## 4 163762473 168585118 173255157 177880746 182811038 187889141 193104347
## 5 8435607 8751648 9082983 9425917 9779120 10139450 10497858
## 6 2726056 2784278 2843960 2904429 2964762 3022635 3083605
## X1988 X1989 X1990 X1991 X1992 X1993 X1994
## 1 59331 60443 62753 65896 69005 73685 77595
## 2 294498625 302939121 311748681 320442961 329082707 338324002 347441809
## 3 11523298 11874088 12045660 12238879 13278974 14943172 16250794
## 4 198485027 204062274 209566031 215178709 221191375 227246778 233360104
## 5 10861291 11238562 11626360 12023529 12423712 12827135 13249764
## 6 3142336 3227943 3286542 3266790 3247039 3227287 3207536
## X1995 X1996 X1997 X1998 X1999 X2000 X2001
## 1 79805 83021 86301 88451 89659 90588 91439
## 2 356580375 366138524 375646235 385505757 395750933 406156661 416807868
## 3 17065836 17763266 18452091 19159996 19887785 20130327 20284307
## 4 239801875 246415446 253207584 260297834 267506298 274968446 282780717
## 5 13699778 14170973 14660413 15159370 15667235 16194869 16747208
## 6 3187784 3168033 3148281 3128530 3108778 3089027 3060173
## X2002 X2003 X2004 X2005 X2006 X2007 X2008
## 1 92074 93128 95138 97635 99405 100150 100917
## 2 427820358 439173286 450928044 463076637 475606210 488580707 502070763
## 3 21378117 22733049 23560654 24404567 25424094 25909852 26482622
## 4 290841795 299142845 307725100 316588476 325663158 334984176 344586109
## 5 17327699 17943712 18600423 19291161 20015279 20778561 21578655
## 6 3051010 3039616 3026939 3011487 2992547 2970017 2947314
## X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 101604 101838 102591 104110 105675 106807 107906
## 2 516003448 530308387 544737983 559609961 575202699 590968990 607123269
## 3 27466101 28284089 29347708 30560034 31622704 32792523 33831764
## 4 354343844 364358270 374790143 385360349 396030207 406992047 418127845
## 5 22414773 23294825 24218352 25177394 26165620 27160769 28157798
## 6 2927519 2913021 2905195 2900401 2895092 2889104 2880703
## X2016 X2017 X2018 X2019 X2020 X2021 X2022
## 1 108727 108735 108908 109203 108587 107700 107310
## 2 623369401 640058741 657801085 675950189 694446100 713090928 731821393
## 3 34700612 35688935 36743039 37856121 39068979 40000412 40578842
## 4 429454743 440882906 452195915 463365429 474569351 485920997 497387180
## 5 29183070 30234839 31297155 32375632 33451132 34532429 35635029
## 6 2876101 2873457 2866376 2854191 2837849 2811666 2777689
## X2023
## 1 107359
## 2 750503764
## 3 41454761
## 4 509398589
## 5 36749906
## 6 2745972
The dataset is also untidy and wide in structure with columns for each year. Lets clean that up
pop_tidy <- pop_raw |>
pivot_longer(cols = starts_with("X"),
names_to = "Year",
values_to = "Population") |>
mutate(Year = as.numeric(str_remove(Year,"^X"))) |>
rename(Country = Country.Name) |>
filter(Year >= 2000, Year <= 2020) |>
select(Country, Year, Population)
head(pop_tidy)
## # A tibble: 6 × 3
## Country Year Population
## <chr> <dbl> <dbl>
## 1 Aruba 2000 90588
## 2 Aruba 2001 91439
## 3 Aruba 2002 92074
## 4 Aruba 2003 93128
## 5 Aruba 2004 95138
## 6 Aruba 2005 97635
Now lets merge the data from the emissions and population dataframes. For the purposes of this analysis we will only keep the rows where a match can be made, so we will be using an inner join.
emiss_pop_data <- emiss_tidy |>
inner_join(pop_tidy, by = c("Country", "Year"))
head(emiss_pop_data)
## # A tibble: 6 × 8
## Country Year Emission_Source Emission_Type Gas_Type Unit Emissions
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2000 Crop Residues "Direct emissions " N2O kilo… 0.52
## 2 Afghanistan 2001 Crop Residues "Direct emissions " N2O kilo… 0.527
## 3 Afghanistan 2002 Crop Residues "Direct emissions " N2O kilo… 0.82
## 4 Afghanistan 2003 Crop Residues "Direct emissions " N2O kilo… 0.999
## 5 Afghanistan 2004 Crop Residues "Direct emissions " N2O kilo… 0.822
## 6 Afghanistan 2005 Crop Residues "Direct emissions " N2O kilo… 1.18
## # ℹ 1 more variable: Population <dbl>
While reviewing the results, there are some rows that if groubed by everything except the last emissions column seem to be duplicates. This may have been from how the data was aggregated from multiple sources at the time the CSV was created. Below we can see some examples of the duplication.
# Examples of duplicates
emiss_dup <- emiss_pop_data |>
group_by(Country, Year, Emission_Source, Emission_Type, Gas_Type, Unit) |>
filter(n() > 1)
head(emiss_dup)
## # A tibble: 6 × 8
## # Groups: Country, Year, Emission_Source, Emission_Type, Gas_Type, Unit [3]
## Country Year Emission_Source Emission_Type Gas_Type Unit Emissions
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2005 Burning - Crop resid… "Emissions " CH4 kilo… 3.48
## 2 Afghanistan 2013 Burning - Crop resid… "Emissions " CH4 kilo… 3.45
## 3 Afghanistan 2005 Burning - Crop resid… "Emissions " CH4 kilo… 10.1
## 4 Afghanistan 2013 Burning - Crop resid… "Emissions " CH4 kilo… 5
## 5 Afghanistan 2005 Burning - Crop resid… "Emissions " N2O kilo… 0.0901
## 6 Afghanistan 2005 Burning - Crop resid… "Emissions " N2O kilo… 0.021
## # ℹ 1 more variable: Population <dbl>
We will combine the emissions amounts for those duplicate rows for the purpose of this assignment.
emiss_pop_data <- emiss_pop_data |>
group_by(Country, Year, Emission_Source, Emission_Type, Gas_Type, Unit, Population) |>
summarize(Emissions = sum(Emissions, na.rm = TRUE), .groups = "drop")
head(emiss_pop_data)
## # A tibble: 6 × 8
## Country Year Emission_Source Emission_Type Gas_Type Unit Population
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2000 AFOLU "Emissions " CH4 kilo… 20130327
## 2 Afghanistan 2000 AFOLU "Emissions " CO2 kilo… 20130327
## 3 Afghanistan 2000 AFOLU "Emissions " CO2eq AR5 kilo… 20130327
## 4 Afghanistan 2000 AFOLU "Emissions " CO2eq from C… kilo… 20130327
## 5 Afghanistan 2000 AFOLU "Emissions " CO2eq from N… kilo… 20130327
## 6 Afghanistan 2000 AFOLU "Emissions " N2O kilo… 20130327
## # ℹ 1 more variable: Emissions <dbl>
Now we will create an emissions per capita column
emiss_pop_data <- emiss_pop_data |>
mutate(Emissions_Per_Capita = round((Emissions) / Population, 2))
head(emiss_pop_data)
## # A tibble: 6 × 9
## Country Year Emission_Source Emission_Type Gas_Type Unit Population
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 2000 AFOLU "Emissions " CH4 kilo… 20130327
## 2 Afghanistan 2000 AFOLU "Emissions " CO2 kilo… 20130327
## 3 Afghanistan 2000 AFOLU "Emissions " CO2eq AR5 kilo… 20130327
## 4 Afghanistan 2000 AFOLU "Emissions " CO2eq from C… kilo… 20130327
## 5 Afghanistan 2000 AFOLU "Emissions " CO2eq from N… kilo… 20130327
## 6 Afghanistan 2000 AFOLU "Emissions " N2O kilo… 20130327
## # ℹ 2 more variables: Emissions <dbl>, Emissions_Per_Capita <dbl>
With such robust data we can look to answer many questions regarding Emissions. Here are some we seek to answer:
Lets start with which single source of emissions contributed to the total emissions and which contributes the least?
# Finding sources of the highest emissions
emiss_pop_data |>
group_by(Emission_Source) |>
summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
.groups = "drop") |>
arrange(desc(Total_emiss_cap))
## # A tibble: 42 × 2
## Emission_Source Total_emiss_cap
## <chr> <dbl>
## 1 All sectors with LULUCF 53.2
## 2 All sectors without LULUCF 47.4
## 3 Energy 36.8
## 4 Agri-food systems 12.5
## 5 Emissions on agricultural land 9.33
## 6 AFOLU 7.14
## 7 IPCC Agriculture 5.75
## 8 Net Forest conversion 4.54
## 9 Farm-gate emissions 3.78
## 10 Land Use change 3.72
## # ℹ 32 more rows
# Finding sources of the lowest emissions
emiss_pop_data |>
group_by(Emission_Source) |>
summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
.groups = "drop") |>
arrange(Total_emiss_cap)
## # A tibble: 42 × 2
## Emission_Source Total_emiss_cap
## <chr> <dbl>
## 1 Forestland -6.42
## 2 LULUCF -5.33
## 3 Burning - Crop residues 0
## 4 Crop Residues 0
## 5 Drained organic soils (N2O) 0
## 6 Food Household Consumption 0
## 7 Food Retail 0
## 8 International bunkers 0
## 9 Manure Management 0
## 10 Manure applied to Soils 0
## # ℹ 32 more rows
Which years had the highest overall emissions? and which had the lowest?
# Finding year of the highest emissions
emiss_pop_data |>
group_by(Year) |>
summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
.groups = "drop") |>
arrange(desc(Total_emiss_cap))
## # A tibble: 21 × 2
## Year Total_emiss_cap
## <dbl> <dbl>
## 1 2001 9.29
## 2 2000 9.12
## 3 2017 9.03
## 4 2016 8.96
## 5 2019 8.96
## 6 2002 8.92
## 7 2018 8.9
## 8 2011 8.84
## 9 2012 8.8
## 10 2003 8.43
## # ℹ 11 more rows
# Finding years of the lowest emissions
emiss_pop_data |>
group_by(Year) |>
summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
.groups = "drop") |>
arrange((Total_emiss_cap))
## # A tibble: 21 × 2
## Year Total_emiss_cap
## <dbl> <dbl>
## 1 2010 7.77
## 2 2009 7.83
## 3 2006 8.03
## 4 2007 8.05
## 5 2008 8.05
## 6 2015 8.14
## 7 2020 8.22
## 8 2013 8.35
## 9 2014 8.35
## 10 2004 8.41
## # ℹ 11 more rows
Lets see which countries contribute the most overall emissions per capita and what their average and highest emissions are:
emiss_pop_data |>
group_by(Country) |>
summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
Avg_emissions_per_capita = round(mean(Emissions_Per_Capita),2),
Highest_emission_per_capita = round(max(Emissions_Per_Capita),2),
Lowest_emission_per_capita = round(min(Emissions_Per_Capita),2),
.groups = "drop") |>
arrange(desc(Total_emiss_cap))
## # A tibble: 186 × 5
## Country Total_emiss_cap Avg_emissions_per_ca…¹ Highest_emission_per…²
## <chr> <dbl> <dbl> <dbl>
## 1 Qatar 12.6 0 0.12
## 2 Trinidad and T… 7.71 0 0.06
## 3 Botswana 6.79 0 0.03
## 4 Mongolia 6.02 0 0.03
## 5 Bahrain 5.79 0 0.06
## 6 Australia 5.56 0 0.04
## 7 Kuwait 5 0 0.05
## 8 Paraguay 4.87 0 0.03
## 9 Brunei Darussa… 4.65 0 0.05
## 10 Belize 4.47 0 0.06
## # ℹ 176 more rows
## # ℹ abbreviated names: ¹​Avg_emissions_per_capita, ²​Highest_emission_per_capita
## # ℹ 1 more variable: Lowest_emission_per_capita <dbl>
which gas type is emitted most?
emiss_pop_data |>
group_by(Gas_Type) |>
summarise(Total_emiss_cap = round(sum(Emissions_Per_Capita, na.rm = TRUE),2),
Avg_emissions_per_capita = round(mean(Emissions_Per_Capita),2),
Highest_emission_per_capita = round(max(Emissions_Per_Capita),2),
Lowest_emission_per_capita = round(min(Emissions_Per_Capita),2),
.groups = "drop") |>
arrange(desc(Total_emiss_cap))
## # A tibble: 7 × 5
## Gas_Type Total_emiss_cap Avg_emissions_per_ca…¹ Highest_emission_per…²
## <chr> <dbl> <dbl> <dbl>
## 1 CO2eq AR5 102. 0 0.21
## 2 CO2 58.3 0 0.07
## 3 CO2eq from CH4 … 17.9 0 0.21
## 4 CO2eq from N2O … 0.2 0 0.04
## 5 CH4 0.05 0 0.01
## 6 CO2eq from F-ga… 0 0 0
## 7 N2O 0 0 0
## # ℹ abbreviated names: ¹​Avg_emissions_per_capita, ²​Highest_emission_per_capita
## # ℹ 1 more variable: Lowest_emission_per_capita <dbl>
Lets view the global emissions over the 20 years.
emiss_pop_data |>
group_by(Year,Gas_Type) |>
summarise(Total_emissions = sum(Emissions, na.rm = TRUE),
.groups = "drop") |>
ggplot(aes(x = Year,
y = Total_emissions,
color = Gas_Type,
group = Gas_Type)) +
geom_line() +
geom_point() +
labs(title = "Global Emissions 2000 - 2020",
x = "Year",
y = "Total Emissions",
color = "Gas_Type")
Further considerations: For future enhancements it would be helpful to know how the different emissions datasets were aggregated to create the kaggle dataset. By identifying the true cause of duplication we can better handle those rows.
The data is very nuanced, are there any ways that we can further group the data to draw some big picture takeaways?
Are there datasets we can use as a supplement in determining what may be the cause in increased or decreased emissions for the different gas types?