library(tidyverse)
library(dplyr)
library(ggplot2)
library(gt)607_Project2C_DylanGold
Codebase 3
Importing data
For my 3rd file I picked a dataset containing emission data based on sector and country. This was obtained from a classmates post from dicussion 5A. Muhammad Khan posted https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions#Fossil_carbon_dioxide_emissions_by_country
From the link I got this csv file containing the green house gas emissions of countries by sector.
I uploaded the file to github then imported this file.
url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Project-2/refs/heads/main/FileC/EDGAR_2024_GHG_booklet_2024.csv"
df <- read_csv(
file = url,
col_names = TRUE
)
head(df, 8)# A tibble: 8 × 58
Substance Sector `EDGAR Country Code` Country `1970` `1971` `1972` `1973`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 CO2 Agricu… AFG Afghan… 0.0292 0.0292 0.0292 0.0292
2 CO2 Agricu… ALB Albania 0.113 0.113 0.113 0.113
3 CO2 Agricu… ARG Argent… 0.104 0.104 0.104 0.104
4 CO2 Agricu… ARM Armenia 0.0553 0.0553 0.0553 0.0553
5 CO2 Agricu… AUS Austra… 0.311 0.311 0.311 0.311
6 CO2 Agricu… AUT Austria 0.00644 0.00644 0.00644 0.00644
7 CO2 Agricu… AZE Azerba… 0.0415 0.0415 0.0415 0.0415
8 CO2 Agricu… BEL Belgium 0.0417 0.0417 0.0417 0.0417
# ℹ 50 more variables: `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
# `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
# `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
# `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
# `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
# `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
# `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, …
I will use glimpse to see the columns of this data set
glimpse(df)Rows: 4,853
Columns: 58
$ Substance <chr> "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", …
$ Sector <chr> "Agriculture", "Agriculture", "Agriculture", "Agr…
$ `EDGAR Country Code` <chr> "AFG", "ALB", "ARG", "ARM", "AUS", "AUT", "AZE", …
$ Country <chr> "Afghanistan", "Albania", "Argentina", "Armenia",…
$ `1970` <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1971` <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1972` <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1973` <dbl> 0.029228567, 0.113300000, 0.104342850, 0.05528820…
$ `1974` <dbl> 0.039966661, 0.113614286, 0.087214278, 0.05996643…
$ `1975` <dbl> 0.045309517, 0.112514286, 0.077314278, 0.05996643…
$ `1976` <dbl> 0.050599993, 0.127757140, 0.070085708, 0.06343258…
$ `1977` <dbl> 0.061442848, 0.135142853, 0.111099990, 0.07115166…
$ `1978` <dbl> 0.058928563, 0.169557133, 0.118014276, 0.07289536…
$ `1979` <dbl> 0.055471421, 0.167985704, 0.137814274, 0.07497930…
$ `1980` <dbl> 0.061338085, 0.162957134, 0.137342845, 0.07695692…
$ `1981` <dbl> 0.047299993, 0.170499990, 0.094338086, 0.07719083…
$ `1982` <dbl> 0.062804752, 0.167828561, 0.098161897, 0.07710577…
$ `1983` <dbl> 0.060395229, 0.172857133, 0.115604751, 0.08206045…
$ `1984` <dbl> 0.087004749, 0.177571418, 0.178461884, 0.08212424…
$ `1985` <dbl> 0.087738082, 0.177571418, 0.179195215, 0.08401680…
$ `1986` <dbl> 0.081347606, 0.177571418, 0.159028554, 0.08718524…
$ `1987` <dbl> 0.088680940, 0.160599991, 0.160704744, 0.08931171…
$ `1988` <dbl> 0.079566653, 0.169085705, 0.174480932, 0.08293230…
$ `1989` <dbl> 0.079776177, 0.192971418, 0.170080931, 0.06760046…
$ `1990` <dbl> 0.06940475, 0.16531427, 0.17505712, 0.06344168, 0…
$ `1991` <dbl> 0.06846189, 0.03504285, 0.16678093, 0.05928290, 0…
$ `1992` <dbl> 0.062542847, 0.023257139, 0.257190449, 0.05512411…
$ `1993` <dbl> 0.06285713, 0.02734285, 0.28191425, 0.05096534, 0…
$ `1994` <dbl> NA, 0.03153333, 0.46865232, 0.04680655, 0.7025856…
$ `1995` <dbl> NA, 0.02828571, 0.54339994, 0.04264777, 0.8184523…
$ `1996` <dbl> 0.007857141, 0.015400000, 0.823218949, 0.03848899…
$ `1997` <dbl> 0.007857141, 0.010633333, 0.786971343, 0.02514285…
$ `1998` <dbl> 0.008642856, 0.038761902, 0.677442783, 0.03017142…
$ `1999` <dbl> 0.007857141, 0.011838094, 0.733542780, 0.01917142…
$ `2000` <dbl> 0.007857141, 0.016552381, 0.762142773, 0.02200000…
$ `2001` <dbl> 0.03802857, 0.02241905, 0.86899989, 0.01571429, 1…
$ `2002` <dbl> 0.032685708, 0.051699997, 0.845323705, 0.01382857…
$ `2003` <dbl> 0.036142851, 0.038604761, 1.162542746, 0.00471428…
$ `2004` <dbl> 0.04085714, 0.06002857, 1.36017605, 0.04965714, 1…
$ `2005` <dbl> 0.050285708, 0.010319047, 1.034942762, 0.04462857…
$ `2006` <dbl> NA, 0.042166665, 1.244414158, NA, 1.500923675, 0.…
$ `2007` <dbl> NA, 0.041799998, 1.363790346, 0.022628567, 1.4407…
$ `2008` <dbl> NA, 0.030171428, 0.917242772, 0.011314284, 1.4446…
$ `2009` <dbl> NA, 0.026714283, 0.807085630, 0.013985712, 1.4902…
$ `2010` <dbl> NA, 0.02262857, 1.03452369, 0.01100000, 1.6955712…
$ `2011` <dbl> NA, 0.030171426, 1.115399856, 0.010999998, 1.8322…
$ `2012` <dbl> 0.062857131, 0.054161900, 0.924680837, 0.01885714…
$ `2013` <dbl> 0.055157133, 0.033995236, 1.000004634, 0.03284285…
$ `2014` <dbl> 0.084490461, 0.053480948, 1.146618895, 0.03520000…
$ `2015` <dbl> 0.116966646, 0.056519042, 0.892257036, 0.03394285…
$ `2016` <dbl> 0.163009495, 0.047404756, 1.385790295, 0.03944285…
$ `2017` <dbl> 0.311299945, 0.054161900, 1.278409352, 0.05248571…
$ `2018` <dbl> 0.161333305, 0.046461900, 1.592118833, 0.05107142…
$ `2019` <dbl> 0.151276165, 0.066419041, 1.703061665, 0.06102380…
$ `2020` <dbl> 0.064795227, 0.060552376, 1.925471156, 0.05583809…
$ `2021` <dbl> 0.046692374, 0.048347615, 2.124204458, 0.04462857…
$ `2022` <dbl> 0.023617115, 0.048536186, 2.270615839, 0.04491142…
$ `2023` <dbl> 0.015142367, 0.048954648, 2.405894067, 0.04523671…
We can see that some aspects of the data are already long, such as the sector and the substance. But we can see each year takes a different column we can use pivot longer to convert from a wide format to a long formatted data set.
GHG_emissions is in units of MtCO2eq/yr, metric tons of co2 per year
emissions <- df %>%
pivot_longer(
cols = !c("Substance", "Sector", "EDGAR Country Code", "Country"),
names_to = "year",
values_to = "GHG_emissions"
)
head(emissions)# A tibble: 6 × 6
Substance Sector `EDGAR Country Code` Country year GHG_emissions
<chr> <chr> <chr> <chr> <chr> <dbl>
1 CO2 Agriculture AFG Afghanistan 1970 0.0292
2 CO2 Agriculture AFG Afghanistan 1971 0.0292
3 CO2 Agriculture AFG Afghanistan 1972 0.0292
4 CO2 Agriculture AFG Afghanistan 1973 0.0292
5 CO2 Agriculture AFG Afghanistan 1974 0.0400
6 CO2 Agriculture AFG Afghanistan 1975 0.0453
We also need to convert the years from characters to numeric types.
emissions$year <- as.numeric(emissions$year)
head(emissions, 8)# A tibble: 8 × 6
Substance Sector `EDGAR Country Code` Country year GHG_emissions
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 CO2 Agriculture AFG Afghanistan 1970 0.0292
2 CO2 Agriculture AFG Afghanistan 1971 0.0292
3 CO2 Agriculture AFG Afghanistan 1972 0.0292
4 CO2 Agriculture AFG Afghanistan 1973 0.0292
5 CO2 Agriculture AFG Afghanistan 1974 0.0400
6 CO2 Agriculture AFG Afghanistan 1975 0.0453
7 CO2 Agriculture AFG Afghanistan 1976 0.0506
8 CO2 Agriculture AFG Afghanistan 1977 0.0614
Our data is now in a longer format with good typed columns We can check NA values, from glimpse we saw we have several na values.
colSums(is.na(emissions)) Substance Sector EDGAR Country Code Country
108 108 108 108
year GHG_emissions
0 5781
We can filter to see which of these values are NA, which we should remove or just leave. We can first look at the 108 NA
na_emissions <- emissions %>%
filter(is.na(Substance))
head(na_emissions)# A tibble: 6 × 6
Substance Sector `EDGAR Country Code` Country year GHG_emissions
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 <NA> <NA> <NA> <NA> 1970 NA
2 <NA> <NA> <NA> <NA> 1971 NA
3 <NA> <NA> <NA> <NA> 1972 NA
4 <NA> <NA> <NA> <NA> 1973 NA
5 <NA> <NA> <NA> <NA> 1974 NA
6 <NA> <NA> <NA> <NA> 1975 NA
The 108 values can be dropped. These are just empty rows.
emissions <- emissions %>% drop_na(Substance)
colSums(is.na(emissions)) Substance Sector EDGAR Country Code Country
0 0 0 0
year GHG_emissions
0 5673
We can now look at our GHG values.
na_emissions <- emissions %>%
filter(is.na(GHG_emissions))
head(na_emissions)# A tibble: 6 × 6
Substance Sector `EDGAR Country Code` Country year GHG_emissions
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 CO2 Agriculture AFG Afghanistan 1994 NA
2 CO2 Agriculture AFG Afghanistan 1995 NA
3 CO2 Agriculture AFG Afghanistan 2006 NA
4 CO2 Agriculture AFG Afghanistan 2007 NA
5 CO2 Agriculture AFG Afghanistan 2008 NA
6 CO2 Agriculture AFG Afghanistan 2009 NA
It looks like the GHG_emissions value can be NA pretty often. I don’t have a good way to replace these values so I will also drop them. It could range from the data not being reported for particular years or other reasons as to why this occurs. We wont be able to graph these points anyway, we can keep this removal in mind for the analysis.
emissions <- emissions %>% drop_na()
colSums(is.na(emissions)) Substance Sector EDGAR Country Code Country
0 0 0 0
year GHG_emissions
0 0
We now have a table
Analysis
The main analysis asked for by our classmate is to show the increase and decrease of emissions of countries over time. To start with I will create a table showing the total emissions of each substance. We will be able to see which substances have the most (quantity) emissions.
emissions_by_substance <- emissions %>%
group_by(Substance) %>%
summarise(emissions = sum(GHG_emissions)) %>%
arrange(desc(emissions))
emissions_by_substance %>%
gt() %>%
cols_label(
`Substance` = "Substance",
emissions = "Emissions Metric Tons per Year"
) %>%
tab_header(title = md("Emissions Global Total by Substance"))| Emissions Global Total by Substance | |
| Substance | Emissions Metric Tons per Year |
|---|---|
| CO2 | 3040805.90 |
| GWP_100_AR5_CH4 | 904760.02 |
| GWP_100_AR5_N2O | 229499.36 |
| GWP_100_AR5_F-gases | 67922.29 |
We can see that CO2 Is by far the highest in terms of metric tons per year. This makes sense because CO2 is by far the most widely known green house gas.
I will also show a table with the global emissions based on the sector. We can see that different sectors have different effects on the global emissions, in particular the Power Industry has the greatest proportion, almost doubling the next two sectors at around 100,000 metrics tons per year.
emissions_by_sector <- emissions %>%
group_by(Sector) %>%
summarise(emissions = sum(GHG_emissions)) %>%
arrange(desc(emissions))
emissions_by_sector %>%
gt() %>%
cols_label(
`Sector` = "Sector",
emissions = "Emissions Metric Tons per Year"
) %>%
tab_header(title = md("Emissions Global Total by Substance"))| Emissions Global Total by Substance | |
| Sector | Emissions Metric Tons per Year |
|---|---|
| Power Industry | 1058497.6 |
| Transport | 634946.8 |
| Agriculture | 617019.0 |
| Industrial Combustion | 538102.7 |
| Fuel Exploitation | 489624.2 |
| Buildings | 427911.5 |
| Processes | 317300.9 |
| Waste | 159584.9 |
I will create a data frame with the combined emissions from all sectors. I will keep the substances separate, I dont know much about hte sustances, if different ones have different levels of danger in different quantities.
emissions_yearly <- emissions %>%
group_by(Country,Substance, year) %>%
summarize(emissions = sum(GHG_emissions), .groups = "keep")
head(emissions_yearly)# A tibble: 6 × 4
# Groups: Country, Substance, year [6]
Country Substance year emissions
<chr> <chr> <dbl> <dbl>
1 Afghanistan CO2 1970 1.73
2 Afghanistan CO2 1971 1.73
3 Afghanistan CO2 1972 1.69
4 Afghanistan CO2 1973 1.73
5 Afghanistan CO2 1974 2.19
6 Afghanistan CO2 1975 2.03
Our classmate mentioned looking at emissions starting at the year 2000.
We now have the data grouped by country, substance and year. Lets create a plot showing the world wide total emissions by substance from 2000 onwards.
emissions_global_2000_onwards <- emissions_yearly %>%
filter(year >= 2000) %>%
group_by(Substance, year)%>%
summarize(emissions = sum(emissions), .groups = "keep")
ggplot(data = emissions_global_2000_onwards, aes(x = year, y = emissions, color = `Substance`)
) +
geom_line() +
ggtitle("Global Emissions Since 2000 by substance") +
xlab("Year") +
ylab("Emissions (Metric Tons per Year)") +
labs(color = "Substance")Here we have a bit more context to our previous table. We can see that while all of the substances have increased in volume CO2 has increased significantly. There have been some years where CO2 usage has decreased but as of 2023 it was back at an all time high.
I am interested in seeing specific countries like the United States and China’s CO2 Emissions since 2000.
countries = c("United States", "China")
china_vs_us_CO2 <- emissions %>%
filter(year >= 2000) %>%
filter(Country %in% countries) %>%
filter(Substance == "CO2") %>%
group_by(Country, year) %>%
summarise(CO2_Emissions = sum(GHG_emissions), .groups = "keep")
ggplot(data = china_vs_us_CO2, aes(x = year, y = CO2_Emissions, color = Country)) +
geom_line() +
ggtitle("CO2 Emissions of China and US") +
xlab("Year") +
ylab("CO2 Emissions (Metric Tons per Year)") +
labs(color = "Country")We can add more countries to the list to compare more, I selected China and United States because I believe they are percieved to be some of the largest culprits of green house gas emissions. We can see here that the United States has a fair amount of CO2 Emissions that have more or less decreased within the last 20 years. China on the other hand has had a large increase in emissions to more than double the US’s emissions. Keep in mind the graph here did not start y at 0 so it looks worse that it is. These were some things we could analyze using the tidied data set.
Summarize
To summarize, we were able to clean and tidy the data set suggested by our classmate about different emissions in different countries over time. After tiding this data set we were able to look at different types of emissions globally as well as how different sectors affected the global emissions total. We were also able to look at the increase in emissions through a timeplot for several substances and finally compared specific countries, in particular the US and China.