Project 1

Author

Ayan Elmi

Introduction

My data set

My data set is about CO2 emission and total greenhouse gases for different countries from the years 1950 till 2022.I am investigating the C02 emissions of each country over time. I was interested in this dataset because I am an environmentally conscious person and I wanted to explore the changes of CO2 emissions over time with a visualization and a linear regression model.

My variables

I plan to explore the correlation between my variables: Country- The name of the country being mentioned in the data set. Measurement type-the type of emissions or greenhouse gases. Quantity- the quantity of the measurement in kilotons. Years- the year of measurement (1950-2022) GDP (Gross Domestic Product)- GDP measures the monetary value of final goods and services. https://www.imf.org/en/Publications/fandd/issues/Series/Back-to-Basics/gross-domestic-product-GDP Population- the total group of people inhabiting a country.

The source of my data set was WorldInData. (learnt how to make my words bold in data 101).

Loading libraries

library(tidyverse)
library(plotly)

Loading the dataset

setwd("~/Desktop/Data 110") #set working directory
emissions<- read_csv("emissions_co2.csv") 

Cleaning the data

names(emissions) <- tolower(names(emissions))  #to make all of the names lowercase instead of the names "Measurement Type" or "Quantity" its now "measurement type" and "quantity".
names(emissions) <- gsub("[(). \\-]", "_", names(emissions)) #instead replace the gaps with _
head(emissions) #verifying how it looks
# A tibble: 6 × 7
  country      year iso_code population        gdp measurement_type     quantity
  <chr>       <dbl> <chr>         <dbl>      <dbl> <chr>                   <dbl>
1 Afghanistan  1950 AFG         7776182 9421400064 cement_co2              0    
2 Afghanistan  1950 AFG         7776182 9421400064 cement_co2_per_capi…    0    
3 Afghanistan  1950 AFG         7776182 9421400064 co2                     0.084
4 Afghanistan  1950 AFG         7776182 9421400064 co2_per_capita          0.011
5 Afghanistan  1950 AFG         7776182 9421400064 co2_per_gdp             0.009
6 Afghanistan  1950 AFG         7776182 9421400064 co2_per_unit_energy    NA    

filtered for my predictive highest contributors of C02 emissions

filtered_emissions<- emissions |>
  filter(measurement_type %in% c("total_ghg","cumulative_gas_co2",  
"cumulative_coal_co2", "cumulative_cement_co2")) |> 
  filter(quantity >=0)
head(filtered_emissions)
# A tibble: 6 × 7
  country      year iso_code population        gdp measurement_type     quantity
  <chr>       <dbl> <chr>         <dbl>      <dbl> <chr>                   <dbl>
1 Afghanistan  1950 AFG         7776182 9421400064 cumulative_cement_c…    0    
2 Afghanistan  1950 AFG         7776182 9421400064 cumulative_coal_co2     0.036
3 Afghanistan  1950 AFG         7776182 9421400064 cumulative_gas_co2      0    
4 Afghanistan  1950 AFG         7776182 9421400064 total_ghg              19.4  
5 Afghanistan  1951 AFG         7879343 9692279808 cumulative_cement_c…    0    
6 Afghanistan  1951 AFG         7879343 9692279808 cumulative_coal_co2     0.061
#I checked the separate measurement types that I thought contributed the most to the total greenhouse gas emissions and C02 of each country. This was my intial predictions.

Widened my plot and added specific measurement types as columns.

wider_df <- filtered_emissions |>
  filter(!is.na(measurement_type) & !is.na(quantity)) |> #I removed the na's and I made the measurement types seperate columns.
  pivot_wider(names_from = measurement_type, values_from = quantity) 
head(wider_df)
# A tibble: 6 × 9
  country      year iso_code population         gdp cumulative_cement_co2
  <chr>       <dbl> <chr>         <dbl>       <dbl>                 <dbl>
1 Afghanistan  1950 AFG         7776182  9421400064                     0
2 Afghanistan  1951 AFG         7879343  9692279808                     0
3 Afghanistan  1952 AFG         7987783 10017325056                     0
4 Afghanistan  1953 AFG         8096703 10630519808                     0
5 Afghanistan  1954 AFG         8207953 10866360320                     0
6 Afghanistan  1955 AFG         8326981 11078185984                     0
# ℹ 3 more variables: cumulative_coal_co2 <dbl>, cumulative_gas_co2 <dbl>,
#   total_ghg <dbl>

linear regression equation

fit1 <- lm(data = emissions, quantity ~ measurement_type + population + gdp + year )
summary(fit1)

Call:
lm(formula = quantity ~ measurement_type + population + gdp + 
    year, data = emissions)

Residuals:
    Min      1Q  Median      3Q     Max 
 -87583      45     230     268 1684468 

Coefficients:
                                                            Estimate Std. Error
(Intercept)                                               -1.063e+03  1.421e+03
measurement_typecement_co2_per_capita                     -5.520e+00  1.423e+02
measurement_typeco2                                        1.705e+02  1.414e+02
measurement_typeco2_per_capita                             5.242e+00  1.414e+02
measurement_typeco2_per_gdp                                1.280e+00  1.414e+02
measurement_typeco2_per_unit_energy                       -1.073e+02  1.569e+02
measurement_typecoal_co2                                   1.462e+01  1.516e+02
measurement_typecoal_co2_per_capita                       -7.480e+01  1.516e+02
measurement_typeconsumption_co2                           -1.273e+01  1.962e+02
measurement_typeconsumption_co2_per_capita                -3.389e+02  1.962e+02
measurement_typeconsumption_co2_per_gdp                   -3.449e+02  1.962e+02
measurement_typecumulative_cement_co2                      1.276e+02  1.423e+02
measurement_typecumulative_co2                             6.748e+03  1.414e+02
measurement_typecumulative_coal_co2                        4.631e+03  1.516e+02
measurement_typecumulative_flaring_co2                    -3.323e+01  1.615e+02
measurement_typecumulative_gas_co2                         1.143e+03  1.635e+02
measurement_typecumulative_luc_co2                         6.141e+03  1.412e+02
measurement_typecumulative_oil_co2                         2.131e+03  1.415e+02
measurement_typecumulative_other_co2                      -9.138e+02  2.830e+02
measurement_typeenergy_per_capita                          2.539e+04  1.568e+02
measurement_typeenergy_per_gdp                            -1.052e+02  1.568e+02
measurement_typeflaring_co2                               -1.386e+02  1.615e+02
measurement_typeflaring_co2_per_capita                    -1.416e+02  1.615e+02
measurement_typegas_co2                                   -1.226e+02  1.635e+02
measurement_typegas_co2_per_capita                        -1.713e+02  1.635e+02
measurement_typeghg_per_capita                             1.419e+01  1.409e+02
measurement_typeland_use_change_co2                        5.492e+01  1.412e+02
measurement_typeland_use_change_co2_per_capita             7.624e+00  1.412e+02
measurement_typemethane                                    6.197e+01  1.409e+02
measurement_typemethane_per_capita                         7.256e+00  1.409e+02
measurement_typenitrous_oxide                              2.069e+01  1.409e+02
measurement_typenitrous_oxide_per_capita                   5.172e+00  1.409e+02
measurement_typeoil_co2                                    6.105e+01  1.415e+02
measurement_typeoil_co2_per_capita                         1.677e+00  1.415e+02
measurement_typeother_co2_per_capita                      -1.102e+03  2.830e+02
measurement_typeother_industry_co2                        -1.094e+03  2.830e+02
measurement_typeprimary_energy_consumption                 8.735e+02  1.568e+02
measurement_typeshare_global_cement_co2                   -4.841e+00  1.423e+02
measurement_typeshare_global_co2                           1.657e+00  1.414e+02
measurement_typeshare_global_co2_including_luc             5.209e-01  1.419e+02
measurement_typeshare_global_coal_co2                     -7.521e+01  1.516e+02
measurement_typeshare_global_cumulative_cement_co2        -4.836e+00  1.423e+02
measurement_typeshare_global_cumulative_co2                1.673e+00  1.414e+02
measurement_typeshare_global_cumulative_co2_including_luc  5.296e-01  1.419e+02
measurement_typeshare_global_cumulative_coal_co2          -7.519e+01  1.516e+02
measurement_typeshare_global_cumulative_flaring_co2       -1.408e+02  1.615e+02
measurement_typeshare_global_cumulative_gas_co2           -1.716e+02  1.635e+02
measurement_typeshare_global_cumulative_luc_co2            6.036e+00  1.412e+02
measurement_typeshare_global_cumulative_oil_co2            5.305e-01  1.415e+02
measurement_typeshare_global_cumulative_other_co2         -1.046e+03  2.900e+02
measurement_typeshare_global_flaring_co2                  -1.408e+02  1.615e+02
measurement_typeshare_global_gas_co2                      -1.717e+02  1.635e+02
measurement_typeshare_global_luc_co2                       6.003e+00  1.412e+02
measurement_typeshare_global_oil_co2                       5.154e-01  1.415e+02
measurement_typeshare_global_other_co2                    -1.046e+03  2.900e+02
measurement_typeshare_of_temperature_change_from_ghg       6.938e+00  1.407e+02
measurement_typetemperature_change_from_ch4                4.546e+00  1.409e+02
measurement_typetemperature_change_from_co2                6.189e+00  1.407e+02
measurement_typetemperature_change_from_ghg                6.191e+00  1.407e+02
measurement_typetemperature_change_from_n2o                4.544e+00  1.409e+02
measurement_typetotal_ghg                                  2.847e+02  1.409e+02
measurement_typetrade_co2                                 -3.454e+02  1.962e+02
measurement_typetrade_co2_share                           -3.205e+02  1.962e+02
population                                                -5.142e-07  1.244e-07
gdp                                                        6.742e-10  9.143e-12
year                                                       4.011e-01  7.130e-01
                                                          t value Pr(>|t|)    
(Intercept)                                                -0.748 0.454377    
measurement_typecement_co2_per_capita                      -0.039 0.969051    
measurement_typeco2                                         1.206 0.228005    
measurement_typeco2_per_capita                              0.037 0.970426    
measurement_typeco2_per_gdp                                 0.009 0.992775    
measurement_typeco2_per_unit_energy                        -0.684 0.494120    
measurement_typecoal_co2                                    0.096 0.923130    
measurement_typecoal_co2_per_capita                        -0.494 0.621648    
measurement_typeconsumption_co2                            -0.065 0.948250    
measurement_typeconsumption_co2_per_capita                 -1.728 0.084046 .  
measurement_typeconsumption_co2_per_gdp                    -1.758 0.078722 .  
measurement_typecumulative_cement_co2                       0.897 0.369718    
measurement_typecumulative_co2                             47.720  < 2e-16 ***
measurement_typecumulative_coal_co2                        30.557  < 2e-16 ***
measurement_typecumulative_flaring_co2                     -0.206 0.836997    
measurement_typecumulative_gas_co2                          6.992 2.72e-12 ***
measurement_typecumulative_luc_co2                         43.498  < 2e-16 ***
measurement_typecumulative_oil_co2                         15.052  < 2e-16 ***
measurement_typecumulative_other_co2                       -3.229 0.001244 ** 
measurement_typeenergy_per_capita                         161.894  < 2e-16 ***
measurement_typeenergy_per_gdp                             -0.671 0.502363    
measurement_typeflaring_co2                                -0.858 0.390793    
measurement_typeflaring_co2_per_capita                     -0.877 0.380692    
measurement_typegas_co2                                    -0.750 0.453176    
measurement_typegas_co2_per_capita                         -1.048 0.294570    
measurement_typeghg_per_capita                              0.101 0.919798    
measurement_typeland_use_change_co2                         0.389 0.697233    
measurement_typeland_use_change_co2_per_capita              0.054 0.956931    
measurement_typemethane                                     0.440 0.660171    
measurement_typemethane_per_capita                          0.051 0.958942    
measurement_typenitrous_oxide                               0.147 0.883323    
measurement_typenitrous_oxide_per_capita                    0.037 0.970727    
measurement_typeoil_co2                                     0.431 0.666239    
measurement_typeoil_co2_per_capita                          0.012 0.990547    
measurement_typeother_co2_per_capita                       -3.895 9.83e-05 ***
measurement_typeother_industry_co2                         -3.866 0.000111 ***
measurement_typeprimary_energy_consumption                  5.570 2.55e-08 ***
measurement_typeshare_global_cement_co2                    -0.034 0.972859    
measurement_typeshare_global_co2                            0.012 0.990653    
measurement_typeshare_global_co2_including_luc              0.004 0.997070    
measurement_typeshare_global_coal_co2                      -0.496 0.619738    
measurement_typeshare_global_cumulative_cement_co2         -0.034 0.972886    
measurement_typeshare_global_cumulative_co2                 0.012 0.990562    
measurement_typeshare_global_cumulative_co2_including_luc   0.004 0.997021    
measurement_typeshare_global_cumulative_coal_co2           -0.496 0.619832    
measurement_typeshare_global_cumulative_flaring_co2        -0.872 0.383255    
measurement_typeshare_global_cumulative_gas_co2            -1.050 0.293728    
measurement_typeshare_global_cumulative_luc_co2             0.043 0.965898    
measurement_typeshare_global_cumulative_oil_co2             0.004 0.997010    
measurement_typeshare_global_cumulative_other_co2          -3.607 0.000309 ***
measurement_typeshare_global_flaring_co2                   -0.872 0.383278    
measurement_typeshare_global_gas_co2                       -1.050 0.293700    
measurement_typeshare_global_luc_co2                        0.043 0.966081    
measurement_typeshare_global_oil_co2                        0.004 0.997095    
measurement_typeshare_global_other_co2                     -3.607 0.000309 ***
measurement_typeshare_of_temperature_change_from_ghg        0.049 0.960680    
measurement_typetemperature_change_from_ch4                 0.032 0.974271    
measurement_typetemperature_change_from_co2                 0.044 0.964924    
measurement_typetemperature_change_from_ghg                 0.044 0.964911    
measurement_typetemperature_change_from_n2o                 0.032 0.974280    
measurement_typetotal_ghg                                   2.020 0.043425 *  
measurement_typetrade_co2                                  -1.760 0.078352 .  
measurement_typetrade_co2_share                            -1.634 0.102289    
population                                                 -4.133 3.58e-05 ***
gdp                                                        73.739  < 2e-16 ***
year                                                        0.563 0.573747    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10510 on 549157 degrees of freedom
  (476417 observations deleted due to missingness)
Multiple R-squared:  0.1358,    Adjusted R-squared:  0.1357 
F-statistic:  1328 on 65 and 549157 DF,  p-value: < 2.2e-16

My linear regression model

The linear regression model I made aims at predicting the total emission quantity using measurement type, GDP, the countries population and the year between (1950-2022). y=a+bx1+bx2 +bx3+bx4 a= intercept x1= how quantity is affected by measurement type (this a categorical variable). x2= population x3=GDP x4=year

Applying this formula in my equation: Quantity in kilotons = -1063+ b1(measurement_type) -5.142e-07(population) + 6.742e-10(GDP) +0.4011(year)

Analysis

The p-values is used to inform my decision on what variables are statistically significant. My highly statistically significant variables (p-value <0.05) were GDP (p-value=< 2e-16 ) and Population (p-value=3.58e-05). This means that for GDP this is highly significant and can be used as a strong predictor when determining emissions, for example countries with greater GDP’s such as China tend to have higher emission rates.

Furthermore, population is also highly significant however there is a negative coefficient meaning that to my surprise in this data set an increase in population tends to show decrease in C02 emissions. I assumes it’s to do with contextually how much C02 was produced in the 1950’s with factories.

Lastly, the measurement types that were the most significant predictors were total green house gases(total_ghg), cummulative_C02 , cummulative_gas_C02, cummulative_coal_CO2 and cummulative_luc_CO2.

year’s p value is 0.573747 which is not statistically significant.

The R^2= 0.1357 (13.57%) this means that the model I created only explains 13.57 percent of the emission quantity variation. This is extremely low meaning that the variables are statiscally significant but they may be other factors that affect the emission quantity or that the factors need to be investigated independently.

My visualization

EXPLORATION (dont grade this)

Filtering for top 10 countries

top_countries <- emissions |> 
  filter(year == 2022 , measurement_type=="total_ghg") |>   #filtering by the years
  filter(!country =="World") |> # filter out world
  filter(!is.na(gdp) & !is.na(quantity)) |> #filtered out Na's again
  arrange(desc(quantity)) |>
  slice_head(n=10)  #learnt both the arrange and slice head in data 101

head (top_countries)
# A tibble: 6 × 7
  country        year iso_code population     gdp measurement_type quantity
  <chr>         <dbl> <chr>         <dbl>   <dbl> <chr>               <dbl>
1 China          2022 CHN      1425179562 2.70e13 total_ghg          13404.
2 United States  2022 USA       341534041 1.95e13 total_ghg           6067.
3 India          2022 IND      1425423212 1.05e13 total_ghg           3953.
4 Russia         2022 RUS       145579890 3.73e12 total_ghg           2668.
5 Brazil         2022 BRA       210306411 3.19e12 total_ghg           2438.
6 Indonesia      2022 IDN       278830529 3.50e12 total_ghg           1817.

To check the top 10 country names

top_10_names<-top_countries$country 
head(top_10_names)
[1] "China"         "United States" "India"         "Russia"       
[5] "Brazil"        "Indonesia"    

Filtering the top 10 names and measuremement types

filtered_emissions<- emissions |>
filter(country %in% top_10_names) |>
  filter( measurement_type %in% c("cumulative_gas_co2",     
"cumulative_coal_co2", "cumulative_cement_co2"))

##TRIAL GRAPH- PLEASE DONT GRADE THIS

P1<-ggplot(data = filtered_emissions, aes(x = year, y = gdp, color = measurement_type, size=population)) +
  geom_point( alpha = 0.5) + 
  labs(  title = "GDP over time ( EXPLORATION GRAPH)", caption = "Source: Our World in Data",
    x = "Year",  y = "GDP",  color = "Measurement Type", size="Population") +
scale_color_manual(values = c("#FF0000", "#0000FF", "#00FF00")) + # I used scale color manual and I got this from https://ggplot2.tidyverse.org/reference/scale_manual.html. and https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/ to find the colours.
  theme_minimal()
P1
Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).

Filtering for Final graph(P2)

filtered_emissions_final<- emissions|> 
  filter(measurement_type %in% c("total_ghg","cumulative_gas_co2",  
"cumulative_coal_co2", "cumulative_cement_co2")) |> #filtering for the measurement types that are significant  
    filter(!is.na(measurement_type) & !is.na(quantity))|> # removing NA's from measurement type and quantity 
  filter(quantity >=0)  # filtering for quantities at least 0
head(filtered_emissions_final) # verifying
# A tibble: 6 × 7
  country      year iso_code population        gdp measurement_type     quantity
  <chr>       <dbl> <chr>         <dbl>      <dbl> <chr>                   <dbl>
1 Afghanistan  1950 AFG         7776182 9421400064 cumulative_cement_c…    0    
2 Afghanistan  1950 AFG         7776182 9421400064 cumulative_coal_co2     0.036
3 Afghanistan  1950 AFG         7776182 9421400064 cumulative_gas_co2      0    
4 Afghanistan  1950 AFG         7776182 9421400064 total_ghg              19.4  
5 Afghanistan  1951 AFG         7879343 9692279808 cumulative_cement_c…    0    
6 Afghanistan  1951 AFG         7879343 9692279808 cumulative_coal_co2     0.061

Final graph

P2 <-ggplot(data = filtered_emissions_final, aes(x = population, y = quantity, color = measurement_type)) +
  geom_point(size = 0.6, alpha = 0.7) + 
  labs(  title = "CO2 and Greenhouse Gas Emissions by Population", caption = "Source: Our World in Data",
    x = "Population",  y = "Emission Quantity (kilotons)",  color = "Measurement Type" ) +
scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3", "#e7298a")) + # I used scale color manual and I got this from https://ggplot2.tidyverse.org/reference/scale_manual.html.
  theme_minimal()
P2
Warning: Removed 88 rows containing missing values or values outside the scale range
(`geom_point()`).

conclusion:

How I cleaned the dataset:

I cleaned the dataset by first changing all my names of the variables to lowercase.This was done by first going into the dataaset and then making all of them lowercase with ‘tolower’ “names(emissions) <- tolower(names(emissions))”. Next I removed all the spaces and full stops and replaced them with underscores using the function ‘gsub’. I also widened my graph to make the measurement types separate columns using ‘pivot wider’. Additionally I filtered out the NA’s in the quantity and measurement type and only included quantities above 0. I achieved this by using filter(quantity >=0) and filter(!is.na(measurement_type) & !is.na(quantity)).

What the visualization illustrates:

My final visualization illustrates the C02 and Greenhouse gas emission by population. I made population the x axis as it was a statistically significant variable, I also understand that my graph may be affected by R^2 being so low and that this means that the model I created only explains 13.57 percent of the emission quantity variation. The first graph highlighted a positive correlation between year and GDP and an increase in measurement types.

Additional improvements I thought about including but couldn’t:

I initially tried to filter for the top 10 countries and then create a graph of GDP over the years however when I did that the colours all blended into one and I wasnt able to differentiate the different measurement types. In conclusion if I was able to do differentiate the colours I would’ve used that graph.