library(tidyverse)
library(plotly)Project 1
Introduction
My data set
My data set is about CO2 emission and total greenhouse gases for different countries from the years 1950 till 2022.I am investigating the C02 emissions of each country over time. I was interested in this dataset because I am an environmentally conscious person and I wanted to explore the changes of CO2 emissions over time with a visualization and a linear regression model.
My variables
I plan to explore the correlation between my variables: Country- The name of the country being mentioned in the data set. Measurement type-the type of emissions or greenhouse gases. Quantity- the quantity of the measurement in kilotons. Years- the year of measurement (1950-2022) GDP (Gross Domestic Product)- GDP measures the monetary value of final goods and services. https://www.imf.org/en/Publications/fandd/issues/Series/Back-to-Basics/gross-domestic-product-GDP Population- the total group of people inhabiting a country.
The source of my data set was WorldInData. (learnt how to make my words bold in data 101).
Loading libraries
Loading the dataset
setwd("~/Desktop/Data 110") #set working directory
emissions<- read_csv("emissions_co2.csv") Cleaning the data
names(emissions) <- tolower(names(emissions)) #to make all of the names lowercase instead of the names "Measurement Type" or "Quantity" its now "measurement type" and "quantity".
names(emissions) <- gsub("[(). \\-]", "_", names(emissions)) #instead replace the gaps with _
head(emissions) #verifying how it looks# A tibble: 6 × 7
country year iso_code population gdp measurement_type quantity
<chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 1950 AFG 7776182 9421400064 cement_co2 0
2 Afghanistan 1950 AFG 7776182 9421400064 cement_co2_per_capi… 0
3 Afghanistan 1950 AFG 7776182 9421400064 co2 0.084
4 Afghanistan 1950 AFG 7776182 9421400064 co2_per_capita 0.011
5 Afghanistan 1950 AFG 7776182 9421400064 co2_per_gdp 0.009
6 Afghanistan 1950 AFG 7776182 9421400064 co2_per_unit_energy NA
filtered for my predictive highest contributors of C02 emissions
filtered_emissions<- emissions |>
filter(measurement_type %in% c("total_ghg","cumulative_gas_co2",
"cumulative_coal_co2", "cumulative_cement_co2")) |>
filter(quantity >=0)
head(filtered_emissions)# A tibble: 6 × 7
country year iso_code population gdp measurement_type quantity
<chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 1950 AFG 7776182 9421400064 cumulative_cement_c… 0
2 Afghanistan 1950 AFG 7776182 9421400064 cumulative_coal_co2 0.036
3 Afghanistan 1950 AFG 7776182 9421400064 cumulative_gas_co2 0
4 Afghanistan 1950 AFG 7776182 9421400064 total_ghg 19.4
5 Afghanistan 1951 AFG 7879343 9692279808 cumulative_cement_c… 0
6 Afghanistan 1951 AFG 7879343 9692279808 cumulative_coal_co2 0.061
#I checked the separate measurement types that I thought contributed the most to the total greenhouse gas emissions and C02 of each country. This was my intial predictions.Widened my plot and added specific measurement types as columns.
wider_df <- filtered_emissions |>
filter(!is.na(measurement_type) & !is.na(quantity)) |> #I removed the na's and I made the measurement types seperate columns.
pivot_wider(names_from = measurement_type, values_from = quantity)
head(wider_df)# A tibble: 6 × 9
country year iso_code population gdp cumulative_cement_co2
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1950 AFG 7776182 9421400064 0
2 Afghanistan 1951 AFG 7879343 9692279808 0
3 Afghanistan 1952 AFG 7987783 10017325056 0
4 Afghanistan 1953 AFG 8096703 10630519808 0
5 Afghanistan 1954 AFG 8207953 10866360320 0
6 Afghanistan 1955 AFG 8326981 11078185984 0
# ℹ 3 more variables: cumulative_coal_co2 <dbl>, cumulative_gas_co2 <dbl>,
# total_ghg <dbl>
linear regression equation
fit1 <- lm(data = emissions, quantity ~ measurement_type + population + gdp + year )
summary(fit1)
Call:
lm(formula = quantity ~ measurement_type + population + gdp +
year, data = emissions)
Residuals:
Min 1Q Median 3Q Max
-87583 45 230 268 1684468
Coefficients:
Estimate Std. Error
(Intercept) -1.063e+03 1.421e+03
measurement_typecement_co2_per_capita -5.520e+00 1.423e+02
measurement_typeco2 1.705e+02 1.414e+02
measurement_typeco2_per_capita 5.242e+00 1.414e+02
measurement_typeco2_per_gdp 1.280e+00 1.414e+02
measurement_typeco2_per_unit_energy -1.073e+02 1.569e+02
measurement_typecoal_co2 1.462e+01 1.516e+02
measurement_typecoal_co2_per_capita -7.480e+01 1.516e+02
measurement_typeconsumption_co2 -1.273e+01 1.962e+02
measurement_typeconsumption_co2_per_capita -3.389e+02 1.962e+02
measurement_typeconsumption_co2_per_gdp -3.449e+02 1.962e+02
measurement_typecumulative_cement_co2 1.276e+02 1.423e+02
measurement_typecumulative_co2 6.748e+03 1.414e+02
measurement_typecumulative_coal_co2 4.631e+03 1.516e+02
measurement_typecumulative_flaring_co2 -3.323e+01 1.615e+02
measurement_typecumulative_gas_co2 1.143e+03 1.635e+02
measurement_typecumulative_luc_co2 6.141e+03 1.412e+02
measurement_typecumulative_oil_co2 2.131e+03 1.415e+02
measurement_typecumulative_other_co2 -9.138e+02 2.830e+02
measurement_typeenergy_per_capita 2.539e+04 1.568e+02
measurement_typeenergy_per_gdp -1.052e+02 1.568e+02
measurement_typeflaring_co2 -1.386e+02 1.615e+02
measurement_typeflaring_co2_per_capita -1.416e+02 1.615e+02
measurement_typegas_co2 -1.226e+02 1.635e+02
measurement_typegas_co2_per_capita -1.713e+02 1.635e+02
measurement_typeghg_per_capita 1.419e+01 1.409e+02
measurement_typeland_use_change_co2 5.492e+01 1.412e+02
measurement_typeland_use_change_co2_per_capita 7.624e+00 1.412e+02
measurement_typemethane 6.197e+01 1.409e+02
measurement_typemethane_per_capita 7.256e+00 1.409e+02
measurement_typenitrous_oxide 2.069e+01 1.409e+02
measurement_typenitrous_oxide_per_capita 5.172e+00 1.409e+02
measurement_typeoil_co2 6.105e+01 1.415e+02
measurement_typeoil_co2_per_capita 1.677e+00 1.415e+02
measurement_typeother_co2_per_capita -1.102e+03 2.830e+02
measurement_typeother_industry_co2 -1.094e+03 2.830e+02
measurement_typeprimary_energy_consumption 8.735e+02 1.568e+02
measurement_typeshare_global_cement_co2 -4.841e+00 1.423e+02
measurement_typeshare_global_co2 1.657e+00 1.414e+02
measurement_typeshare_global_co2_including_luc 5.209e-01 1.419e+02
measurement_typeshare_global_coal_co2 -7.521e+01 1.516e+02
measurement_typeshare_global_cumulative_cement_co2 -4.836e+00 1.423e+02
measurement_typeshare_global_cumulative_co2 1.673e+00 1.414e+02
measurement_typeshare_global_cumulative_co2_including_luc 5.296e-01 1.419e+02
measurement_typeshare_global_cumulative_coal_co2 -7.519e+01 1.516e+02
measurement_typeshare_global_cumulative_flaring_co2 -1.408e+02 1.615e+02
measurement_typeshare_global_cumulative_gas_co2 -1.716e+02 1.635e+02
measurement_typeshare_global_cumulative_luc_co2 6.036e+00 1.412e+02
measurement_typeshare_global_cumulative_oil_co2 5.305e-01 1.415e+02
measurement_typeshare_global_cumulative_other_co2 -1.046e+03 2.900e+02
measurement_typeshare_global_flaring_co2 -1.408e+02 1.615e+02
measurement_typeshare_global_gas_co2 -1.717e+02 1.635e+02
measurement_typeshare_global_luc_co2 6.003e+00 1.412e+02
measurement_typeshare_global_oil_co2 5.154e-01 1.415e+02
measurement_typeshare_global_other_co2 -1.046e+03 2.900e+02
measurement_typeshare_of_temperature_change_from_ghg 6.938e+00 1.407e+02
measurement_typetemperature_change_from_ch4 4.546e+00 1.409e+02
measurement_typetemperature_change_from_co2 6.189e+00 1.407e+02
measurement_typetemperature_change_from_ghg 6.191e+00 1.407e+02
measurement_typetemperature_change_from_n2o 4.544e+00 1.409e+02
measurement_typetotal_ghg 2.847e+02 1.409e+02
measurement_typetrade_co2 -3.454e+02 1.962e+02
measurement_typetrade_co2_share -3.205e+02 1.962e+02
population -5.142e-07 1.244e-07
gdp 6.742e-10 9.143e-12
year 4.011e-01 7.130e-01
t value Pr(>|t|)
(Intercept) -0.748 0.454377
measurement_typecement_co2_per_capita -0.039 0.969051
measurement_typeco2 1.206 0.228005
measurement_typeco2_per_capita 0.037 0.970426
measurement_typeco2_per_gdp 0.009 0.992775
measurement_typeco2_per_unit_energy -0.684 0.494120
measurement_typecoal_co2 0.096 0.923130
measurement_typecoal_co2_per_capita -0.494 0.621648
measurement_typeconsumption_co2 -0.065 0.948250
measurement_typeconsumption_co2_per_capita -1.728 0.084046 .
measurement_typeconsumption_co2_per_gdp -1.758 0.078722 .
measurement_typecumulative_cement_co2 0.897 0.369718
measurement_typecumulative_co2 47.720 < 2e-16 ***
measurement_typecumulative_coal_co2 30.557 < 2e-16 ***
measurement_typecumulative_flaring_co2 -0.206 0.836997
measurement_typecumulative_gas_co2 6.992 2.72e-12 ***
measurement_typecumulative_luc_co2 43.498 < 2e-16 ***
measurement_typecumulative_oil_co2 15.052 < 2e-16 ***
measurement_typecumulative_other_co2 -3.229 0.001244 **
measurement_typeenergy_per_capita 161.894 < 2e-16 ***
measurement_typeenergy_per_gdp -0.671 0.502363
measurement_typeflaring_co2 -0.858 0.390793
measurement_typeflaring_co2_per_capita -0.877 0.380692
measurement_typegas_co2 -0.750 0.453176
measurement_typegas_co2_per_capita -1.048 0.294570
measurement_typeghg_per_capita 0.101 0.919798
measurement_typeland_use_change_co2 0.389 0.697233
measurement_typeland_use_change_co2_per_capita 0.054 0.956931
measurement_typemethane 0.440 0.660171
measurement_typemethane_per_capita 0.051 0.958942
measurement_typenitrous_oxide 0.147 0.883323
measurement_typenitrous_oxide_per_capita 0.037 0.970727
measurement_typeoil_co2 0.431 0.666239
measurement_typeoil_co2_per_capita 0.012 0.990547
measurement_typeother_co2_per_capita -3.895 9.83e-05 ***
measurement_typeother_industry_co2 -3.866 0.000111 ***
measurement_typeprimary_energy_consumption 5.570 2.55e-08 ***
measurement_typeshare_global_cement_co2 -0.034 0.972859
measurement_typeshare_global_co2 0.012 0.990653
measurement_typeshare_global_co2_including_luc 0.004 0.997070
measurement_typeshare_global_coal_co2 -0.496 0.619738
measurement_typeshare_global_cumulative_cement_co2 -0.034 0.972886
measurement_typeshare_global_cumulative_co2 0.012 0.990562
measurement_typeshare_global_cumulative_co2_including_luc 0.004 0.997021
measurement_typeshare_global_cumulative_coal_co2 -0.496 0.619832
measurement_typeshare_global_cumulative_flaring_co2 -0.872 0.383255
measurement_typeshare_global_cumulative_gas_co2 -1.050 0.293728
measurement_typeshare_global_cumulative_luc_co2 0.043 0.965898
measurement_typeshare_global_cumulative_oil_co2 0.004 0.997010
measurement_typeshare_global_cumulative_other_co2 -3.607 0.000309 ***
measurement_typeshare_global_flaring_co2 -0.872 0.383278
measurement_typeshare_global_gas_co2 -1.050 0.293700
measurement_typeshare_global_luc_co2 0.043 0.966081
measurement_typeshare_global_oil_co2 0.004 0.997095
measurement_typeshare_global_other_co2 -3.607 0.000309 ***
measurement_typeshare_of_temperature_change_from_ghg 0.049 0.960680
measurement_typetemperature_change_from_ch4 0.032 0.974271
measurement_typetemperature_change_from_co2 0.044 0.964924
measurement_typetemperature_change_from_ghg 0.044 0.964911
measurement_typetemperature_change_from_n2o 0.032 0.974280
measurement_typetotal_ghg 2.020 0.043425 *
measurement_typetrade_co2 -1.760 0.078352 .
measurement_typetrade_co2_share -1.634 0.102289
population -4.133 3.58e-05 ***
gdp 73.739 < 2e-16 ***
year 0.563 0.573747
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10510 on 549157 degrees of freedom
(476417 observations deleted due to missingness)
Multiple R-squared: 0.1358, Adjusted R-squared: 0.1357
F-statistic: 1328 on 65 and 549157 DF, p-value: < 2.2e-16
My linear regression model
The linear regression model I made aims at predicting the total emission quantity using measurement type, GDP, the countries population and the year between (1950-2022). y=a+bx1+bx2 +bx3+bx4 a= intercept x1= how quantity is affected by measurement type (this a categorical variable). x2= population x3=GDP x4=year
Applying this formula in my equation: Quantity in kilotons = -1063+ b1(measurement_type) -5.142e-07(population) + 6.742e-10(GDP) +0.4011(year)
Analysis
The p-values is used to inform my decision on what variables are statistically significant. My highly statistically significant variables (p-value <0.05) were GDP (p-value=< 2e-16 ) and Population (p-value=3.58e-05). This means that for GDP this is highly significant and can be used as a strong predictor when determining emissions, for example countries with greater GDP’s such as China tend to have higher emission rates.
Furthermore, population is also highly significant however there is a negative coefficient meaning that to my surprise in this data set an increase in population tends to show decrease in C02 emissions. I assumes it’s to do with contextually how much C02 was produced in the 1950’s with factories.
Lastly, the measurement types that were the most significant predictors were total green house gases(total_ghg), cummulative_C02 , cummulative_gas_C02, cummulative_coal_CO2 and cummulative_luc_CO2.
year’s p value is 0.573747 which is not statistically significant.
The R^2= 0.1357 (13.57%) this means that the model I created only explains 13.57 percent of the emission quantity variation. This is extremely low meaning that the variables are statiscally significant but they may be other factors that affect the emission quantity or that the factors need to be investigated independently.
My visualization
EXPLORATION (dont grade this)
Filtering for top 10 countries
top_countries <- emissions |>
filter(year == 2022 , measurement_type=="total_ghg") |> #filtering by the years
filter(!country =="World") |> # filter out world
filter(!is.na(gdp) & !is.na(quantity)) |> #filtered out Na's again
arrange(desc(quantity)) |>
slice_head(n=10) #learnt both the arrange and slice head in data 101
head (top_countries)# A tibble: 6 × 7
country year iso_code population gdp measurement_type quantity
<chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 China 2022 CHN 1425179562 2.70e13 total_ghg 13404.
2 United States 2022 USA 341534041 1.95e13 total_ghg 6067.
3 India 2022 IND 1425423212 1.05e13 total_ghg 3953.
4 Russia 2022 RUS 145579890 3.73e12 total_ghg 2668.
5 Brazil 2022 BRA 210306411 3.19e12 total_ghg 2438.
6 Indonesia 2022 IDN 278830529 3.50e12 total_ghg 1817.
To check the top 10 country names
top_10_names<-top_countries$country
head(top_10_names)[1] "China" "United States" "India" "Russia"
[5] "Brazil" "Indonesia"
Filtering the top 10 names and measuremement types
filtered_emissions<- emissions |>
filter(country %in% top_10_names) |>
filter( measurement_type %in% c("cumulative_gas_co2",
"cumulative_coal_co2", "cumulative_cement_co2"))##TRIAL GRAPH- PLEASE DONT GRADE THIS
P1<-ggplot(data = filtered_emissions, aes(x = year, y = gdp, color = measurement_type, size=population)) +
geom_point( alpha = 0.5) +
labs( title = "GDP over time ( EXPLORATION GRAPH)", caption = "Source: Our World in Data",
x = "Year", y = "GDP", color = "Measurement Type", size="Population") +
scale_color_manual(values = c("#FF0000", "#0000FF", "#00FF00")) + # I used scale color manual and I got this from https://ggplot2.tidyverse.org/reference/scale_manual.html. and https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/ to find the colours.
theme_minimal()
P1Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).
Filtering for Final graph(P2)
filtered_emissions_final<- emissions|>
filter(measurement_type %in% c("total_ghg","cumulative_gas_co2",
"cumulative_coal_co2", "cumulative_cement_co2")) |> #filtering for the measurement types that are significant
filter(!is.na(measurement_type) & !is.na(quantity))|> # removing NA's from measurement type and quantity
filter(quantity >=0) # filtering for quantities at least 0
head(filtered_emissions_final) # verifying# A tibble: 6 × 7
country year iso_code population gdp measurement_type quantity
<chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 1950 AFG 7776182 9421400064 cumulative_cement_c… 0
2 Afghanistan 1950 AFG 7776182 9421400064 cumulative_coal_co2 0.036
3 Afghanistan 1950 AFG 7776182 9421400064 cumulative_gas_co2 0
4 Afghanistan 1950 AFG 7776182 9421400064 total_ghg 19.4
5 Afghanistan 1951 AFG 7879343 9692279808 cumulative_cement_c… 0
6 Afghanistan 1951 AFG 7879343 9692279808 cumulative_coal_co2 0.061
Final graph
P2 <-ggplot(data = filtered_emissions_final, aes(x = population, y = quantity, color = measurement_type)) +
geom_point(size = 0.6, alpha = 0.7) +
labs( title = "CO2 and Greenhouse Gas Emissions by Population", caption = "Source: Our World in Data",
x = "Population", y = "Emission Quantity (kilotons)", color = "Measurement Type" ) +
scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3", "#e7298a")) + # I used scale color manual and I got this from https://ggplot2.tidyverse.org/reference/scale_manual.html.
theme_minimal()
P2Warning: Removed 88 rows containing missing values or values outside the scale range
(`geom_point()`).
conclusion:
How I cleaned the dataset:
I cleaned the dataset by first changing all my names of the variables to lowercase.This was done by first going into the dataaset and then making all of them lowercase with ‘tolower’ “names(emissions) <- tolower(names(emissions))”. Next I removed all the spaces and full stops and replaced them with underscores using the function ‘gsub’. I also widened my graph to make the measurement types separate columns using ‘pivot wider’. Additionally I filtered out the NA’s in the quantity and measurement type and only included quantities above 0. I achieved this by using filter(quantity >=0) and filter(!is.na(measurement_type) & !is.na(quantity)).
What the visualization illustrates:
My final visualization illustrates the C02 and Greenhouse gas emission by population. I made population the x axis as it was a statistically significant variable, I also understand that my graph may be affected by R^2 being so low and that this means that the model I created only explains 13.57 percent of the emission quantity variation. The first graph highlighted a positive correlation between year and GDP and an increase in measurement types.
Additional improvements I thought about including but couldn’t:
I initially tried to filter for the top 10 countries and then create a graph of GDP over the years however when I did that the colours all blended into one and I wasnt able to differentiate the different measurement types. In conclusion if I was able to do differentiate the colours I would’ve used that graph.