Introduction

As climate change becomes an ever looming threat to the livelihood of life on Earth, it is of utmost importance that relationships between carbon dioxide emissions and various global factors be understood. With the accumulation of greenhouse gases in the atmosphere, global surface temperatures rise and cascading repercussions negatively impact the ecosystems and resources that the human-built world relies on. The economic and sociological development of countries and societies around the world often contributes to the emission of greenhouse gases in the atmosphere through certain practices and activities. In order to curve emissions and mitigate the negative impacts that climate change will have on life on Earth, it is important that we understand the relationship that our sociological and economic developments have on the natural climate system. This study aims to understand the relationship between global carbon dioxide emissions and various global sociological and economic factors.

Statistical analysis research was conducted to specifically understand the relationship between carbon dioxide emissions per person (metric tons per person globally) and the variables GDP per capita globally, population growth (annual population growth rate percent per year globally), and high technology exports (percent of high R&D intensity exports globally). To stay within the scope of the prompt and avoid issued of time independance, we filtered our chosen data to focus on the most modern year for all variables, 2014. Variables were chosen based on our research interests in economic and social development, as well as based on availability of data for all/almost all countries globally. Our overall research interests are to understand how the chosen variables relate and potentially impact modern (2014) carbon dioxide emission production globally. We aggregated all countries by region in order to better organize data and depict clear relationships for which we could create visualizations. Our overarching research question is what are the relationships between the identified global variables and CO\(_2\) emissions per person. Understanding these relationships will provide leverage when identifying where changes must be made in order to reduce CO\(_2\) emissions and mitigate climate change.

This statistical report will outline the modes of data collection and cleaning needed before analysis occurred. Next, we describe methods of analysis and how they relate to the variables, data structure, and overarching scientific question. We include figures for each graph, plot, and image that supports our research and conclusions. Overall, we find that there may not be a normal distribution in regards to our data. We continue with a multiple linear regression analysis where we identify the variables that have statistically significant relationship with CO2 emissions per person (CO2pp). From here, we attempt to explain and reason with our findings. We make conclusions about why GDP and region are correlated with CO2pp and we try to further explain trends that we see between these variables. We acknowledge the limitations of this study and provide insight into where and how further research can be conducted.

Data and Application

Statistical data analysis utilized R-Studio to analyze the relationship between CO\(_2\) emissions per person and the global socioeconomic factors. Our data was sourced from Gapminder which originally included various years from 1700 to present. Each data set was organized by country, although not all countries had entries for all years. However, all of the chosen variables had values for all countries for the year 2014 (our chosen filtered year to analyze).

Variable Name Type Unit
"country’’ character Metric tons of CO\(_2\) in each country
"region’’ categorical Geographic grouping of countries. Levels: "Africa’‘, "Americas’‘, "Asia’‘, "Europe’‘, "Oceania’’
"CO\(_2\)share’’ numeric Percentage of total CO\(_2\) globally for each country
"popGrowthFactor’’ numeric Percentage annual population growth rate for each country
"GDP’’ numeric Gross domestic product per person in each country
"highTechExports’’ numeric Percentage of high R&D intensity exports for each country
"CO2pp’’ numeric Percentage of to CO\(_2\) emissions per person in each country

Figure 1. Table with each variable name, type, and units (with levels) used in analysis.

First, each data set was filtered for the year 2014. In order to easitly merge them together, the column names in each dataset were renamed, and a few individual countries were renamed to match the terminology in other data sets. After this, all of the data frames were merged into a final dataset. Countries with entries missing from this set were dropped. This process produced a list of 151 countries with complete data for the year 2014.

Below is a bubble plot that shows the relationship between a country’s GDP and high technology exports. Each point is color coded by region, and the relative size of each point indicates each country’s share of global CO\(_2\) emissions.

Figure 2. Bubble plot depicting the relationship between GDP, tech exports, region, and CO\(_2\) share.

Overall, there seems to be a strong positive linear relationship between CO\(_2\) per person and a country’s GDP. In addition, the bubble plot depicts a few outliers in both the Americas and Asia regions that have both high CO\(_2\) per person and high GDP. There seems to be a trend with African countries tending to have relatively small GDP, small CO\(_2\) emissions per person, and small share of total CO\(_2\) emissions. Lastly, this bubble highlights that China, indicated with the the largest bubble, has the largest share of total CO\(_2\), yet does not have the highest CO\(_2\) emissions per person. This is due to China having the largest population.

Here is a bubble plot that shows the relationship between a countries GDP and high technonogy exports. Each point is color coded by region, and the relative size of each point indicates each country’s share of global CO\(_2\) emmissions.

Figure 3. Bubble plot depicting the relationship between GDP, CO\(_2\) per person, region, and CO\(_2\) share

Overall, there seems to be a positive and weak relationship between the high technology exports of a country and its GDP. Additionally, the bubble plot illustrates the trend of African countries tending to have both a small share of CO\(_2\) emmissions, as well as the trend of Asian countries being outliers with both large GDPs and technology exports. The bubble plot clearly displays the largest global modern CO\(_2\) emission countries, China and the United State, as the largest of the bubbles.

Maps

Because we are working with geographic entities, it makes sense to plot some varaibles on a map. Below is a global map which visualizes the population growth of of each country for 2014.

## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in discrete y-axis

Figure 4. Global map visualizing the population growth (% annually)

Next is a global map which visualizes the CO\(_2\) emissions per person for 2014

Figure 5. Global map visualizing CO2 emissions per person for 2014

These maps make it easy, with a little bit of geographic knoledge, to associate which countries and regions are associated with high CO\(_2\) emissions and GDP.

Models

In order to understand the relationships between the identified global variables and CO\(_2\) emissions per person, we constructed a multiple linear regression model. The goal of this was to find a function that relates our chosen variables to the dependent variable CO\(_2\) per person. We developed simple linear regression models with all of the identified variables and assessed whether there was a statistically significant relationship between each of these variables. For example, we found that there was not a statistically significant relationship between the population grpwth factor and the CO\(_2\) emissions per person. We also found that there was not a statistically significant relationship between technology exports and CO\(_2\) per person. Once it was identified that GDP and geographic region both had a statistically significant relationship with CO\(_2\) per person, we created a multiple linear regression model between these variables.

Multiple Linear Regression Model R code Description
Model 1 lm(CO2pp ~ GDP + region, data = GlobalFactors) Main effect between GDP and CO2pp, without the interaction GDP and region
Model 2 lm(CO2pp ~ GDP * region, data = Globalfactors) Main effects between GDP and CO2pp, with the interaction for GDP and region

Figure 6. Table with multiple linear regression models and descriptions

We created diagnostic plots to assess if the data has a normal distribution and if there seemed to be a linear relationship.

Figure 7. Residual plot for Model 1.

Figure 8. Residual plot for Model 2.

The residual plot was used to assess the homoscedasticity of the data. For both models, there was not an even spread of residuals. There seemed to be a clumping of points on the left side of the plot. This would seem to indicate that there is not a normal distribution of data for either model.

Figure 9. Normal Q-Q- plot for Model 1.

Figure 10. Normal Q-Q plot for Model 2.

We also created Normal Q-Q plots to assess if there is a normal distribution of data. There appears to be significant departure from normality in the tails for both models. These fatter tails suggest that this is not a normal distribution in either model.

These diagnostic plots suggest that the distribution of this data may not be normal. However, we decided to carry out a linear regression analysis to further assess if there was a relationship between the identified variables. We recognize that perhaps this was not the appropriate analysis since the data may not be distributed normally.

Multiple Linear Regression Analysis

As mentioned above, we decided to carry out multiple linear regression with both the Model 1 and the Model 2. The analysis of variance table (anova) suggested that both GDP and region have a statistically significant relationship to CO\(_2\) per person. This suggests that GDP and region are associated with CO\(_2\) oer person. Model 2 was created to include both the main effect and interactions. This model was created in order to account for the interactions that GDP and region may have between each other.

Figure 11. Plot depicting Model 1 - only the main effects

Figure 12. Plot depicting Model 2 - main effects and interactions

Compare the slopes between Model 1 and Model 2. The slope of each regression line for each region is the same - this model does not account for the interactions of geographic region. If there was no significant interaction by region, we would expect the regression lines to be the same. However, we can see that these lines are significantly different, and so there is evidence of the interaction by region.

Figre 13 shows the regression lines broken out by region, and included the confidence intervals. These confidence intervals are quite large, especially for the Americas, Asia, and Oceania. This model attemps to get an estimate of the true relationship between CO\(_2\) and GDP, broken out by region. Because these confidence intervals are large, there is a high amount of variabliliy in the consistency of these regression lines.

## `geom_smooth()` using formula 'y ~ x'

Figure 13. Plots depicting each region and the regression of Model 2 with confidence intervals.

Results

Our analysis suggests that there is a relationship between the GDP, region, and the dependent variable CO2pp. The p-values for both the Model 1 and Model 2 were all statistically significant, suggesting that these variables have a relationship with CO2pp. Through further analysis of the interactions between GDP and region, it seems that for some levels of the variable region, there are significant interactions with GDP. For example, there is statistical significance for the interaction between region Americas and GDP. This may be due to the variety of GDP in the region, which spans from the US, with a very high GDP, to Haiti, with a very low GDP. This can be understood as a statistically significant relationship. Because there is more market activity and production of economic goods in the United States, there may also be more emission of CO\(_2\) as needed to support these processes. On the other hand, there seems to be no statistical relationship in terms of the interaction between GDP and the region Europe. This may be because most of the countries are at a similar level of economic development. Therefore, their GDPs are similar and so too their input of CO\(_2\) into the atmosphere.

The confidence intervals included in the graph above provide a visual representation of the uncertainty of the models. Both of our identified models were developed to estimate the true relationship between GDP, region, and the dependent variable CO\(_2\) per person. The confidence intervals presented on the graph depict the concept that 95% of the intervals created with this model are going to capture the true relationship between these identified variables. Overall, the results of our analysis suggest that GDP and region are useful variables for making inferences about CO\(_2\) emissions per person globally. It seems that in general, as GDP increases, so does CO\(_2\) emissions per person. The bubble plots highlight the countries that have high GDP and CO\(_2\) emissions per person (United States and UK).

Conclusion and Discussion

Sociological and economic factors may lead to the increase in CO\(_2\) emissions into the atmosphere globally. Increased emissions results in global warming and negative cascading effects that impact human-built and natural systems. It is vital that we understand the relationships between the climate system and the human developments that increase global warming. Our analysis suggests that Model 2 can be used as a tool to make inferences about the relationship between CO\(_2\) emissions per person and global socio-economic factors. Our analysis depicts a statistically significant relationship between GDP, geographic region, and CO\(_2\) emissions per person. Modern economic processes of today often rely on renewable energy sources (that produce CO\(_2\) emissions) to generate market activity and revenue. Therefore, this supports our findings that there is a positive correlation between GDP and CO\(_2\) emissions per person. Our graphics help depict other trends regionally such as relatively low GDP and CO\(_2\) emissions per person in Africa and medium to high GDP and CO\(_2\) emissions per person in Europe. Our analysis focused only on a few identified global sociological and economic variables. Whilst we were able to make some interesting insights and better understand the relationships between these variables, a deeper analysis both geographically (not aggregated by regions) and more thoroughly (more variables and included countries) would be able to create a better understanding of the global factors that should be leveraged to decrease CO\(_2\) emissions and mitigate climate change.

Appendix: R Code

The following is the R code which we used to compile our datasets and produce our results.

# Imports ####
library(tidyverse)
library(dplyr)
library("ggplot2")
theme_set(theme_bw())
library("sf")
library("rgeos")
library("rnaturalearth")
library("rnaturalearthdata")
library("mapview")
library("tidycensus")
library("sp")
library("spatialEco")
library("tigris")
library("maps")
library("leaflet")
library("tmap")

# Read Geographical Regions data
regions <- read.csv("Country_Regions.csv")
regions <- regions[,c(1, 6)]
colnames(regions)[1]<- "country"

# Read Annual share of CO2 emmissions data
annualShareCO2 <- read.csv("annual-share-of-co2-emissions.csv", header=TRUE)
colnames(annualShareCO2) <- c('Entity', 'Code', 'Year', 'Share')
share2014<-annualShareCO2 %>%
  filter(Year==2014) %>%
  select(c("Entity", "Share"))
colnames(share2014)[1]<-"country"
colnames(share2014)[2]<-"CO2share"

# Read population growth factor data
popGrowth <- read.csv("population_growth_annual_percent.csv", header=TRUE)
pop2014<-popGrowth %>%
  select(c(country, X2014))
colnames(pop2014)[1]<-"country"
colnames(pop2014)[2] <- "popGrowthFactor"

# Read GDP data
GDPperCap <- read.csv("income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
GDP2014<-GDPperCap %>%
  select(c(country, X2014))
colnames(GDP2014)[1]<- "country"
colnames(GDP2014)[2]<- "GDP"

# Read Tech Exports Data
highTechExports <- read.csv("high_technology_exports_percent_of_manufactured_exports.csv")
highTech2014<-highTechExports %>%
  select(c(country, X2014))
colnames(highTech2014)[1]<- "country"
colnames(highTech2014)[2]<- "highTechExports"

# Read CO2 per person Data
CO2perPerson <- read.csv("co2_emissions_tonnes_per_person.csv")
CO22014 <-CO2perPerson %>%
  select(c(country, X2014))
colnames(CO22014)[1]<- "country"
colnames(CO22014)[2]<- "CO2pp"

# clean country names in region (may be others we didnt catch)
regions$country <- as.character(regions$country)
regions$country[235] <- "United Kingdom"
regions$country[236] <- "United States"

# Merge Datasets
GlobalFactors <- merge(regions, share2014)
GlobalFactors <- merge(GlobalFactors, pop2014)
GlobalFactors <- merge(GlobalFactors, GDP2014)
GlobalFactors <- merge(GlobalFactors, highTech2014)
GlobalFactors <- merge(GlobalFactors, CO22014)

GlobalFactors <- drop_na(GlobalFactors)
#Now 151 countries


# Scatter plots ####

ggplot(GlobalFactors, aes(x = region, y = CO2share)) + 
  geom_boxplot()

ggplot(GlobalFactors, aes(x = highTechExports, y = popGrowthFactor)) + 
  geom_point()

ggplot(GlobalFactors, aes(x = CO2pp, y = CO2share)) + 
  geom_point()

ggplot(GlobalFactors, aes(x = GDP, y = CO2pp)) + 
  geom_point()

ggplot(GlobalFactors, aes(x = GDP, y = highTechExports)) + 
  geom_point()

ggplot(GlobalFactors, aes(x = region, y = popGrowthFactor))+ 
  geom_boxplot()

ggplot(GlobalFactors, aes(x = region, y = CO2pp)) +
  geom_boxplot()


# bubble plots ####

ggplot(GlobalFactors, aes(x = GDP, y = highTechExports)) + 
  geom_point(aes(size = CO2share, color = region), alpha = 0.5) +
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#CC0099", "#66CC00")) +
  scale_size(range = c(0.5, 12))  # Adjust the range of points size

# Maps ####
world <- ne_countries(scale = "medium", returnclass = "sf")
joinFactors<-geo_join(world, GlobalFactors, 
                 by_sp="name", by_df="country")

ggplot(data = joinFactors) +
  geom_sf(aes(fill = popGrowthFactor)) +
  scale_fill_viridis_c(option = "plasma", trans = "sqrt")


# SLR Model w/o interactions ####
GlobalFactorsM <- lm(CO2pp ~ GDP+region, data=GlobalFactors)
summary(GlobalFactorsM)
anova(GlobalFactorsM)

# Plot with one regression line for all regions
ggplot()+
  geom_point(data = GlobalFactors, aes(x=GDP, y=CO2pp, color=region))+
  stat_smooth(data = GlobalFactors, aes(x = GDP, y = CO2pp), method = 'lm', se = F)

# Plot with regression lines for each regions
ggplot(GlobalFactors, aes(x=GDP, y=CO2pp, color=region))+
  geom_point()+
  geom_abline(intercept = GlobalFactorsM$coefficients[1], slope=GlobalFactorsM$coefficients[2],
              color=rgb(.95, .47, .42), lwd=1)+
  geom_abline(intercept = GlobalFactorsM$coefficients[1]+GlobalFactorsM$coefficients[3], slope=GlobalFactorsM$coefficients[2],
              color=rgb(.63, .65, .0), lwd=1)+
  geom_abline(intercept = GlobalFactorsM$coefficients[1]+GlobalFactorsM$coefficients[4], slope=GlobalFactorsM$coefficients[2],
              color=rgb(.18, .75, .47), lwd=1)+
  geom_abline(intercept = GlobalFactorsM$coefficients[1]+GlobalFactorsM$coefficients[5], slope=GlobalFactorsM$coefficients[2],
              color=rgb(.27, .69, .98), lwd=1)+
  geom_abline(intercept = GlobalFactorsM$coefficients[1]+GlobalFactorsM$coefficients[6], slope=GlobalFactorsM$coefficients[2],
              color=rgb(.90, .41, .97), lwd=1)

# SLR model with region interaction ####
GlobalFactorsMI <- lm(CO2pp~GDP*region, data=GlobalFactors)
summary(GlobalFactorsMI)
confint(GlobalFactorsMI)

# Plot faceted by region, with regression line and confidence intervals for each 
ggplot(GlobalFactors, aes(x=GDP, y=CO2pp, color=region))+
  geom_point()+
  facet_wrap(.~region)+
  stat_smooth(method='lm')


# Residuals for model 1
ggplot(GlobalFactorsM, aes(x=.fitted, y=.stdresid))+
  geom_point()+
  geom_abline(slope=0, intercept=0, col="red")

# qqplot for model 1
qqnorm(GlobalFactorsM$residuals)
qqline(GlobalFactorsM$residuals)

# Redisuals or Model 2
ggplot(GlobalFactorsMI, aes(x=.fitted, y=.stdresid))+
  geom_point()+
  geom_abline(slope=0, intercept=0, col="red")

# qq plot for model 2
qqnorm(GlobalFactorsMI$residuals)
qqline(GlobalFactorsMI$residuals)

The following are some plots that we produced and did not use, but may still be interesting.

Citations and Data Sources

“Annual Share of Global CO₂ Emissions.” Our World in Data, ourworldindata.org/grapher/annual-share-of-co2-emissions.

“GDP per Capita, Constant PPP Dollars.” Gapminder, www.gapminder.org/data/documentation/gd001/.

“High-Technology Exports (Current US$).” High-Technology Exports (Current US$) | Data Catalog, datacatalog.worldbank.org/high-technology-exports-current-us-0.

Laboratory, Oak Ridge National. “Carbon Dioxide Information Analysis Center.” Image, cdiac.ess-dive.lbl.gov/.

Lukes. “Countries-with-Regional-Codes.” GitHub, github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv.

“Population Growth (Annual %).” Data, data.worldbank.org/indicator/sp.pop.grow.