Exercise 3 - Data Visualisation

library("ggplot2")
library("Publish")

## Loading required package: prodlim

library("knitr")
install.packages("readxl", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/mneve/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'readxl' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'readxl'

## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problème lors de la copie de C:\Users\mneve\Documents\R\win-
## library\4.0\00LOCK\readxl\libs\x64\readxl.dll vers C:
## \Users\mneve\Documents\R\win-library\4.0\readxl\libs\x64\readxl.dll: Permission
## denied

## Warning: restored 'readxl'

## 
## The downloaded binary packages are in
##  C:\Users\mneve\AppData\Local\Temp\RtmpEp83c6\downloaded_packages

library("readxl")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidyr")
install.packages("stringr", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/mneve/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'stringr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mneve\AppData\Local\Temp\RtmpEp83c6\downloaded_packages

library("stringr")

#Introduction In this document, I will attempt to run a regression of CO2 emissions per capita per country on GDP per capita.

My data was obtained from the World Bank, with time series that range from 1960 to 2019 for GDP and form 1960 to 2016 for CO2 emissions.

#Preparing R for my regression ##Downloading the datasets onto R

data_GDP = read_excel("C:\\Users\\mneve\\Documents\\Etudes\\Masters\\1- Fall Semester\\Quantitative Methods\\Exercise 3 - Data visualisation\\GDP per country time series.xls", col_names = TRUE)
data_CO2E = read_excel("C:\\Users\\mneve\\Documents\\Etudes\\Masters\\1- Fall Semester\\Quantitative Methods\\Exercise 3 - Data visualisation\\CO2E per country time series.xls", col_names = TRUE)

##Combining the two datatables

data_complete = full_join(data_GDP, data_CO2E, by = "Country_Name")

#Running my first regression ##Running a regression of 2016 CO2 emissions per country on 2016 GDP

#I have to select the GDP and CO2 emissions of 2016 first
data_sample = data_complete %>% select("Country_Name", "GDP_2016", "CO2E_2016")

#Creating a scatterplot
ggplot(data_sample, aes(GDP_2016,CO2E_2016)) +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("GDP per capita in 2016") +
  ylab("CO2 emissions per capita in 2016") +
  ggtitle("CO2 emissions per capita and GDP per capita in 2016")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 27 rows containing non-finite values (stat_smooth).

## Warning: Removed 27 rows containing missing values (geom_point).

DISCUSSION

It looks like there is a positive relationship between GDP per capita and emissions per capita.

There are a lot more countries that have low GDP and low emissions. This makes sense as we know that most of the emissions come from developed countries, which are less numerous than developing countries (I include large polluters like China in developed countries here).

There also seems to be less variation the amounts of emissions per capita in poorer income countries, which makes the results of my regression less meaningful. Indeed, we see that the data is very concentrated close to the intercept but spreads wider the further you go from it (i.e. larger GDP per capita or larger CO2 emissions per capita)

There is a strong outlier, Liechtenstein, with a GDP per capita of 165,629.1905 and emissions per capita of 1.36326942, more similar to those of less developed countries. I will now attempt a second regression analysis where I exclude Liechtenstein.

##Running the same regression by removing Liechstenstein

data_sample_clean = data_sample %>% filter(Country_Name != "Liechtenstein")

#Creating a scatterplot
ggplot(data_sample_clean, aes(GDP_2016,CO2E_2016)) +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("GDP per capita in 2016") +
  ylab("CO2 emissions per capita in 2016") +
  ggtitle("CO2 emissions per capita and GDP per capita in 2016 - Liechtenstein exluded") +
  xlim(0,110000)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 27 rows containing non-finite values (stat_smooth).

## Warning: Removed 27 rows containing missing values (geom_point).

DISCUSSION

When looking at the new scatterplot, the regression line seems to better fit the data than it did when Liechstenstein was included. The relationship between GDP per capita and CO2 emissions per capita is even more positive than it was before. My problem is larger residuals for larger GDP values is still a problem.

I wonder if the three points with emissions per capita above the threshold of 30 should be excluded, as they diverge quite significantly from the rest of the data. These three observations are Curacao, Qatar and Zambia. Curacao has less than 200,000 inhabitants, making it’s observation less important. Similarly, Qatar counts less than 3 millions people in its population. Zambia seems to be more important with its 17 million inhabitants.

I am wondering whether I should include them, or whether I can take into account the size of the country as another factor, whereby smaller countries are more likely to have either: - big polluting industries that require little worker population, such as Qatar with its oil extracting industries - very low polluting industries and contributing to global emissions through imports, such as Liechtenstein.

##Running the same regression by removing Liechtenstein, Qatar, Curacao and Zambia

data_sample_clean1 = data_sample %>% filter(Country_Name != "Liechtenstein", CO2E_2016 <= 30)

#Creating a scatterplot
ggplot(data_sample_clean1, aes(GDP_2016,CO2E_2016)) +
  geom_point() +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("GDP per capita in 2016") +
  ylab("CO2 emissions per capita in 2016") +
  ggtitle("CO2 emissions per capita and GDP per capita in 2016 - Liechtenstein, Qatar, Curacao and Zambia exluded") +
  xlim(0,110000) +
  ylim(0,30)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 10 rows containing non-finite values (stat_smooth).

## Warning: Removed 10 rows containing missing values (geom_point).

DISCUSSION

The scatterplot does not seem to have improved largely by removing Qatar, Curacao and Zambia.

As a further step, it would be interesting to figure out a way to improve on the problem with residuals.

Exercise 3 - Data Visualisation - Marie Neveu

Marie