Abstract

For the final project in our Introduction to Data Analytics class (DA210), we decided to analyze two different datasets that we predicted would be correlated; CO2 emissions and wood removal. After merging these two datasets in Excel, we used RStudio to analyze the yearly averages of CO2 emissions and wood removal for over 170 countries around the world. We found that, overtime (1950-2014), the worldwide average of CO2 emissions had increased. However, when looking at individual countries, such as the United States, it was seen that CO2 emissions decreased throughout the past decade (2000-2011). We also analyzed the correlation between the world’s average CO2 emissions and wood removal, finding that there was no significant relationship between the two variables. Throughout this project, we used the tidyr(), ggplot2(), dplyr(), and olsrr() functions to analyze our data.

Introduction

We will be analyzing multiple datasets that were extracted from GapMinder. The compiled datasets (merged.csv) focus on the average CO2 emission and the average wood removed in each country listed between 1990 and 2011. First, we will focus on the change in CO2 emission over time. Then, we will focus on the relationship between CO2 and wood removal.

The main questions we will answer are: Have CO2 emissions increased over time? Has wood removal increased over time? And is there a correlation between the two?

Before doing any analysis, we expect that, as wood removal increases, CO2 emission will increase as well. We believe there will be a direct relationship between the two variables because trees naturally take in CO2 and produce oxygen; but if trees are being cut down and removed for wood, then less CO2 is being taken out of the atmosphere–increasing the CO2 rates.

Data

As mentioned before, we downloaded the two original datasets from GapMinder:

We tidied the data up and used multiple .csv files throughout our analysis:

To make analyzing the data easier in RStudio, we cleaned the data and merged the wood and co2.2 dataframes in Excel, creating our merged.csv file. This file consisted of data from 172 countries over 21 years (1990-2011), and we created two additional columns for the average CO2 emission and average wood removal for each year. We only looked at data from 1990 - 2011 because 1) there is too much missing data from before 1950 and 2) the wood removal data only ranges from 1990 - 2011.

This is how we read the .csv files into RStudio:

co2 <- read.csv("tidyCO2.csv")
wood <- read.csv("tidywood.csv")
co2.2 <- read.csv("tidyCO2.2.csv")
merged <- read.csv("merged.csv")

Results

World’s Average CO2 Emission (1950-2014)

The graph below simply shows the change in CO2 Emission over the decades. It is interesting to see that, from 1950 to about 1975, the CO2 emission growth was quite rapid. But from about 1975 to 2014, the CO2 emission seem to decline and then become more steady.

ggplot(co2, aes(x=Year, y=co2$Total_Avg)) +
  geom_point() +  
  ggtitle("CO2 Emission Worldwide") +
  xlab("Year") + ylab("CO2 Emission")

U.S. Average CO2 Emission Compared to the World’s Average CO2 Emission (1990-2011)

When comparing the US average CO2 emission to the world’s average (1950-2014), it is clear that the graphs are similar.

ggplot(co2, aes(x=Year, y=Total_Avg)) +
  geom_point() +
  ggtitle("World Average CO2 Emission") +
  xlab("Year") + ylab("CO2 Emission")

ggplot(co2, aes(x=Year, y=United.States)) +
  geom_point() +
  ggtitle("US Average CO2 Emission") +
  xlab("Year") + ylab("CO2 Emission")

When looking at the numbers (see below), we found a correlation between the world’s CO2 average and the US CO2 average. Looking at the model below (ModelWUS), we can see that there is a strong/powerful relationship (R^2 = 0.6638), and that it is significant (p-value (2.103e-15) < α (0.001)).

ModelWUS <- lm(Year ~ United.States + Total_Avg, co2, model = TRUE)
summary(ModelWUS)
## 
## Call:
## lm(formula = Year ~ United.States + Total_Avg, data = co2, model = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7913  -8.4493  -0.3025   6.4942  21.8679 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2022.880     17.297 116.950  < 2e-16 ***
## United.States   -8.275      1.415  -5.847 2.02e-07 ***
## Total_Avg       27.558      2.816   9.788 3.40e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.14 on 62 degrees of freedom
## Multiple R-squared:  0.6638, Adjusted R-squared:  0.653 
## F-statistic: 61.22 on 2 and 62 DF,  p-value: 2.103e-15

Is CO2 in the United States increasing or decreasing?

CO2 in the United States seemed to be decreasing over the past 2 decades (1990-2011), as seen in the graph below. However, it has a weak relationship with time (R^2 = 0.2296), and was not extremely significant (p-value (0.01395) > α (0.001)).

ggplot(merged, aes(x = Year, y = United.States_CO2)) +
  geom_point() +
  stat_smooth(method = "lm", col = "blue") +
  ggtitle("CO2 in the United States") +
  xlab("Year") + ylab("CO2 (tonnes per person)")

modelUS20 <- lm(United.States_CO2~Year,merged,model = TRUE)
summary(modelUS20)
## 
## Call:
## lm(formula = United.States_CO2 ~ Year, data = merged, model = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4111 -0.5157  0.2022  0.5700  1.0529 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 152.96979   49.67877   3.079  0.00592 **
## Year         -0.06691    0.02483  -2.694  0.01395 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.739 on 20 degrees of freedom
## Multiple R-squared:  0.2663, Adjusted R-squared:  0.2296 
## F-statistic:  7.26 on 1 and 20 DF,  p-value: 0.01395

However, when looking at CO2 in the United States over the past decade (2000-2011), the graph seems to decrease at a steeper slope. It has a strong/powerful relationship with time (R^2 = 0.7948), and was significant (p-value (6.052e-05) < α (0.001)).

ggplot(merged, aes(x = Year, y = United.States_CO2)) +
  geom_point() +
  xlim(1998, 2012) +
  stat_smooth(method = "lm", col = "blue") +
  ggtitle("CO2 in the United States") +
  xlab("Year") + ylab("CO2 (tonnes per person)")

US10 <- merged[c(11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22),]

modelUS10 <- lm(United.States_CO2~Year,US10,model = TRUE)
summary(modelUS10)
## 
## Call:
## lm(formula = United.States_CO2 ~ Year, data = US10, model = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75256 -0.29968 -0.08526  0.34167  0.80128 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 566.56410   82.93546   6.831 4.56e-05 ***
## Year         -0.27308    0.04135  -6.603 6.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4945 on 10 degrees of freedom
## Multiple R-squared:  0.8135, Adjusted R-squared:  0.7948 
## F-statistic:  43.6 on 1 and 10 DF,  p-value: 6.052e-05

Is there a correlation between the world’s average CO2 emission and average wood removal?

ggplot(merged, aes(x = Total_Avg_wood, y = Total_Avg_CO2)) +
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  ggtitle("Correlation between Wood Removal & CO2 Emission") +
  xlab("Wood Removal (sqr meters)") + ylab("CO2 (tonnes per person)")

modelwco <- lm(Total_Avg_CO2~Total_Avg_wood,merged,model = TRUE)
summary(modelwco)
## 
## Call:
## lm(formula = Total_Avg_CO2 ~ Total_Avg_wood, data = merged, model = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.247743 -0.108061 -0.007886  0.113414  0.250133 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.446e+00  7.692e-01   5.780 1.18e-05 ***
## Total_Avg_wood -2.063e-10  4.528e-08  -0.005    0.996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1394 on 20 degrees of freedom
## Multiple R-squared:  1.038e-06,  Adjusted R-squared:  -0.05 
## F-statistic: 2.075e-05 on 1 and 20 DF,  p-value: 0.9964

The graph above shows a neutral regression of the worlds’s average wood removal over the world’s average CO2 emission. The statistics show that there is a weak relationship between the variables because only ~5% of the data is being represented (R^2 = -0.05). It also shows that the relationship is not significant (p-value (0.9964) > α (0.001)). The equation for this regression line is Y = 4.446 - 2.063e-10 + e.

Individual countries:

While there might a weak relationship between the world’s average CO2 emission and average wood removal, some individual countries seem to show a correlation. However, these relationships are not always direct.

The United Kingdom, for example,

ggplot(co2.2, aes(x=Year, y=co2.2$United.Kingdom)) +
  geom_point() +
  ggtitle("CO2 in the the United Kingdom") +
  xlab("Year") + ylab("CO2 Emission")

ggplot(wood, aes(x=Year, y=wood$United.Kingdom)) +
  geom_point() +
  ggtitle("Wood Removal in the United Kingdom") +
  xlab("Year") + ylab("Wood Removal")

modelUK <- lm(United.Kingdom_CO2~United.Kingdom_wood,merged,model = TRUE)
summary(modelUK)
## 
## Call:
## lm(formula = United.Kingdom_CO2 ~ United.Kingdom_wood, data = merged, 
##     model = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00352 -0.23071 -0.00289  0.25326  0.48764 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.405e+01  6.768e-01  20.762 5.29e-15 ***
## United.Kingdom_wood -6.412e-07  8.440e-08  -7.597 2.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3706 on 20 degrees of freedom
## Multiple R-squared:  0.7427, Adjusted R-squared:  0.7298 
## F-statistic: 57.72 on 1 and 20 DF,  p-value: 2.564e-07

The graphs and numbers above show that there is a correlation between wood removal and CO2 emission in the United Kingdom. It is strong/powerful (R^2 = 0.7298) and significant (p-value (2.564e-07) < α (0.001)). This is an indirect relationship–as wood removal increases in the UK, CO2 emission actually decrease.

Conclusion

We originally predicted that there would be a strong, positive and slightly powerful correlation between CO2 emissions and wood removal throughout time. However, the statistics do not support this, and instead indicate that there were weak correlations between the world’s average CO2 emissions and wood removal (even though this was not always the case in some countries, as seen with the relationship in the UK).

Overall, we can not make a claim that, as wood removal increases, CO2 increases. Even if there was a correlation between the two, correlation does not mean causation! There are many other factors that contribute to CO2 levels, and while wood might have some impact, we do not have enough data/analysis to support or deny this.