Initialization:

Load the tidyverse dataset, and disable the output status.

library(tidyverse)
library(pwr)
df_main <- read.csv("climate_change_dataset.csv")

df_main |> head()

df_main |> str()

## 'data.frame':    1000 obs. of  10 variables:
##  $ Year                       : int  2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
##  $ Country                    : chr  "UK" "USA" "France" "Argentina" ...
##  $ Avg.Temperature...C.       : num  8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
##  $ CO2.Emissions..Tons.Capita.: num  9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
##  $ Sea.Level.Rise..mm.        : num  3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
##  $ Rainfall..mm.              : int  1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
##  $ Population                 : int  530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
##  $ Renewable.Energy....       : num  20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
##  $ Extreme.Weather.Events     : int  14 8 9 7 4 12 10 1 4 5 ...
##  $ Forest.Area....            : num  59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...

#debug
names(df_main)

##  [1] "Year"                        "Country"                    
##  [3] "Avg.Temperature...C."        "CO2.Emissions..Tons.Capita."
##  [5] "Sea.Level.Rise..mm."         "Rainfall..mm."              
##  [7] "Population"                  "Renewable.Energy...."       
##  [9] "Extreme.Weather.Events"      "Forest.Area...."

Continuous and categorical column selection

We select CO2 Emissions (Tons/Capita) as the continuous/response variable and Population as the categorical (continuous) explanatory variable.

We choose CO2 Emissions here as this variable indicates the amount of carbon dioxide that is released into the atomosphere per person. We know that CO2 emissions are a driving factor behind climate change so it is fitting that we use this variable here as our reponse variable.

Population is currently a continuous variable. This will not work for an ANOVA test. However, in the real world it is a very resonable question to ask how population relates to CO2 emissions so this hurdle is but a small one. We will convert population to a categorical variable by createing 4 groups: low, medium, high, and extrm_high population sizes. This allows us to reclassify the population variable as categorical and then we can use it in the ANOVA.

From this we will be able test if CO2 Emissions differ from population size to population size.

Population partitioning:

Lets first start by creating the 4 population groups: low, medium, high, and extrm_high.

#create population quartile groups
df_main <- df_main |>
  mutate(Population_Group =
           cut(Population,
               breaks = quantile(Population, probs = seq(0,1,0.25), na.rm = TRUE),
               include.lowest = TRUE,
               labels = c("Low","Medium","High","Extrm_High")))

#check group counts
df_main |>
  count(Population_Group)

We can visualize the layout of the CO2 emissions relative to the population categories that we have just created with a box plot:

#visualize co2 emissions across population groups

df_main |>
  ggplot(aes(x = Population_Group,
             y = `CO2.Emissions..Tons.Capita.`)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "CO2 Emissions by Population Group",
    x = "Population Buckets",
    y = "CO2 Emissions (Tons/Capita)"
  )

ANOVA Hypotheses

Null Hypothesis (H0) = “The mean CO2 emissions values are equal across all population categories”.

Alternative Hypothesis (H1) = “There exists at minimum one population group such that it has a different mean CO2 emission level”.

With the null and alternative hypothesis set, we will now run the actual ANOVA test and see what happens.

#run ANOVA test

anova_model <- aov(`CO2.Emissions..Tons.Capita.` ~ Population_Group,
                   data = df_main)

#output display results
summary(anova_model)

##                   Df Sum Sq Mean Sq F value Pr(>F)
## Population_Group   3     72   23.95   0.759  0.517
## Residuals        996  31421   31.55

Analysis/Insight:

We can see from the ANOVA test that our p-value (lableded Pr(>F) in the output) is 0.517. Recall that we reject the null hypothesis H0, when the value of p < 0.05, but we fail to reject the null hypothesis H0 when the value of p >= 0.05. Because 0.517 >= 0.05 we say that we FAIL to reject the null hypothesis.

This indicates that there is not sufficient statistical evidence to conclude that mean CO2 emissions differ significantly across the population categories that we have constructed. This means that when countries are grouped by population size, their average CO2 emissions do not appear to differ in a statistically significant way.

We started with an somewhat benign and intuitive assumption that as population size increases, the CO2 emissions also increase. This makes sense from an abstract perspective as a larger concentration of people probably means there is more industrial activity and thus the regional CO2 emissions are greater. In this ANOVA test we have concluded that this is not the case. Upon a more granular look we know that CO2 emissions cannot be so easily explained by increasing population size. There are things like regional energy policies, pockets of heavy industrialized areas, renewable energy use, and data collection quality. It is not uncommon to see areas of a country that are very heavily industrialized and account for tons of CO2 emissions, but the population of that area consists of only workers, engineers, and other production staff. This can granulate down to the level of city/urban planning with where governments build up industry (also can depend on geography like the proximity to rare earth metal mines or bodies of water).

Regression Model

Let’s begin first by finding a variable that has a linear relationship with our response variable which you may recall was CO2 emissions. We can’t use population as it is a flat line (also we talked about earlier how there is basically no relationship between them). After plotting a few different variables, I found that renewable energy does have a linear relationship with CO2 emissions and it is a very applicable combination so we will use it.

#scatter plot to examine relationship between renewable resoureces and co2 emissions
df_main |>
  ggplot(aes(x = Renewable.Energy....,
             y = `CO2.Emissions..Tons.Capita.`)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "CO2 Emissions vs Renewable Energy Sources",
    x = "Renewable Energy Sources",
    y = "CO2 Emissions (Tons/Capita)"
  )

## `geom_smooth()` using formula = 'y ~ x'

As we can see above, there is a slight negative linear correlation between the number of renewable energy sources, and the amount of CO2 released. Let’s now build the regression model and see what happens.

#construct linear regression model for the two variables

linRgrs_model <- lm(`CO2.Emissions..Tons.Capita.` ~ `Renewable.Energy....`,
               data = df_main)

#output results to terminal
summary(linRgrs_model)

## 
## Call:
## lm(formula = CO2.Emissions..Tons.Capita. ~ Renewable.Energy...., 
##     data = df_main)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1280  -4.8633   0.2287   4.9185   9.7632 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.70180    0.41400  25.850   <2e-16 ***
## Renewable.Energy.... -0.01011    0.01370  -0.738    0.461    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.616 on 998 degrees of freedom
## Multiple R-squared:  0.0005455,  Adjusted R-squared:  -0.000456 
## F-statistic: 0.5447 on 1 and 998 DF,  p-value: 0.4607

Analysis/Insight:

The linear regression model above compares the relationship between CO2 emissions and Renewable Energy percentage. Physically the regression equation can be written as: CO2 Emissions = 10.7018 - 0.01011 * Renewable Energy (%). We can interpret this as when the percentage of renewables is 0%, the CO2 output is approx. ~10.70 tons per capita. As the percentage of renewables increases, the CO2 output drops by approx ~0.010 tons per capita for every 1% increase of renewable energy percentage. Thus it is true, there very much is a negative linear correlation between renewable energy use and CO2 output.

We also see that the p-value is 0.461 which is >= then 0.05 so we would reject the null hypothesis if we had one (we don’t have one for this question/bulletpoint).

We also get a value of 0.0005455 for our multiple R-squared value. We can convert it to a percentage and say that 0.05455% of the variation of the CO2 Emission variable can be explained by renewable energy percentage. This is very very small and it is inline with what we expected with our original scatterplot where we saw a linear relationship but it was very very weak. From this we can conlude that there is very weak evidence that an increase in renewable energy percentage leads to a reduced CO2 emissions.

Intuitively we expect that an increase of renewable energy use should lead to a reduction in CO2 emissions. This is infact true but the correlation is very weak. This leads us to a greater truth about life in general: intuition should guide our investigation but should never serve as evidence itself to conclude our assumptions. Basically, things are often more complicated then they appear on the surface level. CO2 emissions are often the culmination of many different factors and just because a country has high renewable energy does not mean that its carbon output is lower. For instance, the dataset does not capture data from a dynamic time frame for each country. This means that the data we have is from a single snapshot. This means that if a country is in the transitionary phase of switching from fossil to renewable energy, there will be a moment in the transitionary pipeline where the renewable energy will be high and the CO2 output will be high because energy must continue to flow at all times and fossil fuels will have to slowly and systematically replace the existing fossil fuel infrastructure and reliance.

week8_dd_RMrkd_Notebook

Initialization:

Continuous and categorical column selection

Population partitioning:

ANOVA Hypotheses

Regression Model