Prepare the tidyverse library, and load up the dataset.
Now we will prepare the two pairs that we will be using for this assignment. First pair will consist of CO2 Emissions and Sea Level Rise. The second pair will be Renewable Energy and Forest Area. We will start be creating two new columns: Adj_Sea_Level and Eco_Balance.
df_main_2 <-
df_main |>
#we can create new columns without altering existing columns of the dataframe
mutate(
#new created sea level column
Adj_Sea_Level =
`Sea.Level.Rise..mm.` - mean(`Sea.Level.Rise..mm.`),
#new created forest area column
Eco_Balance =
`Forest.Area....` + `Renewable.Energy....`
)
Next we will draw a visualization for the CO2 vs Adjusted Sea Level Rise relationship. After that we will calculate the correlation coefficient.
#visualization
df_main_2 |>
#map emissions to the x axis and adjusted sea level to y axis
ggplot(aes(x = `CO2.Emissions..Tons.Capita.`, y = Adj_Sea_Level)) +
#plot scatter points with transparency (or it will be super hard to see!)
geom_point(alpha = 0.4) +
#linear best fit line
geom_smooth(method = "lm", color = "red") +
labs(
title = "CO2 Emissions vs Adjusted Sea Level Rise",
x = "CO2 Emissions (Tons per Capita)",
y = "Sea Level Rise Deviation (mm)"
)
## `geom_smooth()` using formula = 'y ~ x'
The visualization above shows CO2 emissions per capita on the x-axis and sea level rise deviation on the y-axis. The points are widely spread out and do not coagulate in groups or form a cohesive upward or downward pattern. We can visually see that the regression line is almost completely flat which is an indication that there is little to no relationship between the CO2 emissions and rising sea levels. This is a surprising find as one would think that surely there would be some correlation between increased CO2 levels, leading to higher temperatures, leading to higher sea levels. This indicates that countries with higher CO2 emissions do not experience higher sea level rise relative to countries that are not emitting as much.
It is important to note a few very important things from this revelation: the lack of a linear relationship does not dismiss the possibilities of other forms of relationships like logarithmic and exponential relationships; Just because a country itself does not measure higher sea levels does not mean that CO2 is does not effect the global average sea levels.
Thus even though we see no linear correlation between the variables, it is premature to extrapolate from this some overarching conclusion about the nature of the data.
#compute pearson correlation between the two target variables (here it is emissions and created sea column)
df_main_2 |>
summarise(
correlation =
cor(`CO2.Emissions..Tons.Capita.`, Adj_Sea_Level)
)
We see that the corelation coefficient is -0.03881475 for these two variables. This is very close to 0 and thus indicates no linear relationship between CO2 emissions and the sea level rise within the dataset. As we discussed earlier this seems quite bizarre as we would expect some sort of correlation but as stated before there may be alternative explanations for this phenomenon then just concluding that CO2 emissions do not cause rise in sea levels. For example, the dataset itself is organized by country and does not take into account global time progression. Sea level rise is a global measurement but CO2 emissions are a local measurement thus just because a country is producing more CO2 does not mean it will have an increase in sea levels locally.
The correlation coefficient reflects almost a nonexistent correlation between the two variables but a global correlation cannot be ruled out from this alone.
Now wwe will draw a visualization for the Renewable Energy vs Eco Balance relationship. As before, we will then calculate the correlation coefficient.
df_main_2 |>
#renewable energy vs the created eco balance column
ggplot(aes(x = `Renewable.Energy....`, y = Eco_Balance)) +
#create scatterplot with green points should be easier to see then the last one
geom_point(alpha = 0.4, color = "darkgreen") +
#again regressioin line
geom_smooth(method = "lm", color = "black") +
labs(
title = "Renewable Energy vs Econ-Balance Index",
x = "Renewable Energy %",
y = "Eco-Balance Index"
)
## `geom_smooth()` using formula = 'y ~ x'
This graph shows renewable energy percentages mapped out on the x-axis while our Eco-Balance Index is mapped to the y-axis. We see that for this pair of variables there is a visually positive and upward correlation between the two variables. Thus, as renewable energy usage increase, the forest coverage area generally increases along with it.
That being said, we can still see that the points still spread out quite a distance from the line itself. In other words they are not very tightly packed around the line and from this we know that renewable energy is not the only factor affecting forest area. Overall though it does seem higher renewable usage has a tendency to increase forest area.
#same thing calculate pearson correlation but for renewable energy and eco balance this time
df_main_2 |>
summarise(
correlation =
cor(`Renewable.Energy....`, Eco_Balance)
)
For this second pair we see that the correlation coefficient is 0.5867103 which indicates a moderately strong linear, positive, and upward relationship. In effect, this translates to countries that use more renewable energy tend to also have higher forest coverage. This makes sense conceptually as countries who invest in renewable energy often implement broader environmental protections, usually in the form of conservation efforts which preserve existing forest areas.
This fact can also be used to refute an adversarial claim by an hypothetical individual who may claim that the investing in renewable resources is pointless ass we have to use green resources to build renewable infrastructure and we end up expanding into existing forests and disturbing the environment anyway. We can see with this correlation that we can heavily invest in renewable energy without harming the existing environment.
Our two response variables are Adj_Sea_Level and Eco_Balance. We will now construct a confidence interval for each variable and then analyze their meaning.
#confidence intervals for sea leveles
#assemble the pieces for the CL
df_main_2 |>
summarise(
#sample mean of created sea column
mean = mean(Adj_Sea_Level),
#standard dev of the sample
sd = sd(Adj_Sea_Level),
#obs amount
n = n(),
#standard error of the mean
se = sd/sqrt(n),
#lower bound of the 95% t-confidence interval
lower = mean - qt(0.975, n-1) * se,
#upper bound of ^^^
upper = mean + qt(0.975, n-1) * se
)
We can find the confidence interval by taking the t-value which is derived from using the qt() function in R. Then you multiply that with the standard error and get the lower value listed above. You can then subtract that from the mean and then you end up with the lower and upper values. We can see that the mean is in scientific notation but is basically 0 that is a result of centering the interval at zero.
The 95% confidence interval for the Adjusted Sea Level Rise ranges from the lower to upper values of -0.07111969 to 0.07111969. The variable was centered by subtracting the overall mean thus the population mean is near zero. The fact that the interval is tightly centered around zero indicates that the sample average estimates the population average. The narrow width of the interval is due to the number of samples (n = 1000) run which reduces the uncertainty of sampling.
Importantly, the interval contains a zero, which means that there is no consistent positive or negative deviation across countries when it comes to sea level measurements. This confirms the previous conclusion that the sea level of a country is not determined by its CO2 emissions.
#confidence interval for eco balance
#same thing put the pieces together
df_main_2 |>
summarise(
#mean of the eco column sample
mean = mean(Eco_Balance),
#std dev of the column
sd = sd(Eco_Balance),
#number of obs
n = n(),
#std error of column mean
se = sd/sqrt(n),
#lower and upper bounds of the 95% interval
lower = mean - qt(0.975, n-1) * se,
upper = mean + qt(0.975, n-1) * se
)
The 95% confidence interval for this Eco-Balance index variable ranges from a lower 66.53951 to an upper 69.20549 range. Unlike the previous sea levels interval, this interval is not at 0. It is also completely positive. This translates to that on average countries maintain a substantial combined level of renewable energy usage and forest coverage.
The standard error is quite small relative to the standard deviation which shows that even though the countries individually vary widely in environmental performance, the average sustianbility levle across all of the countries is estimated with very good precision.
We know from earlier that the correlation between the renewable energy and forest area variables is positive, and upward. The confidence interval we have just found supports this conclusion from earlier by confirming that activities related to renewable energy initiatives consistently show that the forest area tends increase along with it.