Load the tidyverse dataset, and disable the output status.
library(tidyverse)
library(pwr)
#for multicollinearity check sense the car package is not working
library(performance)
df_main <- read.csv("climate_change_dataset.csv")
df_main |> head()
df_main |> str()
## 'data.frame': 1000 obs. of 10 variables:
## $ Year : int 2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
## $ Country : chr "UK" "USA" "France" "Argentina" ...
## $ Avg.Temperature...C. : num 8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
## $ CO2.Emissions..Tons.Capita.: num 9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
## $ Sea.Level.Rise..mm. : num 3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
## $ Rainfall..mm. : int 1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
## $ Population : int 530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
## $ Renewable.Energy.... : num 20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
## $ Extreme.Weather.Events : int 14 8 9 7 4 12 10 1 4 5 ...
## $ Forest.Area.... : num 59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
#debug
names(df_main)
## [1] "Year" "Country"
## [3] "Avg.Temperature...C." "CO2.Emissions..Tons.Capita."
## [5] "Sea.Level.Rise..mm." "Rainfall..mm."
## [7] "Population" "Renewable.Energy...."
## [9] "Extreme.Weather.Events" "Forest.Area...."
We are instructed to refer to the model we made for the last data dive and then add onto that. So first let us reconstruct the model from the last data dive below.
#model from last week
base_model <- df_main |>
lm(`CO2.Emissions..Tons.Capita.` ~ `Renewable.Energy....`, data = _)
#output
summary(base_model)
##
## Call:
## lm(formula = CO2.Emissions..Tons.Capita. ~ Renewable.Energy....,
## data = df_main)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1280 -4.8633 0.2287 4.9185 9.7632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.70180 0.41400 25.850 <2e-16 ***
## Renewable.Energy.... -0.01011 0.01370 -0.738 0.461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.616 on 998 degrees of freedom
## Multiple R-squared: 0.0005455, Adjusted R-squared: -0.000456
## F-statistic: 0.5447 on 1 and 998 DF, p-value: 0.4607
In the last regression model above, we used renewable energy percentage as a predictor of CO2 emission rate. We found that there was a slight linear correlation between the two as we can see from the following graph:
#scatter plot to examine relationship between renewable resoureces and co2 emissions
df_main |>
ggplot(aes(x = Renewable.Energy....,
y = `CO2.Emissions..Tons.Capita.`)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "CO2 Emissions vs Renewable Energy Sources",
x = "Renewable Energy Sources",
y = "CO2 Emissions (Tons/Capita)"
)
## `geom_smooth()` using formula = 'y ~ x'
To improve the earlier model (and in accordance with this week’s data dive) we will introduce 3 more variables: forest area percentage, sea level rise in mm, and rainfall amount in mm. While technically, these variables shouldn’t directly impact the amount of CO2 released, it would be interesting to see if there are any knock on effects that could encourage countries to release less CO2.
Intuitively we know that a larger forest area percentage would likely increase the amount of CO2 that is absorbed an converted to oxygen, so it might be interesting to see what correlation might be present here. Sea level rise indicates that ice is melting. Ice generally reflects a lot of the heat from the sun back into space and as it melts the sea level rises and the warmer water absorbs more CO2 and heat and accelerates the melting process. Warmer water also doesn’t absorb CO2 as efficiently as cold water. From this we can see that there is a complex relationship between sea levels and amount of CO2 absorbed. We will now test to see if there is any correlation between these seemingly unrelated variables (at least in terms of the amount of CO2 released) and the CO2 release levels.
#run the regression model but this time we will add 3 more variables to improve the modle
improved_model <- df_main |>
lm(`CO2.Emissions..Tons.Capita.` ~
`Renewable.Energy....` +
`Forest.Area....` +
`Sea.Level.Rise..mm.` +
Rainfall..mm.,
data = _)
#output model data
summary(improved_model)
##
## Call:
## lm(formula = CO2.Emissions..Tons.Capita. ~ Renewable.Energy.... +
## Forest.Area.... + Sea.Level.Rise..mm. + Rainfall..mm., data = df_main)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3538 -4.8915 0.1864 4.9212 10.0051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6822007 0.8683426 12.302 <2e-16 ***
## Renewable.Energy.... -0.0097459 0.0137048 -0.711 0.477
## Forest.Area.... 0.0095411 0.0102227 0.933 0.351
## Sea.Level.Rise..mm. -0.1870571 0.1551714 -1.205 0.228
## Rainfall..mm. 0.0001067 0.0002508 0.425 0.671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.617 on 995 degrees of freedom
## Multiple R-squared: 0.003115, Adjusted R-squared: -0.0008924
## F-statistic: 0.7773 on 4 and 995 DF, p-value: 0.54
Insight:
As we can see from the output above, each of the 4 variables has its own intercepts, error, t, and p values. We can see across the board that the p values are greater than 0.05 for each of the 4 variables and thus we would fail to reject the null hypothesis if we had one. In the section below we further analyze the implications of this model via 5 diagnostic plots.
#run the check
improved_model |>
check_collinearity()
Insight:
We can see from the output, that the Variance Inflation Factor (VIF) value for each of the 4 variables that we use is 1.00. The rule is that if VIF is less than 5, we can say that there is low correlation and that we can move forward with the model. In our case we are good with all 4 variables.
We can use the check_model() function and pipe our improved model into it such that the output consists of 5 diagnostic plots that we can use to evaluate the quality of our new model. These diagnostic plots are included in R by default and allow us to quickly and visually search for anomalies in our model without having to tediously create each plot individually.
#output the diaggnostic plots for the new improved lin model
improved_model |>
check_model()
Insight:
For the first plot we can see that the model-predicted distribution closely follows the observed data distribution, with only minor deviations in the tails. This indicates strongly that the the model is capturing the overall structure of the data pretty well. There are slight differences at extreme values, indicating a small amount of model variation, but the issue is not severe. Generally, we can have high confidence that our model provides a good general fit to the data.
For the second plot we check linearity. The residuals appear mostly randomly scattered around the horizontal reference line, with only a slight curve. That indicates that the linearity assumption is largely satisfied. However, there may be a minor nonlinear relationship not fully captured by the model that is hidden. The deviation is small enough, so the severity is low, and we can have moderateish to high confidence that the linearity assumption holds.
In the third plot, we test homogenitey of variance. We can see from the spread of residuals appears relatively constant across fitted values, with no strong funnel shape present. So we an assume that the variance of the errors is approximately constant. There is some natural scatter, but no clear pattern indicating heteroscedasticity. Any anomalies present are undetectabley small, and thus we can say that there is relatively high confidence that the homoscedasticity assumption is satisfied.
We can see for this plot that most of data points fall within the Cook’s distance boundaries, indicating that there are no highly influential observations significantly affecting the model. A few points lie slightly further from the center, but they remain within acceptable limits. This suggests that the model is stable and not overly sensitive to individual data points. Thus there is a descent confidence that influential outliers are not a concern.
As we saw form our earlier multicollinearity check all of our predictor variables have VIF values below the commonly accepted threshold of 5, indicating low multicollinearity. This means the predictors are not strongly correlated with each other, and coefficient estimates are reliable, thus there is no indication of instability due to collinearity, and we can have high confidence that this assumption is satisfied.
The final plot shows some deviation from the reference line, particularly in the tails areas, indicating that the residuals are not perfectly normally distributed. The central portion of the data follows the line reasonably well tho suggesting a mild violation of normality, but not severe enough to invalidate the model. Due to this we can say that we have low to moderate confidence that the normality assumption is acceptable for this analysis.