Load the tidyverse dataset, and disable the output status.
library(tidyverse)
library(pwr)
library(broom)
library(readr)
#for multicollinearity check sense the car package is not working
library(performance)
df_main <- read.csv("climate_change_dataset.csv")
df_main |> head()
df_main |> str()
## 'data.frame': 1000 obs. of 10 variables:
## $ Year : int 2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
## $ Country : chr "UK" "USA" "France" "Argentina" ...
## $ Avg.Temperature...C. : num 8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
## $ CO2.Emissions..Tons.Capita.: num 9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
## $ Sea.Level.Rise..mm. : num 3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
## $ Rainfall..mm. : int 1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
## $ Population : int 530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
## $ Renewable.Energy.... : num 20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
## $ Extreme.Weather.Events : int 14 8 9 7 4 12 10 1 4 5 ...
## $ Forest.Area.... : num 59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
#debug
names(df_main)
## [1] "Year" "Country"
## [3] "Avg.Temperature...C." "CO2.Emissions..Tons.Capita."
## [5] "Sea.Level.Rise..mm." "Rainfall..mm."
## [7] "Population" "Renewable.Energy...."
## [9] "Extreme.Weather.Events" "Forest.Area...."
We will start by first visualize the relationship between sea level rise and average temperature to determine if a linear relationship exists and can be visualized:
#scatterplot with regression line
df_main |>
ggplot(aes(x = Avg.Temperature...C., y = Sea.Level.Rise..mm.)) +
geom_point() + #plot data points
geom_smooth(method = "lm") + #addd linear trend line
labs(
title = "Sea Level Rise vs Average Temperature",
x = "Average Temperature (°C)",
y = "Sea Level Rise (mm)"
)
## `geom_smooth()` using formula = 'y ~ x'
We can see from the above scatterplot that a relationship does exist between average temperature and sea level rise. This means we may be able to conduct a fruitful analysis via a linear model.
#build linear model
model <- df_main |>
lm(Sea.Level.Rise..mm. ~ Avg.Temperature...C., data = _)
#view model summary
summary(model)
##
## Call:
## lm(formula = Sea.Level.Rise..mm. ~ Avg.Temperature...C., data = df_main)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0802 -0.9874 0.0251 0.9830 2.1082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.852208 0.091733 31.093 <2e-16 ***
## Avg.Temperature...C. 0.007916 0.004239 1.867 0.0622 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.145 on 998 degrees of freedom
## Multiple R-squared: 0.003482, Adjusted R-squared: 0.002483
## F-statistic: 3.487 on 1 and 998 DF, p-value: 0.06215
Insight: Ok so with this linear model we attempt to relate the rise in sea level as a function of the change in average temperature. We have an intercept of 2.852208 (in mm) which means that the average sea level rise when the average global temperature is 0 is 2.852208mm. This obviously makes no sense as the average temperature of the earth is unlikely to ever drop that low. Due to this, the intercept gives us not real data.
The temperature coefficient is 0.007916. This means for every 1 degrees celcius increase in average temperature, the sea level is expected to rise by about 0.007916mm. This indicates a clear and positive relationship between the temperature and the rise in sea level.
The p-value is 0.0622 which is > then 0.05, we know that the relationship does not have statistical significance however. This marks a weak relationship between average temperature and sea level rise. However, the value is close to 0.05 and if we were to set our alpha value a bit higher, perhaps there would be a slightly stronger correlation.
The R-squared value is 0.003482 which indicaates that only about 0.30% of the variability in the sea levels is linked to the increase in average temperature. This is a very low amount and thus we may need to use more then just average temperature to get a stronger predictor for sea level rise (this makes sense intuitively).
We will now proceed with model diagnostics. We can assess the integrity of the model and that the assumptions of the linear regression are satisfied. We can analyze linearity, normality of residuals, and constant variance.
#diagnotics plots
par(mfrow = c(2,2))
plot(model)
Insight: In the “Residuals vs Fitted” graaph. We can see that the residuals appear randomly scattered around the zero line center. This suggests that the linearity assumption is not violated. That being said, the fitted line values are concentrated in the a very narrow range, indicating that the model produces a almost entirely constant prediction curve. This indicates that the predictor itself (avg temp) has little influence on the actual target vairable (sea lvl rise).
The Q-Q plot shows that most residuals follow the reference line, indicating approximate normality. However, there are slight deviations at the tails which indicates minor departures from normality. Generally speaking though, the normality assumption is reasonably satisfied.
Scale location can be seen in the bottom left. It shows a fairly constant spread of residuals across fitted values. This suggests that the assumption of constant variance (aka homoscedasticity) is satisfied.
Finally on the bottom right we can see the last Residuals vs Leverage plot. This plot does not show any extreme points or influential observations. This indicates that no single data point is disproportionately affecting the model.
#manual residual plot
df_main |>
mutate(residuals = resid(model)) |>
ggplot(aes(x = Avg.Temperature...C., y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Residual Plot",
x = "Average Temperature",
y = "Residuals"
)
Insight: From the residuals chart above, we can see that the residuals appear to be centered around zero, with a median close to 0.0251 (from the earlier model output). We can also gather from the earlier model output that the approximate spread of the residuals is -2.10 to 2.11 which is an indicator of moderate variability in prediction errors. We also know that there is no immediate indication of severe bias.
We will look at the coefficient associated with average temperature.
#extract coefficients
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.852207648 0.091732752 31.092577 5.778265e-149
## Avg.Temperature...C. 0.007915886 0.004239235 1.867291 6.215392e-02
Insight: Here we take a closer look at the average temperature coefficient. The intercept is 2.852207648. As we measured before, this is the predicted rise in sea level when the average temperature is 0. This a temperature of 0 holds no meaning on Earth, this intercept is more of a baseline rather then a datapoint we can learn something from. We also see from before the coefficient for avg temp is 0.0079 which measures the expected sea level rise for every 1 degrees of celsius temperature change.
We are given that the standard error is 0.00425 and this measures the uncertainity of the coefficient. This value suggests that there is some variability in the link between avg temp and sea level rise.
Observe that there is also a p-value and t-value. The t-value is 1.867 which is very small. This means that the effect of temperature is not very far off from zero. This aligns with our previous finds. The p-value is also given and 0.06215 > 0.05 meaning that if we had a null hypothesis, we would FAIL to reject it (we don’t in this case). This p-value indicates no statistically significant link between rising sea levels and average temperature.