Load the tidyverse dataset, and disable the output status.
library(tidyverse)
library(pwr)
library(broom)
library(readr)
#this broke again for some reason :sob:
library(zoo)
#for multicollinearity check sense the car package is not working
library(performance)
df_main <- read.csv("climate_change_dataset.csv")
df_main |> head()
df_main |> str()
## 'data.frame': 1000 obs. of 10 variables:
## $ Year : int 2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
## $ Country : chr "UK" "USA" "France" "Argentina" ...
## $ Avg.Temperature...C. : num 8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
## $ CO2.Emissions..Tons.Capita.: num 9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
## $ Sea.Level.Rise..mm. : num 3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
## $ Rainfall..mm. : int 1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
## $ Population : int 530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
## $ Renewable.Energy.... : num 20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
## $ Extreme.Weather.Events : int 14 8 9 7 4 12 10 1 4 5 ...
## $ Forest.Area.... : num 59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
#convert the year column into a proper data column
#also create a fake full date Jan 1st of each year
df_main <- df_main |>
mutate(date = ymd(paste(Year, "01", "01")))
#debug
names(df_main)
## [1] "Year" "Country"
## [3] "Avg.Temperature...C." "CO2.Emissions..Tons.Capita."
## [5] "Sea.Level.Rise..mm." "Rainfall..mm."
## [7] "Population" "Renewable.Energy...."
## [9] "Extreme.Weather.Events" "Forest.Area...."
## [11] "date"
We will select Average Temperature in Celsius as our response variable for this analysis. It would be very interesting to track the change in average temperature over time and may allow us to make certain loose conclusions about the dataset or redirect our efforts to a specific style of analysis that could further our understanding of the dataset and its implications.
#select only the date and temperature columns and also remove rows with missing values
ts_data <- df_main |>
select(date, `Avg.Temperature...C.`) |>
drop_na()
We begin by plotting the average temperature over our time variable via a line chart.
#plot temperature over time via linegraph
ts_data |>
ggplot(aes(x = date, y = `Avg.Temperature...C.`)) +
#draw the line for temperature changes
geom_line() +
#add labels and title
labs(
title = "Average Temperature Over Time",
x = "Year",
y = "Temperature (°C)"
)
INSIGHT: We notice immediately that the lines in the graph are very jagged and the values jump up and down a lot. There are also no clear smooth upward or downward trends and the data appears quite noisey. All of this suggests that the average temperature fluctuates substantially from year to year rather than following a clear linear pattern.
Next we will conduct a linear regression analysis to detect any upwards or downwards trends over time.
#fit a linear regression model to detect trend over time
model <- ts_data |>
#convert date into numeric format needed for regression
mutate(time_index = as.numeric(date)) |>
#fit model as temperature as a function of time
lm(`Avg.Temperature...C.` ~ time_index, data = _)
#output summary of regression results
summary(model)
##
## Call:
## lm(formula = Avg.Temperature...C. ~ time_index, data = mutate(ts_data,
## time_index = as.numeric(date)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9977 -7.6785 0.2654 7.3774 15.2033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.885e+01 1.590e+00 11.850 <2e-16 ***
## time_index 6.858e-05 1.036e-04 0.662 0.508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.545 on 998 degrees of freedom
## Multiple R-squared: 0.0004391, Adjusted R-squared: -0.0005624
## F-statistic: 0.4385 on 1 and 998 DF, p-value: 0.508
INSIGHT: From the output of the linear regression model, we can see that our p-value is 0.508 and an \(R^2\) value of 0.0004391. We know that if p-value: 0.508 > alpha value: 0.05 we fail to reject the null hypothesis (if we had one in this case we do not). This means that there is no statistically significant evidence that temperature can be explained by time. The extremely small value of \(R^2\) indicates a very weak or non existent linear relationship between time and temperature. In other words, only 0.044% of the variance in temperature is explained by time.
Currently our model combines all countries together which can introduce conflicting trends that can arise from the noise introduced by each of the countries. This may lead to canceling out any trends that may exist and from our perspective it looks like no trend exists.
The strength of the trends are extremely weak and also not statistically significant (as explained by our \(R^2\) and p-value analysis).
All of this leads to the conclusion that the data may contain multiple underlying patterns that are being obscured. Because the dataset includes multiple countries, it is likely that different regions exhibit different trends. Because of this, I think subsetting the data by country would be necessary to better capture the hidden/obscured trends.
Here we can add the linear regression line to the original temperature over time plot to see some interesting details.
#plot temperature data along with regression trend line
ts_data |>
mutate(time_index = as.numeric(date)) |>
ggplot(aes(x = date, y = `Avg.Temperature...C.`)) +
#original data line
geom_line() +
#add linear regression line
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
INSIGHT: We can see that the regression line does have a very small positive slope. This means that as time has increased (as time has passed over the years), the temperature has continued to increase little by little. However, we can see that the graph itself (from before) is still very erratic with no clear patterns or behavior.
We will now apply average smoothing to detect any seasons in our dataset.
#apply moving average smoothing to highlight trends
ts_data |>
#compute rolling mean with window size of 5
mutate(smoothed = rollmean(`Avg.Temperature...C.`, k = 5, fill = NA)) |>
ggplot(aes(x = date)) +
#plot original data with the lighter line
geom_line(aes(y = `Avg.Temperature...C.`), alpha = 0.5) +
#plot soothed data
geom_line(aes(y = smoothed))
INSIGHT: From the plot we can see that the smoothed line is still quite noisy. It also lacks clear repeating wave patterns or periodic cycles. From this we must conclude that there can be no strong seasonality present.
The smoothing does however, reduce some of the short term flucuations making the overall pattern more visible and less obscured compared to before. That being said, there is still no clear evidence in support of cycles or seasons.
As an extentions of this analysis we can quickly look at using the autocorrelation function or ACF to detect any repeating patterns or seasonality. We do that below:
#autocorrelation function used here to detect repeating patterns or seasonality
acf(ts_data$`Avg.Temperature...C.`)
INSIGHT: We can see our two blue dotted lines which are our confidence intervals positioned above and below the 0.0 value. We observe that along the x axis there are bars that reach up to touch the top and bottom intervals but never excceed them. In order for results to be significant, they must exceed the bounds in either direction which is not observed here. This indicates that there is no strong correlation between temperature values at different times. We also don’t see any clear spikes at regular intervals (or drops). This is further support of a lack of seasonal pattern. This thus supports the conclusion from the smoothing section that the data does not show any strong seasonality behavior.