Climate change is a massive area of study, with implications in ocean temperatures, carbon content, and weather patterns. I will try to just look at some trends in global temperatures, to see how weather may have changed since the Industrial Revolution. First, I’ll load a data set from Berkley Earth (posted on Kaggle).
file_path <- "/Users/micah.isser/Downloads/GlobalTemperatures.csv"
global_temperatures <- read.csv(file_path)
head(global_temperatures)
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## 1 1750-01-01 3.034 3.574
## 2 1750-02-01 3.083 3.702
## 3 1750-03-01 5.626 3.076
## 4 1750-04-01 8.490 2.451
## 5 1750-05-01 11.573 2.072
## 6 1750-06-01 12.937 1.724
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## LandAndOceanAverageTemperatureUncertainty
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
summary(global_temperatures)
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## Length:3192 Min. :-2.080 Min. :0.0340
## Class :character 1st Qu.: 4.312 1st Qu.:0.1867
## Mode :character Median : 8.611 Median :0.3920
## Mean : 8.375 Mean :0.9385
## 3rd Qu.:12.548 3rd Qu.:1.4192
## Max. :19.021 Max. :7.8800
## NA's :12 NA's :12
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## Min. : 5.90 Min. :0.0440 Min. :-5.407
## 1st Qu.:10.21 1st Qu.:0.1420 1st Qu.:-1.335
## Median :14.76 Median :0.2520 Median : 2.950
## Mean :14.35 Mean :0.4798 Mean : 2.744
## 3rd Qu.:18.45 3rd Qu.:0.5390 3rd Qu.: 6.779
## Max. :21.32 Max. :4.3730 Max. : 9.715
## NA's :1200 NA's :1200 NA's :1200
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## Min. :0.0450 Min. :12.47
## 1st Qu.:0.1550 1st Qu.:14.05
## Median :0.2790 Median :15.25
## Mean :0.4318 Mean :15.21
## 3rd Qu.:0.4582 3rd Qu.:16.40
## Max. :3.4980 Max. :17.61
## NA's :1200 NA's :1200
## LandAndOceanAverageTemperatureUncertainty
## Min. :0.0420
## 1st Qu.:0.0630
## Median :0.1220
## Mean :0.1285
## 3rd Qu.:0.1510
## Max. :0.4570
## NA's :1200
Now let’s identify where there are NA values, whether there are duplicates, and whether there are extreme outliers. To take these steps, let’s load the libraries of tidyverse and janitor.
options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("janitor")
##
## The downloaded binary packages are in
## /var/folders/zr/bh_xqrjs3w19n2l94454nczr0000gp/T//Rtmpq5TXXP/downloaded_packages
library(tidyverse)
library(janitor)
sum(is.na(global_temperatures))
## [1] 7224
sapply(global_temperatures, function(x) sum(is.na(x)))
## dt
## 0
## LandAverageTemperature
## 12
## LandAverageTemperatureUncertainty
## 12
## LandMaxTemperature
## 1200
## LandMaxTemperatureUncertainty
## 1200
## LandMinTemperature
## 1200
## LandMinTemperatureUncertainty
## 1200
## LandAndOceanAverageTemperature
## 1200
## LandAndOceanAverageTemperatureUncertainty
## 1200
So of the 3192 entries, there are 1200 for which there are several columns missing. I’m curious about the 12 entries for which there is NA for LandAverageTemperature is NA. Are these rows entirely blank?
na_rows <- which(is.na(global_temperatures$LandAverageTemperature))
print(na_rows)
## [1] 11 17 19 22 23 24 26 29 30 31 32 33
global_temperatures[na_rows, ]
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## 11 1750-11-01 NA NA
## 17 1751-05-01 NA NA
## 19 1751-07-01 NA NA
## 22 1751-10-01 NA NA
## 23 1751-11-01 NA NA
## 24 1751-12-01 NA NA
## 26 1752-02-01 NA NA
## 29 1752-05-01 NA NA
## 30 1752-06-01 NA NA
## 31 1752-07-01 NA NA
## 32 1752-08-01 NA NA
## 33 1752-09-01 NA NA
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 11 NA NA NA
## 17 NA NA NA
## 19 NA NA NA
## 22 NA NA NA
## 23 NA NA NA
## 24 NA NA NA
## 26 NA NA NA
## 29 NA NA NA
## 30 NA NA NA
## 31 NA NA NA
## 32 NA NA NA
## 33 NA NA NA
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 11 NA NA
## 17 NA NA
## 19 NA NA
## 22 NA NA
## 23 NA NA
## 24 NA NA
## 26 NA NA
## 29 NA NA
## 30 NA NA
## 31 NA NA
## 32 NA NA
## 33 NA NA
## LandAndOceanAverageTemperatureUncertainty
## 11 NA
## 17 NA
## 19 NA
## 22 NA
## 23 NA
## 24 NA
## 26 NA
## 29 NA
## 30 NA
## 31 NA
## 32 NA
## 33 NA
They are blank, so I’m going to remove those rows
global_temperatures_clean <- global_temperatures[-na_rows, ]
sum(is.na(global_temperatures_clean))
## [1] 7128
sapply(global_temperatures_clean, function(x) sum(is.na(x)))
## dt
## 0
## LandAverageTemperature
## 0
## LandAverageTemperatureUncertainty
## 0
## LandMaxTemperature
## 1188
## LandMaxTemperatureUncertainty
## 1188
## LandMinTemperature
## 1188
## LandMinTemperatureUncertainty
## 1188
## LandAndOceanAverageTemperature
## 1188
## LandAndOceanAverageTemperatureUncertainty
## 1188
So while there are still 1,188 rows that are missing some elements, there are none with NA for all values. Now let’s look for duplicates.
duplicate_rows <- duplicated(global_temperatures)
global_temperatures[duplicate_rows, ]
## [1] dt
## [2] LandAverageTemperature
## [3] LandAverageTemperatureUncertainty
## [4] LandMaxTemperature
## [5] LandMaxTemperatureUncertainty
## [6] LandMinTemperature
## [7] LandMinTemperatureUncertainty
## [8] LandAndOceanAverageTemperature
## [9] LandAndOceanAverageTemperatureUncertainty
## <0 rows> (or 0-length row.names)
Great! There are no duplicated rows. Now let’s look for outliers in the data - values that deviate so wildly from the median that they will skew the data, and are likely errors. First let’s define interquartile ranges for the ranges, then look for values that are more than 1.5 IQRs lower than the first quartile, or 1.5 IQRs higher than the third quartile.
Q1 <- quantile(global_temperatures_clean$LandAverageTemperature, 0.25, na.rm = TRUE)
Q3 <- quantile(global_temperatures_clean$LandAverageTemperature, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- global_temperatures_clean$LandAverageTemperature < lower_bound | global_temperatures_clean$LandAverageTemperature > upper_bound
print(global_temperatures[outliers, ])
## [1] dt
## [2] LandAverageTemperature
## [3] LandAverageTemperatureUncertainty
## [4] LandMaxTemperature
## [5] LandMaxTemperatureUncertainty
## [6] LandMinTemperature
## [7] LandMinTemperatureUncertainty
## [8] LandAndOceanAverageTemperature
## [9] LandAndOceanAverageTemperatureUncertainty
## <0 rows> (or 0-length row.names)
It looks like there are no significant outliers, but let’s graph to make sure:
ggplot(global_temperatures_clean, aes(x = dt, y = LandAverageTemperature)) +
geom_point() +
theme_minimal() +
ggtitle("Scatter Plot of Land Average Temperature Over Time")
While there definitely is more variation in the temperature in earlier years, there do not seem to be any significant outliers.
It is also seems now that there there is an upward trend in the data, and also that the data falls into particular ‘bands’. Let’s try to find how the data changes through a line of best fit, by using a linear regression. For this, I’ll use the ggplot library in R.
global_temperatures_clean$dt <- as.Date(global_temperatures_clean$dt)
global_temperatures_clean$Year <- as.numeric(format(global_temperatures_clean$dt, "%Y"))
model <- lm(LandAverageTemperature ~ Year, data = global_temperatures_clean)
library(ggplot2)
ggplot(global_temperatures_clean, aes(x = Year, y = LandAverageTemperature)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
theme_minimal() +
labs(title = "Land Average Temperature Over Time",
x = "Year", y = "Land Average Temperature")
## `geom_smooth()` using formula = 'y ~ x'
How steep is the slope of this line? Or in other words, on average, how
quickly has the temperature been getting hotter since 1700?
slope <- coef(model)["Year"]
print(slope)
## Year
## 0.004663524
Interesting - so according to this graph, for each one-year increase, the average land temperature is predicted to increase by about 0.0047 degrees Celsius. Yet, I would guess that the rate at which the Earth is warming is speeding up as time goes on. I wonder if the parabolic curve of a quadratic function would more effectively predict the rate of temperature change.
model_poly <- lm(LandAverageTemperature ~ poly(Year, 2, raw=TRUE), data = global_temperatures_clean)
global_temperatures_clean$predicted_temp <- predict(model_poly, global_temperatures_clean)
ggplot(global_temperatures_clean, aes(x = Year, y = LandAverageTemperature)) +
geom_point() +
geom_line(aes(y = predicted_temp), color = "blue") +
theme_minimal() +
labs(title = "Land Average Temperature Over Time with Polynomial Fit",
x = "Year", y = "Land Average Temperature")
This looks slightly better than the linear function. But where one
number - the slope - can accurately express the trajectory of a line -
we’ll need three numbers to express the quadratic function. If we think
of a quadratic equation as A + Bx + Cx^2, the coefficients will be A, B,
and C.
coeffs <- coef(model_poly)
print(coeffs)
## (Intercept) poly(Year, 2, raw = TRUE)1
## 1.551258e+02 -1.608079e-01
## poly(Year, 2, raw = TRUE)2
## 4.393875e-05
So according to this model, the equation to calculate land temperature would be LandAverageTemperature = 155.1258 − (0.1608079 x Year) + (0.00004393875 × Year^2)
I am curious about how the slope of the graph changes - how is global warming accelerating over time? Let’s look at the derivative of this quadratic function to find the instantaneous slope in different years.
beta_0 <- coeffs[1]
beta_1 <- coeffs[2]
beta_2 <- coeffs[3]
slope_function <- function(year) {
beta_1 + 2 * beta_2 * year
}
years <- seq(from = min(global_temperatures_clean$Year), to = max(global_temperatures_clean$Year), by = 1)
slopes <- sapply(years, slope_function)
interval_years <- seq(from = min(global_temperatures_clean$Year), to = max(global_temperatures_clean$Year), by = 50)
interval_slopes <- sapply(interval_years, slope_function)
plot(years, slopes, type = 'l', col = 'blue', xlab = 'Year', ylab = 'Rate of Change of Temperature',
main = 'Rate of Change of Land Average Temperature Over Time')
labels <- paste(interval_years, "\n", round(interval_slopes, 5), sep="")
text(interval_years, interval_slopes, labels = labels, pos = 3, cex = 0.7, offset = 0.5)
This graph does a good job of showing how climate change is speeding up. According to this model, the average temperature difference from one year to the next was -.007 in 1750 - slightly negative, although fairly close to zero. By 2000, the global land temperature is rising at a rate of roughly .015 degrees per year.