Introduction

Climate change is a massive area of study, with implications in ocean temperatures, carbon content, and weather patterns. I will try to just look at some trends in global temperatures, to see how weather may have changed since the Industrial Revolution. First, I’ll load a data set from Berkley Earth (posted on Kaggle).

file_path <- "/Users/micah.isser/Downloads/GlobalTemperatures.csv"
global_temperatures <- read.csv(file_path)
head(global_temperatures)
##           dt LandAverageTemperature LandAverageTemperatureUncertainty
## 1 1750-01-01                  3.034                             3.574
## 2 1750-02-01                  3.083                             3.702
## 3 1750-03-01                  5.626                             3.076
## 4 1750-04-01                  8.490                             2.451
## 5 1750-05-01                 11.573                             2.072
## 6 1750-06-01                 12.937                             1.724
##   LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 1                 NA                            NA                 NA
## 2                 NA                            NA                 NA
## 3                 NA                            NA                 NA
## 4                 NA                            NA                 NA
## 5                 NA                            NA                 NA
## 6                 NA                            NA                 NA
##   LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 1                            NA                             NA
## 2                            NA                             NA
## 3                            NA                             NA
## 4                            NA                             NA
## 5                            NA                             NA
## 6                            NA                             NA
##   LandAndOceanAverageTemperatureUncertainty
## 1                                        NA
## 2                                        NA
## 3                                        NA
## 4                                        NA
## 5                                        NA
## 6                                        NA
summary(global_temperatures)
##       dt            LandAverageTemperature LandAverageTemperatureUncertainty
##  Length:3192        Min.   :-2.080         Min.   :0.0340                   
##  Class :character   1st Qu.: 4.312         1st Qu.:0.1867                   
##  Mode  :character   Median : 8.611         Median :0.3920                   
##                     Mean   : 8.375         Mean   :0.9385                   
##                     3rd Qu.:12.548         3rd Qu.:1.4192                   
##                     Max.   :19.021         Max.   :7.8800                   
##                     NA's   :12             NA's   :12                       
##  LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
##  Min.   : 5.90      Min.   :0.0440                Min.   :-5.407    
##  1st Qu.:10.21      1st Qu.:0.1420                1st Qu.:-1.335    
##  Median :14.76      Median :0.2520                Median : 2.950    
##  Mean   :14.35      Mean   :0.4798                Mean   : 2.744    
##  3rd Qu.:18.45      3rd Qu.:0.5390                3rd Qu.: 6.779    
##  Max.   :21.32      Max.   :4.3730                Max.   : 9.715    
##  NA's   :1200       NA's   :1200                  NA's   :1200      
##  LandMinTemperatureUncertainty LandAndOceanAverageTemperature
##  Min.   :0.0450                Min.   :12.47                 
##  1st Qu.:0.1550                1st Qu.:14.05                 
##  Median :0.2790                Median :15.25                 
##  Mean   :0.4318                Mean   :15.21                 
##  3rd Qu.:0.4582                3rd Qu.:16.40                 
##  Max.   :3.4980                Max.   :17.61                 
##  NA's   :1200                  NA's   :1200                  
##  LandAndOceanAverageTemperatureUncertainty
##  Min.   :0.0420                           
##  1st Qu.:0.0630                           
##  Median :0.1220                           
##  Mean   :0.1285                           
##  3rd Qu.:0.1510                           
##  Max.   :0.4570                           
##  NA's   :1200

Data Cleaning

Now let’s identify where there are NA values, whether there are duplicates, and whether there are extreme outliers. To take these steps, let’s load the libraries of tidyverse and janitor.

options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("janitor")
## 
## The downloaded binary packages are in
##  /var/folders/zr/bh_xqrjs3w19n2l94454nczr0000gp/T//Rtmpq5TXXP/downloaded_packages
library(tidyverse)
library(janitor)
sum(is.na(global_temperatures))
## [1] 7224
sapply(global_temperatures, function(x) sum(is.na(x))) 
##                                        dt 
##                                         0 
##                    LandAverageTemperature 
##                                        12 
##         LandAverageTemperatureUncertainty 
##                                        12 
##                        LandMaxTemperature 
##                                      1200 
##             LandMaxTemperatureUncertainty 
##                                      1200 
##                        LandMinTemperature 
##                                      1200 
##             LandMinTemperatureUncertainty 
##                                      1200 
##            LandAndOceanAverageTemperature 
##                                      1200 
## LandAndOceanAverageTemperatureUncertainty 
##                                      1200

So of the 3192 entries, there are 1200 for which there are several columns missing. I’m curious about the 12 entries for which there is NA for LandAverageTemperature is NA. Are these rows entirely blank?

na_rows <- which(is.na(global_temperatures$LandAverageTemperature))

print(na_rows)
##  [1] 11 17 19 22 23 24 26 29 30 31 32 33
global_temperatures[na_rows, ]
##            dt LandAverageTemperature LandAverageTemperatureUncertainty
## 11 1750-11-01                     NA                                NA
## 17 1751-05-01                     NA                                NA
## 19 1751-07-01                     NA                                NA
## 22 1751-10-01                     NA                                NA
## 23 1751-11-01                     NA                                NA
## 24 1751-12-01                     NA                                NA
## 26 1752-02-01                     NA                                NA
## 29 1752-05-01                     NA                                NA
## 30 1752-06-01                     NA                                NA
## 31 1752-07-01                     NA                                NA
## 32 1752-08-01                     NA                                NA
## 33 1752-09-01                     NA                                NA
##    LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 11                 NA                            NA                 NA
## 17                 NA                            NA                 NA
## 19                 NA                            NA                 NA
## 22                 NA                            NA                 NA
## 23                 NA                            NA                 NA
## 24                 NA                            NA                 NA
## 26                 NA                            NA                 NA
## 29                 NA                            NA                 NA
## 30                 NA                            NA                 NA
## 31                 NA                            NA                 NA
## 32                 NA                            NA                 NA
## 33                 NA                            NA                 NA
##    LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 11                            NA                             NA
## 17                            NA                             NA
## 19                            NA                             NA
## 22                            NA                             NA
## 23                            NA                             NA
## 24                            NA                             NA
## 26                            NA                             NA
## 29                            NA                             NA
## 30                            NA                             NA
## 31                            NA                             NA
## 32                            NA                             NA
## 33                            NA                             NA
##    LandAndOceanAverageTemperatureUncertainty
## 11                                        NA
## 17                                        NA
## 19                                        NA
## 22                                        NA
## 23                                        NA
## 24                                        NA
## 26                                        NA
## 29                                        NA
## 30                                        NA
## 31                                        NA
## 32                                        NA
## 33                                        NA

They are blank, so I’m going to remove those rows

global_temperatures_clean <- global_temperatures[-na_rows, ]
sum(is.na(global_temperatures_clean))
## [1] 7128
sapply(global_temperatures_clean, function(x) sum(is.na(x))) 
##                                        dt 
##                                         0 
##                    LandAverageTemperature 
##                                         0 
##         LandAverageTemperatureUncertainty 
##                                         0 
##                        LandMaxTemperature 
##                                      1188 
##             LandMaxTemperatureUncertainty 
##                                      1188 
##                        LandMinTemperature 
##                                      1188 
##             LandMinTemperatureUncertainty 
##                                      1188 
##            LandAndOceanAverageTemperature 
##                                      1188 
## LandAndOceanAverageTemperatureUncertainty 
##                                      1188

So while there are still 1,188 rows that are missing some elements, there are none with NA for all values. Now let’s look for duplicates.

duplicate_rows <- duplicated(global_temperatures)
global_temperatures[duplicate_rows, ]
## [1] dt                                       
## [2] LandAverageTemperature                   
## [3] LandAverageTemperatureUncertainty        
## [4] LandMaxTemperature                       
## [5] LandMaxTemperatureUncertainty            
## [6] LandMinTemperature                       
## [7] LandMinTemperatureUncertainty            
## [8] LandAndOceanAverageTemperature           
## [9] LandAndOceanAverageTemperatureUncertainty
## <0 rows> (or 0-length row.names)

Great! There are no duplicated rows. Now let’s look for outliers in the data - values that deviate so wildly from the median that they will skew the data, and are likely errors. First let’s define interquartile ranges for the ranges, then look for values that are more than 1.5 IQRs lower than the first quartile, or 1.5 IQRs higher than the third quartile.

Q1 <- quantile(global_temperatures_clean$LandAverageTemperature, 0.25, na.rm = TRUE)
Q3 <- quantile(global_temperatures_clean$LandAverageTemperature, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- global_temperatures_clean$LandAverageTemperature < lower_bound | global_temperatures_clean$LandAverageTemperature > upper_bound

print(global_temperatures[outliers, ])
## [1] dt                                       
## [2] LandAverageTemperature                   
## [3] LandAverageTemperatureUncertainty        
## [4] LandMaxTemperature                       
## [5] LandMaxTemperatureUncertainty            
## [6] LandMinTemperature                       
## [7] LandMinTemperatureUncertainty            
## [8] LandAndOceanAverageTemperature           
## [9] LandAndOceanAverageTemperatureUncertainty
## <0 rows> (or 0-length row.names)

It looks like there are no significant outliers, but let’s graph to make sure:

ggplot(global_temperatures_clean, aes(x = dt, y = LandAverageTemperature)) +
  geom_point() +
  theme_minimal() +
  ggtitle("Scatter Plot of Land Average Temperature Over Time")

While there definitely is more variation in the temperature in earlier years, there do not seem to be any significant outliers.

Data Analysis and Graphing

It is also seems now that there there is an upward trend in the data, and also that the data falls into particular ‘bands’. Let’s try to find how the data changes through a line of best fit, by using a linear regression. For this, I’ll use the ggplot library in R.

global_temperatures_clean$dt <- as.Date(global_temperatures_clean$dt)

global_temperatures_clean$Year <- as.numeric(format(global_temperatures_clean$dt, "%Y"))

model <- lm(LandAverageTemperature ~ Year, data = global_temperatures_clean)

library(ggplot2)
ggplot(global_temperatures_clean, aes(x = Year, y = LandAverageTemperature)) +
  geom_point() +  
  geom_smooth(method = "lm", col = "blue") + 
  theme_minimal() +
  labs(title = "Land Average Temperature Over Time",
       x = "Year", y = "Land Average Temperature")
## `geom_smooth()` using formula = 'y ~ x'

How steep is the slope of this line? Or in other words, on average, how quickly has the temperature been getting hotter since 1700?

slope <- coef(model)["Year"]
print(slope)
##        Year 
## 0.004663524

Interesting - so according to this graph, for each one-year increase, the average land temperature is predicted to increase by about 0.0047 degrees Celsius. Yet, I would guess that the rate at which the Earth is warming is speeding up as time goes on. I wonder if the parabolic curve of a quadratic function would more effectively predict the rate of temperature change.

model_poly <- lm(LandAverageTemperature ~ poly(Year, 2, raw=TRUE), data = global_temperatures_clean)

global_temperatures_clean$predicted_temp <- predict(model_poly, global_temperatures_clean)

ggplot(global_temperatures_clean, aes(x = Year, y = LandAverageTemperature)) +
  geom_point() +  
  geom_line(aes(y = predicted_temp), color = "blue") +  
  theme_minimal() +
  labs(title = "Land Average Temperature Over Time with Polynomial Fit",
       x = "Year", y = "Land Average Temperature")

This looks slightly better than the linear function. But where one number - the slope - can accurately express the trajectory of a line - we’ll need three numbers to express the quadratic function. If we think of a quadratic equation as A + Bx + Cx^2, the coefficients will be A, B, and C.

coeffs <- coef(model_poly)
print(coeffs)
##                (Intercept) poly(Year, 2, raw = TRUE)1 
##               1.551258e+02              -1.608079e-01 
## poly(Year, 2, raw = TRUE)2 
##               4.393875e-05

So according to this model, the equation to calculate land temperature would be LandAverageTemperature = 155.1258 − (0.1608079 x Year) + (0.00004393875 × Year^2)

I am curious about how the slope of the graph changes - how is global warming accelerating over time? Let’s look at the derivative of this quadratic function to find the instantaneous slope in different years.

beta_0 <- coeffs[1]  
beta_1 <- coeffs[2]  
beta_2 <- coeffs[3]  

slope_function <- function(year) {
  beta_1 + 2 * beta_2 * year
}

years <- seq(from = min(global_temperatures_clean$Year), to = max(global_temperatures_clean$Year), by = 1)

slopes <- sapply(years, slope_function)

interval_years <- seq(from = min(global_temperatures_clean$Year), to = max(global_temperatures_clean$Year), by = 50)

interval_slopes <- sapply(interval_years, slope_function)

plot(years, slopes, type = 'l', col = 'blue', xlab = 'Year', ylab = 'Rate of Change of Temperature',
     main = 'Rate of Change of Land Average Temperature Over Time')

labels <- paste(interval_years, "\n", round(interval_slopes, 5), sep="")

text(interval_years, interval_slopes, labels = labels, pos = 3, cex = 0.7, offset = 0.5) 

This graph does a good job of showing how climate change is speeding up. According to this model, the average temperature difference from one year to the next was -.007 in 1750 - slightly negative, although fairly close to zero. By 2000, the global land temperature is rising at a rate of roughly .015 degrees per year.