This document outlines the steps taken in the Exploratory Data Analysis (EDA) of a global temperature dataset using R. The analysis involves data loading, cleaning, visualization, and statistical examination to uncover insights about global temperature trends.
The data set is reterived from Kaggle
This dataset, sourced from the Berkeley Earth Surface Temperature Study and affiliated with the Lawrence Berkeley National Laboratory, offers an in-depth look into global temperature trends. It compiles an extensive array of data, with 1.6 billion temperature reports drawn from 16 different archives. The collection underscores the evolution of temperature measurement, from early mercury thermometers to modern electronic devices, and reflects significant efforts in data cleaning and preparation. The dataset is crucial for understanding both historical and contemporary climate patterns, accommodating analyses that range from broad global trends to specific regional details.
The primary file, Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv), traces global average temperatures dating back to 1750 for land and 1850 for combined land and ocean readings, including average, maximum, and minimum temperatures along with their 95% confidence intervals. Complementing this are files detailing temperature trends by country, state, major city, and city, each offering unique insights into localized climate behaviors. This dataset not only serves as a vital tool for historical climate trend analysis but also as a means for engaging in the ongoing, critical discourse on climate change.
In this project, I will simplified the project so that I am only using the General Global Temperature data from the “GlobalTemperatures.csv” from the downloaded dataset and running EDA on this data.
The following are my steps:
Required R packages for the analysis include dplyr,
tidyr, ggplot2, rcompanion,
readxl, lubridate, corrplot, and
patchwork.
library(dplyr)
library(tidyr)
library(ggplot2)
library(rcompanion)
library(readxl)
library(lubridate)
library(corrplot)
library(patchwork)
Setting the working directory and checking it with
getwd().
# Setup your working directory if necessary
setwd("your/working/directory")
getwd() #
Loading the global temperature dataset from a CSV file.
file_name <- "./data/GlobalTemperatures.csv"
data_init <- read.csv(file_name, header = TRUE)
Creating a function explore_data to examine the
dataset’s structure, and then applying it to data_init.
explore_data <- function(data_name, n = 5){
cat(paste("THE FIRST", n, "ROWS OF THE DATASET ARE\n"))
print(head(data_name, n))
cat(paste("\n\nTHE LAST", n, "ROWS OF THE DATASET ARE\n"))
print(tail(data_name, n))
cat("\n\nTHE COLUMNS' NAME OF THE DATA ARE\n")
print(colnames(data_name))
cat("\n\nTHE STRUCTURE OF THE DATASET ARE\n")
print(str(data_name))
}
explore_data(data_init)
## THE FIRST 5 ROWS OF THE DATASET ARE
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## 1 1750-01-01 3.034 3.574
## 2 1750-02-01 3.083 3.702
## 3 1750-03-01 5.626 3.076
## 4 1750-04-01 8.490 2.451
## 5 1750-05-01 11.573 2.072
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## LandAndOceanAverageTemperatureUncertainty
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
##
##
## THE LAST 5 ROWS OF THE DATASET ARE
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## 3188 2015-08-01 14.755 0.072
## 3189 2015-09-01 12.999 0.079
## 3190 2015-10-01 10.801 0.102
## 3191 2015-11-01 7.433 0.119
## 3192 2015-12-01 5.518 0.100
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 3188 20.699 0.110 9.005
## 3189 18.845 0.088 7.199
## 3190 16.450 0.059 5.232
## 3191 12.892 0.093 2.157
## 3192 10.725 0.154 0.287
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 3188 0.170 17.589
## 3189 0.229 17.049
## 3190 0.115 16.290
## 3191 0.106 15.252
## 3192 0.099 14.774
## LandAndOceanAverageTemperatureUncertainty
## 3188 0.057
## 3189 0.058
## 3190 0.062
## 3191 0.063
## 3192 0.062
##
##
## THE COLUMNS' NAME OF THE DATA ARE
## [1] "dt"
## [2] "LandAverageTemperature"
## [3] "LandAverageTemperatureUncertainty"
## [4] "LandMaxTemperature"
## [5] "LandMaxTemperatureUncertainty"
## [6] "LandMinTemperature"
## [7] "LandMinTemperatureUncertainty"
## [8] "LandAndOceanAverageTemperature"
## [9] "LandAndOceanAverageTemperatureUncertainty"
##
##
## THE STRUCTURE OF THE DATASET ARE
## 'data.frame': 3192 obs. of 9 variables:
## $ dt : chr "1750-01-01" "1750-02-01" "1750-03-01" "1750-04-01" ...
## $ LandAverageTemperature : num 3.03 3.08 5.63 8.49 11.57 ...
## $ LandAverageTemperatureUncertainty : num 3.57 3.7 3.08 2.45 2.07 ...
## $ LandMaxTemperature : num NA NA NA NA NA NA NA NA NA NA ...
## $ LandMaxTemperatureUncertainty : num NA NA NA NA NA NA NA NA NA NA ...
## $ LandMinTemperature : num NA NA NA NA NA NA NA NA NA NA ...
## $ LandMinTemperatureUncertainty : num NA NA NA NA NA NA NA NA NA NA ...
## $ LandAndOceanAverageTemperature : num NA NA NA NA NA NA NA NA NA NA ...
## $ LandAndOceanAverageTemperatureUncertainty: num NA NA NA NA NA NA NA NA NA NA ...
## NULL
sample_n(data_init, 5) # Sample 5 random rows
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## 1 1907-11-01 5.207 0.207
## 2 1907-12-01 2.956 0.251
## 3 1887-10-01 8.738 0.263
## 4 1961-01-01 2.926 0.089
## 5 1872-07-01 14.285 0.478
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## 1 10.710 0.310 -0.557
## 2 8.831 0.270 -2.585
## 3 14.716 0.690 2.531
## 4 8.380 0.176 -2.383
## 5 19.705 1.105 7.884
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## 1 0.325 13.872
## 2 0.288 13.225
## 3 0.413 14.983
## 4 0.129 13.664
## 5 1.045 16.843
## LandAndOceanAverageTemperatureUncertainty
## 1 0.115
## 2 0.121
## 3 0.126
## 4 0.064
## 5 0.205
summary(data_init)
## dt LandAverageTemperature LandAverageTemperatureUncertainty
## Length:3192 Min. :-2.080 Min. :0.0340
## Class :character 1st Qu.: 4.312 1st Qu.:0.1867
## Mode :character Median : 8.611 Median :0.3920
## Mean : 8.375 Mean :0.9385
## 3rd Qu.:12.548 3rd Qu.:1.4192
## Max. :19.021 Max. :7.8800
## NA's :12 NA's :12
## LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature
## Min. : 5.90 Min. :0.0440 Min. :-5.407
## 1st Qu.:10.21 1st Qu.:0.1420 1st Qu.:-1.335
## Median :14.76 Median :0.2520 Median : 2.950
## Mean :14.35 Mean :0.4798 Mean : 2.744
## 3rd Qu.:18.45 3rd Qu.:0.5390 3rd Qu.: 6.779
## Max. :21.32 Max. :4.3730 Max. : 9.715
## NA's :1200 NA's :1200 NA's :1200
## LandMinTemperatureUncertainty LandAndOceanAverageTemperature
## Min. :0.0450 Min. :12.47
## 1st Qu.:0.1550 1st Qu.:14.05
## Median :0.2790 Median :15.25
## Mean :0.4318 Mean :15.21
## 3rd Qu.:0.4582 3rd Qu.:16.40
## Max. :3.4980 Max. :17.61
## NA's :1200 NA's :1200
## LandAndOceanAverageTemperatureUncertainty
## Min. :0.0420
## 1st Qu.:0.0630
## Median :0.1220
## Mean :0.1285
## 3rd Qu.:0.1510
## Max. :0.4570
## NA's :1200
## Step3.1: Correct any misspelled or bad written column names ----
global_temp_data <- data_init %>%
rename("date" = "dt",
"LandAvgTemp"= "LandAverageTemperature",
"LandAvgTempUncer" = "LandAverageTemperatureUncertainty",
"LandMaxTemp" = "LandMaxTemperature",
"LandMaxTempUncer" = "LandMaxTemperatureUncertainty",
"LandMinTemp" = "LandMinTemperature",
"LandMinTempUncer" = "LandMinTemperatureUncertainty",
"LandOceanAvgTemp" = "LandAndOceanAverageTemperature",
"LandOceanAvgTempUncer" = "LandAndOceanAverageTemperatureUncertainty") # LHS = new, RHS = old
global_temp_data$date <- as.Date(global_temp_data$date, format = "%Y-%m-%d")
find_na_and_clean_column <- function(column, delete_na = FALSE) {
# Extract the parent data frame from the column
dataset_name <- deparse(substitute(column))
dataset <- eval(parse(text = gsub("\\$.*", "", dataset_name)), envir = parent.frame())
# Extract column name from the column
column_name <- gsub(".*\\$", "", dataset_name)
# Check if column exists in the dataset
if (!column_name %in% names(dataset)) {
stop("Column not found in the dataset.")
}
# Calculate total missing values in the specified column
missing_values <- sum(is.na(column))
total_values <- nrow(dataset)
missing_percentage <- (missing_values / total_values) * 100
# Print summary of missing data in the specified column
print(paste("Total number of NA in", column_name, ":", missing_values))
print(paste("Percentage of NA values in", column_name, ":", missing_percentage, "%"))
# Find indices of missing data in the specified column
missing_indices <- which(is.na(column))
print(paste("Indices of missing values in", column_name, ":"))
print(missing_indices)
# Clean the specified column if delete_na is TRUE
if (delete_na) {
cleaned_dataset <- dataset[!is.na(column), ]
cat(paste("Data cleaned in column", column_name, ". Removed rows with missing values.\n"))
return(cleaned_dataset)
} else {
return(dataset)
}
}
global_temp_data <- find_na_and_clean_column(global_temp_data$LandAvgTemp, delete_na = TRUE)
## [1] "Total number of NA in LandAvgTemp : 12"
## [1] "Percentage of NA values in LandAvgTemp : 0.37593984962406 %"
## [1] "Indices of missing values in LandAvgTemp :"
## [1] 11 17 19 22 23 24 26 29 30 31 32 33
## Data cleaned in column LandAvgTemp . Removed rows with missing values.
In this case, we can define season (categorize months into seasons)
global_temp_data <- global_temp_data %>%
mutate(
Year = year(as.Date(date)), # Extracting year from the date
Month = month(as.Date(date), label = TRUE, abbr = FALSE), # Extracting month names
Season = case_when(
Month %in% c("December", "January", "February") ~ "Winter",
Month %in% c("March", "April", "May") ~ "Spring",
Month %in% c("June", "July", "August") ~ "Summer",
Month %in% c("September", "October", "November") ~ "Autumn",
TRUE ~ NA_character_ # For any other cases, which should not exist
),
Century = ceiling(Year / 100), # Calculating the century
OceanAvgTemp = LandOceanAvgTemp - LandAvgTemp, # Calculating Ocean Average Temperature
OceanAvgTempUncer = LandOceanAvgTempUncer - LandAvgTempUncer # Calculating Ocean Average Temperature Uncertainty
) %>%
select(date, Century, Year, Month, Season, LandAvgTemp, OceanAvgTemp, everything()) # Reordering Columns
global_temp_data$Century <- as.factor(global_temp_data$Century)
Save to csv for backup when data is cleaned and ready for EDA.
file_name = "./data/cleaned_global_temperature_data.csv"
write.csv(global_temp_data, file_name, row.names = FALSE)
Select the columns you want to compare
global_temp_data[c(6,8,10,12)] %>%
summarise(across(everything(), list(
Mean = ~mean(., na.rm = TRUE),
Median = ~median(., na.rm = TRUE),
Min = ~min(., na.rm = TRUE),
Max = ~max(., na.rm = TRUE),
StdDev = ~sd(., na.rm = TRUE)
)))
## LandAvgTemp_Mean LandAvgTemp_Median LandAvgTemp_Min LandAvgTemp_Max
## 1 8.374731 8.6105 -2.08 19.021
## LandAvgTemp_StdDev LandAvgTempUncer_Mean LandAvgTempUncer_Median
## 1 4.38131 0.9384679 0.392
## LandAvgTempUncer_Min LandAvgTempUncer_Max LandAvgTempUncer_StdDev
## 1 0.034 7.88 1.09644
## LandMaxTempUncer_Mean LandMaxTempUncer_Median LandMaxTempUncer_Min
## 1 0.4797816 0.252 0.044
## LandMaxTempUncer_Max LandMaxTempUncer_StdDev LandMinTempUncer_Mean
## 1 4.373 0.583203 0.4318489
## LandMinTempUncer_Median LandMinTempUncer_Min LandMinTempUncer_Max
## 1 0.279 0.045 3.498
## LandMinTempUncer_StdDev
## 1 0.4458378
draw_boxplots <- function(data, variable_name) {
# Ensure that the variable name is one of the columns in the dataset
if (!variable_name %in% names(data)) {
stop("Variable not found in the dataset.")
}
# Define categories for which boxplots will be made
categories <- c("Century", "Month", "Season")
# Create boxplots for each category
for (category in categories) {
# Create a formula for the boxplot function
formula <- as.formula(paste(variable_name, "~", category))
# Create the boxplot
boxplot(formula, data = data,
main = paste("Boxplot of", variable_name, "by", category),
xlab = category,
ylab = variable_name,
col = rainbow(length(unique(data[[category]]))))
}
}
# Adjust the categories as needed
draw_boxplots(global_temp_data, "LandAvgTemp")
draw_boxplots(global_temp_data, "OceanAvgTemp")
draw_boxplots(global_temp_data, "LandMaxTemp")
split_data <- split(global_temp_data$LandAvgTemp, global_temp_data$Century)
for (century in unique(global_temp_data$Century)) {
hist(split_data[[century]],
main = paste("Histogram of Land Average Temperature for Century", century),
xlab = "Land Average Temperature",
ylab = "Frequency",
col = "lightblue", # Histogram bar color
border = "black", # Border color of bars
breaks = 20) # Number of bins (adjust as needed)
} # Change variables if needed (i.e. century to month)
temperature_plot <- function(data, variable) {
ggplot(data, aes(x = date)) +
geom_line(aes(y = .data[[variable]], color = Season)) +
geom_smooth(aes(y = .data[[variable]], color = Season), method = "auto") +
theme_minimal() +
labs(
title = paste(variable, "Over Time with Uncertainty"),
x = "Year",
y = variable
) +
scale_color_manual(
values = c("Winter" = "blue", "Spring" = "green", "Summer" = "red", "Autumn" = "orange")
) +
guides(color = guide_legend(title = "Season"))
}
temperature_plot(global_temp_data, "LandAvgTemp")
temperature_plot(global_temp_data, "OceanAvgTemp")
Alternate representation
alt_temp_plot <- function(data, variable) {
plot1 <- ggplot(data, aes(x = date)) +
geom_line(aes(y = .data[[variable]], color = Season)) +
theme_minimal() +
labs(
title = paste("Temperature Over Time -", variable),
x = "Year",
y = "Land Average Temperature"
) +
scale_color_manual(
values = c("Winter" = "blue", "Spring" = "green", "Summer" = "red", "Autumn" = "orange")
) +
guides(color = guide_legend(title = "Season")) +
theme(plot.title = element_text(hjust = 0.5))
plot2 <- ggplot(data, aes(x = date)) +
geom_line(aes(y = .data[[paste0(variable, "Uncer")]], color = Season)) +
theme_minimal() +
labs(
title = paste("Temperature Uncertainty Over Time -", variable),
x = "Year",
y = "Land Average Temperature"
) +
scale_color_manual(
values = c("Winter" = "blue", "Spring" = "green", "Summer" = "red", "Autumn" = "orange")
) +
guides(color = guide_legend(title = "Season")) +
theme(plot.title = element_text(hjust = 0.5))
combined_plot <- plot1 / plot2
return(combined_plot)
}
alt_temp_plot(global_temp_data, "LandAvgTemp")
alt_temp_plot(global_temp_data, "OceanAvgTemp")
# Remove rows where 'LandAvgTemp' or 'OceanAvgTemp' is NA
global_temp_data <- global_temp_data %>%
filter(!is.na(LandAvgTemp) & !is.na(OceanAvgTemp))
# Group by Year and summarize to get the average temperature of each
avg_temp_data <- global_temp_data %>%
group_by(Year) %>%
summarise(
LandAvgTemp = mean(LandAvgTemp, na.rm = TRUE),
LandAvgTempUncer = mean(LandAvgTempUncer, na.rm = TRUE),
OceanAvgTemp = mean(OceanAvgTemp, na.rm = TRUE),
OceanAvgTempUncer = mean(OceanAvgTempUncer, na.rm = TRUE)
) %>%
ungroup()
# Create the plot
ggplot(avg_temp_data, aes(x = Year)) +
geom_ribbon(aes(ymin = LandAvgTemp - LandAvgTempUncer, ymax = LandAvgTemp + LandAvgTempUncer), fill = "orange", alpha = 0.2) +
geom_ribbon(aes(ymin = OceanAvgTemp - OceanAvgTempUncer, ymax = OceanAvgTemp + OceanAvgTempUncer), fill = "blue", alpha = 0.2) +
geom_line(aes(y = LandAvgTemp, color = "Land Average Temperature"), size = 1) +
geom_line(aes(y = OceanAvgTemp, color = "Ocean Average Temperature"), size = 1) +
geom_smooth(aes(y = LandAvgTemp, color = "Land Average Temperature"), method = "lm", se = FALSE, size = 1) +
geom_smooth(aes(y = OceanAvgTemp, color = "Ocean Average Temperature"), method = "lm", se = FALSE, size = 1) +
scale_color_manual(values = c("Land Average Temperature" = "orange", "Ocean Average Temperature" = "blue")) +
labs(title = "Comparison between Ocean and Land Average Temperature (1850 - 2015)",
x = "Year",
y = "Temperature (°C)") +
theme_minimal() +
theme(legend.title = element_blank())
Using statistical methods like Z-score and IQR to detect and handle outliers.
global_temp_data <- global_temp_data %>%
mutate(LandAvgTemp_ZScore = (LandAvgTemp - mean(LandAvgTemp, na.rm = TRUE)) / sd(LandAvgTemp, na.rm = TRUE))
outliers_zscore <- filter(global_temp_data, abs(LandAvgTemp_ZScore) > 2) # Identify outliers (using 2 as the threshold)
print(outliers_zscore)
## [1] date Century Year
## [4] Month Season LandAvgTemp
## [7] OceanAvgTemp LandAvgTempUncer LandMaxTemp
## [10] LandMaxTempUncer LandMinTemp LandMinTempUncer
## [13] LandOceanAvgTemp LandOceanAvgTempUncer OceanAvgTempUncer
## [16] LandAvgTemp_ZScore
## <0 rows> (or 0-length row.names)
Calculate IQR
Q1 <- quantile(global_temp_data$LandAvgTemp, 0.25, na.rm = TRUE)
Q3 <- quantile(global_temp_data$LandAvgTemp, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
outliers_iqr <- filter(global_temp_data, LandAvgTemp < (Q1 - 1.5 * IQR) | LandAvgTemp > (Q3 + 1.5 * IQR))
print(outliers_iqr)
## [1] date Century Year
## [4] Month Season LandAvgTemp
## [7] OceanAvgTemp LandAvgTempUncer LandMaxTemp
## [10] LandMaxTempUncer LandMinTemp LandMinTempUncer
## [13] LandOceanAvgTemp LandOceanAvgTempUncer OceanAvgTempUncer
## [16] LandAvgTemp_ZScore
## <0 rows> (or 0-length row.names)
The EDA conducted on the land average temperature data reveals several key insights into the temporal dynamics of climate patterns:
The time series plots with uncertainty (Figure 1) suggest a clear seasonal pattern in temperature fluctuations, with peaks and troughs corresponding to summer and winter seasons, respectively. This seasonal variation appears consistent over the years. Notably, there is a visible trend of increasing temperatures over time across all seasons, with the most pronounced rise observed in recent decades, indicating a potential long-term warming trend.
The trend of increasing temperatures is further corroborated by the boxplot of average land temperature by century (Figure 2, top). Each successive century shows a median temperature higher than the previous, with the 21st century displaying the highest median temperature. Additionally, the spread of temperatures (interquartile range) in the 21st century is narrower than in the 19th century, suggesting less variability and a more consistent higher temperature regime.
Seasonal distributions of temperatures are explored through the boxplot of average land temperature by month (Figure 2, bottom). This plot highlights the highest median temperatures occurring in the traditional summer months (June, July, August) and the lowest in winter months (December, January, February). The presence of outliers, particularly in the transitional months, suggests occasional temperature extremes outside the expected norms.
From these observations, we can conclude that there is a clear and consistent pattern of seasonal temperature variation, as well as a long-term trend of rising average land temperatures. The increasing median temperatures across centuries, especially the marked rise into the 21st century, align with broader concerns about global warming and climate change. These insights can inform further research into the causes of these trends and aid in the development of climate models and policy decisions.
Perform a more detailed statistical analysis to quantify the warming trend. This could involve fitting a linear regression model to the time series data to estimate the rate of temperature increase. Decompose the time series into trend, seasonal, and residual components to better understand the underlying patterns.
Develop predictive models to forecast future temperature changes. Techniques could include ARIMA models, seasonal-trend decomposition using LOESS (STL), and machine learning methods like random forests or neural networks.
Analyze the correlation between land temperatures and various potential drivers such as greenhouse gas concentrations, solar cycles, ocean temperatures, and land use changes. Use Granger causality tests to investigate whether changes in these factors precede changes in temperature, suggesting a potential causal relationship.
Compare the observed temperature data with projections from climate models. This can help validate the models and improve our understanding of their predictive capabilities.
Include additional climate-related variables such as precipitation, humidity, and atmospheric pressure to perform a multivariate time series analysis. This can provide a more holistic understanding of climate dynamics.