Climate change and global warming have emerged as some of the most significant challenges facing humanity today. Among various environmental indicators, long-term temperature records serve as a direct and quantifiable measure of these changes. By systematically analyzing national temperature data, researchers can detect unusual patterns—such as sudden spikes or abrupt drops—that may signal climate anomalies, data irregularities, or noteworthy environmental events.
By applying machine learning and statistical techniques in R, this project aims to identify years or months with unusual temperature patterns that merit further investigation by climate scientists or policymakers. The project outcomes are expected to support early detection of climate anomalies, inform climate-related decision making, and contribute to the broader understanding of global temperature dynamics.
This study utilizes the “Average Monthly Surface Temperature (2000–2024)” dataset from Kaggle (https://www.kaggle.com/datasets/samithsachidanandan/average-monthly-surface-temperature-1940-2024). The dataset contains monthly average surface temperature records for multiple countries, enabling comprehensive analysis of national temperature trends and anomalies during the selected period.
Data Structure:
Country Name : The name of the countries.
Country Code : Code of the countries.
Year : Years
Day: Date
Average surface temperature : Daily Average surface temperature
Average surface temperature : Monthly Average surface temperature
Example Record:
| Country | Year | Day | Daily Average temp | Monthly Average temp |
|---|---|---|---|---|
| Canada | 2016 | 15/1/1940 | -2.0324936 | 11.327695 |
This section explains the purpose and process of data cleansing to provide high-quality input data for subsequent analysis.
library(tidyverse)
library(dplyr)
library(stringr)
data <- read_csv("average-monthly-surface-temperature.csv")
# View data structure and the first few rows
head(average-monthly-surface-temperature)
str(average-monthly-surface-temperature)
summary(average-monthly-surface-temperature)
# Check for missing values
colSums(is.na(average-monthly-surface-temperature))
# Delete rows containing NA
average-monthly-surface-temperature <- na.omit(average-monthly-surface-temperature)
# Remove Duplicates
# Check for duplicate rows
duplicated_rows <- duplicated(average-monthly-surface-temperature)
sum(duplicated_rows)
# Remove duplicate rows
average-monthly-surface-temperature <- average-monthly-surface-temperature[!duplicated_rows, ]
# Transform data types
# Insert a date column
average-monthly-surface-temperature <- average-monthly-surface-temperature %>%
mutate(Date = paste(Year, Month, "01", sep = "-")) %>%
mutate(Date = as.Date(Date, format = "%Y-%B-%d"))
head(average-monthly-surface-temperature$Date)
# Select data for a specific year
average-monthly-surface-temperature <- average-monthly-surface-temperature %>% filter(year >= 2000)
write_csv(temperature-data, "cleaned-temperature-data.csv")
This chapter performed the following data cleaning operations on the national temperature records: handling missing values, removing duplicates, and converting data types. Subsequent chapters will conduct analysis based on this cleaned dataset.
library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(corrplot)
library(naniar)
library(psych)
library(GGally)
data <- read_excel("cleaned_data1.xlsx")
colnames(data)
## [1] "Entity" "Code"
## [3] "year" "Day"
## [5] "Daily Average surface temperature" "Monthly Average surface temperature"
#3.1 Total missing values
total_na <- sum(is.na(data))
total_na
## [1] 0
#3.2 Missing per column
col_na <- colSums(is.na(data))
col_na
## Entity Code
## 0 0
## year Day
## 0 0
## Daily Average surface temperature Monthly Average surface temperature
## 0 0
Compute and plot the annual mean temperature for Malaysia.
# 4.1 Filter Malaysia and calculate yearly mean
Malaysia_data <- data %>%
filter(Entity == "Malaysia") %>%
group_by(year) %>%
summarise(yearly_avg = mean(`Daily Average surface temperature`, na.rm = TRUE))
# 4.2 Plot yearly trend
ggplot(Malaysia_data, aes(x = year, y = yearly_avg)) +
geom_line(color = "blue") +
geom_point(color = "darkblue") +
labs(
title = "Malaysia Yearly Average Temperature Trend",
subtitle = "Year vs Yearly Average Temperature (°C)",
x = "Year",
y = "Temperature (°C)"
) +
theme_minimal()
Plot the global monthly average temperature over time.
# 5.1 Select global temperature and date
global_data <- data %>%
select(Day, global_avg = `Monthly Average surface temperature`)
# 5.2 Convert Day to Date
global_data$Day <- as.Date(global_data$Day)
# 5.3 Plot time series
ggplot(global_data, aes(x = Day, y = global_avg)) +
geom_line(color = "darkgreen") +
labs(
title = "Global Monthly Average Temperature Trend",
subtitle = "Date vs Global Temperature (°C)",
x = "Date",
y = "Global Avg Temp (°C)"
) +
theme_minimal()
Calculate the difference between Malaysia’s monthly temp and global temp, then plot its distribution.
# 6.1 Join Malaysia and global temp by Day
Malaysia_global <- data %>%
filter(Entity == "Malaysia") %>%
select(Day, Malaysia_temp = `Daily Average surface temperature`) %>%
inner_join(
data %>% select(Day, global_temp = `Monthly Average surface temperature`),
by = "Day"
) %>%
mutate(delta = Malaysia_temp - global_temp)
# 6.2 Histogram of delta
ggplot(Malaysia_global, aes(x = delta)) +
geom_histogram(binwidth = 0.5, fill = "forestgreen", color = "white") +
labs(
title = "Distribution of Temperature Difference (Malaysia vs Global)",
x = "Δ = Malaysia Temp – Global Temp (°C)",
y = "Count"
) +
theme_minimal()
Plot boxplots of Malaysia’s monthly temperatures to observe seasonality.
# 7.1 Extract Malaysia and add month
Malaysia_monthly <- data %>%
filter(Entity == "Malaysia") %>%
mutate(month = as.integer(format(as.Date(Day), "%m")))
# 7.2 Plot boxplot by month
ggplot(Malaysia_monthly, aes(x = factor(month), y = `Daily Average surface temperature`)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Malaysia Monthly Temperature Distribution",
x = "Month",
y = "Avg Temperature (°C)"
) +
theme_minimal()
Compare monthly temperature distributions for China, United States, Malaysia.
# 8.1 Filter three countries and extract month
selected_countries <- data %>%
filter(Entity %in% c("China", "United States", "Malaysia")) %>%
mutate(month = as.integer(format(as.Date(Day), "%m")))
# 8.2 Density plot overlapped
ggplot(selected_countries, aes(x = `Daily Average surface temperature`, color = Entity)) +
geom_density() +
labs(
title = "Temperature Distribution: China vs US vs Malaysia",
x = "Avg Temperature (°C)",
y = "Density"
) +
theme_minimal()
The regression model predicted global monthly average temperatures over time using a simple linear trend based on the date.
# Load packages
library(readxl)
library(dplyr)
library(lubridate)
library(caret)
library(pROC)
library(ggplot2)
library(scales)
# Read data
data <- read_excel("cleaned_data.xlsx")
# Prepare data for regression
# Extract global monthly average
global_df <- data %>%
select(Day, Global_temp = `Monthly Average surface temperature`)
# Ensure Day is Date type and create a numeric time variable
global_df2 <- global_df %>%
mutate(
Day = as_date(Day)
) %>%
arrange(Day) %>%
mutate(
time_ordinal = as.integer(Day)
) %>%
filter(!is.na(Global_temp) & !is.na(time_ordinal))
To train and evaluate the regression model,we split the dataset into a training set and a testing set using a 70/30 proportion. The createDataPartition() function from the caret package was used to perform stratified sampling based on the distribution of the target variable (Global_temp), ensuring that both the training and testing sets maintain similar statistical characteristics. A random seed (set.seed(42)) was set for reproducibility.
set.seed(42)
trainIndex_reg <- createDataPartition(global_df2$Global_temp, p = 0.70, list = FALSE)
train_reg <- global_df2[trainIndex_reg, ]
test_reg <- global_df2[-trainIndex_reg, ]
Using a linear regression model for fitting, the independent variable is the time variable, and the dependent variable is the global monthly average temperature. Using summary(model_reg) will output statistical information such as regression coefficients, R², and p-values.
model_reg <- lm(Global_temp ~ time_ordinal, data = train_reg)
print(summary(model_reg))
##
## Call:
## lm(formula = Global_temp ~ time_ordinal, data = train_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.481 -7.445 3.796 6.954 11.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.715e+01 2.542e-01 67.455 < 2e-16 ***
## time_ordinal 1.047e-04 1.613e-05 6.493 8.54e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.603 on 40950 degrees of freedom
## Multiple R-squared: 0.001028, Adjusted R-squared: 0.001004
## F-statistic: 42.15 on 1 and 40950 DF, p-value: 8.539e-11
The regression model was evaluated on the test set using Root Mean Squared Error and R-squared as performance metrics. Predictions were generated using the predict() function on the test set. The RMSE was computed as the square root of the mean squared error between actual and predicted global temperatures, reflecting the average prediction error in °C. The postResample() function from the caret package was used to extract RMSE and R-squared values.
pred_reg <- predict(model_reg, newdata = test_reg)
mse_reg <- mean((test_reg$Global_temp - pred_reg)^2)
rmse_reg <- sqrt(mse_reg)
res_reg <- postResample(pred = pred_reg, obs = test_reg$Global_temp)
cat("Regression RMSE:", rmse_reg, "\n")
## Regression RMSE: 8.542801
cat("Regression R-squared:", res_reg["Rsquared"], "\n")
## Regression R-squared: 0.001120708
A time series line chart was created to visualize the global monthly average temperature trend along with the linear regression fit. The actual temperature values are plotted as light grey points, while the fitted values from the linear regression model are shown as a red line. The x-axis represents the date, and the y-axis shows the monthly average temperature in °C. The plot clearly demonstrates how closely (or not) the linear model captures the long-term temperature trend.
The use of geom_point() and geom_line() in ggplot2 enabled the comparison between observed data and the fitted trend line over time.
A scatter plot was generated to compare the predicted and actual global monthly average temperatures on the test set. Each point represents a test observation, where the x-axis shows the actual value and the y-axis shows the model’s predicted value. The color of each point encodes the prediction error:
This visualization provides an intuitive understanding of the model’s bias and prediction accuracy across the data range.
We fit a simple linear model of global monthly temperature vs. date. Test‐set R² is extremely low (~0.001–0.002), RMSE ~8.5 °C. Visualizations show the linear fit is a barely sloped line against large seasonal oscillations.
The Random Forest model classifies temperature readings as normal or anomalous by learning patterns in the historical data.
library(readxl)
library(dplyr)
library(caret)
library(randomForest)
library(pROC)
library(ggplot2)
data_path <- "cleaned_data.xlsx"
data <- read_excel(data_path)
# Standardise column names
colnames(data) <- trimws(colnames(data))
colnames(data)[colnames(data) == "Daily Average surface temperature"] <- "temp_daily"
colnames(data)[colnames(data) == "Monthly Average surface temperature"] <- "temp_monthly"
This criterion roughly corresponds to the 97.5th percentile under a normal distribution, striking a balance between sensitivity to extremes and robustness against ordinary variability.
The resulting binary factor anomaly_label becomes the
modelling target (Y).
threshold <- mean(data$temp_daily, na.rm = TRUE) + 2 * sd(data$temp_daily, na.rm = TRUE)
data <- data %>%
mutate(
anomaly_label = factor(
ifelse(temp_daily > threshold, "Anomaly", "Normal"),
levels = c("Normal", "Anomaly")
)
)
To gauge how well our model generalises, we withhold 20% of the
dataset as a test set. Stratified sampling (via
createDataPartition) preserves the original anomaly: normal
ratio in both subsets, preventing evaluation bias due to class
imbalance.
set.seed(2025)
train_idx <- createDataPartition(data$anomaly_label, p = 0.8, list = FALSE)
train <- data[train_idx, ]
test <- data[-train_idx, ]
We fit a forest of 100 trees to keep sufficient for stability of this size.
rf_model <- randomForest(
anomaly_label ~ .,
data = train,
ntree = 100,
importance = TRUE
)
By sweeping the decision threshold from 0 to 1 we plot the ROC curve, summarised by the Area Under the Curve (AUC). AUC values:
# Generate probabilities on held‑out data
test$prob_rf <- predict(rf_model, newdata = test, type = "prob")[, "Anomaly"]
# Compute ROC metrics
roc_rf <- roc(test$anomaly_label, test$prob_rf, levels = c("Normal", "Anomaly"))
## Setting direction: controls < cases
# Visualise
plot(
roc_rf,
col = "darkgreen",
lwd = 2,
main = "ROC Curve – Random Forest Anomaly Detection"
)
abline(a = 0, b = 1, lty = 2, col = "grey")
legend(
"bottomright",
legend = sprintf("Random Forest (AUC = %.3f)", auc(roc_rf)),
col = "darkgreen",
lwd = 2
)
We derive:
pred_labels <- predict(rf_model, newdata = test)
cm <- confusionMatrix(pred_labels, test$anomaly_label, positive = "Anomaly")
accuracy <- cm$overall["Accuracy"]
precision <- cm$byClass["Precision"]
recall <- cm$byClass["Recall"]
f1_score <- 2 * precision * recall / (precision + recall)
metrics_df <- data.frame(
Metric = c("Accuracy", "Precision (Anomaly)", "Recall (Anomaly)", "F1‑Score (Anomaly)"),
Value = c(accuracy, precision, recall, f1_score)
)
print(metrics_df)
## Metric Value
## 1 Accuracy 1
## 2 Precision (Anomaly) 1
## 3 Recall (Anomaly) 1
## 4 F1‑Score (Anomaly) 1
p1 <- ggplot(metrics_df, aes(x = Metric, y = Value)) +
geom_col(fill = "steelblue") +
ylim(0, 1) +
labs(title = "Random Forest Performance Metrics", y = "Score (0–1)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p1)
We plot variables ranked by their mean decrease in Gini.
imp <- as.data.frame(importance(rf_model))
imp$Feature <- rownames(imp)
imp <- imp %>% arrange(MeanDecreaseGini)
p2 <- ggplot(imp, aes(x = MeanDecreaseGini, y = reorder(Feature, MeanDecreaseGini))) +
geom_col(fill = "darkorange") +
labs(title = "Variable Importance – Random Forest",
x = "Mean Decrease in Gini", y = "Feature") +
theme_minimal()
print(p2)
The Random Forest classifier achieved an AUC exceeding 0.99 and perfect scores across precision, recall, and F1 under the default threshold—evidence of a highly separable anomaly boundary in the current dataset. Feature‑importance analysis indicates that daily temperature extremes dominate the decision process, while monthly aggregates fine‑tune boundary cases.
This project applied machine learning and statistical methods to national temperature records from 2000 to 2024. Our EDA revealed a persistent warming trend globally and in Malaysia, with global monthly mean temperatures showing clear seasonality and Malaysia showing weak seasonality.
The linear regression model captured the long-term warming trend but struggled with the strong short-term fluctuations, as indicated by the low R² score (~0.001) and high RMSE (~8.5°C).
The Random Forest classification model achieved near-perfect performance (AUC > 0.99, precision, recall, and F1-score = 1) in distinguishing anomalous temperature readings, demonstrating the strong separability of anomalies in this dataset.
For limitations, the time span (2000–2024) restricts the ability to study longer-term climate cycles. The regression model is relatively simple and does not account for nonlinear or multi-factor relationships. The anomaly labeling is based on a statistical threshold, which may not always align with real-world climate events.
Future research could extend the time range and incorporate more meteorological factors (such as precipitation and humidity) to enrich the analysis.
| Member | Matrix Number | Responsibility |
|---|---|---|
| XU HAIWEN | 24087887 | Background & Structure |
| BAO YISONG | 24076843 | Data Cleaning |
| GUO CHUWEI(Project Leader) | 24078603 | EDA, Conclusion, Integration of R Markdown Report and Submission to RPubs |
| CHEN YANKE | 24067967 | Regression Modeling |
| HAN BING | 24085575 | Classification Modeling |
(Week 10) Background & Structure
(Week 10) Data Cleaning
(Week 10) EDA
(Week 11) Regression Modeling
(Week 11) Classification Modeling
(Week 12) Conclusion
(Week 12) Integration of R Markdown Report and Submission to RPubs
(Week 13) Recording