1. Project Introduction

1.1 Project Background

Climate change and global warming have emerged as some of the most significant challenges facing humanity today. Among various environmental indicators, long-term temperature records serve as a direct and quantifiable measure of these changes. By systematically analyzing national temperature data, researchers can detect unusual patterns—such as sudden spikes or abrupt drops—that may signal climate anomalies, data irregularities, or noteworthy environmental events.

1.2 Project Objectives

By applying machine learning and statistical techniques in R, this project aims to identify years or months with unusual temperature patterns that merit further investigation by climate scientists or policymakers. The project outcomes are expected to support early detection of climate anomalies, inform climate-related decision making, and contribute to the broader understanding of global temperature dynamics.

2. Data Collection & Description

2.1 Dataset Description

This study utilizes the “Average Monthly Surface Temperature (2000–2024)” dataset from Kaggle (https://www.kaggle.com/datasets/samithsachidanandan/average-monthly-surface-temperature-1940-2024). The dataset contains monthly average surface temperature records for multiple countries, enabling comprehensive analysis of national temperature trends and anomalies during the selected period.

Data Structure:

Country Name : The name of the countries.

Country Code : Code of the countries.

Year : Years

Day: Date

Average surface temperature : Daily Average surface temperature

Average surface temperature : Monthly Average surface temperature

Example Record:

Country	Year	Day	Daily Average temp	Monthly Average temp
Canada	2016	15/1/1940	-2.0324936	11.327695

2.2 Dataset Dimensions and Coverage

Temporal range: 2000 to 2024
Geographical coverage: 100+ countries globally
Total observations: Over 25,000 rows, depending on dataset completeness

3. Data Cleaning

3.1 Introduction

This section explains the purpose and process of data cleansing to provide high-quality input data for subsequent analysis.

3.2 Install and configure the environment

library(tidyverse)
library(dplyr)
library(stringr)

3.3. Load Raw Data

data <- read_csv("average-monthly-surface-temperature.csv")

# View data structure and the first few rows
head(average-monthly-surface-temperature)      
str(average-monthly-surface-temperature)  
summary(average-monthly-surface-temperature)
# Check for missing values
colSums(is.na(average-monthly-surface-temperature))

3.4. Data Cleaning Steps

# Delete rows containing NA
average-monthly-surface-temperature <- na.omit(average-monthly-surface-temperature)
# Remove Duplicates
# Check for duplicate rows
duplicated_rows <- duplicated(average-monthly-surface-temperature)  
sum(duplicated_rows)
# Remove duplicate rows
average-monthly-surface-temperature <- average-monthly-surface-temperature[!duplicated_rows, ]
# Transform data types
# Insert a date column
average-monthly-surface-temperature <- average-monthly-surface-temperature %>%  
  mutate(Date = paste(Year, Month, "01", sep = "-")) %>%  
  mutate(Date = as.Date(Date, format = "%Y-%B-%d"))

3.5 Validate the new column

head(average-monthly-surface-temperature$Date)
# Select data for a specific year
average-monthly-surface-temperature <- average-monthly-surface-temperature %>% filter(year >= 2000)

3.6. Save Clean Data

write_csv(temperature-data, "cleaned-temperature-data.csv")

3.7. Summary

This chapter performed the following data cleaning operations on the national temperature records: handling missing values, removing duplicates, and converting data types. Subsequent chapters will conduct analysis based on this cleaned dataset.

4. EDA

4.1. Install and load packages for EDA

library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(corrplot)
library(naniar)
library(psych)
library(GGally)

4.2. Import Cleaned Data and Verify

data <- read_excel("cleaned_data1.xlsx")
colnames(data)

## [1] "Entity"                              "Code"                               
## [3] "year"                                "Day"                                
## [5] "Daily Average surface temperature"   "Monthly Average surface temperature"

4.3. Check Data Cleaning

#3.1 Total missing values
total_na <- sum(is.na(data))
total_na

## [1] 0

#3.2 Missing per column
col_na <- colSums(is.na(data))
col_na

##                              Entity                                Code 
##                                   0                                   0 
##                                year                                 Day 
##                                   0                                   0 
##   Daily Average surface temperature Monthly Average surface temperature 
##                                   0                                   0

4.4. Yearly Trend Analysis (For Malaysia)

Compute and plot the annual mean temperature for Malaysia.

# 4.1 Filter Malaysia and calculate yearly mean
Malaysia_data <- data %>%
  filter(Entity == "Malaysia") %>%
  group_by(year) %>%
  summarise(yearly_avg = mean(`Daily Average surface temperature`, na.rm = TRUE))

# 4.2 Plot yearly trend
ggplot(Malaysia_data, aes(x = year, y = yearly_avg)) +
  geom_line(color = "blue") +
  geom_point(color = "darkblue") +
  labs(
    title = "Malaysia Yearly Average Temperature Trend",
    subtitle = "Year vs Yearly Average Temperature (°C)",
    x = "Year",
    y = "Temperature (°C)"
  ) +
  theme_minimal()

4.5. Global Monthly Temperature Trend

Plot the global monthly average temperature over time.

# 5.1 Select global temperature and date
global_data <- data %>%
  select(Day, global_avg = `Monthly Average surface temperature`)

# 5.2 Convert Day to Date
global_data$Day <- as.Date(global_data$Day)

# 5.3 Plot time series
ggplot(global_data, aes(x = Day, y = global_avg)) +
  geom_line(color = "darkgreen") +
  labs(
    title = "Global Monthly Average Temperature Trend",
    subtitle = "Date vs Global Temperature (°C)",
    x = "Date",
    y = "Global Avg Temp (°C)"
  ) +
  theme_minimal()

4.6. Country vs Global Difference

Calculate the difference between Malaysia’s monthly temp and global temp, then plot its distribution.

# 6.1 Join Malaysia and global temp by Day
Malaysia_global <- data %>%
  filter(Entity == "Malaysia") %>%
  select(Day, Malaysia_temp = `Daily Average surface temperature`) %>%
  inner_join(
    data %>% select(Day, global_temp = `Monthly Average surface temperature`),
    by = "Day"
  ) %>%
  mutate(delta = Malaysia_temp - global_temp)

# 6.2 Histogram of delta
ggplot(Malaysia_global, aes(x = delta)) +
  geom_histogram(binwidth = 0.5, fill = "forestgreen", color = "white") +
  labs(
    title = "Distribution of Temperature Difference (Malaysia vs Global)",
    x = "Δ = Malaysia Temp – Global Temp (°C)",
    y = "Count"
  ) +
  theme_minimal()

4.7. Monthly Seasonal Distribution (For Malaysia)

Plot boxplots of Malaysia’s monthly temperatures to observe seasonality.

# 7.1 Extract Malaysia and add month
Malaysia_monthly <- data %>%
  filter(Entity == "Malaysia") %>%
  mutate(month = as.integer(format(as.Date(Day), "%m")))

# 7.2 Plot boxplot by month
ggplot(Malaysia_monthly, aes(x = factor(month), y = `Daily Average surface temperature`)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Malaysia Monthly Temperature Distribution",
    x = "Month",
    y = "Avg Temperature (°C)"
  ) +
  theme_minimal()

4.8. Compare Temperature Distributions across 3 Countries

Compare monthly temperature distributions for China, United States, Malaysia.

# 8.1 Filter three countries and extract month
selected_countries <- data %>%
  filter(Entity %in% c("China", "United States", "Malaysia")) %>%
  mutate(month = as.integer(format(as.Date(Day), "%m")))

# 8.2 Density plot overlapped
ggplot(selected_countries, aes(x = `Daily Average surface temperature`, color = Entity)) +
  geom_density() +
  labs(
    title = "Temperature Distribution: China vs US vs Malaysia",
    x = "Avg Temperature (°C)",
    y = "Density"
  ) +
  theme_minimal()

4.9. EDA Summary

Yearly Trend Analysis (For Malaysia)

Persistent Warming Trend In Malaysia
- Since 2000, Malaysia’s annual mean temperature exhibits a consistent upward trend.
Short-term Fluctuations
- Despite the overall increase, certain years (e.g., 2008, 2012) show slight dips.
Modeling Implications
- Time-trend (e.g., year) should be included as a predictor in any regression/forecasting models for Malaysia.

Global Monthly Temperature Trend

Overall Warming
- Global monthly mean temperature rises steadily from 2000 to 2024, with a more pronounced acceleration after 2010.
Strong Seasonality
- Pronounced seasonality: peaks in NH summer (Jun–Aug) and troughs in winter (Dec–Feb).

Monthly Seasonal Distribution (Malaysia)

Weak Seasonality
- Malaysia’s monthly temperatures range narrowly ; little difference between winter and summer.
Low Monthly Variability
- Monthly standard deviation ~0.30 °C; no major outliers detected.
Modeling Implications
- For Malaysia, seasonal terms can be omitted; focus on trend and constant.

5.Regression Model

5.0 Regression Problem

The regression model predicted global monthly average temperatures over time using a simple linear trend based on the date.

5.1 Prepare Data for Regression

# Load packages
library(readxl)
library(dplyr)
library(lubridate)
library(caret)
library(pROC)
library(ggplot2)
library(scales)

# Read data
data <- read_excel("cleaned_data.xlsx")

# Prepare data for regression
# Extract global monthly average
global_df   <- data %>%
  select(Day, Global_temp = `Monthly Average surface temperature`)

# Ensure Day is Date type and create a numeric time variable
global_df2 <- global_df %>%
  mutate(
    Day = as_date(Day)
  ) %>%
  arrange(Day) %>%
  mutate(
    time_ordinal = as.integer(Day)
  ) %>%
  filter(!is.na(Global_temp) & !is.na(time_ordinal))

5.2 Train/Test split for regression

To train and evaluate the regression model,we split the dataset into a training set and a testing set using a 70/30 proportion. The createDataPartition() function from the caret package was used to perform stratified sampling based on the distribution of the target variable (Global_temp), ensuring that both the training and testing sets maintain similar statistical characteristics. A random seed (set.seed(42)) was set for reproducibility.

set.seed(42)
trainIndex_reg <- createDataPartition(global_df2$Global_temp, p = 0.70, list = FALSE)
train_reg <- global_df2[trainIndex_reg, ]
test_reg  <- global_df2[-trainIndex_reg, ]

5.3 Fit linear regression

Using a linear regression model for fitting, the independent variable is the time variable, and the dependent variable is the global monthly average temperature. Using summary(model_reg) will output statistical information such as regression coefficients, R², and p-values.

model_reg <- lm(Global_temp ~ time_ordinal, data = train_reg)
print(summary(model_reg))

## 
## Call:
## lm(formula = Global_temp ~ time_ordinal, data = train_reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.481  -7.445   3.796   6.954  11.099 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.715e+01  2.542e-01  67.455  < 2e-16 ***
## time_ordinal 1.047e-04  1.613e-05   6.493 8.54e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.603 on 40950 degrees of freedom
## Multiple R-squared:  0.001028,   Adjusted R-squared:  0.001004 
## F-statistic: 42.15 on 1 and 40950 DF,  p-value: 8.539e-11

5.4 Regression predictions and metrics

The regression model was evaluated on the test set using Root Mean Squared Error and R-squared as performance metrics. Predictions were generated using the predict() function on the test set. The RMSE was computed as the square root of the mean squared error between actual and predicted global temperatures, reflecting the average prediction error in °C. The postResample() function from the caret package was used to extract RMSE and R-squared values.

pred_reg <- predict(model_reg, newdata = test_reg)
mse_reg  <- mean((test_reg$Global_temp - pred_reg)^2)
rmse_reg <- sqrt(mse_reg)
res_reg  <- postResample(pred = pred_reg, obs = test_reg$Global_temp)

cat("Regression RMSE:", rmse_reg, "\n")

## Regression RMSE: 8.542801

cat("Regression R-squared:", res_reg["Rsquared"], "\n")

## Regression R-squared: 0.001120708

5.5 Plot full time‐series and Regression fit

A time series line chart was created to visualize the global monthly average temperature trend along with the linear regression fit. The actual temperature values are plotted as light grey points, while the fitted values from the linear regression model are shown as a red line. The x-axis represents the date, and the y-axis shows the monthly average temperature in °C. The plot clearly demonstrates how closely (or not) the linear model captures the long-term temperature trend.

The use of geom_point() and geom_line() in ggplot2 enabled the comparison between observed data and the fitted trend line over time.

5.6 Plot Actual vs. Predicted (Test Set)

A scatter plot was generated to compare the predicted and actual global monthly average temperatures on the test set. Each point represents a test observation, where the x-axis shows the actual value and the y-axis shows the model’s predicted value. The color of each point encodes the prediction error:

Blue indicates underestimation (prediction < actual),
Red indicates overestimation (prediction > actual),
White indicates minimal error. A 45-degree dashed line (y = x) is shown for reference; predictions falling exactly on this line would be perfectly accurate.

This visualization provides an intuitive understanding of the model’s bias and prediction accuracy across the data range.

5.7 Regression Summary

We fit a simple linear model of global monthly temperature vs. date. Test‐set R² is extremely low (~0.001–0.002), RMSE ~8.5 °C. Visualizations show the linear fit is a barely sloped line against large seasonal oscillations.

6. Classification Model

6.0 Classification Model Problem

The Random Forest model classifies temperature readings as normal or anomalous by learning patterns in the historical data.

6.1 Load Required Libraries

library(readxl)
library(dplyr)
library(caret)
library(randomForest)
library(pROC)
library(ggplot2)

6.2 Read & Clean the Data

data_path <- "cleaned_data.xlsx"
data <- read_excel(data_path)

# Standardise column names
colnames(data) <- trimws(colnames(data))
colnames(data)[colnames(data) == "Daily Average surface temperature"]  <- "temp_daily"
colnames(data)[colnames(data) == "Monthly Average surface temperature"] <- "temp_monthly"

6.3 Define the Anomaly Label

This criterion roughly corresponds to the 97.5th percentile under a normal distribution, striking a balance between sensitivity to extremes and robustness against ordinary variability.

The resulting binary factor anomaly_label becomes the modelling target (Y).

threshold <- mean(data$temp_daily, na.rm = TRUE) + 2 * sd(data$temp_daily, na.rm = TRUE)

data <- data %>%
  mutate(
    anomaly_label = factor(
      ifelse(temp_daily > threshold, "Anomaly", "Normal"),
      levels = c("Normal", "Anomaly")
    )
  )

6.4 Train/Test Split

To gauge how well our model generalises, we withhold 20% of the dataset as a test set. Stratified sampling (via createDataPartition) preserves the original anomaly: normal ratio in both subsets, preventing evaluation bias due to class imbalance.

set.seed(2025)
train_idx <- createDataPartition(data$anomaly_label, p = 0.8, list = FALSE)
train <- data[train_idx, ]
test  <- data[-train_idx, ]

6.5 Random Forest Model

We fit a forest of 100 trees to keep sufficient for stability of this size.

rf_model <- randomForest(
  anomaly_label ~ ., 
  data = train, 
  ntree = 100,
  importance = TRUE
)

6.6 Probabilistic Predictions & ROC Analysis

By sweeping the decision threshold from 0 to 1 we plot the ROC curve, summarised by the Area Under the Curve (AUC). AUC values:

1.0 → perfect separation
0.5 → no better than random guessing

# Generate probabilities on held‑out data
test$prob_rf <- predict(rf_model, newdata = test, type = "prob")[, "Anomaly"]

# Compute ROC metrics
roc_rf <- roc(test$anomaly_label, test$prob_rf, levels = c("Normal", "Anomaly"))

## Setting direction: controls < cases

# Visualise
plot(
  roc_rf,
  col  = "darkgreen",
  lwd  = 2,
  main = "ROC Curve – Random Forest Anomaly Detection"
)
abline(a = 0, b = 1, lty = 2, col = "grey")
legend(
  "bottomright",
  legend = sprintf("Random Forest (AUC = %.3f)", auc(roc_rf)),
  col    = "darkgreen",
  lwd    = 2
)

6.7 Compute Confusion‑Matrix Metrics

We derive:

Accuracy, Precision, Recall (Sensitivity), F1‑Score

pred_labels <- predict(rf_model, newdata = test)
cm <- confusionMatrix(pred_labels, test$anomaly_label, positive = "Anomaly")

accuracy  <- cm$overall["Accuracy"]
precision <- cm$byClass["Precision"]
recall    <- cm$byClass["Recall"]
f1_score  <- 2 * precision * recall / (precision + recall)

metrics_df <- data.frame(
  Metric = c("Accuracy", "Precision (Anomaly)", "Recall (Anomaly)", "F1‑Score (Anomaly)"),
  Value  = c(accuracy, precision, recall, f1_score)
)
print(metrics_df)

##                Metric Value
## 1            Accuracy     1
## 2 Precision (Anomaly)     1
## 3    Recall (Anomaly)     1
## 4  F1‑Score (Anomaly)     1

6.8 Visualise Performance Metrics

p1 <- ggplot(metrics_df, aes(x = Metric, y = Value)) +
  geom_col(fill = "steelblue") +
  ylim(0, 1) +
  labs(title = "Random Forest Performance Metrics", y = "Score (0–1)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p1)

6.9 Feature Importance Insight

We plot variables ranked by their mean decrease in Gini.

imp <- as.data.frame(importance(rf_model))
imp$Feature <- rownames(imp)
imp <- imp %>% arrange(MeanDecreaseGini)

p2 <- ggplot(imp, aes(x = MeanDecreaseGini, y = reorder(Feature, MeanDecreaseGini))) +
  geom_col(fill = "darkorange") +
  labs(title = "Variable Importance – Random Forest",
       x = "Mean Decrease in Gini", y = "Feature") +
  theme_minimal()
print(p2)

6.10 Conclusion & Recommendations

The Random Forest classifier achieved an AUC exceeding 0.99 and perfect scores across precision, recall, and F1 under the default threshold—evidence of a highly separable anomaly boundary in the current dataset. Feature‑importance analysis indicates that daily temperature extremes dominate the decision process, while monthly aggregates fine‑tune boundary cases.

7. Conclusion

7.1 Summarize Key Findings

This project applied machine learning and statistical methods to national temperature records from 2000 to 2024. Our EDA revealed a persistent warming trend globally and in Malaysia, with global monthly mean temperatures showing clear seasonality and Malaysia showing weak seasonality.

The linear regression model captured the long-term warming trend but struggled with the strong short-term fluctuations, as indicated by the low R² score (~0.001) and high RMSE (~8.5°C).

The Random Forest classification model achieved near-perfect performance (AUC > 0.99, precision, recall, and F1-score = 1) in distinguishing anomalous temperature readings, demonstrating the strong separability of anomalies in this dataset.

7.2 Limitations

For limitations, the time span (2000–2024) restricts the ability to study longer-term climate cycles. The regression model is relatively simple and does not account for nonlinear or multi-factor relationships. The anomaly labeling is based on a statistical threshold, which may not always align with real-world climate events.

7.3 Future Work

Future research could extend the time range and incorporate more meteorological factors (such as precipitation and humidity) to enrich the analysis.

8. Appendix

8.1 Group Member Contributions

Member	Matrix Number	Responsibility
XU HAIWEN	24087887	Background & Structure
BAO YISONG	24076843	Data Cleaning
GUO CHUWEI(Project Leader)	24078603	EDA, Conclusion, Integration of R Markdown Report and Submission to RPubs
CHEN YANKE	24067967	Regression Modeling
HAN BING	24085575	Classification Modeling

8.2 Group Project Timeline

(Week 10) Background & Structure

(Week 10) Data Cleaning

(Week 10) EDA

(Week 11) Regression Modeling

(Week 11) Classification Modeling

(Week 12) Conclusion

(Week 12) Integration of R Markdown Report and Submission to RPubs

(Week 13) Recording

Anomaly Detection in National Temperature Records

Group9

WQD7004 Semester 2, Session 2024/2025 OCC4