Global Air Quality Analysis and Prediction (1999–2025)

WQD7004 – Programming for Data Science · Group Project

Author: Group 4
University of Malaya · Faculty of Computer Science
Date: 2025-12-23

1 1.0 Introduction

Air pollution remains a major global public health concern, yet many air quality studies focus on short time periods or individual cities, limiting their ability to capture long-term and cross-regional pollution dynamics. This project conducts a spatio-temporal analysis of monthly PM2.5 and NO₂ data from 20 major global cities covering the period 1999–2025 to examine long-term trends, seasonal patterns, and geographic variation in urban air quality.

In addition to exploratory analysis, the study evaluates regression and classification models to assess their effectiveness in predicting PM2.5 concentrations and categorising air quality levels. By comparing traditional linear models with machine learning approaches, the project highlights both the predictive potential and practical limitations of data-driven air quality modelling in large-scale datasets with imbalanced class distributions.

2 2.0 Research Questions and Objectives

Research Questions

What long-term trends, seasonal patterns, and city-level differences are observed in PM2.5 and NO₂ concentrations across global urban areas?
How accurately can regression models predict PM2.5 concentrations using pollution and geographic variables?
How well can classification models categorise air quality levels into Good, Moderate, and Poor classes?

Research Objectives

To examine long-term temporal, seasonal, and spatial patterns of PM2.5 and NO₂ concentrations across major global cities from 1999 to 2025.
To predict PM2.5 concentrations using regression models based on pollution, geographic, and seasonal factors.
To classify air quality levels using supervised classification models.

3 3.0 Dataset Description

Title and Purpose
The dataset used in this project is a global air quality dataset compiled from the U.S. Environmental Protection Agency (EPA) and the World Health Organization (WHO) databases. It is designed to monitor and analyse urban air pollution, with a focus on PM2.5 and NO₂ concentrations across major global cities.

Time Coverage
- PM2.5: 1999–2025
- NO₂: 1970–2025

This extended time span enables the analysis of long-term pollution trends, seasonal patterns, and temporal variability in air quality.

Dimension and Structure
The dataset contains 6,480 observations representing monthly air quality records from 20 major global cities. It is organised in a structured tabular format, where each row corresponds to a city–month–year observation and each column represents a specific pollution, geographic, temporal, or metadata variable.

Dataset Content
- Location identifiers: city, country
- Geographic coordinates: latitude, longitude
- Time indicators: year, month
- Pollution measures: pm25_ugm3 (PM2.5), no2_ugm3 (NO₂), measured in µg/m³
- Air quality label: data_quality (Good / Moderate / Poor)
- Metadata: measurement_method, data_source (EPA / WHO)

4 4.0 Setup and Required Packages

The analysis was conducted in R using the following packages:

tidyverse: Data cleaning, transformation, and visualisation.
lubridate: Date construction and temporal feature engineering.
skimr: Structured data inspection and summary reporting.
caret: Train/test splitting and model evaluation metrics.
randomForest: Random forest regression and classification.
rpart and rpart.plot: Decision tree modelling and visualisation.
corrplot: Correlation matrix visualisation.
maps and scales: Base map rendering and scale formatting for spatial visualisation.

5 5.0 Load and Inspect Data

df <- read.csv("air_quality_global.csv", stringsAsFactors = FALSE)

cat("Rows:", nrow(df), " Columns:", ncol(df), "\n")

## Rows: 6480  Columns: 11

glimpse(df)

## Rows: 6,480
## Columns: 11
## $ city               <chr> "New York", "New York", "New York", "New York", "Ne…
## $ country            <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "U…
## $ latitude           <dbl> 40.7128, 40.7128, 40.7128, 40.7128, 40.7128, 40.712…
## $ longitude          <dbl> -74.006, -74.006, -74.006, -74.006, -74.006, -74.00…
## $ year               <int> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 199…
## $ month              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, …
## $ pm25_ugm3          <dbl> 18.11, 27.79, 12.05, 35.25, 38.39, 14.89, 19.66, 10…
## $ no2_ugm3           <dbl> 35.98, 17.71, 40.99, 17.18, 25.07, 28.95, 27.85, 26…
## $ data_quality       <chr> "Moderate", "Good", "Moderate", "Poor", "Good", "Go…
## $ measurement_method <chr> "Reference/Equivalent Method", "Reference/Equivalen…
## $ data_source        <chr> "EPA_AQS", "EPA_AQS", "EPA_AQS", "EPA_AQS", "EPA_AQ…

skim(df)

Data summary
Name	df
Number of rows	6480
Number of columns	11
_______________________
Column type frequency:
character	5
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
city	1	5	12	20
country	1	2	7	10
data_quality	1	4	8	3
measurement_method	1	24	27	2
data_source	1	7	12	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
latitude	1	31.54	16.60	-23.55	29.24	33.75	40.14	52.52	▁▁▂▇▇
longitude	1	-35.88	81.51	-121.89	-98.65	-74.59	5.89	139.65	▇▁▃▂▂
year	1	2012.00	7.79	1999.00	2005.00	2012.00	2019.00	2025.00	▇▇▇▇▇
month	1	6.50	3.45	1.00	3.75	6.50	9.25	12.00	▇▅▅▅▇
pm25_ugm3	1	40.97	36.30	5.10	19.34	29.23	46.08	274.18	▇▁▁▁▁
no2_ugm3	1	39.62	16.71	10.25	27.08	36.84	48.92	110.27	▆▇▃▁▁

Interpretation
The dataset contains 6,480 observations and multiple numeric and categorical variables. Preliminary inspection confirms that key pollution and location variables are present and appropriately formatted, providing a suitable basis for subsequent cleaning, exploratory analysis, and modelling.

6 6.0 Data Cleaning and Feature Engineering

clean_df <- df %>%
  mutate(
    date = lubridate::make_date(year, month, 1),
    city = factor(city),
    country = factor(country),
    measurement_method = factor(measurement_method),
    data_source = factor(data_source),
    season = case_when(
      month %in% c(12,1,2) ~ "Winter",
      month %in% c(3,4,5) ~ "Spring",
      month %in% c(6,7,8) ~ "Summer",
      TRUE ~ "Autumn"
    ) %>% factor(levels = c("Winter","Spring","Summer","Autumn")),
    data_quality = factor(data_quality, levels = c("Good","Moderate","Poor"), ordered = TRUE)
  ) %>%
  group_by(city, month) %>%
  mutate(
    pm25_ugm3 = ifelse(is.na(pm25_ugm3), mean(pm25_ugm3, na.rm = TRUE), pm25_ugm3),
    no2_ugm3  = ifelse(is.na(no2_ugm3),  mean(no2_ugm3,  na.rm = TRUE), no2_ugm3)
  ) %>%
  ungroup()

# Quick checks
cat("Remaining NA in PM2.5:", sum(is.na(clean_df$pm25_ugm3)), "\n")

## Remaining NA in PM2.5: 0

cat("Remaining NA in NO2:", sum(is.na(clean_df$no2_ugm3)), "\n")

## Remaining NA in NO2: 0

table(clean_df$data_quality, useNA = "ifany")

## 
##     Good Moderate     Poor 
##     4917     1169      394

Main Cleaning Procedures
- Constructed a date variable from year and month for time-series analysis.
- Converted categorical variables into factors.
- Created a seasonal variable (Winter, Spring, Summer, Autumn).
- Defined data_quality as an ordered factor (Good < Moderate < Poor).
- Imputed missing PM2.5 and NO₂ values using city–month group means.
- Performed post-cleaning checks to confirm no remaining missing values.

7 7.0 Exploratory Data Analysis

7.1 7.1 Global & Yearly Trends

clean_df %>%
  group_by(date) %>%
  summarize(pm25_mean = mean(pm25_ugm3, na.rm = TRUE)) %>%
  ggplot(aes(date, pm25_mean)) +
  geom_line(size = 1.1, color = "#1A73E8") +
  theme_minimal() +
  labs(
    title = "Global Monthly Mean PM2.5 (1999–2025)",
    x = "Year",
    y = "Mean PM2.5 (µg/m³)"
  )

Interpretation
The global monthly mean PM2.5 series exhibits pronounced seasonal fluctuations with recurring peaks and troughs over time. The repeated cyclical pattern highlights the importance of accounting for seasonality in subsequent modelling. A gradual decline is also observed in later years, suggesting long-term improvement in average PM2.5 levels.

7.2 7.2 City-level Distribution

ggplot(clean_df, aes(reorder(city, pm25_ugm3, FUN = median), pm25_ugm3, fill = city)) +
  geom_boxplot(outlier.size = 0.8, alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "PM2.5 Distribution by City (ordered by median)",
    x = "",
    y = "PM2.5 (µg/m³)"
  )

Interpretation
PM2.5 concentrations vary substantially across cities, indicating strong spatial heterogeneity in urban air pollution. Cities such as Delhi, Mumbai, and Beijing exhibit higher median PM2.5 levels and wider variability, while cities such as San Diego, New York, and Phoenix show consistently lower concentrations. The presence of outliers suggests episodic high-pollution events and differing pollution dynamics across urban areas.

7.3 7.3 PM2.5 vs NO₂ Relationship

ggplot(clean_df, aes(no2_ugm3, pm25_ugm3)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  theme_minimal() +
  labs(
    title = "PM2.5 vs NO₂ (monthly observations)",
    x = "NO₂ (µg/m³)",
    y = "PM2.5 (µg/m³)"
  )

Interpretation
The scatter plot shows a positive association between NO₂ and PM2.5 concentrations across monthly observations. Higher NO₂ levels are generally associated with higher PM2.5 values, as indicated by the upward-sloping fitted regression line. The wide dispersion suggests that, while NO₂ is informative, additional factors also influence PM2.5 concentrations.

7.4 7.4 Seasonality

ggplot(clean_df, aes(season, pm25_ugm3, fill = season)) +
  geom_boxplot() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Seasonal Variation of PM2.5",
    x = "Season",
    y = "PM2.5 (µg/m³)"
  )

Interpretation
PM2.5 concentrations exhibit clear seasonal variation. Higher median PM2.5 levels are typically observed during Winter and Spring, while Autumn often shows lower concentrations. Seasonal factors therefore play an important role in PM2.5 variability and should be incorporated into predictive models.

7.5 7.5 Correlation Matrix

cor_mat <- clean_df %>%
  select(pm25_ugm3, no2_ugm3, latitude, longitude) %>%
  cor(use = "pairwise.complete.obs")

corrplot(cor_mat, method = "color", addCoef.col = "black", tl.col = "black")

Interpretation
PM2.5 exhibits a positive correlation with NO₂, supporting the view that pollutant interactions are informative for explaining PM2.5 variation. Correlations between pollution measures and geographic coordinates are weak, suggesting that latitude and longitude alone do not strongly explain pollution variability.

7.6 7.6 Spatial (Map) — City mean PM2.5 on world map

city_mean <- clean_df %>%
  group_by(city, country, latitude, longitude) %>%
  summarize(pm25_avg = mean(pm25_ugm3, na.rm = TRUE), .groups = "drop")

map_df <- map_data("world")

ggplot() +
  geom_polygon(
    data = map_df,
    aes(x = long, y = lat, group = group),
    fill = "#f0f0f0",
    color = "#d9d9d9"
  ) +
  geom_point(
    data = city_mean,
    aes(x = longitude, y = latitude, color = pm25_avg, size = pm25_avg),
    alpha = 0.9
  ) +
  scale_color_viridis_c(option = "plasma", name = "PM2.5 avg") +
  scale_size(range = c(2, 8), guide = "none") +
  coord_quickmap() +
  theme_minimal() +
  labs(title = "City Average PM2.5 (1999–2025)", x = "", y = "")

Interpretation
The spatial map highlights substantial geographic variation in average PM2.5 concentrations. Cities in parts of South and East Asia exhibit higher PM2.5 levels, while cities in Europe and North America generally show lower concentrations. This reinforces the importance of geographic context in air quality analysis.

8 8.0 Modeling (Train/Test Split)

clean_df <- clean_df %>% filter(!is.na(data_quality))
set.seed(123)

train_idx <- createDataPartition(clean_df$data_quality, p = 0.8, list = FALSE)
train <- clean_df[train_idx, ] %>% droplevels()
test  <- clean_df[-train_idx, ] %>% droplevels()

cat("Train / Test sizes:", nrow(train), "/", nrow(test), "\n")

## Train / Test sizes: 5186 / 1294

table(train$data_quality)

## 
##     Good Moderate     Poor 
##     3934      936      316

table(test$data_quality)

## 
##     Good Moderate     Poor 
##      983      233       78

Procedures
- Records with missing data_quality labels were removed.
- An 80/20 stratified train–test split was applied to preserve class proportions.
- The split yields training and testing sets used consistently across regression and classification experiments.

8.1 8.1 Regression: Linear Model (baseline)

lm_fit <- lm(pm25_ugm3 ~ no2_ugm3 + latitude + season + city, data = train)
summary(lm_fit)

## 
## Call:
## lm(formula = pm25_ugm3 ~ no2_ugm3 + latitude + season + city, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.960  -9.206   0.406   9.023 129.196 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -329.27469   15.84574 -20.780  < 2e-16 ***
## no2_ugm3            0.10015    0.02287   4.378 1.22e-05 ***
## latitude            9.96572    0.42539  23.427  < 2e-16 ***
## seasonSpring       10.55571    0.79967  13.200  < 2e-16 ***
## seasonSummer       -0.24278    0.81926  -0.296    0.767    
## seasonAutumn      -11.63281    0.76453 -15.216  < 2e-16 ***
## cityBerlin       -172.25148    6.52073 -26.416  < 2e-16 ***
## cityChicago       -66.51658    2.36116 -28.171  < 2e-16 ***
## cityDallas         23.52521    2.53712   9.272  < 2e-16 ***
## cityDelhi         170.36328    4.27334  39.866  < 2e-16 ***
## cityHouston        57.23679    3.67380  15.580  < 2e-16 ***
## cityLagos         326.65356   13.36556  24.440  < 2e-16 ***
## cityLondon       -158.95266    6.03690 -26.330  < 2e-16 ***
## cityLos Angeles    17.18419    2.14176   8.023 1.26e-15 ***
## cityMexico City   183.26172    8.01249  22.872  < 2e-16 ***
## cityMumbai        232.99353    8.12917  28.661  < 2e-16 ***
## cityNew York      -57.43163    2.05359 -27.966  < 2e-16 ***
## cityParis        -133.87052    4.97687 -26.899  < 2e-16 ***
## cityPhiladelphia  -48.81827    1.83019 -26.674  < 2e-16 ***
## cityPhoenix        14.39198    2.30133   6.254 4.33e-10 ***
## citySan Antonio    57.98366    3.77160  15.374  < 2e-16 ***
## citySan Diego      20.43272    2.53275   8.067 8.86e-16 ***
## citySan Jose      -22.50785    1.50922 -14.914  < 2e-16 ***
## citySão Paulo     600.18093   26.14867  22.953  < 2e-16 ***
## cityTokyo                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.47 on 5162 degrees of freedom
## Multiple R-squared:  0.7091, Adjusted R-squared:  0.7078 
## F-statistic:   547 on 23 and 5162 DF,  p-value: < 2.2e-16

pred_lm <- predict(lm_fit, test)
lm_rmse <- RMSE(pred_lm, test$pm25_ugm3)
lm_mae  <- MAE(pred_lm, test$pm25_ugm3)
lm_r2   <- R2(pred_lm, test$pm25_ugm3)

data.frame(Model = "Linear Regression", RMSE = lm_rmse, MAE = lm_mae, R2 = lm_r2)

##               Model     RMSE      MAE        R2
## 1 Linear Regression 20.30573 13.66486 0.7062653

Interpretation
The linear regression model uses NO₂ concentration, geographic location, seasonality, and city effects to predict PM2.5 levels. It provides a baseline for comparison with more flexible models. Any performance limitations motivate the use of non-linear approaches.

8.2 8.2 Regression: Random Forest (improved)

set.seed(123)
rf_reg <- randomForest(
  pm25_ugm3 ~ no2_ugm3 + latitude + season + city,
  data = train,
  ntree = 300,
  importance = TRUE
)

print(rf_reg)

## 
## Call:
##  randomForest(formula = pm25_ugm3 ~ no2_ugm3 + latitude + season +      city, data = train, ntree = 300, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 364.5039
##                     % Var explained: 71.88

varImpPlot(rf_reg, main = "Random Forest: Regression Variable Importance")

pred_rf <- predict(rf_reg, test)
rf_rmse <- RMSE(pred_rf, test$pm25_ugm3)
rf_mae  <- MAE(pred_rf, test$pm25_ugm3)
rf_r2   <- R2(pred_rf, test$pm25_ugm3)

data.frame(
  Model = c("Linear Regression", "Random Forest Regression"),
  RMSE  = c(lm_rmse, rf_rmse),
  MAE   = c(lm_mae, rf_mae),
  R2    = c(lm_r2, rf_r2)
)

##                      Model     RMSE      MAE        R2
## 1        Linear Regression 20.30573 13.66486 0.7062653
## 2 Random Forest Regression 20.02125 13.24618 0.7182331

Interpretation
A Random Forest regression model is applied to capture non-linear relationships between PM2.5 and the explanatory variables. Variable importance typically highlights the role of city and season effects, consistent with spatial heterogeneity and seasonality observed in EDA. Performance improvements over the linear baseline indicate non-linear structure in the data.

8.3 8.3 Classification: Decision Tree (data_quality)

train_cl <- train %>%
  select(data_quality, pm25_ugm3, no2_ugm3, latitude, season, city) %>%
  drop_na() %>%
  droplevels()

test_cl <- test %>%
  select(data_quality, pm25_ugm3, no2_ugm3, latitude, season, city) %>%
  drop_na() %>%
  droplevels()

tree_fit <- rpart(
  data_quality ~ pm25_ugm3 + no2_ugm3 + latitude + season + city,
  data = train_cl,
  method = "class"
)

rpart.plot(
  tree_fit,
  main = "Decision Tree: Predicting data_quality",
  box.palette = "Greens",
  nn = TRUE,
  shadow.col = "gray"
)

pred_tree <- predict(tree_fit, test_cl, type = "class")
conf_tree <- confusionMatrix(pred_tree, test_cl$data_quality)
conf_tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Moderate Poor
##   Good      983      233   78
##   Moderate    0        0    0
##   Poor        0        0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7597          
##                  95% CI : (0.7354, 0.7827)
##     No Information Rate : 0.7597          
##     P-Value [Acc > NIR] : 0.5152          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Good Class: Moderate Class: Poor
## Sensitivity               1.0000          0.0000     0.00000
## Specificity               0.0000          1.0000     1.00000
## Pos Pred Value            0.7597             NaN         NaN
## Neg Pred Value               NaN          0.8199     0.93972
## Prevalence                0.7597          0.1801     0.06028
## Detection Rate            0.7597          0.0000     0.00000
## Detection Prevalence      1.0000          0.0000     0.00000
## Balanced Accuracy         0.5000          0.5000     0.50000

Interpretation
Decision trees provide interpretable split rules. However, when the target classes are imbalanced, a tree may default to the majority class, producing misleadingly high accuracy but poor discrimination of minority classes. Balanced accuracy and Kappa are therefore considered in evaluation.

8.4 8.4 Classification: Random Forest Classifier (improved)

set.seed(123)

rf_clf <- randomForest(
  data_quality ~ pm25_ugm3 + no2_ugm3 + latitude + season + city,
  data = train_cl,
  ntree = 300,
  importance = TRUE
)

print(rf_clf)

## 
## Call:
##  randomForest(formula = data_quality ~ pm25_ugm3 + no2_ugm3 +      latitude + season + city, data = train_cl, ntree = 300, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 25.18%
## Confusion matrix:
##          Good Moderate Poor class.error
## Good     3876       51    7  0.01474326
## Moderate  930        4    2  0.99572650
## Poor      311        5    0  1.00000000

varImpPlot(rf_clf, main = "Random Forest: Classification Variable Importance")

pred_rf_cl <- predict(rf_clf, test_cl)
conf_rf_cl <- confusionMatrix(pred_rf_cl, test_cl$data_quality)
conf_rf_cl

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Moderate Poor
##   Good      972      229   75
##   Moderate   10        4    3
##   Poor        1        0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7543          
##                  95% CI : (0.7298, 0.7775)
##     No Information Rate : 0.7597          
##     P-Value [Acc > NIR] : 0.6887          
##                                           
##                   Kappa : 0.011           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: Good Class: Moderate Class: Poor
## Sensitivity              0.98881        0.017167   0.0000000
## Specificity              0.02251        0.987747   0.9991776
## Pos Pred Value           0.76176        0.235294   0.0000000
## Neg Pred Value           0.38889        0.820673   0.9396752
## Prevalence               0.75966        0.180062   0.0602782
## Detection Rate           0.75116        0.003091   0.0000000
## Detection Prevalence     0.98609        0.013138   0.0007728
## Balanced Accuracy        0.50566        0.502457   0.4995888

Interpretation
The Random Forest classifier can improve robustness compared with a single tree, but overall performance may still be dominated by the majority class under severe imbalance. The confusion matrix, per-class sensitivity, Kappa, and balanced accuracy provide a more reliable assessment than accuracy alone.

9 9.0 Model Comparison Summary

reg_table <- data.frame(
  Model = c("Linear Regression","Random Forest (reg)"),
  RMSE  = c(lm_rmse, rf_rmse),
  MAE   = c(lm_mae, rf_mae),
  R2    = c(lm_r2, rf_r2)
)

knitr::kable(reg_table, digits = 3, caption = "Regression model comparison (test set)")

Regression model comparison (test set)
Model	RMSE	MAE	R2
Linear Regression	20.306	13.665	0.706
Random Forest (reg)	20.021	13.246	0.718

class_summary <- tibble(
  Model = c("Decision Tree","Random Forest (clf)"),
  Accuracy = c(conf_tree$overall["Accuracy"], conf_rf_cl$overall["Accuracy"]),
  Kappa = c(conf_tree$overall["Kappa"], conf_rf_cl$overall["Kappa"])
)

knitr::kable(class_summary, digits = 3, caption = "Classification model comparison (test set)")

Classification model comparison (test set)
Model	Accuracy	Kappa
Decision Tree	0.760	0.000
Random Forest (clf)	0.754	0.011

Interpretation
Regression results typically show that Random Forest slightly outperforms linear regression, indicating modest gains from modelling non-linearities. Classification results often appear similar in overall accuracy, but this metric can be inflated by the dominant class. Kappa and balanced accuracy help reveal limitations in minority-class detection.

10 10.0 Discussion

PM2.5 & NO₂: Exploratory analysis and correlation results indicate a consistent positive relationship, suggesting shared emission activities such as traffic and combustion-related sources.
Spatial patterns: City-level average PM2.5 concentrations exhibit substantial heterogeneity, indicating location-specific pollution dynamics.
Regression: Random Forest regression marginally outperforms linear regression, suggesting non-linear effects and interactions involving seasonality and city effects.
Classification: Overall classification performance is driven by the dominant Good class. Severe class imbalance limits the ability to detect Moderate and Poor categories reliably.
Feature importance: PM2.5 and NO₂ rank among the most influential predictors across models, aligning with domain knowledge.

11 11.0 Limitations & Future Work

Limitations

Monthly averages may mask daily peaks and episodic pollution events.
Meteorological variables (e.g., wind speed, temperature, humidity) are not included, although they strongly affect pollutant dispersion.
Measurement heterogeneity (e.g., data_source differences) may introduce bias.
Class imbalance limits classification effectiveness for Moderate and Poor categories.

Future Work

Incorporate meteorological and emissions-related variables to improve predictive accuracy and interpretability.
Apply time-series forecasting methods (e.g., Prophet or LSTM) for short-term prediction tasks.
Explore imbalance-handling strategies (e.g., class weighting, SMOTE) and alternative metrics for classification.
Expand spatial coverage using satellite-derived indicators or spatial interpolation methods.

12 12.0 Conclusion

This project presents a comprehensive analysis of global urban air quality using long-term PM2.5 and NO₂ data across major cities. Exploratory analysis identifies pronounced temporal, seasonal, and spatial heterogeneity in air quality patterns. Random Forest regression demonstrates modest improvements over linear models, indicating non-linear relationships in PM2.5 dynamics. Classification results highlight the challenges of categorising air quality under imbalanced class distributions. Overall, the study underscores the value of integrating exploratory analysis with machine learning techniques to enhance understanding and prediction of urban air pollution.

13 Appendix: Code & Session Info

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0         maps_3.4.3           randomForest_4.7-1.2
##  [4] rpart.plot_3.1.3     rpart_4.1.24         caret_7.0-1         
##  [7] lattice_0.22-7       corrplot_0.95        skimr_2.2.1         
## [10] lubridate_1.9.4      forcats_1.0.1        stringr_1.5.2       
## [13] dplyr_1.1.4          purrr_1.2.0          readr_2.1.6         
## [16] tidyr_1.3.1          tibble_3.3.0         ggplot2_4.0.1       
## [19] tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    timeDate_4051.111   
##  [4] farver_2.1.2         S7_0.2.1             fastmap_1.2.0       
##  [7] pROC_1.19.0.1        digest_0.6.37        timechange_0.3.0    
## [10] lifecycle_1.0.4      survival_3.8-3       magrittr_2.0.4      
## [13] compiler_4.5.1       rlang_1.1.6          sass_0.4.10         
## [16] tools_4.5.1          yaml_2.3.10          data.table_1.17.8   
## [19] knitr_1.50           labeling_0.4.3       plyr_1.8.9          
## [22] repr_1.1.7           RColorBrewer_1.1-3   withr_3.0.2         
## [25] nnet_7.3-20          grid_4.5.1           stats4_4.5.1        
## [28] e1071_1.7-16         future_1.68.0        globals_0.18.0      
## [31] iterators_1.0.14     MASS_7.3-65          cli_3.6.5           
## [34] rmarkdown_2.30       generics_0.1.4       rstudioapi_0.17.1   
## [37] future.apply_1.20.0  reshape2_1.4.5       tzdb_0.5.0          
## [40] proxy_0.4-27         cachem_1.1.0         splines_4.5.1       
## [43] parallel_4.5.1       base64enc_0.1-3      vctrs_0.6.5         
## [46] hardhat_1.4.2        Matrix_1.7-3         jsonlite_2.0.0      
## [49] hms_1.1.4            listenv_0.10.0       foreach_1.5.2       
## [52] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.1       
## [55] glue_1.8.0           parallelly_1.45.1    codetools_0.2-20    
## [58] stringi_1.8.7        gtable_0.3.6         pillar_1.11.1       
## [61] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.2          
## [64] R6_2.6.1             evaluate_1.0.5       bslib_0.9.0         
## [67] class_7.3-23         Rcpp_1.1.0           nlme_3.1-168        
## [70] prodlim_2025.04.28   mgcv_1.9-3           xfun_0.53           
## [73] pkgconfig_2.0.3      ModelMetrics_1.2.2.2