WQD7004 – Programming for Data Science · Group Project
Air pollution remains a major global public health concern, yet many air quality studies focus on short time periods or individual cities, limiting their ability to capture long-term and cross-regional pollution dynamics. This project conducts a spatio-temporal analysis of monthly PM2.5 and NO₂ data from 20 major global cities covering the period 1999–2025 to examine long-term trends, seasonal patterns, and geographic variation in urban air quality.
In addition to exploratory analysis, the study evaluates regression and classification models to assess their effectiveness in predicting PM2.5 concentrations and categorising air quality levels. By comparing traditional linear models with machine learning approaches, the project highlights both the predictive potential and practical limitations of data-driven air quality modelling in large-scale datasets with imbalanced class distributions.
Research Questions
Research Objectives
Title and Purpose
The dataset used in this project is a global air quality dataset
compiled from the U.S. Environmental Protection Agency (EPA) and the
World Health Organization (WHO) databases. It is designed to monitor and
analyse urban air pollution, with a focus on PM2.5 and NO₂
concentrations across major global cities.
Time Coverage
- PM2.5: 1999–2025
- NO₂: 1970–2025
This extended time span enables the analysis of long-term pollution trends, seasonal patterns, and temporal variability in air quality.
Dimension and Structure
The dataset contains 6,480 observations representing monthly air quality
records from 20 major global cities. It is organised in a structured
tabular format, where each row corresponds to a city–month–year
observation and each column represents a specific pollution, geographic,
temporal, or metadata variable.
Dataset Content
- Location identifiers: city, country
- Geographic coordinates: latitude,
longitude
- Time indicators: year, month
- Pollution measures: pm25_ugm3 (PM2.5),
no2_ugm3 (NO₂), measured in µg/m³
- Air quality label: data_quality (Good / Moderate /
Poor)
- Metadata: measurement_method, data_source
(EPA / WHO)
Summary of Raw Data
Preliminary inspection of the raw data indicates substantial variation
in PM2.5 and NO₂ concentrations across cities and time periods. The
dataset includes both numeric and categorical variables, providing a
suitable foundation for subsequent exploratory, regression, and
classification analyses. Detailed temporal trends and seasonal patterns
are examined in later sections through formal exploratory data
analysis.
The analysis was conducted in R using the following packages:
df <- read.csv("air_quality_global.csv", stringsAsFactors = FALSE)
cat("Rows:", nrow(df), " Columns:", ncol(df), "\n")
## Rows: 6480 Columns: 11
glimpse(df)
## Rows: 6,480
## Columns: 11
## $ city <chr> "New York", "New York", "New York", "New York", "Ne…
## $ country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "U…
## $ latitude <dbl> 40.7128, 40.7128, 40.7128, 40.7128, 40.7128, 40.712…
## $ longitude <dbl> -74.006, -74.006, -74.006, -74.006, -74.006, -74.00…
## $ year <int> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 199…
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, …
## $ pm25_ugm3 <dbl> 18.11, 27.79, 12.05, 35.25, 38.39, 14.89, 19.66, 10…
## $ no2_ugm3 <dbl> 35.98, 17.71, 40.99, 17.18, 25.07, 28.95, 27.85, 26…
## $ data_quality <chr> "Moderate", "Good", "Moderate", "Poor", "Good", "Go…
## $ measurement_method <chr> "Reference/Equivalent Method", "Reference/Equivalen…
## $ data_source <chr> "EPA_AQS", "EPA_AQS", "EPA_AQS", "EPA_AQS", "EPA_AQ…
skim(df)
| Name | df |
| Number of rows | 6480 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city | 0 | 1 | 5 | 12 | 0 | 20 | 0 |
| country | 0 | 1 | 2 | 7 | 0 | 10 | 0 |
| data_quality | 0 | 1 | 4 | 8 | 0 | 3 | 0 |
| measurement_method | 0 | 1 | 24 | 27 | 0 | 2 | 0 |
| data_source | 0 | 1 | 7 | 12 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| latitude | 0 | 1 | 31.54 | 16.60 | -23.55 | 29.24 | 33.75 | 40.14 | 52.52 | ▁▁▂▇▇ |
| longitude | 0 | 1 | -35.88 | 81.51 | -121.89 | -98.65 | -74.59 | 5.89 | 139.65 | ▇▁▃▂▂ |
| year | 0 | 1 | 2012.00 | 7.79 | 1999.00 | 2005.00 | 2012.00 | 2019.00 | 2025.00 | ▇▇▇▇▇ |
| month | 0 | 1 | 6.50 | 3.45 | 1.00 | 3.75 | 6.50 | 9.25 | 12.00 | ▇▅▅▅▇ |
| pm25_ugm3 | 0 | 1 | 40.97 | 36.30 | 5.10 | 19.34 | 29.23 | 46.08 | 274.18 | ▇▁▁▁▁ |
| no2_ugm3 | 0 | 1 | 39.62 | 16.71 | 10.25 | 27.08 | 36.84 | 48.92 | 110.27 | ▆▇▃▁▁ |
Interpretation
The dataset contains 6,480 observations and multiple numeric and
categorical variables. Preliminary inspection confirms that key
pollution and location variables are present and appropriately
formatted, providing a suitable basis for subsequent cleaning,
exploratory analysis, and modelling.
clean_df <- df %>%
mutate(
date = lubridate::make_date(year, month, 1),
city = factor(city),
country = factor(country),
measurement_method = factor(measurement_method),
data_source = factor(data_source),
season = case_when(
month %in% c(12,1,2) ~ "Winter",
month %in% c(3,4,5) ~ "Spring",
month %in% c(6,7,8) ~ "Summer",
TRUE ~ "Autumn"
) %>% factor(levels = c("Winter","Spring","Summer","Autumn")),
data_quality = factor(data_quality, levels = c("Good","Moderate","Poor"), ordered = TRUE)
) %>%
group_by(city, month) %>%
mutate(
pm25_ugm3 = ifelse(is.na(pm25_ugm3), mean(pm25_ugm3, na.rm = TRUE), pm25_ugm3),
no2_ugm3 = ifelse(is.na(no2_ugm3), mean(no2_ugm3, na.rm = TRUE), no2_ugm3)
) %>%
ungroup()
# Quick checks
cat("Remaining NA in PM2.5:", sum(is.na(clean_df$pm25_ugm3)), "\n")
## Remaining NA in PM2.5: 0
cat("Remaining NA in NO2:", sum(is.na(clean_df$no2_ugm3)), "\n")
## Remaining NA in NO2: 0
table(clean_df$data_quality, useNA = "ifany")
##
## Good Moderate Poor
## 4917 1169 394
Main Cleaning Procedures
- Constructed a date variable from year and month for time-series
analysis.
- Converted categorical variables into factors.
- Created a seasonal variable (Winter, Spring, Summer, Autumn).
- Defined data_quality as an ordered factor (Good <
Moderate < Poor).
- Imputed missing PM2.5 and NO₂ values using city–month group
means.
- Performed post-cleaning checks to confirm no remaining missing
values.
clean_df %>%
group_by(date) %>%
summarize(pm25_mean = mean(pm25_ugm3, na.rm = TRUE)) %>%
ggplot(aes(date, pm25_mean)) +
geom_line(size = 1.1, color = "#1A73E8") +
theme_minimal() +
labs(
title = "Global Monthly Mean PM2.5 (1999–2025)",
x = "Year",
y = "Mean PM2.5 (µg/m³)"
)
Interpretation
The global monthly mean PM2.5 series exhibits pronounced seasonal
fluctuations with recurring peaks and troughs over time. The repeated
cyclical pattern highlights the importance of accounting for seasonality
in subsequent modelling. A gradual decline is also observed in later
years, suggesting long-term improvement in average PM2.5 levels.
ggplot(clean_df, aes(reorder(city, pm25_ugm3, FUN = median), pm25_ugm3, fill = city)) +
geom_boxplot(outlier.size = 0.8, alpha = 0.8) +
coord_flip() +
theme_minimal() +
theme(legend.position = "none") +
labs(
title = "PM2.5 Distribution by City (ordered by median)",
x = "",
y = "PM2.5 (µg/m³)"
)
Interpretation
PM2.5 concentrations vary substantially across cities, indicating strong
spatial heterogeneity in urban air pollution. Cities such as Delhi,
Mumbai, and Beijing exhibit higher median PM2.5 levels and wider
variability, while cities such as San Diego, New York, and Phoenix show
consistently lower concentrations. The presence of outliers suggests
episodic high-pollution events and differing pollution dynamics across
urban areas.
ggplot(clean_df, aes(no2_ugm3, pm25_ugm3)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme_minimal() +
labs(
title = "PM2.5 vs NO₂ (monthly observations)",
x = "NO₂ (µg/m³)",
y = "PM2.5 (µg/m³)"
)
Interpretation
The scatter plot shows a positive association between NO₂ and PM2.5
concentrations across monthly observations. Higher NO₂ levels are
generally associated with higher PM2.5 values, as indicated by the
upward-sloping fitted regression line. The wide dispersion suggests
that, while NO₂ is informative, additional factors also influence PM2.5
concentrations.
ggplot(clean_df, aes(season, pm25_ugm3, fill = season)) +
geom_boxplot() +
theme_minimal() +
theme(legend.position = "none") +
labs(
title = "Seasonal Variation of PM2.5",
x = "Season",
y = "PM2.5 (µg/m³)"
)
Interpretation
PM2.5 concentrations exhibit clear seasonal variation. Higher median
PM2.5 levels are typically observed during Winter and Spring, while
Autumn often shows lower concentrations. Seasonal factors therefore play
an important role in PM2.5 variability and should be incorporated into
predictive models.
cor_mat <- clean_df %>%
select(pm25_ugm3, no2_ugm3, latitude, longitude) %>%
cor(use = "pairwise.complete.obs")
corrplot(cor_mat, method = "color", addCoef.col = "black", tl.col = "black")
Interpretation
PM2.5 exhibits a positive correlation with NO₂, supporting the view that
pollutant interactions are informative for explaining PM2.5 variation.
Correlations between pollution measures and geographic coordinates are
weak, suggesting that latitude and longitude alone do not strongly
explain pollution variability.
city_mean <- clean_df %>%
group_by(city, country, latitude, longitude) %>%
summarize(pm25_avg = mean(pm25_ugm3, na.rm = TRUE), .groups = "drop")
map_df <- map_data("world")
ggplot() +
geom_polygon(
data = map_df,
aes(x = long, y = lat, group = group),
fill = "#f0f0f0",
color = "#d9d9d9"
) +
geom_point(
data = city_mean,
aes(x = longitude, y = latitude, color = pm25_avg, size = pm25_avg),
alpha = 0.9
) +
scale_color_viridis_c(option = "plasma", name = "PM2.5 avg") +
scale_size(range = c(2, 8), guide = "none") +
coord_quickmap() +
theme_minimal() +
labs(title = "City Average PM2.5 (1999–2025)", x = "", y = "")
Interpretation
The spatial map highlights substantial geographic variation in average
PM2.5 concentrations. Cities in parts of South and East Asia exhibit
higher PM2.5 levels, while cities in Europe and North America generally
show lower concentrations. This reinforces the importance of geographic
context in air quality analysis.
clean_df <- clean_df %>% filter(!is.na(data_quality))
set.seed(123)
train_idx <- createDataPartition(clean_df$data_quality, p = 0.8, list = FALSE)
train <- clean_df[train_idx, ] %>% droplevels()
test <- clean_df[-train_idx, ] %>% droplevels()
cat("Train / Test sizes:", nrow(train), "/", nrow(test), "\n")
## Train / Test sizes: 5186 / 1294
table(train$data_quality)
##
## Good Moderate Poor
## 3934 936 316
table(test$data_quality)
##
## Good Moderate Poor
## 983 233 78
Procedures
Records with missing data_quality labels were
removed.
An 80/20 stratified train–test split was applied to preserve class proportions.
The split yields training and testing sets used consistently across regression and classification experiments.
lm_fit <- lm(pm25_ugm3 ~ no2_ugm3 + latitude + season + city, data = train)
summary(lm_fit)
##
## Call:
## lm(formula = pm25_ugm3 ~ no2_ugm3 + latitude + season + city,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.960 -9.206 0.406 9.023 129.196
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -329.27469 15.84574 -20.780 < 2e-16 ***
## no2_ugm3 0.10015 0.02287 4.378 1.22e-05 ***
## latitude 9.96572 0.42539 23.427 < 2e-16 ***
## seasonSpring 10.55571 0.79967 13.200 < 2e-16 ***
## seasonSummer -0.24278 0.81926 -0.296 0.767
## seasonAutumn -11.63281 0.76453 -15.216 < 2e-16 ***
## cityBerlin -172.25148 6.52073 -26.416 < 2e-16 ***
## cityChicago -66.51658 2.36116 -28.171 < 2e-16 ***
## cityDallas 23.52521 2.53712 9.272 < 2e-16 ***
## cityDelhi 170.36328 4.27334 39.866 < 2e-16 ***
## cityHouston 57.23679 3.67380 15.580 < 2e-16 ***
## cityLagos 326.65356 13.36556 24.440 < 2e-16 ***
## cityLondon -158.95266 6.03690 -26.330 < 2e-16 ***
## cityLos Angeles 17.18419 2.14176 8.023 1.26e-15 ***
## cityMexico City 183.26172 8.01249 22.872 < 2e-16 ***
## cityMumbai 232.99353 8.12917 28.661 < 2e-16 ***
## cityNew York -57.43163 2.05359 -27.966 < 2e-16 ***
## cityParis -133.87052 4.97687 -26.899 < 2e-16 ***
## cityPhiladelphia -48.81827 1.83019 -26.674 < 2e-16 ***
## cityPhoenix 14.39198 2.30133 6.254 4.33e-10 ***
## citySan Antonio 57.98366 3.77160 15.374 < 2e-16 ***
## citySan Diego 20.43272 2.53275 8.067 8.86e-16 ***
## citySan Jose -22.50785 1.50922 -14.914 < 2e-16 ***
## citySão Paulo 600.18093 26.14867 22.953 < 2e-16 ***
## cityTokyo NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.47 on 5162 degrees of freedom
## Multiple R-squared: 0.7091, Adjusted R-squared: 0.7078
## F-statistic: 547 on 23 and 5162 DF, p-value: < 2.2e-16
pred_lm <- predict(lm_fit, test)
lm_rmse <- RMSE(pred_lm, test$pm25_ugm3)
lm_mae <- MAE(pred_lm, test$pm25_ugm3)
lm_r2 <- R2(pred_lm, test$pm25_ugm3)
data.frame(Model = "Linear Regression", RMSE = lm_rmse, MAE = lm_mae, R2 = lm_r2)
## Model RMSE MAE R2
## 1 Linear Regression 20.30573 13.66486 0.7062653
Interpretation
The linear regression model uses NO₂ concentration, geographic location, seasonality, and city effects to predict PM2.5 levels.
NO₂, latitude, seasonal indicators, and several city effects are statistically significant, indicating their relevance in explaining PM2.5 variation.
The model explains approximately 71% of the variance in PM2.5 (R² ≈ 0.71). Test-set evaluation shows moderate prediction accuracy (RMSE ≈ 20.3), serving as a baseline for comparison with more flexible models.
This suggests the presence of non-linear relationships, motivating the use of more advanced models.
set.seed(123)
rf_reg <- randomForest(
pm25_ugm3 ~ no2_ugm3 + latitude + season + city,
data = train,
ntree = 300,
importance = TRUE
)
print(rf_reg)
##
## Call:
## randomForest(formula = pm25_ugm3 ~ no2_ugm3 + latitude + season + city, data = train, ntree = 300, importance = TRUE)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 364.5039
## % Var explained: 71.88
varImpPlot(rf_reg, main = "Random Forest: Regression Variable Importance")
pred_rf <- predict(rf_reg, test)
rf_rmse <- RMSE(pred_rf, test$pm25_ugm3)
rf_mae <- MAE(pred_rf, test$pm25_ugm3)
rf_r2 <- R2(pred_rf, test$pm25_ugm3)
data.frame(
Model = c("Linear Regression", "Random Forest Regression"),
RMSE = c(lm_rmse, rf_rmse),
MAE = c(lm_mae, rf_mae),
R2 = c(lm_r2, rf_r2)
)
## Model RMSE MAE R2
## 1 Linear Regression 20.30573 13.66486 0.7062653
## 2 Random Forest Regression 20.02125 13.24618 0.7182331
Interpretation
A Random Forest regression model was applied to capture non-linear relationships between PM2.5 and the explanatory variables.
The model explains approximately 72% of the variance in PM2.5, slightly improving upon the linear baseline.
Variable importance results indicate that city and season are the most influential predictors, followed by NO₂ and latitude.
Test-set evaluation shows lower prediction errors (RMSE ≈ 20.0, MAE ≈ 13.2) compared to linear regression, confirming improved predictive performance.
train_cl <- train %>%
select(data_quality, pm25_ugm3, no2_ugm3, latitude, season, city) %>%
drop_na() %>%
droplevels()
test_cl <- test %>%
select(data_quality, pm25_ugm3, no2_ugm3, latitude, season, city) %>%
drop_na() %>%
droplevels()
tree_fit <- rpart(
data_quality ~ pm25_ugm3 + no2_ugm3 + latitude + season + city,
data = train_cl,
method = "class"
)
rpart.plot(
tree_fit,
main = "Decision Tree: Predicting data_quality",
box.palette = "Greens",
nn = TRUE,
shadow.col = "gray"
)
pred_tree <- predict(tree_fit, test_cl, type = "class")
conf_tree <- confusionMatrix(pred_tree, test_cl$data_quality)
conf_tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Moderate Poor
## Good 983 233 78
## Moderate 0 0 0
## Poor 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.7597
## 95% CI : (0.7354, 0.7827)
## No Information Rate : 0.7597
## P-Value [Acc > NIR] : 0.5152
##
## Kappa : 0
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Good Class: Moderate Class: Poor
## Sensitivity 1.0000 0.0000 0.00000
## Specificity 0.0000 1.0000 1.00000
## Pos Pred Value 0.7597 NaN NaN
## Neg Pred Value NaN 0.8199 0.93972
## Prevalence 0.7597 0.1801 0.06028
## Detection Rate 0.7597 0.0000 0.00000
## Detection Prevalence 1.0000 0.0000 0.00000
## Balanced Accuracy 0.5000 0.5000 0.50000
Interpretation
The decision tree failed to generate meaningful splits and predicted all observations as the majority class (Good).
Although the overall accuracy appears high (≈ 76%), this performance is misleading and equivalent to always predicting the dominant class.
The zero Kappa value and balanced accuracy indicate that the model does not effectively distinguish between Moderate and Poor air quality categories.
set.seed(123)
rf_clf <- randomForest(
data_quality ~ pm25_ugm3 + no2_ugm3 + latitude + season + city,
data = train_cl,
ntree = 300,
importance = TRUE
)
print(rf_clf)
##
## Call:
## randomForest(formula = data_quality ~ pm25_ugm3 + no2_ugm3 + latitude + season + city, data = train_cl, ntree = 300, importance = TRUE)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 25.18%
## Confusion matrix:
## Good Moderate Poor class.error
## Good 3876 51 7 0.01474326
## Moderate 930 4 2 0.99572650
## Poor 311 5 0 1.00000000
varImpPlot(rf_clf, main = "Random Forest: Classification Variable Importance")
pred_rf_cl <- predict(rf_clf, test_cl)
conf_rf_cl <- confusionMatrix(pred_rf_cl, test_cl$data_quality)
conf_rf_cl
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Moderate Poor
## Good 972 229 75
## Moderate 10 4 3
## Poor 1 0 0
##
## Overall Statistics
##
## Accuracy : 0.7543
## 95% CI : (0.7298, 0.7775)
## No Information Rate : 0.7597
## P-Value [Acc > NIR] : 0.6887
##
## Kappa : 0.011
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Good Class: Moderate Class: Poor
## Sensitivity 0.98881 0.017167 0.0000000
## Specificity 0.02251 0.987747 0.9991776
## Pos Pred Value 0.76176 0.235294 0.0000000
## Neg Pred Value 0.38889 0.820673 0.9396752
## Prevalence 0.75966 0.180062 0.0602782
## Detection Rate 0.75116 0.003091 0.0000000
## Detection Prevalence 0.98609 0.013138 0.0007728
## Balanced Accuracy 0.50566 0.502457 0.4995888
Interpretation
The Random Forest classifier achieves an overall test accuracy of 75.4%, which is comparable to the Decision Tree baseline but slightly below the no-information rate (75.9%).
Variable importance results indicate that PM2.5 and NO₂ concentrations are the most influential predictors of air quality categories, followed by city and seasonal effects.
The model demonstrates very high sensitivity for the Good class (98.9%), but performs poorly for minority classes, with sensitivity of 1.7% for Moderate and 0% for Poor air quality.
However, due to class imbalance, the model still exhibits limited sensitivity in identifying Moderate and Poor air quality categories, as reflected by the low Kappa value (0.011) and balanced accuracy values close to 0.5.
reg_table <- data.frame(
Model = c("Linear Regression","Random Forest (reg)"),
RMSE = c(lm_rmse, rf_rmse),
MAE = c(lm_mae, rf_mae),
R2 = c(lm_r2, rf_r2)
)
knitr::kable(reg_table, digits = 3, caption = "Regression model comparison (test set)")
| Model | RMSE | MAE | R2 |
|---|---|---|---|
| Linear Regression | 20.306 | 13.665 | 0.706 |
| Random Forest (reg) | 20.021 | 13.246 | 0.718 |
class_summary <- tibble(
Model = c("Decision Tree","Random Forest (clf)"),
Accuracy = c(conf_tree$overall["Accuracy"], conf_rf_cl$overall["Accuracy"]),
Kappa = c(conf_tree$overall["Kappa"], conf_rf_cl$overall["Kappa"])
)
knitr::kable(class_summary, digits = 3, caption = "Classification model comparison (test set)")
| Model | Accuracy | Kappa |
|---|---|---|
| Decision Tree | 0.760 | 0.000 |
| Random Forest (clf) | 0.754 | 0.011 |
Interpretation
For regression tasks, the Random Forest model slightly outperforms linear regression, achieving a lower RMSE (20.02 vs 20.31) and MAE (13.25 vs 13.67), as well as a higher R² (0.718 vs 0.706). This indicates that non-linear models provide modest improvements in predicting PM2.5 concentrations.
For classification tasks, both the Decision Tree and Random Forest models achieve similar overall accuracy. However, this metric is heavily influenced by the dominant Good air quality class.
Despite its more flexible structure, the Random Forest classifier does not substantially improve classification performance over the Decision Tree due to severe class imbalance, limiting its ability to accurately identify Moderate and Poor air quality categories.
Limitations
Future Work
This project presents a comprehensive analysis of global urban air quality using long-term PM2.5 and NO₂ data across major cities. Through exploratory analysis and predictive modelling, the study identifies pronounced temporal, spatial, and pollutant-level heterogeneity in air quality patterns. Random Forest regression demonstrates modest improvements over linear models, indicating the presence of non-linear relationships in PM2.5 dynamics, while classification results highlight the challenges of categorising air quality under imbalanced class distributions. Overall, the findings underscore the value of integrating exploratory analysis with machine learning techniques to enhance understanding and prediction of urban air pollution.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Shanghai
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 maps_3.4.3 randomForest_4.7-1.2
## [4] rpart.plot_3.1.3 rpart_4.1.24 caret_7.0-1
## [7] lattice_0.22-7 corrplot_0.95 skimr_2.2.1
## [10] lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2
## [13] dplyr_1.1.4 purrr_1.2.0 readr_2.1.6
## [16] tidyr_1.3.1 tibble_3.3.0 ggplot2_4.0.1
## [19] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 timeDate_4051.111
## [4] farver_2.1.2 S7_0.2.1 fastmap_1.2.0
## [7] pROC_1.19.0.1 digest_0.6.37 timechange_0.3.0
## [10] lifecycle_1.0.4 survival_3.8-3 magrittr_2.0.4
## [13] compiler_4.5.1 rlang_1.1.6 sass_0.4.10
## [16] tools_4.5.1 yaml_2.3.10 data.table_1.17.8
## [19] knitr_1.50 labeling_0.4.3 plyr_1.8.9
## [22] repr_1.1.7 RColorBrewer_1.1-3 withr_3.0.2
## [25] nnet_7.3-20 grid_4.5.1 stats4_4.5.1
## [28] e1071_1.7-16 future_1.68.0 globals_0.18.0
## [31] iterators_1.0.14 MASS_7.3-65 cli_3.6.5
## [34] rmarkdown_2.30 generics_0.1.4 rstudioapi_0.17.1
## [37] future.apply_1.20.0 reshape2_1.4.5 tzdb_0.5.0
## [40] proxy_0.4-27 cachem_1.1.0 splines_4.5.1
## [43] parallel_4.5.1 base64enc_0.1-3 vctrs_0.6.5
## [46] hardhat_1.4.2 Matrix_1.7-3 jsonlite_2.0.0
## [49] hms_1.1.4 listenv_0.10.0 foreach_1.5.2
## [52] gower_1.0.2 jquerylib_0.1.4 recipes_1.3.1
## [55] glue_1.8.0 parallelly_1.45.1 codetools_0.2-20
## [58] stringi_1.8.7 gtable_0.3.6 pillar_1.11.1
## [61] htmltools_0.5.8.1 ipred_0.9-15 lava_1.8.2
## [64] R6_2.6.1 evaluate_1.0.5 bslib_0.9.0
## [67] class_7.3-23 Rcpp_1.1.0 nlme_3.1-168
## [70] prodlim_2025.04.28 mgcv_1.9-3 xfun_0.53
## [73] pkgconfig_2.0.3 ModelMetrics_1.2.2.2