Global Air Quality Analysis & Prediction (1999–2025)

WQD7004 – Programming for Data Science

Group Project Report

Author: Group 4

Date: 2025-11-21

1 1. Introduction

This project analyzes monthly PM2.5 and NO₂ levels across
20 major global cities (1999–2025) using EPA & WHO datasets.

2 2. Load Packages

options(repos = c(CRAN = "https://cloud.r-project.org"))

required <- c(
  "tidyverse","lubridate","skimr","corrplot",
  "caret","rpart","rpart.plot","randomForest"
)

to_install <- required[!(required %in% installed.packages()[,"Package"])]
if(length(to_install) > 0){
  install.packages(to_install, dependencies = TRUE)
}

lapply(required, library, character.only = TRUE)

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "skimr"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "corrplot"  "skimr"     "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[5]]
##  [1] "caret"     "lattice"   "corrplot"  "skimr"     "lubridate" "forcats"  
##  [7] "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"   
## [13] "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices" "utils"    
## [19] "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "rpart"     "caret"     "lattice"   "corrplot"  "skimr"     "lubridate"
##  [7] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
## [13] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [19] "utils"     "datasets"  "methods"   "base"     
## 
## [[7]]
##  [1] "rpart.plot" "rpart"      "caret"      "lattice"    "corrplot"  
##  [6] "skimr"      "lubridate"  "forcats"    "stringr"    "dplyr"     
## [11] "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
## [16] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"     
## [21] "datasets"   "methods"    "base"      
## 
## [[8]]
##  [1] "randomForest" "rpart.plot"   "rpart"        "caret"        "lattice"     
##  [6] "corrplot"     "skimr"        "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"

3 3. Data Loading & Inspection

df <- read.csv("air_quality_global.csv")

cat("Rows:", nrow(df), "\nColumns:", ncol(df), "\n")

## Rows: 6480 
## Columns: 11

head(df)

##       city country latitude longitude year month pm25_ugm3 no2_ugm3
## 1 New York     USA  40.7128   -74.006 1999     1     18.11    35.98
## 2 New York     USA  40.7128   -74.006 1999     2     27.79    17.71
## 3 New York     USA  40.7128   -74.006 1999     3     12.05    40.99
## 4 New York     USA  40.7128   -74.006 1999     4     35.25    17.18
## 5 New York     USA  40.7128   -74.006 1999     5     38.39    25.07
## 6 New York     USA  40.7128   -74.006 1999     6     14.89    28.95
##   data_quality          measurement_method data_source
## 1     Moderate Reference/Equivalent Method     EPA_AQS
## 2         Good Reference/Equivalent Method     EPA_AQS
## 3     Moderate Reference/Equivalent Method     EPA_AQS
## 4         Poor Reference/Equivalent Method     EPA_AQS
## 5         Good Reference/Equivalent Method     EPA_AQS
## 6         Good Reference/Equivalent Method     EPA_AQS

str(df)

## 'data.frame':    6480 obs. of  11 variables:
##  $ city              : chr  "New York" "New York" "New York" "New York" ...
##  $ country           : chr  "USA" "USA" "USA" "USA" ...
##  $ latitude          : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ longitude         : num  -74 -74 -74 -74 -74 ...
##  $ year              : int  1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
##  $ month             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pm25_ugm3         : num  18.1 27.8 12.1 35.2 38.4 ...
##  $ no2_ugm3          : num  36 17.7 41 17.2 25.1 ...
##  $ data_quality      : chr  "Moderate" "Good" "Moderate" "Poor" ...
##  $ measurement_method: chr  "Reference/Equivalent Method" "Reference/Equivalent Method" "Reference/Equivalent Method" "Reference/Equivalent Method" ...
##  $ data_source       : chr  "EPA_AQS" "EPA_AQS" "EPA_AQS" "EPA_AQS" ...

skim(df)

Data summary
Name	df
Number of rows	6480
Number of columns	11
_______________________
Column type frequency:
character	5
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
city	1	5	12	20
country	1	2	7	10
data_quality	1	4	8	3
measurement_method	1	24	27	2
data_source	1	7	12	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
latitude	1	31.54	16.60	-23.55	29.24	33.75	40.14	52.52	▁▁▂▇▇
longitude	1	-35.88	81.51	-121.89	-98.65	-74.59	5.89	139.65	▇▁▃▂▂
year	1	2012.00	7.79	1999.00	2005.00	2012.00	2019.00	2025.00	▇▇▇▇▇
month	1	6.50	3.45	1.00	3.75	6.50	9.25	12.00	▇▅▅▅▇
pm25_ugm3	1	40.97	36.30	5.10	19.34	29.23	46.08	274.18	▇▁▁▁▁
no2_ugm3	1	39.62	16.71	10.25	27.08	36.84	48.92	110.27	▆▇▃▁▁

4 4. Data Cleaning

clean_df <- df %>%
  mutate(
    date = make_date(year, month, 1),
    city = factor(city),
    country = factor(country),
    measurement_method = factor(measurement_method),
    data_source = factor(data_source),
    season = case_when(
      month %in% c(12,1,2) ~ "Winter",
      month %in% c(3,4,5) ~ "Spring",
      month %in% c(6,7,8) ~ "Summer",
      TRUE ~ "Autumn"
    ) |> factor(),
    data_quality = factor(data_quality,
                          levels = c("Good","Moderate","Poor"),
                          ordered = TRUE)
  ) %>%
  group_by(city, month) %>%
  mutate(
    pm25_ugm3 = ifelse(is.na(pm25_ugm3),
                       mean(pm25_ugm3, na.rm = TRUE),
                       pm25_ugm3),
    no2_ugm3 = ifelse(is.na(no2_ugm3),
                      mean(no2_ugm3, na.rm = TRUE),
                      no2_ugm3)
  ) %>%
  ungroup()

skim(clean_df)

Data summary
Name	clean_df
Number of rows	6480
Number of columns	13
_______________________
Column type frequency:
Date	1
factor	6
numeric	6
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	1999-01-01	2025-12-01	2012-06-16	324

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
city	1	FALSE	20	Bei: 324, Ber: 324, Chi: 324, Dal: 324
country	1	FALSE	10	USA: 3240, Ind: 648, Bra: 324, Chi: 324
data_quality	1	TRUE	3	Goo: 4917, Mod: 1169, Poo: 394
measurement_method	1	FALSE	2	Fed: 5040, Ref: 1440
data_source	1	FALSE	2	EPA: 3240, WHO: 3240
season	1	FALSE	4	Aut: 1620, Spr: 1620, Sum: 1620, Win: 1620

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
latitude	1	31.54	16.60	-23.55	29.24	33.75	40.14	52.52	▁▁▂▇▇
longitude	1	-35.88	81.51	-121.89	-98.65	-74.59	5.89	139.65	▇▁▃▂▂
year	1	2012.00	7.79	1999.00	2005.00	2012.00	2019.00	2025.00	▇▇▇▇▇
month	1	6.50	3.45	1.00	3.75	6.50	9.25	12.00	▇▅▅▅▇
pm25_ugm3	1	40.97	36.30	5.10	19.34	29.23	46.08	274.18	▇▁▁▁▁
no2_ugm3	1	39.62	16.71	10.25	27.08	36.84	48.92	110.27	▆▇▃▁▁

5 5. Exploratory Data Analysis

5.1 5.1 Global PM2.5 Trend

clean_df %>%
  group_by(date) %>%
  summarize(pm25 = mean(pm25_ugm3)) %>%
  ggplot(aes(date, pm25)) +
  geom_line(size=1.1, color="#1A73E8") +
  theme_minimal() +
  labs(title="Global PM2.5 Trend (1999–2025)",
       y="Mean PM2.5 (µg/m³)",
       x="Year")

5.2 5.2 PM2.5 Distribution by City

ggplot(clean_df, aes(city, pm25_ugm3, fill=city)) +
  geom_boxplot(alpha=.7) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title="PM2.5 Distribution Across Cities")

5.3 5.3 PM2.5 vs NO₂

ggplot(clean_df, aes(no2_ugm3, pm25_ugm3, color=city)) +
  geom_point(alpha=.45) +
  geom_smooth(method="lm", se=FALSE, color="black") +
  theme_minimal() +
  labs(title="PM2.5 vs NO₂ Levels")

5.4 5.4 Correlation Heatmap

cor_matrix <- clean_df %>%
  select(pm25_ugm3, no2_ugm3, latitude, longitude) %>%
  cor()

corrplot(cor_matrix, method="color", addCoef.col="black")

6 6. Modeling

6.1 6.1 Train–Test Split

set.seed(123)

clean_df <- clean_df %>% filter(!is.na(data_quality))

train_idx <- createDataPartition(clean_df$data_quality, p=0.8, list=FALSE)

train <- clean_df[train_idx, ] %>% droplevels()
test  <- clean_df[-train_idx, ] %>% droplevels()

6.2 6.2 Regression Model

model_reg <- lm(pm25_ugm3 ~ no2_ugm3 + latitude + season + city,
                data=train)

summary(model_reg)

## 
## Call:
## lm(formula = pm25_ugm3 ~ no2_ugm3 + latitude + season + city, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.960  -9.206   0.406   9.023 129.196 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -340.90751   15.83840 -21.524  < 2e-16 ***
## no2_ugm3            0.10015    0.02287   4.378 1.22e-05 ***
## latitude            9.96572    0.42539  23.427  < 2e-16 ***
## seasonSpring       22.18852    0.78926  28.113  < 2e-16 ***
## seasonSummer       11.39003    0.80466  14.155  < 2e-16 ***
## seasonWinter       11.63281    0.76453  15.216  < 2e-16 ***
## cityBerlin       -172.25148    6.52073 -26.416  < 2e-16 ***
## cityChicago       -66.51658    2.36116 -28.171  < 2e-16 ***
## cityDallas         23.52521    2.53712   9.272  < 2e-16 ***
## cityDelhi         170.36328    4.27334  39.866  < 2e-16 ***
## cityHouston        57.23679    3.67380  15.580  < 2e-16 ***
## cityLagos         326.65356   13.36556  24.440  < 2e-16 ***
## cityLondon       -158.95266    6.03690 -26.330  < 2e-16 ***
## cityLos Angeles    17.18419    2.14176   8.023 1.26e-15 ***
## cityMexico City   183.26172    8.01249  22.872  < 2e-16 ***
## cityMumbai        232.99353    8.12917  28.661  < 2e-16 ***
## cityNew York      -57.43163    2.05359 -27.966  < 2e-16 ***
## cityParis        -133.87052    4.97687 -26.899  < 2e-16 ***
## cityPhiladelphia  -48.81827    1.83019 -26.674  < 2e-16 ***
## cityPhoenix        14.39198    2.30133   6.254 4.33e-10 ***
## citySan Antonio    57.98366    3.77160  15.374  < 2e-16 ***
## citySan Diego      20.43272    2.53275   8.067 8.86e-16 ***
## citySan Jose      -22.50785    1.50922 -14.914  < 2e-16 ***
## citySão Paulo     600.18093   26.14867  22.953  < 2e-16 ***
## cityTokyo                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.47 on 5162 degrees of freedom
## Multiple R-squared:  0.7091, Adjusted R-squared:  0.7078 
## F-statistic:   547 on 23 and 5162 DF,  p-value: < 2.2e-16

6.2.1 Regression Evaluation

pred_reg <- predict(model_reg, test)

cat("RMSE:", RMSE(pred_reg, test$pm25_ugm3), "\n")

## RMSE: 20.30573

cat("MAE :", MAE(pred_reg, test$pm25_ugm3), "\n")

## MAE : 13.66486

cat("R²  :", R2(pred_reg, test$pm25_ugm3), "\n")

## R²  : 0.7062653

6.3 6.3 Decision Tree Classifier

train_mod <- train %>%
  select(data_quality, pm25_ugm3, no2_ugm3, latitude, season, city) %>%
  drop_na() %>% droplevels()

test_mod <- test %>%
  select(data_quality, pm25_ugm3, no2_ugm3, latitude, season, city) %>%
  drop_na() %>% droplevels()

model_tree <- rpart(
  data_quality ~ pm25_ugm3 + no2_ugm3 + latitude + season + city,
  data=train_mod,
  method="class"
)

rpart.plot(
  model_tree,
  main="Decision Tree for Air Quality",
  box.palette="Greens",
  nn=TRUE,
  shadow.col="gray"
)

6.4 6.4 Classification Evaluation

pred_class <- predict(model_tree, test_mod, type="class")
conf <- confusionMatrix(pred_class, test_mod$data_quality)
conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Moderate Poor
##   Good      983      233   78
##   Moderate    0        0    0
##   Poor        0        0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7597          
##                  95% CI : (0.7354, 0.7827)
##     No Information Rate : 0.7597          
##     P-Value [Acc > NIR] : 0.5152          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Good Class: Moderate Class: Poor
## Sensitivity               1.0000          0.0000     0.00000
## Specificity               0.0000          1.0000     1.00000
## Pos Pred Value            0.7597             NaN         NaN
## Neg Pred Value               NaN          0.8199     0.93972
## Prevalence                0.7597          0.1801     0.06028
## Detection Rate            0.7597          0.0000     0.00000
## Detection Prevalence      1.0000          0.0000     0.00000
## Balanced Accuracy         0.5000          0.5000     0.50000

7 7. Conclusion

PM2.5 varies significantly across cities and time
PM2.5 & NO₂ show moderate positive correlation
Regression: NO₂, season, city significantly influence PM2.5
Decision tree effectively predicts Good / Moderate / Poor