Air Quality Health

Author

Charanpreet Singh

Image

Source: The Guardian - https://www.theguardian.com/sustainable-business/2016/jul/05/how-air-pollution-affects-your-health-infographic

Topic: The Relationship Between Air Quality and Respiratory Hospital Admissions in Major Global Cities

In recent decades, urban air pollution has become one of the most pressing environmental health risks worldwide. This project explores how fluctuations in air quality measured through metrics like the Air Quality Index (AQI) & particulate matter (PM2.5) — correlate with respiratory related hospital admissions in major cities around the world.

This dataset, titled “Global Air Quality and Respiratory Health Outcomes”, offers a detailed, multi-city view of daily air pollution levels and hospitalizations. It includes:

Cities covered: Beijing, Delhi, London, Los Angeles, Mexico City, Tokyo, São Paulo, Cairo

Key variables:

aqi, pm2_5, pm10, no2, o3: Air pollution indicators (quantitative)

hospital_admissions: Number of respiratory related hospital visits per day (quantitative)

city, population_density: Geographic & contextual identifiers (categorical)

date: Date of observation

temperature, humidity: Environmental conditions (quantitative)

Variables I used:

City, Hospital Admissions, Date, AQI, pm2.5 & Population Density

This topic holds personal significance for me because my family lives in Delhi, one of the most polluted cities in the world. Every time I visit, I can physically feel the difference in air quality — the heaviness, the smog, the way it affects breathing even after a short walk outside. Experiencing that firsthand made me curious to explore the issue from a data driven perspective. I wanted to go beyond and actually quantify the impact — to see how poor air quality translates into real public health consequences, especially in cities like Delhi where millions are exposed daily.

This data includes over 88,000 observations, making it robust enough for trend analysis and exploratory modeling.

Loading the Library

library(tidyverse)
library(leaflet)

# Load dataset
air_data <- read_csv("air_quality_health_dataset.csv")

#Cleaning the Data

# Convert character columns to factors
air_data <- air_data |> 
  mutate(
    city = as.factor(city),
    population_density = as.factor(population_density)
  )

# Convert date
air_data <- air_data |> 
  mutate(date = as.Date(date, format = "%Y-%m-%d"))

# Check for missing values in each column
colSums(is.na(air_data))
               city                date                 aqi               pm2_5 
                  0                   0                   0                   0 
               pm10                 no2                  o3         temperature 
                  0                   0                   0                   0 
           humidity hospital_admissions  population_density   hospital_capacity 
                  0                   0                   0                   0 
# Replace NAs in numeric columns with column means 
air_data <- air_data |> 
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

Assigning new vairables to cleaned data

# Create a new categorical variable based on AQI levels (EPA standard)
air_data <- air_data |> 
  mutate(
    pollution_level = case_when(
      aqi <= 50 ~ "Good",
      aqi <= 100 ~ "Moderate",
      aqi <= 150 ~ "Unhealthy for Sensitive Groups",
      aqi <= 200 ~ "Unhealthy",
      aqi <= 300 ~ "Very Unhealthy",
      TRUE ~ "Hazardous"
    )
  )

# Preview cleaned data
head(air_data)
# A tibble: 6 × 13
  city        date         aqi pm2_5  pm10   no2    o3 temperature humidity
  <fct>       <date>     <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>    <dbl>
1 Los Angeles 2020-01-01    65  34    52.7   2.2  38.5        33.5       33
2 Beijing     2020-01-02   137  33.7  31.5  36.7  27.5        -1.6       32
3 London      2020-01-03   266  43    59.6  30.4  57.3        36.4       25
4 Mexico City 2020-01-04   293  33.7  37.9  12.3  42.7        -1         67
5 Delhi       2020-01-05   493  50.3  34.8  31.2  35.6        33.5       72
6 Cairo       2020-01-06    28  67.2  44.9  41.9  47.8         7.9       89
# ℹ 4 more variables: hospital_admissions <dbl>, population_density <fct>,
#   hospital_capacity <dbl>, pollution_level <chr>

Source: https://www.airnow.gov/aqi/aqi-basics/ Source tells how to label AQI at what levels

# Convert to dates for 1 year
air_data <- air_data |> 
  mutate(date = as.Date(date)) |> 
  filter(date >= as.Date("2020-01-01") & date < as.Date("2021-01-01"))

This dataset contained data from year 2260 which is not possible so went back & re filtered

# Convert dates
air_data <- air_data |> 
  mutate(date = as.Date(date))

# AQI Facet Plot
ggplot(air_data, aes(x = date, y = aqi)) +
  geom_line(color = "#F8766D", alpha = 0.8) +
  facet_wrap(~ city, scales = "free_y", ncol = 2) +
  labs(
    title = "Air Quality Index Over 2020 by City",
    x = "Date",
    y = "AQI",
    caption = "Source: Global Air Quality Dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(angle = 30, hjust = 1)
  )
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

No clue why Sao Paulo is broken here

Inspiration for facet line plot : Stack Overflow

Adding Coordinates assigning 5 Major Cities

# Standardize city names to lowercase and remove extra spaces
air_data <- air_data |> 
  mutate(city = str_trim(str_to_lower(as.character(city))))

city_coords <- tibble(
  city = c("los angeles", "beijing", "london", "delhi", "mexico city"),
  lat = c(34.0522, 39.9042, 51.5074, 28.6139, 19.4326),
  long = c(-118.2437, 116.4074, -0.1278, 77.2090, -99.1332)
)

# Join with your cleaned data
air_data <- left_join(air_data, city_coords, by = "city")

Adding Colors Preparing for Map Visualization

custom_colors <- c(
  "Good" = "#FFFF99",          # soft yellow
  "Moderate" = "#FFB347",      # soft orange
  "Unhealthy" = "#FF6961",     # coral red
  "Hazardous" = "#990000"      # deep red
)
pal <- colorFactor(
  palette = custom_colors,
  domain = c("Good", "Moderate", "Unhealthy", "Hazardous")
)

# Recode pollution category based on AQI
air_data <- air_data |> 
  mutate(
    pollution_category = case_when(
      aqi <= 50 ~ "Good",
      aqi <= 100 ~ "Moderate",
      aqi <= 200 ~ "Unhealthy",
      TRUE ~ "Hazardous"
    )
  )

air_data_map <- air_data |> 
  filter(!is.na(lat) & !is.na(long))

Map Visualization utilizing leaflet

#Utilizing leaflet to make map of hopspitals in cities showing their admissions count
leaflet(data = air_data_map) |> 
  addTiles() |> 
  addCircleMarkers(
    lng = ~long,
    lat = ~lat,
    radius = ~hospital_admissions * 1.5,
    color = ~pal(pollution_category),
    stroke = FALSE,
    fillOpacity = 0.6,
    label = ~paste(city, "<br> AQI:", aqi, "<br> Admissions:", hospital_admissions),
    popup = ~paste("City:", city,
                   "<br> AQI:", aqi,
                   "<br> PM2.5:", pm2_5,
                   "<br> Pollution Level:", pollution_category)
  ) |> 
  #INserting legend
  addLegend(
    position = "bottomright",
    pal = pal,
    values = ~pollution_category,
    title = "Pollution Level",
    opacity = 1
  )

You can tell from a simple map visualization that what cities are more polluted than others

Sources

Copy paste from Kaggle [1] https://www.stateofglobalair.org/data [2] https://www.who.int/data/gho/data/themes/air-pollution/who-air-quality-database [3] https://www.stateofglobalair.org/resources/report/state-global-air-report-2024 [4] https://www.iqair.com/in-en/world-air-quality-report [5] https://www.aqi.in/in/world-air-quality-report [6] https://www.healthdata.org/research-analysis/library/state-global-air-2024 [7] https://www.who.int/data/gho/data/themes/air-pollution [8] https://www.kaggle.com/datasets/sazidthe1/global-air-pollution-data