Air Quality Health
Image
Source: The Guardian - https://www.theguardian.com/sustainable-business/2016/jul/05/how-air-pollution-affects-your-health-infographic
Topic: The Relationship Between Air Quality and Respiratory Hospital Admissions in Major Global Cities
In recent decades, urban air pollution has become one of the most pressing environmental health risks worldwide. This project explores how fluctuations in air quality measured through metrics like the Air Quality Index (AQI) & particulate matter (PM2.5) — correlate with respiratory related hospital admissions in major cities around the world.
This dataset, titled “Global Air Quality and Respiratory Health Outcomes”, offers a detailed, multi-city view of daily air pollution levels and hospitalizations. It includes:
Cities covered: Beijing, Delhi, London, Los Angeles, Mexico City, Tokyo, São Paulo, Cairo
Key variables:
aqi, pm2_5, pm10, no2, o3: Air pollution indicators (quantitative)
hospital_admissions: Number of respiratory related hospital visits per day (quantitative)
city, population_density: Geographic & contextual identifiers (categorical)
date: Date of observation
temperature, humidity: Environmental conditions (quantitative)
Variables I used:
City, Hospital Admissions, Date, AQI, pm2.5 & Population Density
This topic holds personal significance for me because my family lives in Delhi, one of the most polluted cities in the world. Every time I visit, I can physically feel the difference in air quality — the heaviness, the smog, the way it affects breathing even after a short walk outside. Experiencing that firsthand made me curious to explore the issue from a data driven perspective. I wanted to go beyond and actually quantify the impact — to see how poor air quality translates into real public health consequences, especially in cities like Delhi where millions are exposed daily.
This data includes over 88,000 observations, making it robust enough for trend analysis and exploratory modeling.
Loading the Library
library(tidyverse)
library(leaflet)
# Load dataset
air_data <- read_csv("air_quality_health_dataset.csv")#Cleaning the Data
# Convert character columns to factors
air_data <- air_data |>
mutate(
city = as.factor(city),
population_density = as.factor(population_density)
)
# Convert date
air_data <- air_data |>
mutate(date = as.Date(date, format = "%Y-%m-%d"))
# Check for missing values in each column
colSums(is.na(air_data)) city date aqi pm2_5
0 0 0 0
pm10 no2 o3 temperature
0 0 0 0
humidity hospital_admissions population_density hospital_capacity
0 0 0 0
# Replace NAs in numeric columns with column means
air_data <- air_data |>
mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))Assigning new vairables to cleaned data
# Create a new categorical variable based on AQI levels (EPA standard)
air_data <- air_data |>
mutate(
pollution_level = case_when(
aqi <= 50 ~ "Good",
aqi <= 100 ~ "Moderate",
aqi <= 150 ~ "Unhealthy for Sensitive Groups",
aqi <= 200 ~ "Unhealthy",
aqi <= 300 ~ "Very Unhealthy",
TRUE ~ "Hazardous"
)
)
# Preview cleaned data
head(air_data)# A tibble: 6 × 13
city date aqi pm2_5 pm10 no2 o3 temperature humidity
<fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Los Angeles 2020-01-01 65 34 52.7 2.2 38.5 33.5 33
2 Beijing 2020-01-02 137 33.7 31.5 36.7 27.5 -1.6 32
3 London 2020-01-03 266 43 59.6 30.4 57.3 36.4 25
4 Mexico City 2020-01-04 293 33.7 37.9 12.3 42.7 -1 67
5 Delhi 2020-01-05 493 50.3 34.8 31.2 35.6 33.5 72
6 Cairo 2020-01-06 28 67.2 44.9 41.9 47.8 7.9 89
# ℹ 4 more variables: hospital_admissions <dbl>, population_density <fct>,
# hospital_capacity <dbl>, pollution_level <chr>
Source: https://www.airnow.gov/aqi/aqi-basics/ Source tells how to label AQI at what levels
# Convert to dates for 1 year
air_data <- air_data |>
mutate(date = as.Date(date)) |>
filter(date >= as.Date("2020-01-01") & date < as.Date("2021-01-01"))This dataset contained data from year 2260 which is not possible so went back & re filtered
# Convert dates
air_data <- air_data |>
mutate(date = as.Date(date))
# AQI Facet Plot
ggplot(air_data, aes(x = date, y = aqi)) +
geom_line(color = "#F8766D", alpha = 0.8) +
facet_wrap(~ city, scales = "free_y", ncol = 2) +
labs(
title = "Air Quality Index Over 2020 by City",
x = "Date",
y = "AQI",
caption = "Source: Global Air Quality Dataset"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 30, hjust = 1)
)`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
No clue why Sao Paulo is broken here
Inspiration for facet line plot : Stack Overflow
Adding Coordinates assigning 5 Major Cities
# Standardize city names to lowercase and remove extra spaces
air_data <- air_data |>
mutate(city = str_trim(str_to_lower(as.character(city))))
city_coords <- tibble(
city = c("los angeles", "beijing", "london", "delhi", "mexico city"),
lat = c(34.0522, 39.9042, 51.5074, 28.6139, 19.4326),
long = c(-118.2437, 116.4074, -0.1278, 77.2090, -99.1332)
)
# Join with your cleaned data
air_data <- left_join(air_data, city_coords, by = "city")Adding Colors Preparing for Map Visualization
custom_colors <- c(
"Good" = "#FFFF99", # soft yellow
"Moderate" = "#FFB347", # soft orange
"Unhealthy" = "#FF6961", # coral red
"Hazardous" = "#990000" # deep red
)
pal <- colorFactor(
palette = custom_colors,
domain = c("Good", "Moderate", "Unhealthy", "Hazardous")
)
# Recode pollution category based on AQI
air_data <- air_data |>
mutate(
pollution_category = case_when(
aqi <= 50 ~ "Good",
aqi <= 100 ~ "Moderate",
aqi <= 200 ~ "Unhealthy",
TRUE ~ "Hazardous"
)
)
air_data_map <- air_data |>
filter(!is.na(lat) & !is.na(long))Map Visualization utilizing leaflet
#Utilizing leaflet to make map of hopspitals in cities showing their admissions count
leaflet(data = air_data_map) |>
addTiles() |>
addCircleMarkers(
lng = ~long,
lat = ~lat,
radius = ~hospital_admissions * 1.5,
color = ~pal(pollution_category),
stroke = FALSE,
fillOpacity = 0.6,
label = ~paste(city, "<br> AQI:", aqi, "<br> Admissions:", hospital_admissions),
popup = ~paste("City:", city,
"<br> AQI:", aqi,
"<br> PM2.5:", pm2_5,
"<br> Pollution Level:", pollution_category)
) |>
#INserting legend
addLegend(
position = "bottomright",
pal = pal,
values = ~pollution_category,
title = "Pollution Level",
opacity = 1
)You can tell from a simple map visualization that what cities are more polluted than others
Sources
Copy paste from Kaggle [1] https://www.stateofglobalair.org/data [2] https://www.who.int/data/gho/data/themes/air-pollution/who-air-quality-database [3] https://www.stateofglobalair.org/resources/report/state-global-air-report-2024 [4] https://www.iqair.com/in-en/world-air-quality-report [5] https://www.aqi.in/in/world-air-quality-report [6] https://www.healthdata.org/research-analysis/library/state-global-air-2024 [7] https://www.who.int/data/gho/data/themes/air-pollution [8] https://www.kaggle.com/datasets/sazidthe1/global-air-pollution-data