Air pollution is one of the most pressing public-health and environmental challenges facing India. This report presents an exploratory data analysis (EDA) of daily air-quality measurements across major Indian cities between 2015 and 2020. The goal of this initial analysis is to understand the structure and quality of the data, surface the dominant patterns, and frame the business and policy questions worth pursuing in deeper work.
I approach the data with two complementary mindsets: as a data scientist — checking integrity, quantifying missingness, examining distributions and relationships, and being honest about limitations; and as a business / policy analyst — translating numbers into action: which cities need help first, when risk is highest, and whether policy levers actually move the needle.
| # | Question |
|---|---|
| 1 | Where is the problem worst? Which cities carry the highest burden? |
| 2 | When is risk highest? Is there a seasonal pattern to drive the timing of interventions? |
| 3 | What drives the AQI? Which pollutants most strongly track the overall index? |
| 4 | Do interventions work? (beyond brief) Did the 2020 lockdown measurably improve air quality? |
Before any analysis, a data scientist must understand what is missing. Sensor coverage in India expanded over the study period, so earlier years and smaller cities have sparser records.
Xylene is missing in over 60% of records, and
PM10 and NH3 in roughly a third — these should be treated cautiously or
dropped in modelling. Reassuringly, the headline AQI is present for ~84% of rows, giving a
solid base. At this exploratory stage we keep missing values as NA rather than imputing.| City | Mean AQI | Median | Max | Days |
|---|---|---|---|---|
| Ahmedabad | 452.1 | 384.5 | 2049 | 1,334 |
| Delhi | 259.5 | 257.0 | 716 | 1,999 |
| Patna | 240.8 | 215.0 | 619 | 1,459 |
| Gurugram | 225.1 | 208.0 | 891 | 1,453 |
| Lucknow | 218.0 | 198.0 | 707 | 1,893 |
Top 5 most polluted cities by mean AQI.
Several major cities show a gentle downward drift over the period, with a notable dip in 2020 (partly the COVID lockdown, explored below). The 2020 values cover only Jan–Jul, so they are partial.
| # | Finding | Recommended action |
|---|---|---|
| 1 | Burden is highly uneven (~13× gap; worst: Ahmedabad / Delhi cluster) | Prioritise the worst city-cluster for targeted intervention |
| 2 | Strong winter peak in PM2.5 (Nov–Jan); cleanest in monsoon | Schedule advisories & health-system readiness ahead of winter |
| 3 | AQI driven mainly by particulates (PM10, PM2.5) and CO | Focus monitoring & regulation on particulate sources |
| 4 | ~40% of days Good/Satisfactory, but thousands of severe days remain | Build an AQI early-warning & alert system |
| 5 | 2020 lockdown cut Delhi PM2.5 by ~59% | Use this evidence to justify traffic/industry controls |
Limitations: coverage is uneven across cities/years (Xylene, PM10, NH3 heavily incomplete); the lockdown analysis overlaps with seasonal change, so a difference-in-differences or interrupted time-series design is needed to isolate the effect; and city-level aggregation hides neighbourhood hotspots.
Next steps: (1) build a predictive model for next-day AQI from lagged pollutants and weather; (2) cluster cities by pollution profile; (3) join external data (meteorology, crop-burning, traffic) to move from correlation toward causal explanation.
air_quality_analysis.Rmd. Knitting that file in RStudio regenerates every figure and table
from city_day.csv (included with the submission). All statistics shown were computed directly
from the dataset.