Air Quality in Indian Cities: An Exploratory Data Analysis

Initial Analysis Submission — Module 1
Author: Pooja Goswami
Date: June 16, 2026
Course: [Add course name & code]
Instructor: [Add instructor name]
Dataset: city_day.csv (included)
Records: 29,531 city-days · 26 cities
Contents
  1. Introduction & Business Questions
  2. Data Quality & Missingness
  3. Q1 — Where is the problem worst?
  4. AQI Category Mix
  5. Q2 — When is risk highest? (Seasonality)
  6. Q3 — What drives the AQI?
  7. Q4 — Do interventions work? (COVID natural experiment)
  8. Key Findings & Recommendations
  9. Limitations & Next Steps

1. Introduction

Air pollution is one of the most pressing public-health and environmental challenges facing India. This report presents an exploratory data analysis (EDA) of daily air-quality measurements across major Indian cities between 2015 and 2020. The goal of this initial analysis is to understand the structure and quality of the data, surface the dominant patterns, and frame the business and policy questions worth pursuing in deeper work.

I approach the data with two complementary mindsets: as a data scientist — checking integrity, quantifying missingness, examining distributions and relationships, and being honest about limitations; and as a business / policy analyst — translating numbers into action: which cities need help first, when risk is highest, and whether policy levers actually move the needle.

Business questions

#Question
1Where is the problem worst? Which cities carry the highest burden?
2When is risk highest? Is there a seasonal pattern to drive the timing of interventions?
3What drives the AQI? Which pollutants most strongly track the overall index?
4Do interventions work? (beyond brief) Did the 2020 lockdown measurably improve air quality?

2. Data Quality & Missingness

Before any analysis, a data scientist must understand what is missing. Sensor coverage in India expanded over the study period, so earlier years and smaller cities have sparser records.

Figure 1 — Share of records missing each measurement.
Figure 1 — Share of records missing each measurement.
Insight: Xylene is missing in over 60% of records, and PM10 and NH3 in roughly a third — these should be treated cautiously or dropped in modelling. Reassuringly, the headline AQI is present for ~84% of rows, giving a solid base. At this exploratory stage we keep missing values as NA rather than imputing.

3. Q1 — Where is the Problem Worst?

Figure 2 — Mean AQI by city; dashed line marks the national mean (~166).
Figure 2 — Mean AQI by city; dashed line marks the national mean (~166).
Insight: The burden is highly uneven. Ahmedabad records by far the worst average AQI (~452), followed by the north-Indian cluster of Delhi, Patna, Gurugram and Lucknow on the Indo-Gangetic plain. At the clean end, hill and coastal cities such as Aizawl, Shillong and Coimbatore stay near "Satisfactory". This ~13× gap argues for geographically targeted intervention rather than uniform national policy.
CityMean AQIMedianMaxDays
Ahmedabad452.1384.520491,334
Delhi259.5257.07161,999
Patna240.8215.06191,459
Gurugram225.1208.08911,453
Lucknow218.0198.07071,893

Top 5 most polluted cities by mean AQI.

4. AQI Category Mix

Figure 3 — Number of city-days in each AQI category.
Figure 3 — Number of city-days in each AQI category.
Insight: Roughly 40% of recorded days fall in the Good or Satisfactory range. But the combined Poor / Very Poor / Severe days (~6,500 city-days) represent serious public-health events that justify an early-warning system.

5. Q2 — When is Risk Highest? (Seasonality)

Figure 4 — National mean PM2.5 by month of year.
Figure 4 — National mean PM2.5 by month of year.
Insight: A strong, repeatable seasonal cycle. PM2.5 peaks sharply in November–January (crop-residue burning, festivals, cold stagnant air) and bottoms out in the monsoon (July–August) when rain scavenges particulates. Advisories and health-system readiness should therefore be scheduled ahead of the winter window, not applied uniformly year-round.

Annual trend by major city

Figure 5 — Annual mean AQI for six major cities, 2015–2020.
Figure 5 — Annual mean AQI for six major cities, 2015–2020.

Several major cities show a gentle downward drift over the period, with a notable dip in 2020 (partly the COVID lockdown, explored below). The 2020 values cover only Jan–Jul, so they are partial.

6. Q3 — What Drives the AQI? (Correlations)

Figure 6 — Pairwise correlations among pollutants and the AQI.
Figure 6 — Pairwise correlations among pollutants and the AQI.
Insight: The AQI is most strongly associated with PM10 (r≈0.80), CO (≈0.68) and PM2.5 (≈0.66) — particulate matter and combustion products dominate the index. Ozone relates weakly (and even negatively to some primary pollutants), consistent with its secondary, photochemical formation. Particulates are the highest-leverage target for monitoring and regulation.

7. Q4 — Do Interventions Work? COVID-19 as a Natural Experiment

Beyond the brief: This section treats the 2020 lockdown as a quasi-experiment to test whether reduced human activity measurably improves air quality — the kind of causal question a business/policy analyst would want answered.
Figure 7 — Delhi PM2.5 (7-day average); dashed line marks the 25 Mar 2020 lockdown.
Figure 7 — Delhi PM2.5 (7-day average); dashed line marks the 25 Mar 2020 lockdown.
Insight: In the weeks after the 25 March 2020 national lockdown, Delhi's mean PM2.5 fell by roughly 59% (≈118 → ≈49 µg/m³). Although confounded by the onset of the cleaner pre-monsoon season, the magnitude and timing provide striking real-world evidence that curbing traffic, construction and industry translates directly and rapidly into cleaner air — an encouraging signal for the feasibility of policy intervention.

8. Key Findings & Recommendations

#FindingRecommended action
1Burden is highly uneven (~13× gap; worst: Ahmedabad / Delhi cluster)Prioritise the worst city-cluster for targeted intervention
2Strong winter peak in PM2.5 (Nov–Jan); cleanest in monsoonSchedule advisories & health-system readiness ahead of winter
3AQI driven mainly by particulates (PM10, PM2.5) and COFocus monitoring & regulation on particulate sources
4~40% of days Good/Satisfactory, but thousands of severe days remainBuild an AQI early-warning & alert system
52020 lockdown cut Delhi PM2.5 by ~59%Use this evidence to justify traffic/industry controls

9. Limitations & Next Steps

Limitations: coverage is uneven across cities/years (Xylene, PM10, NH3 heavily incomplete); the lockdown analysis overlaps with seasonal change, so a difference-in-differences or interrupted time-series design is needed to isolate the effect; and city-level aggregation hides neighbourhood hotspots.

Next steps: (1) build a predictive model for next-day AQI from lagged pollutants and weather; (2) cluster cities by pollution profile; (3) join external data (meteorology, crop-burning, traffic) to move from correlation toward causal explanation.

This HTML is a rendered preview of the analysis produced by the accompanying air_quality_analysis.Rmd. Knitting that file in RStudio regenerates every figure and table from city_day.csv (included with the submission). All statistics shown were computed directly from the dataset.