Introduction

Air Quality Index (AQI) is a standardized indicator used to communicate how polluted the air currently is or how polluted it is forecast to become. It aggregates multiple pollutants such as PM2.5, PM10, NO2, and SO2 into a single value to represent overall air quality. Monitoring AQI is crucial for public health as poor air quality can lead to respiratory and cardiovascular diseases.

This analysis explores AQI trends across major Indian cities from 2015 to 2023, aiming to identify pollution hotspots, temporal patterns, and pollutant contributions. Understanding these patterns can inform policy decisions and public awareness efforts to improve air quality.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'tidyr' was built under R version 4.5.3
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ purrr     1.2.1
## ✔ ggplot2   4.0.2     ✔ stringr   1.6.0
## ✔ lubridate 1.9.5     ✔ tibble    3.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
AQI <- read.csv("india_city_aqi_2015_2023.csv")

View(AQI)       # Spreadsheet view
colnames(AQI)   # Column names
##  [1] "city"         "date"         "pm25"         "pm10"         "no2"         
##  [6] "so2"          "co"           "o3"           "aqi"          "aqi_category"
head(AQI)       # First few rows
##    city       date      pm25      pm10      no2       so2        co       o3
## 1 Delhi 2015-01-01  99.86857 147.10328 49.71533 19.615149 0.7297540 46.48795
## 2 Delhi 2015-01-02 143.16851 208.51721 32.95788 14.712800 0.6609747 43.01405
## 3 Delhi 2015-01-03  89.67849 101.41289 14.12623  9.188562 0.4961507 54.71371
## 4 Delhi 2015-01-04  43.67904  65.43296 61.98473 10.871118 0.8202585 28.62878
## 5 Delhi 2015-01-05  58.22469 110.44314 22.73510 13.878490 0.6198084 45.62459
## 6 Delhi 2015-01-06  55.93174 142.97730 39.79754  6.711445 1.0467635 31.68735
##   aqi aqi_category
## 1 103     Moderate
## 2 141     Moderate
## 3  82 Satisfactory
## 4  50         Good
## 5  69 Satisfactory
## 6  78 Satisfactory

Level 1: Understanding the Data (Basic Exploration)


Q1.1 What is the structure of the dataset(number of rows,columns, and data types)?

str(AQI)
## 'data.frame':    32870 obs. of  10 variables:
##  $ city        : chr  "Delhi" "Delhi" "Delhi" "Delhi" ...
##  $ date        : chr  "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
##  $ pm25        : num  99.9 143.2 89.7 43.7 58.2 ...
##  $ pm10        : num  147.1 208.5 101.4 65.4 110.4 ...
##  $ no2         : num  49.7 33 14.1 62 22.7 ...
##  $ so2         : num  19.62 14.71 9.19 10.87 13.88 ...
##  $ co          : num  0.73 0.661 0.496 0.82 0.62 ...
##  $ o3          : num  46.5 43 54.7 28.6 45.6 ...
##  $ aqi         : int  103 141 82 50 69 78 81 81 89 126 ...
##  $ aqi_category: chr  "Moderate" "Moderate" "Satisfactory" "Good" ...

Interpretation: The structure of the dataset shows how many rows (observations) and columns (variables) it contains, along with the data types. This helps us understand the scope of the data and what kind of analysis is possible.

Q1.2: Are there missing values in the dataset?

colSums(is.na(AQI))
##         city         date         pm25         pm10          no2          so2 
##            0            0            0            0            0            0 
##           co           o3          aqi aqi_category 
##            0            0            0            0

Interpretation: Checking for missing values ensures data quality. If many values are missing, results could be biased or misleading, so this step highlights whether cleaning or imputation is needed.

Q1.3: What is the average AQI per city across all years?

AQI %>%
  group_by(city) %>%
  summarise(avg_AQI = mean(aqi, na.rm = TRUE))
## # A tibble: 10 × 2
##    city      avg_AQI
##    <chr>       <dbl>
##  1 Ahmedabad    88.5
##  2 Bengaluru    88.4
##  3 Chennai      87.9
##  4 Delhi        88.3
##  5 Hyderabad    88.9
##  6 Jaipur       87.6
##  7 Kolkata      88.3
##  8 Lucknow      87.9
##  9 Mumbai       86.9
## 10 Pune         88.6

Interpretation: Calculating the average AQI per city across all years gives a baseline measure of air quality in each location, allowing us to compare which cities consistently face higher pollution levels.

Level 2: Data Extraction & Filtering


Q2.1:Which are the top 10 cities with the highest total AQI values across 2015–2023?

AQI %>%
  group_by(city) %>%
  summarise(total_AQI = sum(aqi, na.rm = TRUE)) %>%
  arrange(desc(total_AQI)) %>%
  head(10)
## # A tibble: 10 × 2
##    city      total_AQI
##    <chr>         <int>
##  1 Hyderabad    292161
##  2 Pune         291200
##  3 Ahmedabad    290982
##  4 Bengaluru    290528
##  5 Kolkata      290244
##  6 Delhi        290163
##  7 Chennai      288946
##  8 Lucknow      288876
##  9 Jaipur       288005
## 10 Mumbai       285487

Interpretation: Summing AQI values across 2015–2023 identifies the top 10 cities with the highest cumulative pollution burden, highlighting long‑term hotspots that need urgent policy attention.

Q2.2: Which 5 cities recorded the highest PM2.5 concentration in any single year?

AQI %>% arrange(desc(pm25)) %>% head(5)
##        city       date     pm25     pm10      no2       so2        co       o3
## 1 Hyderabad 2018-10-26 248.0810 299.7417 30.60253  9.843647 1.0992364 40.99654
## 2    Jaipur 2023-04-03 239.0747 244.6652 57.47341 18.656332 0.8441378 54.64714
## 3   Chennai 2015-01-01 224.2236 314.6235 43.73234  8.120949 0.6706274 52.10713
## 4    Mumbai 2020-12-23 224.0966 270.6069 51.07668 18.937492 0.7412002 48.27587
## 5 Hyderabad 2017-12-08 223.9925 275.1802 31.48040  2.958523 0.9077135 60.13636
##   aqi aqi_category
## 1 221         Poor
## 2 204         Poor
## 3 216         Poor
## 4 203         Poor
## 5 203         Poor

Interpretation: Finding the 5 cities with the highest PM2.5 concentration in a single year shows extreme pollution events and pinpoints where particulate matter spikes most severely.

#Q2.3: Find the top 10 cities with the largest number of “Severe” AQI days.

AQI %>%
  group_by(city) %>%
  summarise(avg_AQI = mean(aqi, na.rm = TRUE)) %>%
  arrange(desc(avg_AQI)) %>%
  head(10)
## # A tibble: 10 × 2
##    city      avg_AQI
##    <chr>       <dbl>
##  1 Hyderabad    88.9
##  2 Pune         88.6
##  3 Ahmedabad    88.5
##  4 Bengaluru    88.4
##  5 Kolkata      88.3
##  6 Delhi        88.3
##  7 Chennai      87.9
##  8 Lucknow      87.9
##  9 Jaipur       87.6
## 10 Mumbai       86.9

Interpretation: Ranking cities by the number of “Severe” AQI days reveals which places experience the worst air quality episodes most frequently, emphasizing chronic exposure risks.

Level 3: Grouping & Summarization


Q3.1: Determine the month with the most pollution spikes (AQI > 200) by counting records per month.

AQI <- AQI %>%
  mutate(month = month(date)) 

AQI %>%
  group_by(month) %>%
  summarise(spike_days = sum(aqi > 200, na.rm = TRUE)) %>%
  arrange(desc(spike_days))
## # A tibble: 12 × 2
##    month spike_days
##    <dbl>      <int>
##  1     1          3
##  2    10          2
##  3    12          2
##  4     3          1
##  5     4          1
##  6     8          1
##  7     2          0
##  8     5          0
##  9     6          0
## 10     7          0
## 11     9          0
## 12    11          0

Interpretation: Counting AQI spikes (>200) by month shows seasonal pollution patterns, such as winter smog or harvest‑related burning, helping explain when air quality deteriorates most.

Q3.2: Find the city with the most “Unhealthy” AQI days by analyzing AQI category occurrences.

AQI %>%
  filter(aqi_category == "Poor") %>%
  group_by(city) %>%
  summarise(unhealthy_days = sum(aqi > 200)) %>%
  arrange(desc(unhealthy_days))
## # A tibble: 7 × 2
##   city      unhealthy_days
##   <chr>              <int>
## 1 Hyderabad              3
## 2 Bengaluru              2
## 3 Chennai                1
## 4 Jaipur                 1
## 5 Kolkata                1
## 6 Lucknow                1
## 7 Mumbai                 1

Interpretation: Identifying the city with the most “Unhealthy” AQI days highlights where poor air quality is most persistent, guiding health risk assessments and interventions.

#Level 4: Sorting & Ranking Data


Q4.1: Rank cities based on average PM2.5 levels (highest to lowest).

AQI %>%
  group_by(city) %>%
  summarise(avg_PM25 = mean(pm25, na.rm = TRUE)) %>%
  arrange(desc(avg_PM25))
## # A tibble: 10 × 2
##    city      avg_PM25
##    <chr>        <dbl>
##  1 Hyderabad     81.6
##  2 Pune          81.4
##  3 Ahmedabad     81.2
##  4 Bengaluru     81.1
##  5 Kolkata       80.9
##  6 Delhi         80.8
##  7 Chennai       80.5
##  8 Lucknow       80.4
##  9 Jaipur        80.3
## 10 Mumbai        79.3

Interpretation: Ranking cities by average PM2.5 levels shows which locations suffer the highest fine particulate pollution, a major health hazard linked to respiratory and cardiovascular diseases.

Q4.2: Find the top 5 cities with the highest NO₂ concentration.

AQI %>%
  arrange(desc(no2)) %>%
  select(city, date, no2) %>%
  head(5)
##      city       date      no2
## 1 Chennai 2018-08-25 97.44673
## 2  Jaipur 2023-10-17 95.98160
## 3  Mumbai 2021-03-16 95.91750
## 4    Pune 2015-06-10 94.85450
## 5 Lucknow 2020-07-14 94.26193

Interpretation: Listing the top 5 cities with the highest NO₂ concentrations highlights areas with heavy traffic or industrial emissions, since NO₂ is a key indicator of combustion pollution.

Q4.3: Identify the year with the highest average AQI across all cities.

AQI %>%
  group_by(date) %>%
  summarise(mean_AQI = mean(aqi, na.rm = TRUE)) %>%
  arrange(desc(mean_AQI)) %>%
  head(1)
## # A tibble: 1 × 2
##   date       mean_AQI
##   <chr>         <dbl>
## 1 2016-10-25     128.

Interpretation: Finding the year with the highest average AQI across all cities reveals the worst overall pollution year, showing temporal trends and whether air quality is improving or worsening.

#Level 5: Feature Engineering (Creating New Insights)


Q5.1: Create a new column for “Total Pollutant Load” (PM2.5 + PM10 + NO₂ + SO₂ + CO + O₃).

AQI <- AQI %>%
  mutate(total_load = pm25 + pm10 + no2 + so2 + co + o3)

AQI %>%
  mutate(year = lubridate::year(date)) %>%
  group_by(city, year) %>%
  summarise(avg_total_load = mean(total_load, na.rm = TRUE),
            .groups = "drop") %>%
  arrange(desc(avg_total_load))
## # A tibble: 90 × 3
##    city       year avg_total_load
##    <chr>     <dbl>          <dbl>
##  1 Hyderabad  2018           325.
##  2 Delhi      2015           323.
##  3 Lucknow    2022           322.
##  4 Ahmedabad  2015           322.
##  5 Bengaluru  2015           322.
##  6 Kolkata    2015           321.
##  7 Ahmedabad  2016           321.
##  8 Ahmedabad  2023           321.
##  9 Jaipur     2019           321.
## 10 Pune       2021           321.
## # ℹ 80 more rows

Interpretation: Creating a “Total Pollutant Load” combines all pollutants into one measure, giving a holistic view of the pollution burden per city and year rather than focusing on a single pollutant.

Q5.2: Calculate “Pollution Intensity Ratio” (Total Pollutant Load / AQI).

AQI <- AQI %>%
  mutate(
    pollution_intensity = total_load / aqi,
    year = year(date)   # make sure year is created here
  )

pollution_summary <- AQI %>%
  group_by(city, year) %>%
  summarise(
    avg_intensity = mean(pollution_intensity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_intensity))

pollution_summary
## # A tibble: 90 × 3
##    city       year avg_intensity
##    <chr>     <dbl>         <dbl>
##  1 Chennai    2021          3.85
##  2 Jaipur     2023          3.84
##  3 Delhi      2023          3.83
##  4 Hyderabad  2017          3.82
##  5 Ahmedabad  2019          3.82
##  6 Chennai    2015          3.82
##  7 Bengaluru  2022          3.82
##  8 Lucknow    2019          3.82
##  9 Mumbai     2022          3.81
## 10 Mumbai     2016          3.81
## # ℹ 80 more rows

Interpretation: The “Pollution Intensity Ratio” (Total Load / AQI) shows how much pollutant mass contributes to each AQI unit, helping compare whether AQI is driven by a dense mix of pollutants or dominated by fewer sources.

#Data Visualization


V1: Bar Chart — Compare average AQI across cities.

ggplot(AQI, aes(x = city, y = aqi, fill = city)) +
  stat_summary(fun = "mean", geom = "bar") +
  labs(title = "Average AQI Across Cities")

V2: Histogram — Visualize distribution of PM2.5 values across all records.

ggplot(AQI, aes(x = pm25)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of PM2.5")

V3: Pie Chart — Show pollutant contribution (PM2.5, PM10, NO₂, SO₂, CO, O₃) for Delhi.

delhi <- AQI %>% filter(city == "Delhi") %>%
  summarise(pm25 = sum(pm25), pm10 = sum(pm10), no2 = sum(no2),
            so2 = sum(so2), co = sum(co), o3 = sum(o3))
pie(as.numeric(delhi), labels = names(delhi), main = "Pollutant Contribution in Delhi")

V4: Line Chart — Show AQI trend over years (2015–2023).

yearly <- AQI %>% group_by(year) %>%
  summarise(mean_AQI = mean(aqi, na.rm = TRUE))
ggplot(yearly, aes(x = year, y = mean_AQI)) +
  geom_line(color = "red",  linewidth = 1.2) +
  labs(title = "AQI Trend (2015–2023)")