Air Quality Index (AQI) is a standardized indicator used to communicate how polluted the air currently is or how polluted it is forecast to become. It aggregates multiple pollutants such as PM2.5, PM10, NO2, and SO2 into a single value to represent overall air quality. Monitoring AQI is crucial for public health as poor air quality can lead to respiratory and cardiovascular diseases.
This analysis explores AQI trends across major Indian cities from 2015 to 2023, aiming to identify pollution hotspots, temporal patterns, and pollutant contributions. Understanding these patterns can inform policy decisions and public awareness efforts to improve air quality.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'tidyr' was built under R version 4.5.3
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ purrr 1.2.1
## ✔ ggplot2 4.0.2 ✔ stringr 1.6.0
## ✔ lubridate 1.9.5 ✔ tibble 3.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
AQI <- read.csv("india_city_aqi_2015_2023.csv")
View(AQI) # Spreadsheet view
colnames(AQI) # Column names
## [1] "city" "date" "pm25" "pm10" "no2"
## [6] "so2" "co" "o3" "aqi" "aqi_category"
head(AQI) # First few rows
## city date pm25 pm10 no2 so2 co o3
## 1 Delhi 2015-01-01 99.86857 147.10328 49.71533 19.615149 0.7297540 46.48795
## 2 Delhi 2015-01-02 143.16851 208.51721 32.95788 14.712800 0.6609747 43.01405
## 3 Delhi 2015-01-03 89.67849 101.41289 14.12623 9.188562 0.4961507 54.71371
## 4 Delhi 2015-01-04 43.67904 65.43296 61.98473 10.871118 0.8202585 28.62878
## 5 Delhi 2015-01-05 58.22469 110.44314 22.73510 13.878490 0.6198084 45.62459
## 6 Delhi 2015-01-06 55.93174 142.97730 39.79754 6.711445 1.0467635 31.68735
## aqi aqi_category
## 1 103 Moderate
## 2 141 Moderate
## 3 82 Satisfactory
## 4 50 Good
## 5 69 Satisfactory
## 6 78 Satisfactory
Level 1: Understanding the Data (Basic Exploration)
Q1.1 What is the structure of the dataset(number of rows,columns, and data types)?
str(AQI)
## 'data.frame': 32870 obs. of 10 variables:
## $ city : chr "Delhi" "Delhi" "Delhi" "Delhi" ...
## $ date : chr "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
## $ pm25 : num 99.9 143.2 89.7 43.7 58.2 ...
## $ pm10 : num 147.1 208.5 101.4 65.4 110.4 ...
## $ no2 : num 49.7 33 14.1 62 22.7 ...
## $ so2 : num 19.62 14.71 9.19 10.87 13.88 ...
## $ co : num 0.73 0.661 0.496 0.82 0.62 ...
## $ o3 : num 46.5 43 54.7 28.6 45.6 ...
## $ aqi : int 103 141 82 50 69 78 81 81 89 126 ...
## $ aqi_category: chr "Moderate" "Moderate" "Satisfactory" "Good" ...
Interpretation: The structure of the dataset shows how many rows (observations) and columns (variables) it contains, along with the data types. This helps us understand the scope of the data and what kind of analysis is possible.
colSums(is.na(AQI))
## city date pm25 pm10 no2 so2
## 0 0 0 0 0 0
## co o3 aqi aqi_category
## 0 0 0 0
Interpretation: Checking for missing values ensures data quality. If many values are missing, results could be biased or misleading, so this step highlights whether cleaning or imputation is needed.
AQI %>%
group_by(city) %>%
summarise(avg_AQI = mean(aqi, na.rm = TRUE))
## # A tibble: 10 × 2
## city avg_AQI
## <chr> <dbl>
## 1 Ahmedabad 88.5
## 2 Bengaluru 88.4
## 3 Chennai 87.9
## 4 Delhi 88.3
## 5 Hyderabad 88.9
## 6 Jaipur 87.6
## 7 Kolkata 88.3
## 8 Lucknow 87.9
## 9 Mumbai 86.9
## 10 Pune 88.6
Interpretation: Calculating the average AQI per city across all years gives a baseline measure of air quality in each location, allowing us to compare which cities consistently face higher pollution levels.
Level 2: Data Extraction & Filtering
Q2.1:Which are the top 10 cities with the highest total AQI values across 2015–2023?
AQI %>%
group_by(city) %>%
summarise(total_AQI = sum(aqi, na.rm = TRUE)) %>%
arrange(desc(total_AQI)) %>%
head(10)
## # A tibble: 10 × 2
## city total_AQI
## <chr> <int>
## 1 Hyderabad 292161
## 2 Pune 291200
## 3 Ahmedabad 290982
## 4 Bengaluru 290528
## 5 Kolkata 290244
## 6 Delhi 290163
## 7 Chennai 288946
## 8 Lucknow 288876
## 9 Jaipur 288005
## 10 Mumbai 285487
Interpretation: Summing AQI values across 2015–2023 identifies the top 10 cities with the highest cumulative pollution burden, highlighting long‑term hotspots that need urgent policy attention.
AQI %>% arrange(desc(pm25)) %>% head(5)
## city date pm25 pm10 no2 so2 co o3
## 1 Hyderabad 2018-10-26 248.0810 299.7417 30.60253 9.843647 1.0992364 40.99654
## 2 Jaipur 2023-04-03 239.0747 244.6652 57.47341 18.656332 0.8441378 54.64714
## 3 Chennai 2015-01-01 224.2236 314.6235 43.73234 8.120949 0.6706274 52.10713
## 4 Mumbai 2020-12-23 224.0966 270.6069 51.07668 18.937492 0.7412002 48.27587
## 5 Hyderabad 2017-12-08 223.9925 275.1802 31.48040 2.958523 0.9077135 60.13636
## aqi aqi_category
## 1 221 Poor
## 2 204 Poor
## 3 216 Poor
## 4 203 Poor
## 5 203 Poor
Interpretation: Finding the 5 cities with the highest PM2.5 concentration in a single year shows extreme pollution events and pinpoints where particulate matter spikes most severely.
#Q2.3: Find the top 10 cities with the largest number of “Severe” AQI days.
AQI %>%
group_by(city) %>%
summarise(avg_AQI = mean(aqi, na.rm = TRUE)) %>%
arrange(desc(avg_AQI)) %>%
head(10)
## # A tibble: 10 × 2
## city avg_AQI
## <chr> <dbl>
## 1 Hyderabad 88.9
## 2 Pune 88.6
## 3 Ahmedabad 88.5
## 4 Bengaluru 88.4
## 5 Kolkata 88.3
## 6 Delhi 88.3
## 7 Chennai 87.9
## 8 Lucknow 87.9
## 9 Jaipur 87.6
## 10 Mumbai 86.9
Interpretation: Ranking cities by the number of “Severe” AQI days reveals which places experience the worst air quality episodes most frequently, emphasizing chronic exposure risks.
Q3.1: Determine the month with the most pollution spikes (AQI > 200) by counting records per month.
AQI <- AQI %>%
mutate(month = month(date))
AQI %>%
group_by(month) %>%
summarise(spike_days = sum(aqi > 200, na.rm = TRUE)) %>%
arrange(desc(spike_days))
## # A tibble: 12 × 2
## month spike_days
## <dbl> <int>
## 1 1 3
## 2 10 2
## 3 12 2
## 4 3 1
## 5 4 1
## 6 8 1
## 7 2 0
## 8 5 0
## 9 6 0
## 10 7 0
## 11 9 0
## 12 11 0
Interpretation: Counting AQI spikes (>200) by month shows seasonal pollution patterns, such as winter smog or harvest‑related burning, helping explain when air quality deteriorates most.
Q3.2: Find the city with the most “Unhealthy” AQI days by analyzing AQI category occurrences.
AQI %>%
filter(aqi_category == "Poor") %>%
group_by(city) %>%
summarise(unhealthy_days = sum(aqi > 200)) %>%
arrange(desc(unhealthy_days))
## # A tibble: 7 × 2
## city unhealthy_days
## <chr> <int>
## 1 Hyderabad 3
## 2 Bengaluru 2
## 3 Chennai 1
## 4 Jaipur 1
## 5 Kolkata 1
## 6 Lucknow 1
## 7 Mumbai 1
Interpretation: Identifying the city with the most “Unhealthy” AQI days highlights where poor air quality is most persistent, guiding health risk assessments and interventions.
#Level 4: Sorting & Ranking Data
Q4.1: Rank cities based on average PM2.5 levels (highest to lowest).
AQI %>%
group_by(city) %>%
summarise(avg_PM25 = mean(pm25, na.rm = TRUE)) %>%
arrange(desc(avg_PM25))
## # A tibble: 10 × 2
## city avg_PM25
## <chr> <dbl>
## 1 Hyderabad 81.6
## 2 Pune 81.4
## 3 Ahmedabad 81.2
## 4 Bengaluru 81.1
## 5 Kolkata 80.9
## 6 Delhi 80.8
## 7 Chennai 80.5
## 8 Lucknow 80.4
## 9 Jaipur 80.3
## 10 Mumbai 79.3
Interpretation: Ranking cities by average PM2.5 levels shows which locations suffer the highest fine particulate pollution, a major health hazard linked to respiratory and cardiovascular diseases.
Q4.2: Find the top 5 cities with the highest NO₂ concentration.
AQI %>%
arrange(desc(no2)) %>%
select(city, date, no2) %>%
head(5)
## city date no2
## 1 Chennai 2018-08-25 97.44673
## 2 Jaipur 2023-10-17 95.98160
## 3 Mumbai 2021-03-16 95.91750
## 4 Pune 2015-06-10 94.85450
## 5 Lucknow 2020-07-14 94.26193
Interpretation: Listing the top 5 cities with the highest NO₂ concentrations highlights areas with heavy traffic or industrial emissions, since NO₂ is a key indicator of combustion pollution.
Q4.3: Identify the year with the highest average AQI across all cities.
AQI %>%
group_by(date) %>%
summarise(mean_AQI = mean(aqi, na.rm = TRUE)) %>%
arrange(desc(mean_AQI)) %>%
head(1)
## # A tibble: 1 × 2
## date mean_AQI
## <chr> <dbl>
## 1 2016-10-25 128.
Interpretation: Finding the year with the highest average AQI across all cities reveals the worst overall pollution year, showing temporal trends and whether air quality is improving or worsening.
#Level 5: Feature Engineering (Creating New Insights)
Q5.1: Create a new column for “Total Pollutant Load” (PM2.5 + PM10 + NO₂ + SO₂ + CO + O₃).
AQI <- AQI %>%
mutate(total_load = pm25 + pm10 + no2 + so2 + co + o3)
AQI %>%
mutate(year = lubridate::year(date)) %>%
group_by(city, year) %>%
summarise(avg_total_load = mean(total_load, na.rm = TRUE),
.groups = "drop") %>%
arrange(desc(avg_total_load))
## # A tibble: 90 × 3
## city year avg_total_load
## <chr> <dbl> <dbl>
## 1 Hyderabad 2018 325.
## 2 Delhi 2015 323.
## 3 Lucknow 2022 322.
## 4 Ahmedabad 2015 322.
## 5 Bengaluru 2015 322.
## 6 Kolkata 2015 321.
## 7 Ahmedabad 2016 321.
## 8 Ahmedabad 2023 321.
## 9 Jaipur 2019 321.
## 10 Pune 2021 321.
## # ℹ 80 more rows
Interpretation: Creating a “Total Pollutant Load” combines all pollutants into one measure, giving a holistic view of the pollution burden per city and year rather than focusing on a single pollutant.
Q5.2: Calculate “Pollution Intensity Ratio” (Total Pollutant Load / AQI).
AQI <- AQI %>%
mutate(
pollution_intensity = total_load / aqi,
year = year(date) # make sure year is created here
)
pollution_summary <- AQI %>%
group_by(city, year) %>%
summarise(
avg_intensity = mean(pollution_intensity, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(avg_intensity))
pollution_summary
## # A tibble: 90 × 3
## city year avg_intensity
## <chr> <dbl> <dbl>
## 1 Chennai 2021 3.85
## 2 Jaipur 2023 3.84
## 3 Delhi 2023 3.83
## 4 Hyderabad 2017 3.82
## 5 Ahmedabad 2019 3.82
## 6 Chennai 2015 3.82
## 7 Bengaluru 2022 3.82
## 8 Lucknow 2019 3.82
## 9 Mumbai 2022 3.81
## 10 Mumbai 2016 3.81
## # ℹ 80 more rows
Interpretation: The “Pollution Intensity Ratio” (Total Load / AQI) shows how much pollutant mass contributes to each AQI unit, helping compare whether AQI is driven by a dense mix of pollutants or dominated by fewer sources.
#Data Visualization
V1: Bar Chart — Compare average AQI across cities.
ggplot(AQI, aes(x = city, y = aqi, fill = city)) +
stat_summary(fun = "mean", geom = "bar") +
labs(title = "Average AQI Across Cities")
V2: Histogram — Visualize distribution of PM2.5 values across all records.
ggplot(AQI, aes(x = pm25)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of PM2.5")
V3: Pie Chart — Show pollutant contribution (PM2.5, PM10, NO₂, SO₂, CO, O₃) for Delhi.
delhi <- AQI %>% filter(city == "Delhi") %>%
summarise(pm25 = sum(pm25), pm10 = sum(pm10), no2 = sum(no2),
so2 = sum(so2), co = sum(co), o3 = sum(o3))
pie(as.numeric(delhi), labels = names(delhi), main = "Pollutant Contribution in Delhi")
V4: Line Chart — Show AQI trend over years (2015–2023).
yearly <- AQI %>% group_by(year) %>%
summarise(mean_AQI = mean(aqi, na.rm = TRUE))
ggplot(yearly, aes(x = year, y = mean_AQI)) +
geom_line(color = "red", linewidth = 1.2) +
labs(title = "AQI Trend (2015–2023)")