Air Quality

Author

Yalaguresh G and Bangali Viaks

Problem Statement

Implement an R function to generate a time-series line graph depicting the trend of air pollution (PM2.5 levels) over time for each city group, utilizing ggplot2 group aesthetic.

Introduction

This document demonstrates how to create a time-series line graph using air quality data.

  • The dataset is obtained from IQAir
  • It contains PM2.5 values for multiple cities across years (2017–2025)
  • We convert the dataset into a structured dataframe suitable for visualization
  • We visualize pollution trends over time using ggplot2
  • We draw separate lines for each city using group aesthetic.

Step 1: Load Necessary Libraries

We load:

  • ggplot2 → for visualization
  • dplyr → for data manipulation
  • tidyr → for reshaping data
  • readxl → to import Excel dataset
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(readxl)

Step 2: Load Dataset and Convert to DataFrame

We import the Excel dataset and prepare it for analysis.

data <- read_excel("C:/Users/Yalaguresh/Downloads/air_quality_100.xlsx")

head(data)
# A tibble: 6 × 12
   Rank City     Country `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024`
  <dbl> <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1 Loni     India     41.4  108.    43.2   61.1   84.2  116.   102.    48.4
2     2 Hotan    China     58.3   89.7  111.    93.9   72.7   76.5   96.5   66.4
3     3 Byrnihat India     76.4   42.8   57     99.5   60.2  111.   107.    47.7
4     4 Delhi    India     89.5   52.9   53.1   90.3   61     52.9   68.9   81.5
5     5 Faisala… Pakist…   69     78    112.   118.    71.3   50.7  106.    47.1
6     6 Rahim Y… Pakist…   77.4   82.1   52.3   54.9   54.6   91.2   44.9   92.5
# ℹ 1 more variable: `2025` <dbl>

Step 3: Data Preparation

The dataset is in wide format (years as columns). We convert it into long format for ggplot.

data_long <- data %>%
  pivot_longer(
    cols = `2017`:`2025`,
    names_to = "Year",
    values_to = "PM25"
  )

# Convert Year to numeric
data_long$Year <- as.numeric(data_long$Year)

head(data_long)
# A tibble: 6 × 5
   Rank City  Country  Year  PM25
  <dbl> <chr> <chr>   <dbl> <dbl>
1     1 Loni  India    2017  41.4
2     1 Loni  India    2018 108. 
3     1 Loni  India    2019  43.2
4     1 Loni  India    2020  61.1
5     1 Loni  India    2021  84.2
6     1 Loni  India    2022 116. 

Step 4: Understand Data Structure

str(data_long)
tibble [900 × 5] (S3: tbl_df/tbl/data.frame)
 $ Rank   : num [1:900] 1 1 1 1 1 1 1 1 1 2 ...
 $ City   : chr [1:900] "Loni" "Loni" "Loni" "Loni" ...
 $ Country: chr [1:900] "India" "India" "India" "India" ...
 $ Year   : num [1:900] 2017 2018 2019 2020 2021 ...
 $ PM25   : num [1:900] 41.4 108.4 43.2 61.1 84.2 ...
summary(data_long)
      Rank            City             Country               Year     
 Min.   :  1.00   Length:900         Length:900         Min.   :2017  
 1st Qu.: 25.75   Class :character   Class :character   1st Qu.:2019  
 Median : 50.50   Mode  :character   Mode  :character   Median :2021  
 Mean   : 50.50                                         Mean   :2021  
 3rd Qu.: 75.25                                         3rd Qu.:2023  
 Max.   :100.00                                         Max.   :2025  
      PM25       
 Min.   : 40.00  
 1st Qu.: 58.48  
 Median : 78.40  
 Mean   : 78.98  
 3rd Qu.: 99.33  
 Max.   :119.90  
# Check year range
range(data_long$Year, na.rm = TRUE)
[1] 2017 2025
# Count entries per year
table(data_long$Year)

2017 2018 2019 2020 2021 2022 2023 2024 2025 
 100  100  100  100  100  100  100  100  100 

Step 5: Define Function for Time-Series Plot

  • Function Purpose
  • Creates reusable plotting function
  • Uses group aesthetic for multiple cities
  • Allows flexible column selection.
plot_air_quality <- function(data, x_col, y_col, group_col,
                             title = "Air Quality Trends Over Time") {
  
  ggplot(
    data,
    aes(
      x = .data[[x_col]],
      y = .data[[y_col]],
      color = .data[[group_col]],
      group = .data[[group_col]]
    )
  ) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2) +
    labs(
      title = title,
      x = x_col,
      y = y_col,
      color = group_col
    ) +
    theme_minimal() +
    theme(legend.position = "top")
}

Step 6: Generate the Plot

We call the function:

plot_air_quality(
  data_long,
  "Year",
  "PM25",
  "City",
  "Trend of Air Pollution Across Cities"
)

Step 7: Additional Visualizations

🔸 Top 10 Polluted Cities (2025).

top10 <- data %>%
  arrange(desc(`2025`)) %>%
  slice(1:10)

top10
# A tibble: 10 × 12
    Rank City    Country `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024`
   <dbl> <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1    12 Mullan… India    116     93.2  111.   118     43.4   83.7   68.9   91  
 2    99 City_99 Vietnam   82.7   62    106.    59.3   95.9   74    108.    46.2
 3     3 Byrnih… India     76.4   42.8   57     99.5   60.2  111.   107.    47.7
 4    25 City_25 India     52.4   68.6   87.8   93.8   99.1   76.2   42.1   99  
 5    50 City_50 Turkey   113.    83.5  100.    69.4   73.8   65.5   61.6  102. 
 6    24 City_24 China    103.    67.7  109.    51.1   49    107.   112.    44.9
 7    92 City_92 Turkey    72    114.   107.   119.   115.   112.    91.6   53.2
 8    52 City_52 Vietnam   69.1   61.8   97.6   40.7  114.    76     90    114. 
 9    18 Noida   India     44.8   69.8   43.7   93.6   65.2   53.1   54     43.6
10    66 City_66 Turkey    55.8   67.8   83.8   48.3   68.2   44.4   59.3   57.1
# ℹ 1 more variable: `2025` <dbl>
ggplot(top10, aes(x=reorder(City, `2025`), y=`2025`, fill=Country)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title="Top 10 Polluted Cities (2025)",
       x="City",
       y="PM2.5")

🔸 PM2.5 Distribution

ggplot(data, aes(x=`2025`)) +
  geom_histogram(bins=10, fill="red") +
  labs(title="PM2.5 Distribution (2025)",
       x="PM2.5",
       y="Frequency")

🔸 Country-wise Comparison

ggplot(data, aes(x=Country, y=`2025`)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=90)) +
  labs(title="Country-wise Pollution Comparison")

Step 8: Advanced Analysis

  • Year-to-Year Change
data <- data %>%
  mutate(change = `2025` - `2024`)

head(data)
# A tibble: 6 × 13
   Rank City     Country `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024`
  <dbl> <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1 Loni     India     41.4  108.    43.2   61.1   84.2  116.   102.    48.4
2     2 Hotan    China     58.3   89.7  111.    93.9   72.7   76.5   96.5   66.4
3     3 Byrnihat India     76.4   42.8   57     99.5   60.2  111.   107.    47.7
4     4 Delhi    India     89.5   52.9   53.1   90.3   61     52.9   68.9   81.5
5     5 Faisala… Pakist…   69     78    112.   118.    71.3   50.7  106.    47.1
6     6 Rahim Y… Pakist…   77.4   82.1   52.3   54.9   54.6   91.2   44.9   92.5
# ℹ 2 more variables: `2025` <dbl>, change <dbl>
  • Correlation Analysis
cor(data$`2024`, data$`2025`, use="complete.obs")
[1] -0.0376808
  • Heatmap Visualization
ggplot(data_long, aes(x=Year, y=City, fill=PM25)) +
  geom_tile() +
  scale_fill_gradient(low="green", high="red") +
  labs(title="Heatmap of Pollution Levels")

Step 9: Interpretation

  • Most cities show high PM2.5 levels exceeding WHO limits
  • South Asian cities dominate pollution rankings
  • Pollution levels remain consistently high over years
  • Some cities show minor improvements, but overall trend remains concerning
  • Correlation indicates pollution persists year-to-year