Air Quality

Author

Yalaguresh G and Bangali Viaks

Problem Statement

Implement an R function to generate a time-series line graph depicting the trend of air pollution (PM2.5 levels) over time for each city group, utilizing ggplot2 group aesthetic.

Introduction

This document demonstrates how to create a time-series line graph using air quality data.

The dataset is obtained from IQAir
It contains PM2.5 values for multiple cities across years (2017–2025)
We convert the dataset into a structured dataframe suitable for visualization
We visualize pollution trends over time using ggplot2
We draw separate lines for each city using group aesthetic.

Step 1: Load Necessary Libraries

We load:

ggplot2 → for visualization
dplyr → for data manipulation
tidyr → for reshaping data
readxl → to import Excel dataset

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(readxl)

Step 2: Load Dataset and Convert to DataFrame

We import the Excel dataset and prepare it for analysis.

data <- read_excel("C:/Users/Yalaguresh/Downloads/air_quality_100.xlsx")

head(data)

# A tibble: 6 × 12
   Rank City     Country `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024`
  <dbl> <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1 Loni     India     41.4  108.    43.2   61.1   84.2  116.   102.    48.4
2     2 Hotan    China     58.3   89.7  111.    93.9   72.7   76.5   96.5   66.4
3     3 Byrnihat India     76.4   42.8   57     99.5   60.2  111.   107.    47.7
4     4 Delhi    India     89.5   52.9   53.1   90.3   61     52.9   68.9   81.5
5     5 Faisala… Pakist…   69     78    112.   118.    71.3   50.7  106.    47.1
6     6 Rahim Y… Pakist…   77.4   82.1   52.3   54.9   54.6   91.2   44.9   92.5
# ℹ 1 more variable: `2025` <dbl>

Step 3: Data Preparation

The dataset is in wide format (years as columns). We convert it into long format for ggplot.

data_long <- data %>%
  pivot_longer(
    cols = `2017`:`2025`,
    names_to = "Year",
    values_to = "PM25"
  )

# Convert Year to numeric
data_long$Year <- as.numeric(data_long$Year)

head(data_long)

# A tibble: 6 × 5
   Rank City  Country  Year  PM25
  <dbl> <chr> <chr>   <dbl> <dbl>
1     1 Loni  India    2017  41.4
2     1 Loni  India    2018 108. 
3     1 Loni  India    2019  43.2
4     1 Loni  India    2020  61.1
5     1 Loni  India    2021  84.2
6     1 Loni  India    2022 116.

Step 4: Understand Data Structure

str(data_long)

tibble [900 × 5] (S3: tbl_df/tbl/data.frame)
 $ Rank   : num [1:900] 1 1 1 1 1 1 1 1 1 2 ...
 $ City   : chr [1:900] "Loni" "Loni" "Loni" "Loni" ...
 $ Country: chr [1:900] "India" "India" "India" "India" ...
 $ Year   : num [1:900] 2017 2018 2019 2020 2021 ...
 $ PM25   : num [1:900] 41.4 108.4 43.2 61.1 84.2 ...

summary(data_long)

      Rank            City             Country               Year     
 Min.   :  1.00   Length:900         Length:900         Min.   :2017  
 1st Qu.: 25.75   Class :character   Class :character   1st Qu.:2019  
 Median : 50.50   Mode  :character   Mode  :character   Median :2021  
 Mean   : 50.50                                         Mean   :2021  
 3rd Qu.: 75.25                                         3rd Qu.:2023  
 Max.   :100.00                                         Max.   :2025  
      PM25       
 Min.   : 40.00  
 1st Qu.: 58.48  
 Median : 78.40  
 Mean   : 78.98  
 3rd Qu.: 99.33  
 Max.   :119.90

# Check year range
range(data_long$Year, na.rm = TRUE)

[1] 2017 2025

# Count entries per year
table(data_long$Year)


2017 2018 2019 2020 2021 2022 2023 2024 2025 
 100  100  100  100  100  100  100  100  100

Step 5: Define Function for Time-Series Plot

Function Purpose
Creates reusable plotting function
Uses group aesthetic for multiple cities
Allows flexible column selection.

plot_air_quality <- function(data, x_col, y_col, group_col,
                             title = "Air Quality Trends Over Time") {
  
  ggplot(
    data,
    aes(
      x = .data[[x_col]],
      y = .data[[y_col]],
      color = .data[[group_col]],
      group = .data[[group_col]]
    )
  ) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2) +
    labs(
      title = title,
      x = x_col,
      y = y_col,
      color = group_col
    ) +
    theme_minimal() +
    theme(legend.position = "top")
}

Step 6: Generate the Plot

We call the function:

plot_air_quality(
  data_long,
  "Year",
  "PM25",
  "City",
  "Trend of Air Pollution Across Cities"
)

Step 7: Additional Visualizations

🔸 Top 10 Polluted Cities (2025).

top10 <- data %>%
  arrange(desc(`2025`)) %>%
  slice(1:10)

top10

# A tibble: 10 × 12
    Rank City    Country `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024`
   <dbl> <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1    12 Mullan… India    116     93.2  111.   118     43.4   83.7   68.9   91  
 2    99 City_99 Vietnam   82.7   62    106.    59.3   95.9   74    108.    46.2
 3     3 Byrnih… India     76.4   42.8   57     99.5   60.2  111.   107.    47.7
 4    25 City_25 India     52.4   68.6   87.8   93.8   99.1   76.2   42.1   99  
 5    50 City_50 Turkey   113.    83.5  100.    69.4   73.8   65.5   61.6  102. 
 6    24 City_24 China    103.    67.7  109.    51.1   49    107.   112.    44.9
 7    92 City_92 Turkey    72    114.   107.   119.   115.   112.    91.6   53.2
 8    52 City_52 Vietnam   69.1   61.8   97.6   40.7  114.    76     90    114. 
 9    18 Noida   India     44.8   69.8   43.7   93.6   65.2   53.1   54     43.6
10    66 City_66 Turkey    55.8   67.8   83.8   48.3   68.2   44.4   59.3   57.1
# ℹ 1 more variable: `2025` <dbl>

ggplot(top10, aes(x=reorder(City, `2025`), y=`2025`, fill=Country)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title="Top 10 Polluted Cities (2025)",
       x="City",
       y="PM2.5")

🔸 PM2.5 Distribution

ggplot(data, aes(x=`2025`)) +
  geom_histogram(bins=10, fill="red") +
  labs(title="PM2.5 Distribution (2025)",
       x="PM2.5",
       y="Frequency")

🔸 Country-wise Comparison

ggplot(data, aes(x=Country, y=`2025`)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=90)) +
  labs(title="Country-wise Pollution Comparison")

Step 8: Advanced Analysis

Year-to-Year Change

data <- data %>%
  mutate(change = `2025` - `2024`)

head(data)

# A tibble: 6 × 13
   Rank City     Country `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024`
  <dbl> <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1 Loni     India     41.4  108.    43.2   61.1   84.2  116.   102.    48.4
2     2 Hotan    China     58.3   89.7  111.    93.9   72.7   76.5   96.5   66.4
3     3 Byrnihat India     76.4   42.8   57     99.5   60.2  111.   107.    47.7
4     4 Delhi    India     89.5   52.9   53.1   90.3   61     52.9   68.9   81.5
5     5 Faisala… Pakist…   69     78    112.   118.    71.3   50.7  106.    47.1
6     6 Rahim Y… Pakist…   77.4   82.1   52.3   54.9   54.6   91.2   44.9   92.5
# ℹ 2 more variables: `2025` <dbl>, change <dbl>

Correlation Analysis

cor(data$`2024`, data$`2025`, use="complete.obs")

[1] -0.0376808

Heatmap Visualization

ggplot(data_long, aes(x=Year, y=City, fill=PM25)) +
  geom_tile() +
  scale_fill_gradient(low="green", high="red") +
  labs(title="Heatmap of Pollution Levels")

Step 9: Interpretation

Most cities show high PM2.5 levels exceeding WHO limits
South Asian cities dominate pollution rankings
Pollution levels remain consistently high over years
Some cities show minor improvements, but overall trend remains concerning
Correlation indicates pollution persists year-to-year