Correlation Network Graph - WHO Air Quality Dataset

Author

TEAM 16

Introduction

This project aims to analyze relationships between air pollution variables using a correlation network graph. The dataset contains variables such as PM2.5, PM10, NO2, and Ozone.


Load Required Libraries

This step loads necessary libraries for data processing and visualization.

library(readxl)
Warning: package 'readxl' was built under R version 4.5.3
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(igraph)
Warning: package 'igraph' was built under R version 4.5.3

Attaching package: 'igraph'
The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union
The following objects are masked from 'package:stats':

    decompose, spectrum
The following object is masked from 'package:base':

    union
library(ggraph)
Warning: package 'ggraph' was built under R version 4.5.3

Install Packages (Run Only Once)

This step installs required packages if not already installed.

install.packages("readxl")
Warning: package 'readxl' is in use and will not be installed
install.packages("ggraph")
Warning: package 'ggraph' is in use and will not be installed

Load Dataset

Here we load the dataset from the local system and display first few rows.

data <- read.csv("C:/Users/Prathee Gowda/Documents/Air_Quality.csv")
head(data)
                    WHO.Region ISO3 WHO.Country.Name City.or.Locality
1 Eastern Mediterranean Region  AFG      Afghanistan            Kabul
2              European Region  ALB          Albania           Durres
3              European Region  ALB          Albania           Durres
4              European Region  ALB          Albania          Elbasan
5              European Region  ALB          Albania          Elbasan
6              European Region  ALB          Albania          Elbasan
  Measurement.Year PM2.5...g.m3. PM10...g.m3. NO2...g.m3.
1             2019        119.77           NA          NA
2             2015            NA        17.65       26.63
3             2016         14.32        24.56       24.78
4             2015            NA           NA       23.96
5             2016            NA           NA       26.26
6             2017            NA           NA       24.70
  PM25.temporal.coverage.... PM10.temporal.coverage....
1                         18                         NA
2                         NA                         NA
3                         NA                         NA
4                         NA                         NA
5                         NA                         NA
6                         NA                         NA
  NO2.temporal.coverage....
1                        NA
2                  83.96119
3                  87.93260
4                  97.85388
5                  96.04964
6                  89.29224
                                                                Reference
1 U.S. Department of State, United States Environmental Protection Agency
2                        European Environment Agency (downloaded in 2021)
3                        European Environment Agency (downloaded in 2021)
4                        European Environment Agency (downloaded in 2021)
5                        European Environment Agency (downloaded in 2021)
6                        European Environment Agency (downloaded in 2021)
  Number.and.type.of.monitoring.stations Version.of.the.database Status
1                                   <NA>                    2022     NA
2                                   <NA>                    2022     NA
3                                   <NA>                    2022     NA
4                                   <NA>                    2022     NA
5                                   <NA>                    2022     NA
6                                   <NA>                    2022     NA

Check Column Names

This step displays column names to understand dataset structure.

names(data)
 [1] "WHO.Region"                            
 [2] "ISO3"                                  
 [3] "WHO.Country.Name"                      
 [4] "City.or.Locality"                      
 [5] "Measurement.Year"                      
 [6] "PM2.5...g.m3."                         
 [7] "PM10...g.m3."                          
 [8] "NO2...g.m3."                           
 [9] "PM25.temporal.coverage...."            
[10] "PM10.temporal.coverage...."            
[11] "NO2.temporal.coverage...."             
[12] "Reference"                             
[13] "Number.and.type.of.monitoring.stations"
[14] "Version.of.the.database"               
[15] "Status"                                

Data Cleaning and Renaming

We rename columns to simpler and consistent names.

Rename properly

colnames(data)[1:5] <- c("Region", "CountryCode", "Country", "City", "Year")
colnames(data)[6]   <- "PM25"
colnames(data)[7]   <- "PM10"
colnames(data)[8]   <- "NO2"
colnames(data)[11]  <- "Ozone"

Clean Column Names

We remove special characters and standardize names.

names(data) <- gsub("PM2.5", "PM25", names(data))
names(data) <- gsub("PM10", "PM10", names(data))
names(data) <- gsub("NO_2", "NO2", names(data))
names(data) <- gsub("O3", "Ozone", names(data))

Additional Column Fixing

Ensuring all variable names are consistent.

colnames(data)[colnames(data) == "PM2.5"] <- "PM25"
colnames(data)[colnames(data) == "PM10"]  <- "PM10"
colnames(data)[colnames(data) == "NO_2"]  <- "NO2"
colnames(data)[colnames(data) == "O3"]    <- "Ozone"

Select Relevant Columns

We select only important columns for analysis.

clean_data <- data[, c("Country","City","Year","PM25","PM10","NO2","Ozone")]
head(clean_data)
      Country    City Year   PM25  PM10   NO2    Ozone
1 Afghanistan   Kabul 2019 119.77    NA    NA       NA
2     Albania  Durres 2015     NA 17.65 26.63 83.96119
3     Albania  Durres 2016  14.32 24.56 24.78 87.93260
4     Albania Elbasan 2015     NA    NA 23.96 97.85388
5     Albania Elbasan 2016     NA    NA 26.26 96.04964
6     Albania Elbasan 2017     NA    NA 24.70 89.29224

Adjust Column Names

Ensuring correct naming for selected columns.

colnames(data)[c(3:5)] <- c("Country","City","Year")

Prepare Clean Dataset

Final cleaned dataset for analysis.

clean_data <- data[, c("Country","City","Year","PM25","PM10","NO2","Ozone")]
head(clean_data)
      Country    City Year   PM25  PM10   NO2    Ozone
1 Afghanistan   Kabul 2019 119.77    NA    NA       NA
2     Albania  Durres 2015     NA 17.65 26.63 83.96119
3     Albania  Durres 2016  14.32 24.56 24.78 87.93260
4     Albania Elbasan 2015     NA    NA 23.96 97.85388
5     Albania Elbasan 2016     NA    NA 26.26 96.04964
6     Albania Elbasan 2017     NA    NA 24.70 89.29224

Check Data Structure

Understanding the type of each variable.

str(clean_data)
'data.frame':   32191 obs. of  7 variables:
 $ Country: chr  "Afghanistan" "Albania" "Albania" "Albania" ...
 $ City   : chr  "Kabul" "Durres" "Durres" "Elbasan" ...
 $ Year   : int  2019 2015 2016 2015 2016 2017 2015 2016 2014 2015 ...
 $ PM25   : num  119.8 NA 14.3 NA NA ...
 $ PM10   : num  NA 17.6 24.6 NA NA ...
 $ NO2    : num  NA 26.6 24.8 24 26.3 ...
 $ Ozone  : num  NA 84 87.9 97.9 96 ...

Select Numeric Data

We extract only numeric variables for correlation.

numeric_data <- clean_data[, c("PM25","PM10","NO2","Ozone")]
head(numeric_data)
    PM25  PM10   NO2    Ozone
1 119.77    NA    NA       NA
2     NA 17.65 26.63 83.96119
3  14.32 24.56 24.78 87.93260
4     NA    NA 23.96 97.85388
5     NA    NA 26.26 96.04964
6     NA    NA 24.70 89.29224

Create Correlation Matrix

This calculates relationships between variables.

cor_matrix <- cor(numeric_data, use = "complete.obs")
cor_matrix
            PM25       PM10          NO2        Ozone
PM25   1.0000000  0.8902050  0.318011407 -0.274277593
PM10   0.8902050  1.0000000  0.257306174 -0.313350671
NO2    0.3180114  0.2573062  1.000000000 -0.004052908
Ozone -0.2742776 -0.3133507 -0.004052908  1.000000000

Create Edge List (Strong Correlations)

We select strong correlations above threshold.

edges <- which(abs(cor_matrix) > 0.6 & cor_matrix != 1, arr.ind = TRUE)

edge_list <- data.frame(
  from   = rownames(cor_matrix)[edges[,1]],
  to     = colnames(cor_matrix)[edges[,2]],
  weight = cor_matrix[edges]
)

head(edge_list)
  from   to   weight
1 PM10 PM25 0.890205
2 PM25 PM10 0.890205

Create Graph Object

We convert edge list into network graph.

g <- graph_from_data_frame(edge_list, directed = FALSE)
g
IGRAPH d64b7a6 UNW- 2 2 -- 
+ attr: name (v/c), weight (e/n)
+ edges from d64b7a6 (vertex names):
[1] PM10--PM25 PM10--PM25

Adjust Threshold for Better Visualization

Using lower threshold to include more relationships.

edges <- which(abs(cor_matrix) > 0.3 & cor_matrix != 1, arr.ind = TRUE)

edge_list <- data.frame(
  from   = rownames(cor_matrix)[edges[,1]],
  to     = colnames(cor_matrix)[edges[,2]],
  weight = cor_matrix[edges]
)

Plot Correlation Network Graph

Final visualization of relationships.

g <- graph_from_data_frame(edge_list, directed = FALSE)

ggraph(g, layout = "fr", weights = abs(E(g)$weight)) +
  geom_edge_link(aes(width = abs(weight), color = weight), alpha = 0.8) +
  geom_node_point(size = 8, color = "skyblue") +
  geom_node_text(aes(label = name), repel = TRUE, size = 5, fontface = "bold") +
  scale_edge_color_gradient(low = "red", high = "blue", transform = "identity") +
  labs(title = "Enriched Correlation Network of WHO Air Quality Variables",
       subtitle = "Edges represent correlations above 0.3",
       edge_color = "Correlation") +
  theme_void()
Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
ℹ Please use the `transform` argument instead.

Insights

PM2.5 and PM10 show strong correlation Some pollutants are highly connected Weak relationships indicate less dependency

Conclusion

The correlation network graph helps visualize relationships between air quality variables effectively and provides meaningful insights.