Introduction

In this second paper, I will perform Principal Component Analysis (PCA) to reduce the dimensions of the air pollution dataset. My goal is to understand which pollutants contribute most to the total air quality variance and to represent the data in a simpler way.

# Loading necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
# Loading the dataset
df <- read.csv("global air pollution dataset.csv")

# Selecting only numerical columns for PCA
# We take AQI Value, CO, Ozone, NO2, and PM2.5
df_numeric <- df %>% 
  select(AQI.Value, CO.AQI.Value, Ozone.AQI.Value, NO2.AQI.Value, PM2.5.AQI.Value) %>%
  drop_na()

# Scaling the data is mandatory for PCA
df_scaled <- scale(df_numeric)

Running PCA

I am using the PCA() function to analyze the dataset. I set graph = FALSE because I want to create custom, professional plots in the next steps.

# Computing PCA
res.pca <- PCA(df_scaled, graph = FALSE)

# Inspecting the results
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23463 individuals, described by 5 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

Variance Explained

To decide how many components are important, I look at the Scree Plot. This plot shows the percentage of variance explained by each principal component.

# Visualizing eigenvalues/variances
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 70))
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

Interpreting the Scree Plot

The Scree Plot shows how much information (variance) each principal component captures. In my analysis, the first two components (Dim 1 and Dim 2) explain a significant percentage of the total variance.

Because these two components represent the majority of the information, we can effectively reduce our 5D data into 2D without losing much detail. This makes it easier to visualize and analyze the air pollution patterns across different cities.

# Graph of variables: Positive correlations are shown by arrows pointing in the same direction
fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)     # Avoid text overlapping
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Variable Correlation Circle

This plot shows how each pollutant relates to the principal components.

Direction of Arrows: If two arrows point in the same direction, those pollutants are positively correlated (they increase together).

Length of Arrows: Longer arrows mean that the pollutant has a strong influence on that specific component.

In my results, we can see that AQI Value and PM2.5 are very close to each other, meaning they are the main drivers of the first dimension (Dim 1). Other pollutants like Ozone might contribute to the second dimension (Dim 2).

PCA Individual Plot

This graph shows the distribution of all cities in the dataset based on the first two principal components. Cities that are close to each other have similar air pollution patterns.

# Graph of individuals
fviz_pca_ind(res.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             geom = "point",   # Show points only (to avoid text clutter)
             alpha.ind = 0.5) + # Transparency for better visibility
  labs(title = "PCA: Map of Cities",
       subtitle = "Colors represent the quality of representation on the map")

Conclusion

In this second paper, I applied Principal Component Analysis (PCA) to the global air pollution dataset. Here are my main findings:

Overall, PCA allowed me to simplify a complex dataset and identify the most important factors driving air pollution globally.

AI Usage Statement

This project was designed and managed by me. I used AI as a technical assistant for the following tasks:

I personally selected the variables for the analysis and interpreted all the graphical outputs to reach the final conclusions.