In this second paper, I will perform Principal Component Analysis (PCA) to reduce the dimensions of the air pollution dataset. My goal is to understand which pollutants contribute most to the total air quality variance and to represent the data in a simpler way.
# Loading necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
# Loading the dataset
df <- read.csv("global air pollution dataset.csv")
# Selecting only numerical columns for PCA
# We take AQI Value, CO, Ozone, NO2, and PM2.5
df_numeric <- df %>%
select(AQI.Value, CO.AQI.Value, Ozone.AQI.Value, NO2.AQI.Value, PM2.5.AQI.Value) %>%
drop_na()
# Scaling the data is mandatory for PCA
df_scaled <- scale(df_numeric)
I am using the PCA() function to analyze the dataset. I set graph = FALSE because I want to create custom, professional plots in the next steps.
# Computing PCA
res.pca <- PCA(df_scaled, graph = FALSE)
# Inspecting the results
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23463 individuals, described by 5 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
To decide how many components are important, I look at the Scree Plot. This plot shows the percentage of variance explained by each principal component.
# Visualizing eigenvalues/variances
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 70))
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
The Scree Plot shows how much information (variance) each principal component captures. In my analysis, the first two components (Dim 1 and Dim 2) explain a significant percentage of the total variance.
Because these two components represent the majority of the information, we can effectively reduce our 5D data into 2D without losing much detail. This makes it easier to visualize and analyze the air pollution patterns across different cities.
# Graph of variables: Positive correlations are shown by arrows pointing in the same direction
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) # Avoid text overlapping
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Variable Correlation Circle
This plot shows how each pollutant relates to the principal components.
Direction of Arrows: If two arrows point in the same direction, those pollutants are positively correlated (they increase together).
Length of Arrows: Longer arrows mean that the pollutant has a strong influence on that specific component.
In my results, we can see that AQI Value and PM2.5 are very close to each other, meaning they are the main drivers of the first dimension (Dim 1). Other pollutants like Ozone might contribute to the second dimension (Dim 2).
This graph shows the distribution of all cities in the dataset based on the first two principal components. Cities that are close to each other have similar air pollution patterns.
# Graph of individuals
fviz_pca_ind(res.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
geom = "point", # Show points only (to avoid text clutter)
alpha.ind = 0.5) + # Transparency for better visibility
labs(title = "PCA: Map of Cities",
subtitle = "Colors represent the quality of representation on the map")
In this second paper, I applied Principal Component Analysis (PCA) to the global air pollution dataset. Here are my main findings:
Dimensionality Reduction: I reduced 5 pollution variables into 2 principal components. These two components explain most of the information (variance) in the dataset.
Variable Relationships: The variable plot showed that AQI Value and PM2.5 are strongly correlated, as they move in the same direction. This suggests they are the primary factors affecting air quality in most cities.
City Distribution: The individual plot showed a dense cluster of cities with similar air profiles, while some outliers represent locations with extreme pollution levels.
Overall, PCA allowed me to simplify a complex dataset and identify the most important factors driving air pollution globally.
This project was designed and managed by me. I used AI as a technical assistant for the following tasks:
Mathematical Syntax: AI helped with the R code for the FactoMineR and factoextra libraries to compute and visualize the PCA results.
Interpretation Support: AI provided guidance on how to read the
Scree Plot and the Variable
Correlation Circle.
English Editing: AI assisted in proofreading the explanations to ensure a professional and academic tone.
I personally selected the variables for the analysis and interpreted all the graphical outputs to reach the final conclusions.