Dimensionality Reduction on WDI Environmental Dataset

Introduction

In recent years, environmental sustainability has become an increasingly pressing global issue. Governments, researchers, and policymakers rely on large-scale datasets to assess environmental trends and make informed decisions. However, environmental data often consists of multiple highly correlated variables, making it difficult to extract meaningful patterns directly. Dimensionality reduction techniques help simplify complex datasets, making it easier to identify key drivers of environmental variation across countries.

This study focuses on analyzing seven key environmental indicators for 25 countries (United States, China, India, Brazil, Russia, Germany, Japan, United Kingdom, France, Canada, Australia, Mexico, Indonesia, Saudi Arabia, South Africa, Italy, Spain, South Korea, Turkey, Argentina, Sweden, Norway, Denmark, Poland, Belgium), these countries are chosen on a basis of data completeness & relevance in European and World economics, using data from the World Bank’s World Development Indicators (WDI) database. Specifically, 2018 is used as it is the most recent year with the most complete and high-quality data available across all selected countries.

The indicators include:

PM2.5 air pollution (concentration of fine particulate matter in the air)
Forest area percentage (proportion of land covered by forests)
CO₂ emissions excluding land-use change and forestry (LUCF)
Energy intensity (energy use per GDP unit)
Renewable energy consumption percentage
Agricultural land percentage
Protected areas percentage (proportion of land and sea area under protection)

To analyze patterns across countries, a combination of dimensionality reduction techniques is applied, including:

Principal Component Analysis (PCA) to reduce data complexity while preserving the variance structure.
t-SNE (t-distributed Stochastic Neighbor Embedding) to reveal local structures and similarities in the data.
Multidimensional Scaling (MDS) to visualize distances between countries in a meaningful way.

The Importance of Reducing Data Complexity in Environmental Analysis

Dimensionality reduction is important in environmental analysis because many indicators are complex and closely related. Principal Component Analysis (PCA) has been useful in environmental research by helping to find the most important variables while reducing unnecessary information (Ma et al., 2023). PCA is often used in environmental policy planning to better understand key factors like carbon emissions and energy use (Ma et al., 2023).

In sustainability research, PCA has helped identify key environmental indicators at both local and national levels. For example, Martins et al. (2021) used PCA to create sustainability indicators for neighborhoods, showing that it helps organize environmental, social, and economic factors in a clear way. Similarly, studies on environmental degradation show that PCA is valuable for analyzing linked environmental indicators, making it easier to evaluate patterns of degradation (Khatun, 2009).

Using PCA, t-SNE, and MDS makes it easier to interpret and analyze environmental trends. These methods help policymakers understand how different countries perform on environmental issues, supporting data-driven decisions for sustainability goals.

Data Collection and Preparation

First it is necessary to gather and preprocess the environmental data. This study uses the data from the World Bank’s WDI database, focusing on 25 selected countries with strong environmental and economic influence. Seven key environmental indicators are selected.

To fetch this data, the WDI package in R is used, which allows direct access to World Bank datasets.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## corrplot 0.95 loaded

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# 25 countries (including several in Europe)
countries_25 <- c(
  "USA", "CHN", "IND", "BRA", "RUS",
  "DEU", "JPN", "GBR", "FRA", "CAN",
  "AUS", "MEX", "IDN", "SAU", "ZAF",
  "ITA", "ESP", "KOR", "TUR", "ARG",
  "SWE", "NOR", "DNK", "POL", "BEL"
)


# 7 environment-related indicators:
indicators <- c(
  "EN.ATM.PM25.MC.M3",   # PM2.5 air pollution (µg/m^3)
  "AG.LND.FRST.ZS",      # Forest area (% of land area)
  "CC.CO2.EMSE.EL",      # CO2 emissions (excl. LUCF)
  "EG.EGY.PRIM.PP.KD",   # Energy intensity
  "EG.FEC.RNEW.ZS",      # Renewable energy consumption (% of total final)
  "AG.LND.AGRI.ZS",      # Agricultural land (% of land area)
  "ER.PTD.TOTL.ZS"       # Protected areas (% of total land and sea area)
)

# Fetching data for 2018
wdi_data_2018 <- WDI(
  country  = countries_25, 
  indicator = indicators,
  start     = 2018,
  end       = 2018,
  extra     = FALSE
)

#rename for easier readability
name_map <- c(
  "PM2.5 (µg/m^3)"         = "EN.ATM.PM25.MC.M3",
  "ForestArea (%)"         = "AG.LND.FRST.ZS",
  "CO2 ExclLUCF"           = "CC.CO2.EMSE.EL",
  "EnergyIntensity"        = "EG.EGY.PRIM.PP.KD",
  "RenewableEnergy (%)"    = "EG.FEC.RNEW.ZS",
  "AgriculturalLand (%)"   = "AG.LND.AGRI.ZS",
  "ProtectedAreas (%)"     = "ER.PTD.TOTL.ZS"
)


wdi_data_2018 <- wdi_data_2018 %>%
  rename(all_of(name_map))

#grouping by year
wdi_data_2018 <- wdi_data_2018 %>%
  group_by(country, year) %>%
  summarize(
    across(everything(), ~ {
      if (is.numeric(.)) {
        mean(.x, na.rm = TRUE)  # or sum, median, etc.
      } else {
        first(.)  # keep the first value for character/factor columns
      }
    }),
    .groups = "drop"
  )

#drop 'iso2c'& 'iso3c'& 'year' since we don't need it
wdi_2018 <- wdi_data_2018 %>%
  select(-iso2c, -iso3c,-year)


# Keep 'country' for labeling
country_names <- wdi_2018$country

# Drop 'country' from the numeric data & scale
wdi_scaled <- scale(wdi_2018 %>%
                      select(-country))
rownames(wdi_scaled) <- country_names

Visualizing Results

Correlation Matrix

There is a strong negative correlation (-0.79) between forest area and agricultural land, indicating that countries with large forested areas tend to have less land dedicated to agriculture. This suggests a trade-off between deforestation and agricultural expansion. Renewable energy and agricultural land also show a negative correlation (-0.39), meaning that countries with a higher share of agricultural land tend to rely less on renewable energy.

PM2.5 air pollution levels have a moderate positive correlation (0.46) with agricultural land, suggesting that intensive agricultural activity may contribute to air pollution. CO₂ emissions and energy intensity are strongly related (0.80), showing that energy-intensive economies tend to produce higher emissions.

By analyzing these correlations, it becomes clear that environmental indicators are interrelated, and reducing dimensionality using PCA can help summarize these complex relationships in a more interpretable way.

Scree Plot

The scree plot shows that the first two components explain 59.3% of the variance, with PC1 capturing 34% and PC2 accounting for 25.3%. The sharp decline in variance after the second component suggests that most of the meaningful variation in the dataset can be captured with just two principal components. This confirms that reducing the dataset to two dimensions maintains most of its explanatory power.

pca_result <- PCA(wdi_scaled, scale.unit = TRUE, graph = FALSE)

# Scree plot 
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))

PCA Biplot

The PCA biplot provides a visual representation of how different countries relate to the environmental indicators. The arrows represent the variables, with longer arrows indicating stronger contributions to the variance. Several key observations emerge from this plot:

Energy intensity and CO₂ emissions are strongly aligned, indicating that countries with high energy intensity tend to have higher CO₂ emissions.
PM2.5 levels correlate with energy intensity, suggesting that air pollution is associated with high fossil fuel consumption.
Forest area and agricultural land are positioned in opposite directions, indicating that countries with extensive agricultural land tend to have less forest coverage.
Renewable energy and forest area are positioned together, suggesting that countries with significant forest cover also tend to have higher renewable energy usage.

#Biplot
fviz_pca_biplot(
  pca_result,
  label = "var",  
  repel = TRUE
)

T-SNE on PCA-Reduced Data

T-SNE is useful for highlighting local similarities in complex datasets. Since PCA removes noise and keeps only the most important variation, t-SNE builds on this by emphasizing nonlinear relationships between countries.

Some countries do not form clear clusters and appear more spread out, which suggests that their environmental policies or indicators do not strongly align with others in the dataset. The distribution in this t-SNE visualization highlights both well-defined groupings and outliers, offering additional insight into global environmental strategies.

# Extract the factor coordinates from  PCA
pca_coords <- pca_result$ind$coord

# Set a seed for reproducibility
set.seed(123)

# t-SNE on PCA coords
tsne_out <- Rtsne(
  pca_coords,          # your input data (PCA coords)
  dims = 2,            # produce 2D embedding
  perplexity = 5,      # adjust based on dataset size
  check_duplicates = FALSE
)

# Build a data frame with t-SNE outputs
tsne_df <- data.frame(
  Dim1 = tsne_out$Y[,1],
  Dim2 = tsne_out$Y[,2],
  Country = rownames(pca_coords)
)

# Plot the t-SNE result
ggplot(tsne_df, aes(x = Dim1, y = Dim2, label = Country)) +
  geom_point() +
  geom_text_repel() +
  ggtitle("t-SNE on PCA Coordinates")

Multidimensional Scaling

MDS helps to understand how countries relate to each other in terms of environmental performance by keeping their relative distances as accurate as possible. While t-SNE focuses on local structures, MDS is better at preserving global relationships in the data.

Countries close together have similar environmental profiles, while those further apart differ more. Sweden and Norway are near each other possibly due to high renewable energy use & high forested area, while China and India are possibly separate because of high CO₂ emissions. France, Germany, and Denmark form a group, showing shared environmental policies and behaviour. The United Kingdom is slightly seperate. MDS confirms patterns from PCA and t-SNE, making global relationships clearer.

# Compute distance matrix from PCA-reduced data
mds_dist <- dist(pca_result$ind$coord) 

# Perform MDS
mds_coords <- cmdscale(mds_dist, k = 2)

# Convert to dataframe
mds_df <- data.frame(Dim1 = mds_coords[, 1], Dim2 = mds_coords[, 2], Country = rownames(pca_result$ind$coord))

# Plot
ggplot(mds_df, aes(x = Dim1, y = Dim2, label = Country)) +
  geom_point() +
  geom_text_repel() +
  ggtitle("MDS on PCA-Reduced Data")

K-Means and Pairwise Plot

K-means clusters countries based on indicators, grouping nations like Sweden and Norway together, while placing China and India in a different group. These tools highlight broad numerical similarities but do not confirm policy alignments and require more context and in depth research for real-world conclusions. The pairwise plot shows scatterplots and correlations for each pair of indicators, helping us see which variables most strongly relate.

# K-means Clustering

# We cluster on the PCA coords on the PCA coords
set.seed(123)
kmeans_k4 <- kmeans(pca_coords, centers = 4)

# Visualize clusters in the PCA space
pca_clusters_df <- data.frame(pca_coords) %>%
  mutate(Cluster = factor(kmeans_k4$cluster),
         Country = rownames(pca_coords))

ggplot(pca_clusters_df, aes(x = Dim.1, y = Dim.2, color = Cluster, label = Country)) +
  geom_point(size=2) +
  geom_text_repel() +
  ggtitle("K-means Clusters (k=4) on PCA Coordinates") +
  xlab("PC1") + ylab("PC2")

#Pairwise Plot (GGally)

# Convert to data frame
scaled_df <- as.data.frame(wdi_scaled)

ggpairs(
  scaled_df,
  upper = list(continuous = "cor"),
  diag  = list(continuous = "barDiag")
) +
  theme(
    axis.text = element_text(size = 5),                     
    axis.text.x = element_text(angle = 45, hjust = 2, vjust = 2), 
    axis.text.y = element_text(angle = 45, hjust = 2, vjust = 2), 
    strip.text = element_text(size = 5)                       
  )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

T-SNE density plot and 3D PCA plot

t-SNE Density Plot: This plot uses the t-SNE algorithm to group similar countries and adds a density layer to show where clusters are tightly packed. The darker areas indicate higher concentration, and labels highlight key countries in each cluster.
3D PCA Plot: This visualization represents the data in three dimensions, helping us understand broad patterns. Each dot stands for a country, and different colors indicate different clusters. Unlike t-SNE, PCA gives a bigger picture of how the data is spread out rather than focusing on local clusters.

kmeans_result <- kmeans(wdi_scaled, centers = 4)

# t-SNE on 25-row data
tsne_out <- Rtsne(
  wdi_scaled,
  dims = 2,
  perplexity = 5,
  check_duplicates = FALSE,
  pca = FALSE
)

# data frame for plotting
tsne_df <- data.frame(
  Dimension1 = tsne_out$Y[, 1],
  Dimension2 = tsne_out$Y[, 2],
  Cluster    = factor(kmeans_result$cluster),
  Country    = rownames(wdi_scaled)
)

# Plot , coloring by cluster
ggplot(tsne_df, aes(x = Dimension1, y = Dimension2)) +
  geom_point(aes(color = Cluster), size = 3) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", alpha = 0.2) +  
  geom_text_repel(aes(label = Country))

#data frame with 3 PCA components
pca_df <- data.frame(
  PC1 = pca_result$ind$coord[, 1],
  PC2 = pca_result$ind$coord[, 2],
  PC3 = pca_result$ind$coord[, 3],
  Cluster = factor(kmeans_result$cluster),
  Country = rownames(pca_result$ind$coord)
)

#Plot
plot_ly(
  data = pca_df,
  x = ~PC1,
  y = ~PC2,
  z = ~PC3,
  type = "scatter3d",
  mode = "markers",
  color = ~Cluster,
  marker = list(size = 5)
)

Conclusion

In this analysis, we explored how dimensionality reduction can help us understand patterns in the WDI Environment Dataset. We used Principal Component Analysis (PCA) to get a broad overview of how countries differ based on environmental and economic factors, and t-SNE to reveal detailed clusters of similar countries.

The t-SNE density plot showed distinct clusters, making it easier to see which countries share environmental similarities. The 3D PCA plot, on the other hand, helped us grasp how much variation exists in the data but showed some overlapping groups, meaning certain countries have similarities across multiple features.

Future Considerations

While these methods gave useful insights, there is clearly room for improvement:

Exploring other techniques like UMAP for better clustering.
Checking feature importance to see which environmental factors influence clusters the most.
Analyzing outliers like Brazil or Norway to understand why they stand apart.
Adding more variables to include additional environmental or economic indicators that could improve the accuracy of clustering and give more insightful analysis.
Expanding to include more countries, giving broader global representation and more meaningful comparisons.

Overall, dimensionality reduction proved to be a powerful tool for simplifying complex data while still preserving key patterns. Future work could improve on this approach by combining different techniques, improving feature selection, and incorporating a more comprehensive dataset.

References

Chu, K., Liu, W., She, Y., Hua, Z., Tan, M., Liu, X., Gu, L., & Jia, Y. (2018). Modified Principal Component Analysis for Identifying Key Environmental Indicators and Application to a Large-Scale Tidal Flat Reclamation. Water, 10(1), 69. https://doi.org/10.3390/w10010069.

Ma, S., Huang, Y., Liu, Y., Kong, X., Yin, L., & Chen, G. (2023). Analysis of principal component techniques in environmental policy planning. Journal of Environmental Studies, 45(3), 228-237.

Martins, M. S., Kalil, R. M. L., & Rosa, F. D. (2021). Sustainable neighbourhoods: Applicable indicators through principal component analysis. Proceedings of the Institution of Civil Engineers - Urban Design and Planning, 174, 1-27. https://doi.org/10.1680/jurdp.20.00058.

Khatun, T. (2009). Measuring environmental degradation by using principal component analysis. Environment, Development and Sustainability, 11, 439-457. https://doi.org/10.1007/s10668-007-9123-2.