Clustering and dimension reduction applied to economic activities of over 190 countries from 2017 to 2021

Overview of the Dataset

The World Development Indicators dataset, derived from the World Bank’s Data portal, evaluates various economic, social, and environmental metrics across more than 190 countries. These indicators assess critical dimensions such as health, education, poverty, and infrastructure, providing insights into the developmental progress and challenges faced by different nations. By applying clustering and dimensionality reduction techniques to this dataset, we can uncover patterns in how countries perform across these indicators, group them based on similar developmental characteristics, and identify the key factors that influence overall development. This analysis helps to compare and contrast countries’ progress and supports targeted policy-making and investment decisions aimed at improving global development outcomes.

Overview of the Indicators

Evaluating 7 most important of these economic indicators is crucial for assessing a country’s overall performance as they provide a multidimensional view of its economic health, resilience, and development trajector:

Economic Growth and Development Trends: Indicators like GDP growth and GDP per capita highlight the scale of economic expansion and the inclusiveness of prosperity. They are fundamental for measuring progress over time and comparing nations’ economic well-being.
Investment and Structural Shifts: Metrics such as gross capital formation and sectoral transitions reveal a country’s focus on industrialization, modernization, or service-oriented growth. They are vital for understanding long-term economic transformation and diversification.
Trade Dynamics: Export and import data, along with net trade figures, showcase integration into global markets and competitiveness. These metrics are key for evaluating a nation’s trade strategy and its impact on the domestic economy.
Inflation and Price Stability: Stability in consumer prices reflects sound economic governance and policy effectiveness. High inflation or deflation can signal underlying economic imbalances or policy failures.
Resilience to Global Shocks: Recovery from crises like the 2008 financial crash or COVID-19 highlights economic adaptability and the robustness of governance. It is essential for gauging the ability to withstand and rebound from external disruptions.
Socioeconomic Changes: Indicators like population growth, urbanization, and demographic trends provide insights into labor market dynamics, infrastructure needs, and potential future challenges such as aging populations.
Environmental and Sustainability Metrics: Energy use and CO2 emissions shed light on the environmental impact of economic activities, emphasizing the importance of sustainability in long-term growth.

Why I chose specially this Dataset

I chose this dataset because it provides valuable insights into key economic indicators that reflect the performance and development of countries. By analyzing metrics such as GDP growth, trade dynamics, investment trends, and sustainability factors, I can uncover patterns and trends crucial for evaluating global economic shifts. This dataset is particularly interesting as it captures a wide range of factors-economic growth, trade, demographics, and environmental metrics-that influence and shape the overall economic landscape of nations.

Additionally, the courses I have undertaken, such as Macroeconomics and Microeconomics, have equipped me with the theoretical knowledge needed to explore this dataset effectively. These courses have allowed me to delve deeper into understanding the dataset and uncover meaningful insights about global economic dynamics.

How Clustering helps to analyse this dataset further?

Clustering will enable the grouping of countries with similar economic characteristics, simplifying the comparison of economic performance across different clusters. By applying clustering algorithms, we can uncover relationships between key indicators like GDP growth, trade balance, and demographic trends. This will help identify patterns of development, such as clusters of emerging economies or countries with similar challenges. Additionally, clustering can aid in spotting outliers or unique cases that require deeper analysis, providing a foundation for identifying areas of improvement or further investigation into global economic trends.

How dimension reduction helps to simplify this dataset?

The dataset likely contains numerous indicators, leading to high-dimensional data. Dimensionality reduction techniques like PCA (Principal Component Analysis) reduce the number of dimensions while retaining most of the original information, making the data easier to visualize and interpret.By reducing dimensions, these techniques identify the most significant indicators that contribute to the variance in the dataset, helping focus on the most critical factors for analysis. Overall, dimensionality reduction makes the dataset more manageable and improves the effectiveness of further analysis and modeling.

General Information

What’s Data Loading, Preprocessing, and Visualization? And why do we need it before clustering?

Data Loading: This step involves importing data into a working environment. Preprocessing: Preprocessing cleans and transforms the data to make it suitable for analysis. This includes handling missing values, scaling features, encoding categorical variables, and removing outliers. Visualization: Visualizing data helps understand its distribution, identify patterns, detect outliers, and recognize relationships between variables. Visualization, through methods like histograms and scatter plots, helps understand data distribution, detect patterns, and identify outliers. These steps improve data quality, ensure proper feature scaling, and allow better selection and interpretation of clustering methods.

What’s K-means, PAM, Heirarchical and DBSCAN clustering methods?

K-means: K-means clustering partitions data into K clusters by minimizing the variance within clusters. It uses centroids to define cluster centers and is sensitive to outliers. The number of clusters must be predefined.

PAM (Partitioning Around Medoids): PAM is a robust clustering method that uses medoids (central points) instead of centroids, making it less sensitive to outliers. It is suitable for datasets with mixed types or where medoids are more interpretable.

Hierarchical Clustering: This method builds a tree-like structure of clusters, either agglomeratively (bottom-up) or divisively (top-down). It does not require specifying the number of clusters beforehand, making it flexible for exploring data.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters based on dense regions of points and can discover clusters of arbitrary shapes. It handles noise and outliers effectively, unlike K-means or hierarchical clustering.

What’s PCA, t-SNE, UMAP, Factor Analysis, MDS dimension reduction methods?

PCA (Principal Component Analysis): PCA reduces dimensionality by transforming the original variables into a smaller set of uncorrelated variables called principal components, which capture the most variance in the data. It’s useful for linear data structures.

t-SNE (t-Distributed Stochastic Neighbor Embedding): t-SNE reduces dimensionality by converting the distances between data points into probabilities, preserving local relationships and creating a lower-dimensional map.

UMAP (Uniform Manifold Approximation and Projection): UMAP is a nonlinear dimension reduction technique that focuses on preserving the global structure of the data while also maintaining local neighbor relations. It’s faster and scales better than t-SNE, making it suitable for large datasets.

Factor Analysis: Factor Analysis identifies latent variables (factors) that explain the correlations among observed variables. It assumes that underlying factors influence the observed data and is often used in psychometrics and social sciences.

MDS (Multidimensional Scaling): MDS reduces dimensionality by preserving the pairwise distances between data points. It aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible.

What is the Elbow Method and Why is it Used?

The Elbow Method is a technique that helps determine the best number of clusters when using algorithms like K-means. You plot the variance explained by each cluster against the number of clusters, and you look for a point where the curve starts to level off, forming an “elbow.” This elbow indicates the ideal number of clusters, as adding more clusters beyond this point doesn’t improve the model much. It’s a helpful way to balance between overfitting (too many clusters) and underfitting (too few clusters).

What is the Silhouette Score and Why is it Used?

The Silhouette Score is a way to measure how well the data points fit into their assigned clusters in clustering algorithms like K-means. It checks how similar each point is to others in the same cluster compared to points in nearby clusters.

+1 indicates that the points are well clustered, far from neighboring clusters.

0 indicates that the points are on or near the decision boundary between clusters.

-1 indicates that the points may have been assigned to the wrong cluster.

It’s really helpful for determining whether the clusters make sense and for comparing different clustering models to find the best fit for your data.

Plan of the Paper:

Data Loading, Preprocessing, and Visualization to Analyze Data Distribution
Clustering K-means, PAM, Heirarchical
Dimention Reduction PCA, t-SNE, UMAP, Factor Analysis, MDS
Re-clustering K-means, DBSCAN, Hierarchical

Data Loading, Preprocessing, and Visualization to Analyze Data Distribution

if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

## Warning: package 'pacman' was built under R version 4.4.2

pacman::p_load("yaml", "plotly", "ggplot2", "Rtsne", "htmlwidgets", "stats", "ggcorrplot", 
               "readxl", "knitr", "clustertend", "hopkins", "factoextra", "GGally", "mclust",
               "gridExtra", "dendextend", "cluster", "fpc", "clustMixType", "reshape2",
               "MASS", "FactoMineR", "BiocManager", "gplots", "pheatmap", "tidyr", "umap",
               "fpc", "flexclust", "cluster", "ClusterR", "NbClust", "DescTools", "pastecs",
               "corrplot", "psych", "dbscan", "cluster.stats")

## Installing package into 'C:/Users/khami/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## Warning: package 'cluster.stats' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4/PACKAGES'

## Warning in p_install(package, character.only = TRUE, ...):

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'cluster.stats'

## Warning in pacman::p_load("yaml", "plotly", "ggplot2", "Rtsne", "htmlwidgets", : Failed to install/load:
## cluster.stats

Loading the required packages

setwd("C:\\Users\\khami\\Desktop\\University of Warsaw\\R\\Elbek aka")
Dataset <- read_excel("2011-2021_Data_Extract_From_World_Development_Indicators.xlsx", 
                      sheet = "2017-21")

Setting the working directory and below summarizing the dataset

dim(Dataset)

## [1] 1330   19

#head(Dataset)
str(Dataset)

## tibble [1,330 × 19] (S3: tbl_df/tbl/data.frame)
##  $ Time                                                                  : num [1:1330] 2017 2017 2017 2017 2017 ...
##  $ Time Code                                                             : chr [1:1330] "YR2017" "YR2017" "YR2017" "YR2017" ...
##  $ Country Name                                                          : chr [1:1330] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country Code                                                          : chr [1:1330] "AFG" "ALB" "DZA" "ASM" ...
##  $ GDP (current US$)                                                     : num [1:1330] 1.88e+10 1.30e+10 1.90e+11 6.12e+08 3.00e+09 ...
##  $ Gross capital formation (annual % growth)                             : num [1:1330] NA 4.37 -2.23 NA NA ...
##  $ Exports of goods and services (% of GDP)                              : num [1:1330] NA 31.6 20.3 59.2 NA ...
##  $ Inflation, consumer prices (annual %)                                 : num [1:1330] 4.98 1.99 5.59 NA NA ...
##  $ GNI per capita, Atlas method (current US$)                            : num [1:1330] 530 4290 4440 NA NA ...
##  $ Foreign direct investment, net inflows (% of GDP)                     : num [1:1330] 0.275 7.855 0.648 NA NA ...
##  $ Domestic credit provided by financial sector (% of GDP)               : num [1:1330] NA 63.6 NA NA NA ...
##  $ Trade (% of GDP)                                                      : num [1:1330] NA 78.2 49.8 161.4 NA ...
##  $ Net trade in goods and services (BoP, current US$)                    : num [1:1330] -6.80e+09 -1.98e+09 -2.24e+10 NA NA ...
##  $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 NA NA ...
##  $ Employment in agriculture (% of total employment)                     : num [1:1330] 44 38.1 10.2 NA NA ...
##  $ Age dependency ratio (% of working-age population)                    : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
##  $ Tax revenue (% of GDP)                                                : num [1:1330] 9.9 18.9 NA NA NA ...
##  $ Energy use (kg of oil equivalent per capita)                          : logi [1:1330] NA NA NA NA NA NA ...
##  $ Government expenditure on education, total (% of GDP)                 : num [1:1330] 4.34 3.61 5.73 NA 3.22 ...

summary(Dataset)

##       Time       Time Code         Country Name       Country Code      
##  Min.   :2017   Length:1330        Length:1330        Length:1330       
##  1st Qu.:2018   Class :character   Class :character   Class :character  
##  Median :2019   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2019                                                           
##  3rd Qu.:2020                                                           
##  Max.   :2021                                                           
##                                                                         
##  GDP (current US$)   Gross capital formation (annual % growth)
##  Min.   :4.528e+07   Min.   :-938.996                         
##  1st Qu.:9.617e+09   1st Qu.:  -3.395                         
##  Median :5.516e+10   Median :   3.564                         
##  Mean   :2.809e+12   Mean   :   2.054                         
##  3rd Qu.:5.621e+11   3rd Qu.:   9.444                         
##  Max.   :9.785e+13   Max.   : 182.284                         
##  NA's   :42          NA's   :352                              
##  Exports of goods and services (% of GDP) Inflation, consumer prices (annual %)
##  Min.   :  2.25                           Min.   : -3.233                      
##  1st Qu.: 22.29                           1st Qu.:  1.414                      
##  Median : 31.57                           Median :  2.558                      
##  Mean   : 39.74                           Mean   :  5.543                      
##  3rd Qu.: 47.86                           3rd Qu.:  4.283                      
##  Max.   :213.22                           Max.   :557.202                      
##  NA's   :199                              NA's   :199                          
##  GNI per capita, Atlas method (current US$)
##  Min.   :   210                            
##  1st Qu.:  2083                            
##  Median :  6110                            
##  Mean   : 14788                            
##  3rd Qu.: 18003                            
##  Max.   :120990                            
##  NA's   :94                                
##  Foreign direct investment, net inflows (% of GDP)
##  Min.   :-1303.108                                
##  1st Qu.:    1.067                                
##  Median :    2.225                                
##  Mean   :    4.850                                
##  3rd Qu.:    4.221                                
##  Max.   :  602.375                                
##  NA's   :153                                      
##  Domestic credit provided by financial sector (% of GDP) Trade (% of GDP) 
##  Min.   :  6.613                                         Min.   :  4.128  
##  1st Qu.: 42.711                                         1st Qu.: 51.204  
##  Median : 64.740                                         Median : 71.222  
##  Mean   : 85.079                                         Mean   : 84.632  
##  3rd Qu.: 98.346                                         3rd Qu.:101.218  
##  Max.   :393.451                                         Max.   :402.460  
##  NA's   :1073                                            NA's   :204      
##  Net trade in goods and services (BoP, current US$)
##  Min.   :-8.481e+11                                
##  1st Qu.:-2.281e+09                                
##  Median :-2.180e+08                                
##  Mean   : 5.234e+09                                
##  3rd Qu.: 2.463e+09                                
##  Max.   : 5.252e+11                                
##  NA's   :410                                       
##  Labor force participation rate, total (% of total population ages 15+)
##  Min.   :31.40                                                         
##  1st Qu.:55.55                                                         
##  Median :60.66                                                         
##  Mean   :60.24                                                         
##  3rd Qu.:65.76                                                         
##  Max.   :87.64                                                         
##  NA's   :155                                                           
##  Employment in agriculture (% of total employment)
##  Min.   : 0.1078                                  
##  1st Qu.: 5.8431                                  
##  Median :18.3271                                  
##  Mean   :24.5136                                  
##  3rd Qu.:38.9447                                  
##  Max.   :86.3098                                  
##  NA's   :155                                      
##  Age dependency ratio (% of working-age population) Tax revenue (% of GDP)
##  Min.   : 17.45                                     Min.   : 0.0001       
##  1st Qu.: 48.44                                     1st Qu.:11.6061       
##  Median : 54.54                                     Median :15.4216       
##  Mean   : 58.65                                     Mean   :16.1972       
##  3rd Qu.: 67.75                                     3rd Qu.:20.5240       
##  Max.   :105.29                                     Max.   :44.4023       
##  NA's   :5                                          NA's   :494           
##  Energy use (kg of oil equivalent per capita)
##  Mode:logical                                
##  NA's:1330                                   
##                                              
##                                              
##                                              
##                                              
##                                              
##  Government expenditure on education, total (% of GDP)
##  Min.   : 0.3585                                      
##  1st Qu.: 3.2962                                      
##  Median : 4.1319                                      
##  Mean   : 4.3082                                      
##  3rd Qu.: 5.0636                                      
##  Max.   :15.3770                                      
##  NA's   :277

stat.desc(Dataset)

##                      Time Time Code Country Name Country Code GDP (current US$)
## nbr.val      1.330000e+03        NA           NA           NA      1.288000e+03
## nbr.null     0.000000e+00        NA           NA           NA      0.000000e+00
## nbr.na       0.000000e+00        NA           NA           NA      4.200000e+01
## min          2.017000e+03        NA           NA           NA      4.527660e+07
## max          2.021000e+03        NA           NA           NA      9.784830e+13
## range        4.000000e+00        NA           NA           NA      9.784825e+13
## sum          2.685270e+06        NA           NA           NA      3.618306e+15
## median       2.019000e+03        NA           NA           NA      5.515636e+10
## mean         2.019000e+03        NA           NA           NA      2.809244e+12
## SE.mean      3.879292e-02        NA           NA           NA      2.673606e+11
## CI.mean.0.95 7.610204e-02        NA           NA           NA      5.245104e+11
## var          2.001505e+00        NA           NA           NA      9.206841e+25
## std.dev      1.414746e+00        NA           NA           NA      9.595228e+12
## coef.var     7.007160e-04        NA           NA           NA      3.415591e+00
##              Gross capital formation (annual % growth)
## nbr.val                                     978.000000
## nbr.null                                      0.000000
## nbr.na                                      352.000000
## min                                        -938.995687
## max                                         182.283579
## range                                      1121.279267
## sum                                        2008.484716
## median                                        3.564137
## mean                                          2.053665
## SE.mean                                       1.204690
## CI.mean.0.95                                  2.364078
## var                                        1419.349717
## std.dev                                      37.674258
## coef.var                                     18.344887
##              Exports of goods and services (% of GDP)
## nbr.val                                  1.131000e+03
## nbr.null                                 0.000000e+00
## nbr.na                                   1.990000e+02
## min                                      2.249870e+00
## max                                      2.132227e+02
## range                                    2.109728e+02
## sum                                      4.494378e+04
## median                                   3.156980e+01
## mean                                     3.973809e+01
## SE.mean                                  8.827692e-01
## CI.mean.0.95                             1.732051e+00
## var                                      8.813674e+02
## std.dev                                  2.968783e+01
## coef.var                                 7.470875e-01
##              Inflation, consumer prices (annual %)
## nbr.val                               1131.0000000
## nbr.null                                 0.0000000
## nbr.na                                 199.0000000
## min                                     -3.2333893
## max                                    557.2018174
## range                                  560.4352067
## sum                                   6269.5973731
## median                                   2.5575448
## mean                                     5.5434106
## SE.mean                                  0.7051755
## CI.mean.0.95                             1.3836005
## var                                    562.4151196
## std.dev                                 23.7152929
## coef.var                                 4.2781051
##              GNI per capita, Atlas method (current US$)
## nbr.val                                    1.236000e+03
## nbr.null                                   0.000000e+00
## nbr.na                                     9.400000e+01
## min                                        2.100000e+02
## max                                        1.209900e+05
## range                                      1.207800e+05
## sum                                        1.827822e+07
## median                                     6.110074e+03
## mean                                       1.478820e+04
## SE.mean                                    5.612330e+02
## CI.mean.0.95                               1.101076e+03
## var                                        3.893184e+08
## std.dev                                    1.973115e+04
## coef.var                                   1.334250e+00
##              Foreign direct investment, net inflows (% of GDP)
## nbr.val                                            1177.000000
## nbr.null                                              5.000000
## nbr.na                                              153.000000
## min                                               -1303.108267
## max                                                 602.374526
## range                                              1905.482793
## sum                                                5708.707365
## median                                                2.225386
## mean                                                  4.850219
## SE.mean                                               1.434614
## CI.mean.0.95                                          2.814689
## var                                                2422.404873
## std.dev                                              49.217932
## coef.var                                             10.147570
##              Domestic credit provided by financial sector (% of GDP)
## nbr.val                                                 2.570000e+02
## nbr.null                                                0.000000e+00
## nbr.na                                                  1.073000e+03
## min                                                     6.612871e+00
## max                                                     3.934513e+02
## range                                                   3.868384e+02
## sum                                                     2.186543e+04
## median                                                  6.473999e+01
## mean                                                    8.507951e+01
## SE.mean                                                 4.231875e+00
## CI.mean.0.95                                            8.333722e+00
## var                                                     4.602554e+03
## std.dev                                                 6.784212e+01
## coef.var                                                7.973968e-01
##              Trade (% of GDP)
## nbr.val          1.126000e+03
## nbr.null         0.000000e+00
## nbr.na           2.040000e+02
## min              4.127549e+00
## max              4.024597e+02
## range            3.983321e+02
## sum              9.529540e+04
## median           7.122244e+01
## mean             8.463179e+01
## SE.mean          1.608174e+00
## CI.mean.0.95     3.155357e+00
## var              2.912087e+03
## std.dev          5.396376e+01
## coef.var         6.376298e-01
##              Net trade in goods and services (BoP, current US$)
## nbr.val                                            9.200000e+02
## nbr.null                                           0.000000e+00
## nbr.na                                             4.100000e+02
## min                                               -8.480720e+11
## max                                                5.252276e+11
## range                                              1.373300e+12
## sum                                                4.815380e+12
## median                                            -2.179975e+08
## mean                                               5.234109e+09
## SE.mean                                            2.259529e+09
## CI.mean.0.95                                       4.434436e+09
## var                                                4.697034e+21
## std.dev                                            6.853491e+10
## coef.var                                           1.309390e+01
##              Labor force participation rate, total (% of total population ages 15+)
## nbr.val                                                                1.175000e+03
## nbr.null                                                               0.000000e+00
## nbr.na                                                                 1.550000e+02
## min                                                                    3.140200e+01
## max                                                                    8.764100e+01
## range                                                                  5.623900e+01
## sum                                                                    7.078724e+04
## median                                                                 6.065558e+01
## mean                                                                   6.024446e+01
## SE.mean                                                                2.784123e-01
## CI.mean.0.95                                                           5.462413e-01
## var                                                                    9.107827e+01
## std.dev                                                                9.543493e+00
## coef.var                                                               1.584128e-01
##              Employment in agriculture (% of total employment)
## nbr.val                                           1.175000e+03
## nbr.null                                          0.000000e+00
## nbr.na                                            1.550000e+02
## min                                               1.077741e-01
## max                                               8.630985e+01
## range                                             8.620208e+01
## sum                                               2.880347e+04
## median                                            1.832706e+01
## mean                                              2.451359e+01
## SE.mean                                           6.059417e-01
## CI.mean.0.95                                      1.188850e+00
## var                                               4.314193e+02
## std.dev                                           2.077064e+01
## coef.var                                          8.473111e-01
##              Age dependency ratio (% of working-age population)
## nbr.val                                            1.325000e+03
## nbr.null                                           0.000000e+00
## nbr.na                                             5.000000e+00
## min                                                1.745307e+01
## max                                                1.052923e+02
## range                                              8.783927e+01
## sum                                                7.771582e+04
## median                                             5.453799e+01
## mean                                               5.865345e+01
## SE.mean                                            4.430052e-01
## CI.mean.0.95                                       8.690687e-01
## var                                                2.600360e+02
## std.dev                                            1.612563e+01
## coef.var                                           2.749307e-01
##              Tax revenue (% of GDP)
## nbr.val                8.360000e+02
## nbr.null               0.000000e+00
## nbr.na                 4.940000e+02
## min                    6.279221e-05
## max                    4.440227e+01
## range                  4.440220e+01
## sum                    1.354087e+04
## median                 1.542158e+01
## mean                   1.619721e+01
## SE.mean                2.151279e-01
## CI.mean.0.95           4.222550e-01
## var                    3.869009e+01
## std.dev                6.220136e+00
## coef.var               3.840251e-01
##              Energy use (kg of oil equivalent per capita)
## nbr.val                                                NA
## nbr.null                                               NA
## nbr.na                                                 NA
## min                                                    NA
## max                                                    NA
## range                                                  NA
## sum                                                    NA
## median                                                 NA
## mean                                                   NA
## SE.mean                                                NA
## CI.mean.0.95                                           NA
## var                                                    NA
## std.dev                                                NA
## coef.var                                               NA
##              Government expenditure on education, total (% of GDP)
## nbr.val                                               1.053000e+03
## nbr.null                                              0.000000e+00
## nbr.na                                                2.770000e+02
## min                                                   3.584790e-01
## max                                                   1.537700e+01
## range                                                 1.501852e+01
## sum                                                   4.536514e+03
## median                                                4.131948e+00
## mean                                                  4.308180e+00
## SE.mean                                               5.141068e-02
## CI.mean.0.95                                          1.008792e-01
## var                                                   2.783141e+00
## std.dev                                               1.668275e+00
## coef.var                                              3.872342e-01

#View(Dataset)
options(scipen=999)
colSums(is.na(Dataset))

##                                                                   Time 
##                                                                      0 
##                                                              Time Code 
##                                                                      0 
##                                                           Country Name 
##                                                                      0 
##                                                           Country Code 
##                                                                      0 
##                                                      GDP (current US$) 
##                                                                     42 
##                              Gross capital formation (annual % growth) 
##                                                                    352 
##                               Exports of goods and services (% of GDP) 
##                                                                    199 
##                                  Inflation, consumer prices (annual %) 
##                                                                    199 
##                             GNI per capita, Atlas method (current US$) 
##                                                                     94 
##                      Foreign direct investment, net inflows (% of GDP) 
##                                                                    153 
##                Domestic credit provided by financial sector (% of GDP) 
##                                                                   1073 
##                                                       Trade (% of GDP) 
##                                                                    204 
##                     Net trade in goods and services (BoP, current US$) 
##                                                                    410 
## Labor force participation rate, total (% of total population ages 15+) 
##                                                                    155 
##                      Employment in agriculture (% of total employment) 
##                                                                    155 
##                     Age dependency ratio (% of working-age population) 
##                                                                      5 
##                                                 Tax revenue (% of GDP) 
##                                                                    494 
##                           Energy use (kg of oil equivalent per capita) 
##                                                                   1330 
##                  Government expenditure on education, total (% of GDP) 
##                                                                    277

Let’s examine what each of these dimensions corresponds to:

GDP (current US$) ~ Total monetary value of all goods and services produced within a country in a specific period, measured in current US dollars.

Gross capital formation (annual % growth) ~ Annual percentage growth rate of gross capital formation, indicating investments in physical assets like buildings, machinery, and infrastructure.

Exports of goods and services (% of GDP) ~ Value of all goods and services exported as a percentage of the country’s GDP.

Inflation, consumer prices (annual %) ~ Annual percentage change in the cost to the average consumer of acquiring a basket of goods and services.

GNI per capita, Atlas method (current US$) ~ Gross National Income divided by the population, adjusted using the Atlas method, expressed in current US dollars.

Foreign direct investment, net inflows (% of GDP) ~ The net inflows of investment to acquire a lasting management interest in a business operating in another economy, as a percentage of GDP.

Domestic credit provided by financial sector (% of GDP) ~ Total credit extended to various sectors of the economy by financial intermediaries, expressed as a percentage of GDP.

Trade (% of GDP) ~ Sum of exports and imports of goods and services measured as a share of GDP.

Net trade in goods and services (BoP, current US$) ~ The difference between exports and imports of goods and services, expressed in current US dollars.

Labor force participation rate, total (% of total population ages 15+) ~ The percentage of the population aged 15 and older that is economically active.

Employment in agriculture (% of total employment) ~ Percentage of total employment that is engaged in agriculture, forestry, and fishing.

Age dependency ratio (% of working-age population) ~ The ratio of dependents (people younger than 15 or older than 64) to the working-age population.

Tax revenue (% of GDP) ~ The total government tax revenue expressed as a percentage of GDP.

Government expenditure on education, total (% of GDP) ~ The total public expenditure on education as a percentage of GDP.

Visualizing the missing values per column in bar chart, histogram and scatter plot for better visibility of missing data

missing_counts <- colSums(is.na(Dataset))
missing_counts <- missing_counts[missing_counts > 0]

Bar chart for missing values per column

if (length(missing_counts) > 0) {
  bar_chart <- ggplot(data.frame(Columns = names(missing_counts), Missing = missing_counts), aes(x = reorder(Columns, Missing), y = Missing)) +
    geom_bar(stat = "identity", fill = "skyblue") +
    coord_flip() +
    labs(title = "Missing Values per Column", x = "Columns", y = "Number of Missing Values") +
    theme_minimal()
} else {
  bar_chart <- ggplot() +
    annotate("text", x = 0.5, y = 0.5, label = "No Missing Values", size = 6, color = "green", hjust = 0.5, vjust = 0.5) +
    theme_void() +
    labs(title = "Missing Values per Column")
}

Histogram of missing values per row

gg_plot <- ggplot(data.frame(missing_per_row = rowSums(is.na(Dataset))), aes(x = missing_per_row)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "white") +
  labs(title = "Histogram of Missing Values per Row", x = "Number of Missing Values", y = "Frequency")

Arranging all visualizations in one line

grid.arrange(gg_plot, bar_chart)

Data Cleaning Part started with filling NA with the mean of each column

colSums(is.na(Dataset))

##                                                                   Time 
##                                                                      0 
##                                                              Time Code 
##                                                                      0 
##                                                           Country Name 
##                                                                      0 
##                                                           Country Code 
##                                                                      0 
##                                                      GDP (current US$) 
##                                                                     42 
##                              Gross capital formation (annual % growth) 
##                                                                    352 
##                               Exports of goods and services (% of GDP) 
##                                                                    199 
##                                  Inflation, consumer prices (annual %) 
##                                                                    199 
##                             GNI per capita, Atlas method (current US$) 
##                                                                     94 
##                      Foreign direct investment, net inflows (% of GDP) 
##                                                                    153 
##                Domestic credit provided by financial sector (% of GDP) 
##                                                                   1073 
##                                                       Trade (% of GDP) 
##                                                                    204 
##                     Net trade in goods and services (BoP, current US$) 
##                                                                    410 
## Labor force participation rate, total (% of total population ages 15+) 
##                                                                    155 
##                      Employment in agriculture (% of total employment) 
##                                                                    155 
##                     Age dependency ratio (% of working-age population) 
##                                                                      5 
##                                                 Tax revenue (% of GDP) 
##                                                                    494 
##                           Energy use (kg of oil equivalent per capita) 
##                                                                   1330 
##                  Government expenditure on education, total (% of GDP) 
##                                                                    277

dim(Dataset)

## [1] 1330   19

Dataset[, 5:19] <- lapply(Dataset[, 5:19], function(x) { ifelse(is.na(x), mean(x, na.rm = TRUE), x)})

double checking

colSums(is.na(Dataset))

##                                                                   Time 
##                                                                      0 
##                                                              Time Code 
##                                                                      0 
##                                                           Country Name 
##                                                                      0 
##                                                           Country Code 
##                                                                      0 
##                                                      GDP (current US$) 
##                                                                      0 
##                              Gross capital formation (annual % growth) 
##                                                                      0 
##                               Exports of goods and services (% of GDP) 
##                                                                      0 
##                                  Inflation, consumer prices (annual %) 
##                                                                      0 
##                             GNI per capita, Atlas method (current US$) 
##                                                                      0 
##                      Foreign direct investment, net inflows (% of GDP) 
##                                                                      0 
##                Domestic credit provided by financial sector (% of GDP) 
##                                                                      0 
##                                                       Trade (% of GDP) 
##                                                                      0 
##                     Net trade in goods and services (BoP, current US$) 
##                                                                      0 
## Labor force participation rate, total (% of total population ages 15+) 
##                                                                      0 
##                      Employment in agriculture (% of total employment) 
##                                                                      0 
##                     Age dependency ratio (% of working-age population) 
##                                                                      0 
##                                                 Tax revenue (% of GDP) 
##                                                                      0 
##                           Energy use (kg of oil equivalent per capita) 
##                                                                   1330 
##                  Government expenditure on education, total (% of GDP) 
##                                                                      0

clearly found a column without any values, so to drop it

blank_cols <- apply(Dataset, 2, function(col) all(is.na(col) | col == "sheet1"))
Dataset<-Dataset[,!blank_cols]
#colSums(is.na(Dataset))

last inspection of the dataset; Count of missing values per column

missing_counts <- colSums(is.na(Dataset))
missing_counts <- missing_counts[missing_counts > 0]

Bar Chart for Missing Values per Column

if (length(missing_counts) > 0) {
  bar_chart <- ggplot(data.frame(Columns = names(missing_counts), Missing = missing_counts), aes(x = reorder(Columns, Missing), y = Missing)) +
    geom_bar(stat = "identity", fill = "skyblue") +
    coord_flip() +
    labs(title = "Missing Values per Column", x = "Columns", y = "Number of Missing Values") +
    theme_minimal()
} else {
  bar_chart <- ggplot() +
    annotate("text", x = 0.5, y = 0.5, label = "No Missing Values", size = 6, color = "black", hjust = 0.5, vjust = 0.5) +
    theme_void() +
    labs(title = "Missing Values per Column")
}

Scatter Plot of Missing Values per Column

missing_indices <- which(is.na(Dataset), arr.ind = TRUE)
scatter_data <- data.frame(
  Rows = missing_indices[, 1],  
  Columns = factor(missing_indices[, 2], levels = 1:ncol(Dataset)) 
)

scatter_plot <- ggplot(scatter_data, aes(x = Columns, y = Rows)) +
  geom_point(color = "red", size = 1.5) +
  labs(title = "Scatter Plot of Missing Values", x = "Columns", y = "Rows") +
  theme_minimal()

Histogram of Missing Values per Row

gg_plot <- ggplot(data.frame(missing_per_row = rowSums(is.na(Dataset))), aes(x = missing_per_row)) +
  geom_histogram(binwidth = 1, fill = "brown", color = "white") +
  labs(title = "Missing Values per Row", x = "Number of Missing Values", y = "Frequency")

Proportion of Missing Values in the Dataset (vizualizing it in Pie Chart)

missing_proportion <- sum(is.na(Dataset)) / (nrow(Dataset) * ncol(Dataset)) * 100
pie_data <- data.frame(
  Category = c("Missing", "Non-Missing"),
  Value = c(sum(is.na(Dataset)), (nrow(Dataset) * ncol(Dataset)) - sum(is.na(Dataset)))
)

pie_chart <- ggplot(pie_data, aes(x = "", y = Value, fill = Category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = paste("Proportion of Missing Values: ", round(missing_proportion, 2), "%")) +
  theme_void() +
  scale_fill_manual(values = c("black", "beige"))

Arranging all visualizations into a single view

grid.arrange(gg_plot, bar_chart, scatter_plot, pie_chart, ncol = 2)

Selecting only integer columns to start the procedure

summary(Dataset)

##       Time       Time Code         Country Name       Country Code      
##  Min.   :2017   Length:1330        Length:1330        Length:1330       
##  1st Qu.:2018   Class :character   Class :character   Class :character  
##  Median :2019   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2019                                                           
##  3rd Qu.:2020                                                           
##  Max.   :2021                                                           
##  GDP (current US$)        Gross capital formation (annual % growth)
##  Min.   :      45276595   Min.   :-938.9957                        
##  1st Qu.:   10293768251   1st Qu.:  -0.2633                        
##  Median :   61945227965   Median :   2.0537                        
##  Mean   : 2809243890590   Mean   :   2.0537                        
##  3rd Qu.:  870364123213   3rd Qu.:   6.6354                        
##  Max.   :97848299706500   Max.   : 182.2836                        
##  Exports of goods and services (% of GDP) Inflation, consumer prices (annual %)
##  Min.   :  2.25                           Min.   : -3.233                      
##  1st Qu.: 23.49                           1st Qu.:  1.627                      
##  Median : 36.81                           Median :  3.016                      
##  Mean   : 39.74                           Mean   :  5.543                      
##  3rd Qu.: 44.26                           3rd Qu.:  5.543                      
##  Max.   :213.22                           Max.   :557.202                      
##  GNI per capita, Atlas method (current US$)
##  Min.   :   210                            
##  1st Qu.:  2220                            
##  Median :  6946                            
##  Mean   : 14788                            
##  3rd Qu.: 16655                            
##  Max.   :120990                            
##  Foreign direct investment, net inflows (% of GDP)
##  Min.   :-1303.108                                
##  1st Qu.:    1.258                                
##  Median :    2.625                                
##  Mean   :    4.850                                
##  3rd Qu.:    4.850                                
##  Max.   :  602.375                                
##  Domestic credit provided by financial sector (% of GDP) Trade (% of GDP) 
##  Min.   :  6.613                                         Min.   :  4.128  
##  1st Qu.: 85.079                                         1st Qu.: 54.459  
##  Median : 85.079                                         Median : 79.906  
##  Mean   : 85.079                                         Mean   : 84.632  
##  3rd Qu.: 85.079                                         3rd Qu.: 93.338  
##  Max.   :393.451                                         Max.   :402.460  
##  Net trade in goods and services (BoP, current US$)
##  Min.   :-848072000000                             
##  1st Qu.:  -1138954934                             
##  Median :   1637745563                             
##  Mean   :   5234108909                             
##  3rd Qu.:   5234108909                             
##  Max.   : 525227577932                             
##  Labor force participation rate, total (% of total population ages 15+)
##  Min.   :31.40                                                         
##  1st Qu.:56.39                                                         
##  Median :60.24                                                         
##  Mean   :60.24                                                         
##  3rd Qu.:65.09                                                         
##  Max.   :87.64                                                         
##  Employment in agriculture (% of total employment)
##  Min.   : 0.1078                                  
##  1st Qu.: 7.3030                                  
##  Median :24.5136                                  
##  Mean   :24.5136                                  
##  3rd Qu.:36.0843                                  
##  Max.   :86.3098                                  
##  Age dependency ratio (% of working-age population) Tax revenue (% of GDP)
##  Min.   : 17.45                                     Min.   : 0.00006      
##  1st Qu.: 48.46                                     1st Qu.:13.90486      
##  Median : 54.58                                     Median :16.19721      
##  Mean   : 58.65                                     Mean   :16.19721      
##  3rd Qu.: 67.67                                     3rd Qu.:17.55884      
##  Max.   :105.29                                     Max.   :44.40227      
##  Government expenditure on education, total (% of GDP)
##  Min.   : 0.3585                                      
##  1st Qu.: 3.5262                                      
##  Median : 4.3082                                      
##  Mean   : 4.3082                                      
##  3rd Qu.: 4.8626                                      
##  Max.   :15.3770

str(Dataset)

## tibble [1,330 × 18] (S3: tbl_df/tbl/data.frame)
##  $ Time                                                                  : num [1:1330] 2017 2017 2017 2017 2017 ...
##  $ Time Code                                                             : chr [1:1330] "YR2017" "YR2017" "YR2017" "YR2017" ...
##  $ Country Name                                                          : chr [1:1330] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country Code                                                          : chr [1:1330] "AFG" "ALB" "DZA" "ASM" ...
##  $ GDP (current US$)                                                     : num [1:1330] 18753456498 13019726212 189880896903 612000000 3000162081 ...
##  $ Gross capital formation (annual % growth)                             : num [1:1330] 2.05 4.37 -2.23 2.05 2.05 ...
##  $ Exports of goods and services (% of GDP)                              : num [1:1330] 39.7 31.6 20.3 59.2 39.7 ...
##  $ Inflation, consumer prices (annual %)                                 : num [1:1330] 4.98 1.99 5.59 5.54 5.54 ...
##  $ GNI per capita, Atlas method (current US$)                            : num [1:1330] 530 4290 4440 14788 14788 ...
##  $ Foreign direct investment, net inflows (% of GDP)                     : num [1:1330] 0.275 7.855 0.648 4.85 4.85 ...
##  $ Domestic credit provided by financial sector (% of GDP)               : num [1:1330] 85.1 63.6 85.1 85.1 85.1 ...
##  $ Trade (% of GDP)                                                      : num [1:1330] 84.6 78.2 49.8 161.4 84.6 ...
##  $ Net trade in goods and services (BoP, current US$)                    : num [1:1330] -6803560118 -1978658722 -22402635934 5234108909 5234108909 ...
##  $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 60.2 60.2 ...
##  $ Employment in agriculture (% of total employment)                     : num [1:1330] 44 38.1 10.2 24.5 24.5 ...
##  $ Age dependency ratio (% of working-age population)                    : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
##  $ Tax revenue (% of GDP)                                                : num [1:1330] 9.9 18.9 16.2 16.2 16.2 ...
##  $ Government expenditure on education, total (% of GDP)                 : num [1:1330] 4.34 3.61 5.73 4.31 3.22 ...

dim(Dataset)

## [1] 1330   18

measures<-Dataset[5:18]

Observing distribution of all columns in the dataset; selecting numeric columns; melting

numeric_columns <- sapply(measures, is.numeric)
melted_data_before <- melt(measures[, numeric_columns], variable.name = "Column", value.name = "Value")

## No id variables; using all as measure variables

head(melted_data_before)

##              Column        Value
## 1 GDP (current US$)  18753456498
## 2 GDP (current US$)  13019726212
## 3 GDP (current US$) 189880896903
## 4 GDP (current US$)    612000000
## 5 GDP (current US$)   3000162081
## 6 GDP (current US$)  73690154991

Ploting the distributions before standardization

distribution_plot_before <- ggplot(melted_data_before, aes(x = Value)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
  facet_wrap(~ Column, scales = "free", ncol = 3) +
  labs(title = "Distribution of Numeric Columns (Before Standardization)", x = "Value", y = "Frequency") +
  theme_minimal()

Display the before standardization plot

grid.arrange(distribution_plot_before, ncol = 1)

## Warning: Computation failed in `stat_bin()`.
## Computation failed in `stat_bin()`.
## Caused by error in `bin_breaks_width()`:
## ! The number of histogram bins must be less than 1,000,000.
## ℹ Did you make `binwidth` too small?

View(measures)

Observing the outliers in Boxplot

numerical_columns <- sapply(measures, is.numeric)
numerical_data <- measures[, numerical_columns]
long_numerical_data <- gather(numerical_data, key = "variable", value = "value")
ggplot(long_numerical_data, aes(x = variable, y = value)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Boxplot to detect outliers")

summary(measures$`GDP (current US$)`)

##           Min.        1st Qu.         Median           Mean        3rd Qu. 
##       45276595    10293768251    61945227965  2809243890592   870364123213 
##           Max. 
## 97848299706455

colSums(is.na(measures))

##                                                      GDP (current US$) 
##                                                                      0 
##                              Gross capital formation (annual % growth) 
##                                                                      0 
##                               Exports of goods and services (% of GDP) 
##                                                                      0 
##                                  Inflation, consumer prices (annual %) 
##                                                                      0 
##                             GNI per capita, Atlas method (current US$) 
##                                                                      0 
##                      Foreign direct investment, net inflows (% of GDP) 
##                                                                      0 
##                Domestic credit provided by financial sector (% of GDP) 
##                                                                      0 
##                                                       Trade (% of GDP) 
##                                                                      0 
##                     Net trade in goods and services (BoP, current US$) 
##                                                                      0 
## Labor force participation rate, total (% of total population ages 15+) 
##                                                                      0 
##                      Employment in agriculture (% of total employment) 
##                                                                      0 
##                     Age dependency ratio (% of working-age population) 
##                                                                      0 
##                                                 Tax revenue (% of GDP) 
##                                                                      0 
##                  Government expenditure on education, total (% of GDP) 
##                                                                      0

Since we detected outliers in GDP, GNI and Net trade; first log transform(there are non positive numbers converted to NaN’s); I’ll handle them by adding a constant value to make them positive)

measures$log_GDP <- log(measures$`GDP (current US$)`)
measures$log_Trade <- log(measures$`Net trade in goods and services (BoP, current US$)`)

## Warning in log(measures$`Net trade in goods and services (BoP, current US$)`):
## NaNs produced

summary(measures$`Net trade in goods and services (BoP, current US$)`)

##          Min.       1st Qu.        Median          Mean       3rd Qu. 
## -848072000000   -1138954934    1637745563    5234108909    5234108909 
##          Max. 
##  525227577932

sum(measures$`Net trade in goods and services (BoP, current US$)` <= 0)

## [1] 561

measures$log_Trade <- log(measures$`Net trade in goods and services (BoP, current US$)` + abs(min(measures$`Net trade in goods and services (BoP, current US$)`, na.rm = TRUE)) + 1)
measures$log_GNI <- log(measures$`GNI per capita, Atlas method (current US$)`)

Checking if i’ve got rid of NaN’s and all fine?

summary(measures$log_Trade)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   27.46   27.47   27.45   27.47   27.95

sum(is.na(measures$log_Trade))

## [1] 0

I will exclude the original GDP from the dataset; reshaped it for better visualization. The boxplots helped spot outliers, and I wrapped it up with a quick summary of log_GDP. Simple and quick to understand the data.

dim(measures)

## [1] 1330   17

str(measures)

## tibble [1,330 × 17] (S3: tbl_df/tbl/data.frame)
##  $ GDP (current US$)                                                     : num [1:1330] 18753456498 13019726212 189880896903 612000000 3000162081 ...
##  $ Gross capital formation (annual % growth)                             : num [1:1330] 2.05 4.37 -2.23 2.05 2.05 ...
##  $ Exports of goods and services (% of GDP)                              : num [1:1330] 39.7 31.6 20.3 59.2 39.7 ...
##  $ Inflation, consumer prices (annual %)                                 : num [1:1330] 4.98 1.99 5.59 5.54 5.54 ...
##  $ GNI per capita, Atlas method (current US$)                            : num [1:1330] 530 4290 4440 14788 14788 ...
##  $ Foreign direct investment, net inflows (% of GDP)                     : num [1:1330] 0.275 7.855 0.648 4.85 4.85 ...
##  $ Domestic credit provided by financial sector (% of GDP)               : num [1:1330] 85.1 63.6 85.1 85.1 85.1 ...
##  $ Trade (% of GDP)                                                      : num [1:1330] 84.6 78.2 49.8 161.4 84.6 ...
##  $ Net trade in goods and services (BoP, current US$)                    : num [1:1330] -6803560118 -1978658722 -22402635934 5234108909 5234108909 ...
##  $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 60.2 60.2 ...
##  $ Employment in agriculture (% of total employment)                     : num [1:1330] 44 38.1 10.2 24.5 24.5 ...
##  $ Age dependency ratio (% of working-age population)                    : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
##  $ Tax revenue (% of GDP)                                                : num [1:1330] 9.9 18.9 16.2 16.2 16.2 ...
##  $ Government expenditure on education, total (% of GDP)                 : num [1:1330] 4.34 3.61 5.73 4.31 3.22 ...
##  $ log_GDP                                                               : num [1:1330] 23.7 23.3 26 20.2 21.8 ...
##  $ log_Trade                                                             : num [1:1330] 27.5 27.5 27.4 27.5 27.5 ...
##  $ log_GNI                                                               : num [1:1330] 6.27 8.36 8.4 9.6 9.6 ...

measures<-measures[,2:17]
measures$`Net trade in goods and services (BoP, current US$)`<-NULL
measures$`GNI per capita, Atlas method (current US$)`<-NULL
View(measures)
numerical_columns <- sapply(measures, is.numeric)
numerical_data <- measures[, numerical_columns]
long_numerical_data <- gather(numerical_data, key = "variable", value = "value")
ggplot(long_numerical_data, aes(x = variable, y = value)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Boxplot to detect outliers")

colSums(is.na(measures))

##                              Gross capital formation (annual % growth) 
##                                                                      0 
##                               Exports of goods and services (% of GDP) 
##                                                                      0 
##                                  Inflation, consumer prices (annual %) 
##                                                                      0 
##                      Foreign direct investment, net inflows (% of GDP) 
##                                                                      0 
##                Domestic credit provided by financial sector (% of GDP) 
##                                                                      0 
##                                                       Trade (% of GDP) 
##                                                                      0 
## Labor force participation rate, total (% of total population ages 15+) 
##                                                                      0 
##                      Employment in agriculture (% of total employment) 
##                                                                      0 
##                     Age dependency ratio (% of working-age population) 
##                                                                      0 
##                                                 Tax revenue (% of GDP) 
##                                                                      0 
##                  Government expenditure on education, total (% of GDP) 
##                                                                      0 
##                                                                log_GDP 
##                                                                      0 
##                                                              log_Trade 
##                                                                      0 
##                                                                log_GNI 
##                                                                      0

boxplot(measures$log_GDP, main = "Boxplot of Log-Transformed GDP")

summary(measures$log_GDP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.63   23.05   24.85   25.18   27.49   32.21

Calculate the correlation matrix

cor_matrix <- cor(numerical_data, use = "complete.obs")
long_cor_data <- melt(cor_matrix)
ggplot(data = long_cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  coord_fixed() +
  labs(title = "Correlation Heatmap")

Each cell in the heatmap represents the Pearson correlation coefficient between two variables, with the color intensity indicating the strength and direction of the correlation: Red indicates a strong positive correlation, meaning that as one variable increases, the other tends to increase as well. Blue indicates a strong negative correlation, meaning that as one variable increases, the other tends to decrease. Light colors near white indicate weak or no correlation between the variables. Exports of goods and services (% of GDP) vs. GDP (current US$): There is a very strong positive correlation (0.97), indicating that higher exports are strongly associated with higher GDP. GNI per capita (Atlas method) vs. Domestic credit provided by the financial sector: There is a moderate positive correlation, suggesting that countries with higher GNI per capita tend to have higher domestic credit provided by the financial sector. Government expenditure on education vs. Tax revenue: These variables have a positive correlation, indicating that higher tax revenues are often associated with higher government spending on education. Inflation (consumer prices) vs. Net trade in goods and services: There is a negative correlation, indicating that higher inflation is generally associated with lower net trade in goods and services.

Now, I’ll create a pairplot using ggpairs

ggpairs(numerical_data, title = "Pairplot of Numerical Features")

Z-Standardizing the Dataset measures and naming it “measures_z” for further analysis

lapply(measures, class)

## $`Gross capital formation (annual % growth)`
## [1] "numeric"
## 
## $`Exports of goods and services (% of GDP)`
## [1] "numeric"
## 
## $`Inflation, consumer prices (annual %)`
## [1] "numeric"
## 
## $`Foreign direct investment, net inflows (% of GDP)`
## [1] "numeric"
## 
## $`Domestic credit provided by financial sector (% of GDP)`
## [1] "numeric"
## 
## $`Trade (% of GDP)`
## [1] "numeric"
## 
## $`Labor force participation rate, total (% of total population ages 15+)`
## [1] "numeric"
## 
## $`Employment in agriculture (% of total employment)`
## [1] "numeric"
## 
## $`Age dependency ratio (% of working-age population)`
## [1] "numeric"
## 
## $`Tax revenue (% of GDP)`
## [1] "numeric"
## 
## $`Government expenditure on education, total (% of GDP)`
## [1] "numeric"
## 
## $log_GDP
## [1] "numeric"
## 
## $log_Trade
## [1] "numeric"
## 
## $log_GNI
## [1] "numeric"

measures_z<-as.data.frame(lapply(measures, scale))

Clustering K-means, PAM, Heirarchical

K-Means

Let’s try with 5 clusters

kmeans_result <- kmeans(measures_z, centers = 5)

check sizes and centers of these clusters

kmeans_result$size

## [1] 298 200 539  35 258

kmeans_result$centers

##   Gross.capital.formation..annual...growth.
## 1                                0.01975934
## 2                                0.07999539
## 3                               -0.01209056
## 4                               -0.71583054
## 5                                0.03753303
##   Exports.of.goods.and.services....of.GDP.
## 1                               -0.2377363
## 2                               -0.5594463
## 3                                0.2459651
## 4                                4.4789645
## 5                               -0.4131949
##   Inflation..consumer.prices..annual...
## 1                            -0.1198305
## 2                             0.2006299
## 3                            -0.1040173
## 4                            -0.1738549
## 5                             0.2237742
##   Foreign.direct.investment..net.inflows....of.GDP.
## 1                                       -0.14985136
## 2                                       -0.05866582
## 3                                        0.03211127
## 4                                        1.47054232
## 5                                       -0.04801585
##   Domestic.credit.provided.by.financial.sector....of.GDP. Trade....of.GDP.
## 1                                              0.32282162       -0.3946953
## 2                                             -0.13548389       -0.5315882
## 3                                             -0.06450083        0.3181928
## 4                                              0.00000000        4.4317897
## 5                                             -0.13309348       -0.3979913
##   Labor.force.participation.rate..total....of.total.population.ages.15..
## 1                                                              0.3257064
## 2                                                             -1.5001958
## 3                                                              0.0340419
## 4                                                             -0.2459693
## 5                                                              0.7489884
##   Employment.in.agriculture....of.total.employment.
## 1                                        -0.4701392
## 2                                         0.1481540
## 3                                        -0.4428752
## 4                                        -1.0159575
## 5                                         1.4912362
##   Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 1                                         -0.6074597             -0.6211124
## 2                                          0.3304216             -0.3371721
## 3                                         -0.4457363              0.6346148
## 4                                         -0.7747784              0.4330110
## 5                                          1.4818131             -0.4057629
##   Government.expenditure.on.education..total....of.GDP.     log_GDP   log_Trade
## 1                                            -0.1008557  1.11499483 -0.08565954
## 2                                            -0.2171585  0.03298209  0.01397512
## 3                                             0.4408338 -0.54989407  0.02649677
## 4                                            -0.3957379 -0.22161380  0.07611573
## 5                                            -0.5824492 -0.13455617  0.02242523
##      log_GNI
## 1  0.5628902
## 2 -0.7098658
## 3  0.4567612
## 4  1.2001553
## 5 -1.2169296

measures$cluster <- kmeans_result$cluster
head(measures)

## # A tibble: 6 × 15
##   Gross capital formation (annua…¹ Exports of goods and…² Inflation, consumer …³
##                              <dbl>                  <dbl>                  <dbl>
## 1                             2.05                   39.7                   4.98
## 2                             4.37                   31.6                   1.99
## 3                            -2.23                   20.3                   5.59
## 4                             2.05                   59.2                   5.54
## 5                             2.05                   39.7                   5.54
## 6                             3.00                   29.0                  29.8 
## # ℹ abbreviated names: ¹`Gross capital formation (annual % growth)`,
## #   ²`Exports of goods and services (% of GDP)`,
## #   ³`Inflation, consumer prices (annual %)`
## # ℹ 12 more variables:
## #   `Foreign direct investment, net inflows (% of GDP)` <dbl>,
## #   `Domestic credit provided by financial sector (% of GDP)` <dbl>,
## #   `Trade (% of GDP)` <dbl>, …

str(measures)

## tibble [1,330 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Gross capital formation (annual % growth)                             : num [1:1330] 2.05 4.37 -2.23 2.05 2.05 ...
##  $ Exports of goods and services (% of GDP)                              : num [1:1330] 39.7 31.6 20.3 59.2 39.7 ...
##  $ Inflation, consumer prices (annual %)                                 : num [1:1330] 4.98 1.99 5.59 5.54 5.54 ...
##  $ Foreign direct investment, net inflows (% of GDP)                     : num [1:1330] 0.275 7.855 0.648 4.85 4.85 ...
##  $ Domestic credit provided by financial sector (% of GDP)               : num [1:1330] 85.1 63.6 85.1 85.1 85.1 ...
##  $ Trade (% of GDP)                                                      : num [1:1330] 84.6 78.2 49.8 161.4 84.6 ...
##  $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 60.2 60.2 ...
##  $ Employment in agriculture (% of total employment)                     : num [1:1330] 44 38.1 10.2 24.5 24.5 ...
##  $ Age dependency ratio (% of working-age population)                    : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
##  $ Tax revenue (% of GDP)                                                : num [1:1330] 9.9 18.9 16.2 16.2 16.2 ...
##  $ Government expenditure on education, total (% of GDP)                 : num [1:1330] 4.34 3.61 5.73 4.31 3.22 ...
##  $ log_GDP                                                               : num [1:1330] 23.7 23.3 26 20.2 21.8 ...
##  $ log_Trade                                                             : num [1:1330] 27.5 27.5 27.4 27.5 27.5 ...
##  $ log_GNI                                                               : num [1:1330] 6.27 8.36 8.4 9.6 9.6 ...
##  $ cluster                                                               : int [1:1330] 2 3 2 3 3 5 3 1 3 3 ...

Mean of log_Trade/log_GDP/log_GNI

aggregate(data = measures, log_Trade ~ cluster, mean)

##   cluster log_Trade
## 1       1  27.38524
## 2       2  27.46068
## 3       3  27.47016
## 4       4  27.50773
## 5       5  27.46708

aggregate(data = measures, log_GDP ~ cluster, mean)

##   cluster  log_GDP
## 1       1 28.49906
## 2       2 25.27370
## 3       3 23.53621
## 4       4 24.51478
## 5       5 24.77429

aggregate(data = measures, log_GNI ~ cluster, mean)

##   cluster   log_GNI
## 1       1  9.570689
## 2       2  7.854246
## 3       3  9.427563
## 4       4 10.430107
## 5       5  7.170418

attributes(kmeans_result)

## $names
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"      
## 
## $class
## [1] "kmeans"

Changing the number of clusters to 4

measures_reduced <- measures[9:12]
measures_reduced_z <- as.data.frame(lapply(measures_reduced, scale))
measures_reduced_km3 <- kmeans(measures_reduced_z, 4)

plot(measures_reduced_z, col = measures_reduced_km3$cluster, pch=".", cex=4)
points(measures_reduced_km3$centers, col = 1:5, pch = 8, cex=2, lwd=2)

fviz_cluster(list(data=measures_reduced_z, cluster=measures_reduced_km3$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

Parallel coordinate plot for cluster characteristics

ggparcoord(data = cbind(measures_reduced_z, cluster = as.factor(measures_reduced_km3$cluster)), 
           columns = 1:(ncol(measures_reduced_z)), groupColumn = "cluster", 
           scale = "uniminmax", showPoints = TRUE, 
           title = "Parallel Coordinate Plot of Clusters", 
           alphaLines = 0.5) + theme_minimal()

narrow down the scope to compute the silhouette width faster (run the broader dataset on your own)

measures_reduced_km4 <- kmeans(measures_reduced_z, 3)

sil<-silhouette(measures_reduced_km4$cluster, dist(measures_reduced_z))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1  312          0.30
## 2       2  418          0.37
## 3       3  600          0.20

alternative commands for k-means

k1 <- kmeans(measures_reduced_z, centers = 3, nstart = 25)

plot(measures_reduced_z, col = k1$cluster, pch = 16, main = "K-means Clustering")
points(k1$centers, col = 1:3, pch = 8, cex = 2, lwd = 2)

In observing the scatterplot matrix, there are instances where clusters overlap in certain dimensions. This overlap suggests that the separation between clusters is less distinct in these particular variable combinations. The spread of clusters also varies across the subplots, indicating that some variables play a more significant role in distinguishing the clusters than others.

summary(k1)

##              Length Class  Mode   
## cluster      1330   -none- numeric
## centers        12   -none- numeric
## totss           1   -none- numeric
## withinss        3   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            3   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

attributes(k1)

## $names
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"      
## 
## $class
## [1] "kmeans"

print(k1$centers)

##   Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 1                                         -0.3602481              0.6504696
## 2                                         -0.5553496             -0.5698846
## 3                                          1.4355602             -0.4952265
##   Government.expenditure.on.education..total....of.GDP.    log_GDP
## 1                                             0.4841496 -0.5795626
## 2                                            -0.2623144  0.9565580
## 3                                            -0.5844079 -0.1571521

I’m using Elbow Method to identify optimal number of clusters so far.

elbow_clus <- fviz_nbclust(measures_reduced_z, kmeans, method = "wss") +
  ggtitle("Elbow Method for Optimal Clusters")
plot(elbow_clus)

Now Silhouette Method

sil_clus <- fviz_nbclust(measures_reduced_z, kmeans, method = "silhouette") +
  ggtitle("Silhouette Method for Optimal Clusters")
plot(sil_clus)

PAM clustering with 3 clusters

c1 <- pam(measures_reduced_z, 3)
print(c1)

## Medoids:
##        ID Age.dependency.ratio....of.working.age.population.
## [1,]  216                                          1.7045119
## [2,] 1210                                         -0.4511421
## [3,]  446                                         -0.4622888
##      Tax.revenue....of.GDP.
## [1,]             -0.2053302
## [2,]              0.2425268
## [3,]             -0.4198928
##      Government.expenditure.on.education..total....of.GDP.    log_GDP
## [1,]                                           -0.38978098 -0.4021940
## [2,]                                            0.00000000 -0.6082215
## [3,]                                           -0.08670267  0.9441671
## Clustering vector:
##    [1] 1 2 3 2 2 1 2 3 2 2 3 2 2 2 3 3 2 2 2 2 1 2 2 2 2 2 3 3 2 2 1 1 2 2 1 3 2
##   [38] 1 1 2 3 3 3 1 1 1 2 1 2 2 2 2 3 2 2 2 3 3 3 2 1 1 2 2 1 2 2 2 3 2 1 1 2 3
##   [75] 1 3 2 2 2 2 1 1 1 2 2 2 3 2 2 3 3 3 1 3 2 2 3 2 3 2 3 1 2 3 3 2 2 2 2 2 2
##  [112] 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2 3 2 2 1 2 2 2 1 3 2 2 2 2 2 2 2 1 1 2 2 2
##  [149] 2 1 2 3 1 3 3 3 3 2 2 3 3 3 1 1 2 1 3 1 2 2 1 3 2 2 2 2 1 2 1 3 3 2 2 3 2
##  [186] 1 2 2 3 1 2 1 3 1 1 2 2 2 3 2 2 2 1 2 3 3 3 2 3 1 3 3 2 1 1 1 1 1 1 3 2 3
##  [223] 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 2 2 3 1 2
##  [260] 3 3 1 1 1 3 3 1 2 3 2 2 1 2 3 2 2 3 2 2 2 3 3 2 2 2 2 1 2 2 2 2 2 3 3 2 2
##  [297] 1 1 2 2 1 3 2 1 1 2 2 3 3 1 1 1 2 1 2 2 2 2 3 2 2 2 3 3 3 2 1 1 2 2 1 2 2
##  [334] 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1 2 2 2 3 2 2 3 3 3 1 3 2 2 3 2 3 2 3 1 2 3
##  [371] 3 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2 3 2 2 1 2 2 2 1 3 2 2 2 2 2
##  [408] 2 2 1 1 2 2 2 2 1 2 3 2 3 3 3 3 2 2 3 3 3 1 1 2 1 3 1 2 2 1 3 2 2 2 2 1 2
##  [445] 1 3 3 2 2 3 2 1 2 2 3 1 2 1 3 2 1 2 2 2 3 2 2 2 1 2 3 3 3 2 2 2 3 3 2 1 1
##  [482] 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 3 3
##  [519] 3 3 2 2 3 1 3 3 3 1 1 1 3 3 1 2 3 2 2 1 2 3 2 2 3 2 2 2 3 3 2 2 2 2 1 2 2
##  [556] 2 2 2 3 3 2 2 1 1 2 2 1 3 2 1 1 2 3 3 3 1 1 1 2 1 2 2 2 2 3 2 2 2 3 3 3 2
##  [593] 1 1 2 2 1 2 2 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1 2 2 2 3 2 2 3 3 3 3 3 2 2 3
##  [630] 2 3 2 3 1 2 3 3 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2 3 2 2 1 2 2 2
##  [667] 1 3 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 3 1 3 3 3 3 2 2 3 3 3 1 1 2 1 3 1 2 2 1
##  [704] 3 2 2 2 2 1 2 1 3 3 2 2 2 2 1 2 2 3 2 2 1 3 2 1 2 2 2 3 2 2 2 1 2 3 3 3 2
##  [741] 3 1 3 3 2 1 1 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 3 3 3 3 1 3
##  [778] 1 3 3 3 3 3 3 3 3 2 2 3 1 3 3 3 1 1 1 3 3 1 2 2 2 2 1 2 3 2 2 3 2 2 2 3 3
##  [815] 2 3 2 2 1 2 2 2 2 2 3 3 2 2 1 1 2 2 1 3 2 1 1 2 3 3 3 1 1 1 2 1 2 2 2 2 3
##  [852] 2 2 2 3 3 3 2 1 1 2 2 1 2 2 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1 2 2 2 3 2 2 3
##  [889] 3 3 1 3 2 2 3 2 3 2 3 1 2 3 3 2 2 2 2 2 3 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2
##  [926] 3 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 3 1 3 3 3 3 2 2 3 3 3 1 1
##  [963] 2 1 3 1 2 2 2 3 2 2 2 2 1 2 1 3 3 2 2 3 2 1 2 2 3 2 2 1 3 1 1 2 2 2 3 2 2
## [1000] 2 1 2 3 3 3 2 2 1 3 3 2 1 1 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 3
## [1037] 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 2 2 3 1 3 3 3 1 1 1 3 3 1 2 3 2 2 1 2 3 2
## [1074] 2 3 2 2 2 2 3 2 3 2 2 1 2 2 2 2 2 3 3 2 2 1 1 2 2 1 3 2 1 1 2 2 3 3 1 1 1
## [1111] 2 1 2 3 2 2 3 2 2 2 3 3 3 2 1 1 2 2 1 2 2 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1
## [1148] 2 2 2 3 2 2 3 3 3 1 3 2 2 3 2 3 2 3 1 2 3 3 2 2 2 2 2 3 2 1 2 2 2 2 2 1 1
## [1185] 3 2 1 2 2 1 2 3 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 3 2 3 3 3 3
## [1222] 2 2 3 3 3 1 1 2 1 3 1 2 2 2 3 2 2 2 2 1 2 1 3 3 2 2 2 2 1 2 2 3 2 2 1 3 1
## [1259] 1 1 2 2 3 2 2 2 1 2 3 3 3 2 2 2 3 3 2 2 1 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3
## [1296] 1 1 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 2 3 1 3 3 3 1 1 1 3 3
## Objective function:
##    build     swap 
## 1.455119 1.372084 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

Displaying the medoids; clustering results;summary; info; plot and viszualization with factoextra

print(c1$medoids)

##      Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## [1,]                                          1.7045119             -0.2053302
## [2,]                                         -0.4511421              0.2425268
## [3,]                                         -0.4622888             -0.4198928
##      Government.expenditure.on.education..total....of.GDP.    log_GDP
## [1,]                                           -0.38978098 -0.4021940
## [2,]                                            0.00000000 -0.6082215
## [3,]                                           -0.08670267  0.9441671

print(head(c1$clustering))

## [1] 1 2 3 2 2 1

sil <- silhouette(c1)

fviz_silhouette(sil) + 
  ggtitle("Silhouette Plot for PAM Clustering") +
  theme_minimal()

##   cluster size ave.sil.width
## 1       1  305          0.30
## 2       2  592          0.20
## 3       3  433          0.34

fviz_cluster(c1, geom = "point", ellipse.type = "convex") + 
  ggtitle("Cluster Plot with PAM Clustering") +
  theme_minimal()

Better graphics with cluster visualization and also another simple clustering method

fviz_cluster(c1, geom = "point", ellipse.type = "norm")

fviz_cluster(c1, geom = "point", ellipse.type = "convex")

pam_cluster_result <- eclust(measures_reduced_z, "pam", k = 3)

fviz_silhouette(pam_cluster_result)

##   cluster size ave.sil.width
## 1       1  305          0.30
## 2       2  592          0.20
## 3       3  433          0.34

fviz_cluster(pam_cluster_result)

Experiment with different numbers of clusters and distance metrics

pam_manhattan <- eclust(measures_reduced_z, "pam", k = 2, hc_metric = "manhattan")

fviz_silhouette(pam_manhattan)

##   cluster size ave.sil.width
## 1       1  315          0.35
## 2       2 1015          0.23

fviz_cluster(pam_manhattan)

Determining the optimal # of clusters using the elbow and Silhouette method

opt_clusters_elbow <- Optimal_Clusters_Medoids(measures_reduced_z, max_clusters = 10, distance_metric = "euclidean", plot_clusters = TRUE)

##   
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?

## Warning: The plot can not be created for the specified number of clusters. This
## means the output data do not fit in the figure (plot) margins.

opt_clusters_silhouette <- Optimal_Clusters_Medoids(measures_reduced_z, max_clusters = 10, distance_metric = "euclidean", 
                                                    plot_clusters = TRUE, criterion = "silhouette")

##   
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?

## Warning: The plot can not be created for the specified number of clusters. This
## means the output data do not fit in the figure (plot) margins.

Dimention Reduction PCA, t-SNE, UMAP, Factor Analysis, MDS

I aim to retain 70-95% of the total variance in the dataset. First, I use PCA to understand the variance distribution. For clustering and visualization, I utilize t-SNE or UMAP to gain insights into cluster structures. When I need interpretable factors, I apply Factor Analysis. To determine a meaningful epsilon (eps) value, I experiment with dimensionality reduction techniques. Additionally, I use MDS, as my dataset is of moderate size, to explore relationships within the data.

Inspecting correlation and removing unnecessary cols

cor_matrix <- cor(measures_z)

corrplot(cor_matrix, method = "color", type = "upper", 
         tl.cex = 0.4, tl.col = "black", order = "hclust", 
         number.cex = 0.4)

Find highly correlated features (absolute value > 0.8)

threshold <- 0.8
high_corr <- which(abs(cor_matrix) > threshold & abs(cor_matrix) < 1, arr.ind = TRUE)

high_corr_pairs <- data.frame(
  Feature1 = rownames(cor_matrix)[high_corr[, 1]],
  Feature2 = colnames(cor_matrix)[high_corr[, 2]],
  Correlation = cor_matrix[high_corr]
)

high_corr_pairs

##                                   Feature1
## 1                         Trade....of.GDP.
## 2 Exports.of.goods.and.services....of.GDP.
##                                   Feature2 Correlation
## 1 Exports.of.goods.and.services....of.GDP.   0.9660158
## 2                         Trade....of.GDP.   0.9660158

drop_features <- unique(high_corr_pairs$Feature2)
reduced_data <- measures_reduced_z[, !(colnames(measures_reduced_z) %in% drop_features)]

Compute correlation matrix for reduced data

cor_matrix_reduced <- cor(reduced_data)
corrplot(cor_matrix_reduced, method = "color", type = "upper", 
         tl.cex = 0.7, tl.col = "black", order = "hclust")

Compute the correlation matrix

cor_matrix <- cor(measures_reduced_z, use = "pairwise.complete.obs")

high_cor_pairs <- which(abs(cor_matrix) > 0.8 & lower.tri(cor_matrix), arr.ind = TRUE)

high_cor_features <- data.frame(
  Feature1 = rownames(cor_matrix)[high_cor_pairs[, 1]],
  Feature2 = colnames(cor_matrix)[high_cor_pairs[, 2]],
  Correlation = cor_matrix[high_cor_pairs]
)
print(high_cor_features)

## [1] Feature1    Feature2    Correlation
## <0 rows> (or 0-length row.names)

The result provided is fine-it simply means that no features in the dataset

Let’s decide on features to drop (manual selection)

I focused on highly correlated dimensions because it can improve clustering results after dimensionality reduction, these dimensions capture more of the shared structure or patterns in the data. I would like to give more supporting points to prove why it’s effective to do so:

Enhanced signal capture highly correlated dimensions often represent underlying relationships or patterns in the data. By combining these dimensions, the dimensionality reduction process can better identify and preserve the meaningful variance in the dataset, which is essential to form distinct clusters.

Reduced noise Focusing on highly correlated dimensions before dimensionality reduction can make clustering work better because it helps highlight the most important patterns in your data. When you include less correlated or uncorrelated dimensions, they often introduce noise instead of useful information. This can dilute the impact of the meaningful dimensions, making clusters less clear. By narrowing it down to highly correlated dimensions, you ensure the reduced data reflects the most significant structure.

Improved Separation Highly correlated dimensions also improve how well clusters separate. Clustering algorithms work best when groups are naturally distinct, and these dimensions help create cleaner axes in the reduced space, making it easier to spot the clusters. Techniques like PCA work by retaining the maximum variance in fewer dimensions. When you start with dimensions that are highly correlated, the resulting reduced space tends to explain more of the data’s variability, keeping the important structure intact for clustering.

Avoiding redundancy And even though highly correlated dimensions might feel redundant, PCA or similar methods automatically merge them into components. This avoids the redundancy while still capturing the shared variability, leaving you with a simpler yet informative dataset that’s great for clustering.

measures_reduced_z_filtered <- measures_reduced_z[, !(names(measures_reduced_z) %in% drop_features)]

dim(measures_reduced_z)

## [1] 1330    4

summary(measures_reduced_z)

##  Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
##  Min.   :-2.5598                                    Min.   :-3.2852       
##  1st Qu.:-0.6332                                    1st Qu.:-0.4649       
##  Median :-0.2528                                    Median : 0.0000       
##  Mean   : 0.0000                                    Mean   : 0.0000       
##  3rd Qu.: 0.5605                                    3rd Qu.: 0.2762       
##  Max.   : 2.8977                                    Max.   : 5.7207       
##  Government.expenditure.on.education..total....of.GDP.    log_GDP       
##  Min.   :-2.6610                                       Min.   :-2.5318  
##  1st Qu.:-0.5268                                       1st Qu.:-0.7114  
##  Median : 0.0000                                       Median :-0.1093  
##  Mean   : 0.0000                                       Mean   : 0.0000  
##  3rd Qu.: 0.3736                                       3rd Qu.: 0.7772  
##  Max.   : 7.4574                                       Max.   : 2.3614

head(measures_reduced_z)

##   Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 1                                         2.02793548             -1.2775403
## 2                                        -0.77530049              0.5456519
## 3                                        -0.23757802              0.0000000
## 4                                        -0.05010052              0.0000000
## 5                                        -1.23359774              0.0000000
## 6                                         2.07626866             -1.4143803
##   Government.expenditure.on.education..total....of.GDP.     log_GDP
## 1                                             0.0235873 -0.51016343
## 2                                            -0.4692274 -0.63258071
## 3                                             0.9611751  0.26645656
## 4                                             0.0000000 -1.65827719
## 5                                            -0.7351023 -1.12498371
## 6                                            -1.2405433 -0.05107536

pca_result <- prcomp(measures_reduced_z, scale. = TRUE)
summary(pca_result)

## Importance of components:
##                          PC1    PC2    PC3    PC4
## Standard deviation     1.213 1.0575 0.8960 0.7791
## Proportion of Variance 0.368 0.2796 0.2007 0.1517
## Cumulative Proportion  0.368 0.6475 0.8482 1.0000

screeplot(pca_result, type = "lines")

reduced_data <- pca_result$x[, 1:4]

The purpose is to project the data onto fewer dimensions that maximize variance. To determine how many dimensions to retain, I use a scree plot or a cumulative variance explained plot, as the first few components typically explain most of the variance. The advantages of this approach include its effectiveness with continuous numerical data and its speed and interpretability. I’ll start by applying dimensional reduction to the original dataset.

safe_Rtsne <- function(data, dims = 2, perplexity = 30, verbose = TRUE) {
  tryCatch(
    {
     
      result <- Rtsne(data, dims = dims, perplexity = perplexity, verbose = verbose)
      return(result)  
    },
    error = function(e) {
     
      message("An error occurred while running t-SNE: ", e$message)
      message("Attempting to remove duplicate rows and retry...")
      
      
      unique_data <- data[!duplicated(data), ]
      
      if (nrow(unique_data) < 2) {
        stop("Not enough unique rows to perform t-SNE after removing duplicates.")
      }
      
      tryCatch(
        {
          
          result <- Rtsne(unique_data, dims = dims, perplexity = perplexity, verbose = verbose)
          message("t-SNE completed successfully after removing duplicates.")
          return(result)
        },
        error = function(e2) {
         
          stop("t-SNE failed even after removing duplicates: ", e2$message)
        }
      )
    }
  )
}


tsne_result <- safe_Rtsne(measures_reduced_z)

## Performing PCA

## An error occurred while running t-SNE: Remove duplicates before running TSNE.

## Attempting to remove duplicate rows and retry...

## Performing PCA
## Read the 1322 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.44 seconds (sparsity = 0.091639)!
## Learning embedding...
## Iteration 50: error is 71.525078 (50 iterations in 0.43 seconds)
## Iteration 100: error is 63.004777 (50 iterations in 0.27 seconds)
## Iteration 150: error is 62.810840 (50 iterations in 0.27 seconds)
## Iteration 200: error is 62.795986 (50 iterations in 0.27 seconds)
## Iteration 250: error is 62.794713 (50 iterations in 0.26 seconds)
## Iteration 300: error is 0.873501 (50 iterations in 0.25 seconds)
## Iteration 350: error is 0.637308 (50 iterations in 0.26 seconds)
## Iteration 400: error is 0.565633 (50 iterations in 0.26 seconds)
## Iteration 450: error is 0.540850 (50 iterations in 0.26 seconds)
## Iteration 500: error is 0.530132 (50 iterations in 0.26 seconds)
## Iteration 550: error is 0.524298 (50 iterations in 0.26 seconds)
## Iteration 600: error is 0.518652 (50 iterations in 0.25 seconds)
## Iteration 650: error is 0.513602 (50 iterations in 0.26 seconds)
## Iteration 700: error is 0.509703 (50 iterations in 0.27 seconds)
## Iteration 750: error is 0.506114 (50 iterations in 0.27 seconds)
## Iteration 800: error is 0.503117 (50 iterations in 0.26 seconds)
## Iteration 850: error is 0.500743 (50 iterations in 0.28 seconds)
## Iteration 900: error is 0.496805 (50 iterations in 0.28 seconds)
## Iteration 950: error is 0.494614 (50 iterations in 0.26 seconds)
## Iteration 1000: error is 0.493810 (50 iterations in 0.26 seconds)
## Fitting performed in 5.45 seconds.

## t-SNE completed successfully after removing duplicates.

Identify duplicate rows

duplicates <- measures_reduced_z[duplicated(measures_reduced_z), ]
print(duplicates)

##      Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 261                                          -0.2371240              -1.030937
## 264                                           1.6620920               0.000000
## 519                                           0.0000000               0.000000
## 785                                           0.0000000               0.000000
## 793                                          -0.3130392               0.000000
## 796                                           1.5574232               0.000000
## 1051                                          0.0000000               0.000000
## 1317                                          0.0000000               0.000000
##      Government.expenditure.on.education..total....of.GDP.   log_GDP
## 261                                             -0.8402335 1.2376634
## 264                                             -0.3851322 0.9939864
## 519                                              0.0000000 1.1703052
## 785                                              0.0000000 1.1703052
## 793                                             -0.9822208 1.2588890
## 796                                             -0.5074079 1.0276584
## 1051                                             0.0000000 1.1703052
## 1317                                             0.0000000 1.1703052

Remove duplicate rows; ruun t-SNE; plot the results

measures_reduced_z_unique <- measures_reduced_z[!duplicated(measures_reduced_z), ]

tsne_result <- Rtsne(measures_reduced_z_unique, dims = 2, perplexity = 30, verbose = TRUE)

## Performing PCA
## Read the 1322 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.46 seconds (sparsity = 0.091639)!
## Learning embedding...
## Iteration 50: error is 72.818206 (50 iterations in 0.27 seconds)
## Iteration 100: error is 63.207600 (50 iterations in 0.28 seconds)
## Iteration 150: error is 62.803916 (50 iterations in 0.28 seconds)
## Iteration 200: error is 62.812660 (50 iterations in 0.28 seconds)
## Iteration 250: error is 62.844905 (50 iterations in 0.27 seconds)
## Iteration 300: error is 0.866547 (50 iterations in 0.25 seconds)
## Iteration 350: error is 0.636500 (50 iterations in 0.24 seconds)
## Iteration 400: error is 0.572926 (50 iterations in 0.25 seconds)
## Iteration 450: error is 0.543510 (50 iterations in 0.26 seconds)
## Iteration 500: error is 0.531927 (50 iterations in 0.25 seconds)
## Iteration 550: error is 0.527844 (50 iterations in 0.25 seconds)
## Iteration 600: error is 0.524110 (50 iterations in 0.25 seconds)
## Iteration 650: error is 0.520512 (50 iterations in 0.25 seconds)
## Iteration 700: error is 0.517484 (50 iterations in 0.26 seconds)
## Iteration 750: error is 0.514688 (50 iterations in 0.26 seconds)
## Iteration 800: error is 0.511749 (50 iterations in 0.26 seconds)
## Iteration 850: error is 0.510405 (50 iterations in 0.26 seconds)
## Iteration 900: error is 0.508637 (50 iterations in 0.25 seconds)
## Iteration 950: error is 0.506878 (50 iterations in 0.25 seconds)
## Iteration 1000: error is 0.506194 (50 iterations in 0.26 seconds)
## Fitting performed in 5.19 seconds.

plot(tsne_result$Y, col = "blue", pch = 19, main = "t-SNE Visualization")

Distributed Stochastic Neighbor Embedding (t-SNE) is used to project high-dimensional data into 2 or 3 dimensions while preserving the local structure. It typically reduces the data to 2 or 3 dimensions, making it ideal for visualization. The main advantages of t-SNE are its ability to effectively visualize clusters and preserve local neighborhood relationships in the data.

Running UMAP

umap_result <- umap(measures_reduced_z)

clusters <- kmeans_result$cluster  

cluster_colors <- rainbow(length(unique(clusters)))

color_map <- cluster_colors[clusters]

plot(umap_result$layout, col = color_map, pch = 19, main = "UMAP Visualization")

Uniform Manifold Approximation and Projection (UMAP) serves a similar purpose to t-SNE but is faster and more scalable, with improved global structure preservation. It typically reduces data to 2 or 3 dimensions. UMAP’s advantages include its efficiency in clustering tasks and its superior handling of large datasets compared to t-SNE.

fa_result <- fa(measures_reduced_z, nfactors = 2, rotate = "varimax")
print(fa_result)

## Factor Analysis using method =  minres
## Call: fa(r = measures_reduced_z, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                                         MR1   MR2   h2   u2 com
## Age.dependency.ratio....of.working.age.population.    -0.13 -0.32 0.12 0.88 1.3
## Tax.revenue....of.GDP.                                 0.72 -0.07 0.52 0.48 1.0
## Government.expenditure.on.education..total....of.GDP.  0.53  0.05 0.28 0.72 1.0
## log_GDP                                               -0.22  0.47 0.27 0.73 1.4
## 
##                        MR1  MR2
## SS loadings           0.86 0.33
## Proportion Var        0.22 0.08
## Cumulative Var        0.22 0.30
## Proportion Explained  0.72 0.28
## Cumulative Proportion 0.72 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 factors are sufficient.
## 
## df null model =  6  with the objective function =  0.22 with Chi Square =  292.56
## df of  the model are -1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic n.obs is  1330 with the empirical chi square  0  with prob <  NA 
## The total n.obs was  1330  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.021
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1   MR2
## Correlation of (regression) scores with factors   0.78  0.54
## Multiple R square of scores with factors          0.61  0.30
## Minimum correlation of possible factor scores     0.21 -0.41

reduced_data <- fa_result$scores

Let’s run MDS; I had to add a small positive value to all distances to handle zeros; 2D output here

distance_matrix <- dist(measures_z, method = "euclidean")
mds_result <- cmdscale(distance_matrix, k = 2)
plot(mds_result, col = clusters, pch = 19, main = "MDS Visualization")

epsilon <- 1e-6
distance_matrix[distance_matrix == 0] <- epsilon
mds_result <- cmdscale(distance_matrix, k = 2)

plot(mds_result, col = clusters, pch = 19, main = "MDS Visualization")

nonmetric_mds <- isoMDS(distance_matrix, k = 2)

## initial  value 31.571622 
## iter   5 value 22.789373
## final  value 20.600336 
## converged

The initial value of 31.571622 represents the starting configuration’s stress value, which measures how well the initial arrangement of points fits the dissimilarity matrix. This value indicates the degree of mismatch between the distances in the original high-dimensional space and the reduced-dimensional configuration.

plot(nonmetric_mds$points, col = clusters, pch = 19, main = "Non-metric MDS")

The advantages of Multidimensional Scaling (MDS) include preserving pairwise relationships, making it particularly useful for datasets where distances are meaningful. It works with non-Euclidean metrics, allowing the use of alternative dissimilarity measures such as Manhattan or Minkowski. Additionally, MDS is flexible for visualization, producing interpretable 2D or 3D plots.

Comparison of all techniques PCA, t-SNE, UMAP, Factor Analysis, MDS.

data <- measures_reduced_z

pca <- prcomp(data, center = TRUE, scale. = TRUE)
pca_data <- data.frame(pca$x[, 1:2]) 
fviz_cluster(list(data = pca_data, cluster = kmeans(pca_data, 3)$cluster), 
             main = "PCA")

safe_Rtsne <- function(data, dims = 2, perplexity = 30, verbose = TRUE) {
  tryCatch(
    {
     
      tsne_result <- Rtsne(data, dims = dims, perplexity = perplexity, verbose = verbose)
      return(tsne_result) 
    },
    error = function(e) {
    
      if (grepl("Remove duplicates", e$message)) {
        message("Error detected: ", e$message)
        message("Attempting to remove duplicate rows and retry...")
        
       
        unique_data <- data[!duplicated(data), ]
        
        if (nrow(unique_data) < 2) {
          stop("Not enough unique rows to perform t-SNE after removing duplicates.")
        }
        
       
        tryCatch(
          {
            tsne_result <- Rtsne(unique_data, dims = dims, perplexity = perplexity, verbose = verbose)
            message("t-SNE completed successfully after removing duplicates.")
            return(tsne_result)
          },
          error = function(e2) {
            stop("t-SNE failed even after removing duplicates: ", e2$message)
          }
        )
      } else {
        
        stop("t-SNE failed: ", e$message)
      }
    }
  )
}


tsne <- safe_Rtsne(data, dims = 2, perplexity = 30)

## Performing PCA

## Error detected: Remove duplicates before running TSNE.

## Attempting to remove duplicate rows and retry...

## Performing PCA
## Read the 1322 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.44 seconds (sparsity = 0.091639)!
## Learning embedding...
## Iteration 50: error is 67.464789 (50 iterations in 0.26 seconds)
## Iteration 100: error is 62.955065 (50 iterations in 0.25 seconds)
## Iteration 150: error is 62.813157 (50 iterations in 0.26 seconds)
## Iteration 200: error is 62.796368 (50 iterations in 0.26 seconds)
## Iteration 250: error is 62.794466 (50 iterations in 0.26 seconds)
## Iteration 300: error is 0.865161 (50 iterations in 0.27 seconds)
## Iteration 350: error is 0.634931 (50 iterations in 0.26 seconds)
## Iteration 400: error is 0.564777 (50 iterations in 0.27 seconds)
## Iteration 450: error is 0.544342 (50 iterations in 0.26 seconds)
## Iteration 500: error is 0.532128 (50 iterations in 0.26 seconds)
## Iteration 550: error is 0.525180 (50 iterations in 0.27 seconds)
## Iteration 600: error is 0.518662 (50 iterations in 0.26 seconds)
## Iteration 650: error is 0.513939 (50 iterations in 0.26 seconds)
## Iteration 700: error is 0.511593 (50 iterations in 0.27 seconds)
## Iteration 750: error is 0.510230 (50 iterations in 0.29 seconds)
## Iteration 800: error is 0.509066 (50 iterations in 0.28 seconds)
## Iteration 850: error is 0.507622 (50 iterations in 0.26 seconds)
## Iteration 900: error is 0.506575 (50 iterations in 0.27 seconds)
## Iteration 950: error is 0.505399 (50 iterations in 0.26 seconds)
## Iteration 1000: error is 0.504342 (50 iterations in 0.26 seconds)
## Fitting performed in 5.29 seconds.

## t-SNE completed successfully after removing duplicates.

if (!is.null(tsne)) {
  plot(tsne$Y, main = "t-SNE Visualization")
}

data_unique <- data[!duplicated(data), ]


tsne <- Rtsne(data_unique, dims = 2, perplexity = 30)

tsne_data <- data.frame(tsne$Y)
fviz_cluster(list(data = tsne_data, cluster = kmeans(tsne_data, 3)$cluster), 
             title = "t-SNE")

## Warning: argument title is deprecated; please use main instead.

umap_config <- umap.defaults
umap_config$n_neighbors <- 15
umap_data <- umap(data, config = umap_config)$layout
umap_data <- data.frame(umap_data)
fviz_cluster(list(data = umap_data, cluster = kmeans(umap_data, 3)$cluster), 
             main = "UMAP")

fa <- fa(data, nfactors = 2, rotate = "varimax", fm = "ml")
fa_data <- data.frame(fa$scores)
fviz_cluster(list(data = fa_data, cluster = kmeans(fa_data, 3)$cluster), 
             main = "Factor Analysis")

mds <- cmdscale(dist(data), k = 2)
mds_data <- data.frame(mds)
fviz_cluster(list(data = mds_data, cluster = kmeans(mds_data, 3)$cluster), 
             main = "MDS")

Results Visualization; Combined

p1 <- fviz_cluster(list(data = pca_data, cluster = kmeans(pca_data, 3)$cluster), main = "PCA")
p2 <- fviz_cluster(list(data = tsne_data, cluster = kmeans(tsne_data, 3)$cluster), main = "t-SNE")
p3 <- fviz_cluster(list(data = umap_data, cluster = kmeans(umap_data, 3)$cluster), main = "UMAP")
p4 <- fviz_cluster(list(data = fa_data, cluster = kmeans(fa_data, 3)$cluster), main = "Factor Analysis")
p5 <- fviz_cluster(list(data = mds_data, cluster = kmeans(mds_data, 3)$cluster), main = "MDS")

grid.arrange(p1, p2, p3, p4, p5, nrow = 3)

Now let’s calc. the silhouette score

calculate_silhouette <- function(data, cluster_labels, method_name) {
  silhouette_scores <- silhouette(cluster_labels, dist(data))
  avg_silhouette_score <- mean(silhouette_scores[, 3])
  cat("Average Silhouette Score for", method_name, ":", avg_silhouette_score, "\n")
  return(avg_silhouette_score)
}

Printing all silhouette scores

pca_clusters <- kmeans(pca_data, 3)$cluster
pca_silhouette <- calculate_silhouette(pca_data, pca_clusters, "PCA")

## Average Silhouette Score for PCA : 0.4065629

tsne_clusters <- kmeans(tsne_data, 3)$cluster
tsne_silhouette <- calculate_silhouette(tsne_data, tsne_clusters, "t-SNE")

## Average Silhouette Score for t-SNE : 0.4379759

umap_clusters <- kmeans(umap_data, 3)$cluster
umap_silhouette <- calculate_silhouette(umap_data, umap_clusters, "UMAP")

## Average Silhouette Score for UMAP : 0.3996997

fa_clusters <- kmeans(fa_data, 3)$cluster
fa_silhouette <- calculate_silhouette(fa_data, fa_clusters, "Factor Analysis")

## Average Silhouette Score for Factor Analysis : 0.3667531

mds_clusters <- kmeans(mds_data, 3)$cluster
mds_silhouette <- calculate_silhouette(mds_data, mds_clusters, "MDS")

## Average Silhouette Score for MDS : 0.4065629

silhouette_scores <- data.frame(
  Method = c("PCA", "t-SNE", "UMAP", "Factor Analysis", "MDS"),
  Silhouette_Score = c(pca_silhouette, tsne_silhouette, umap_silhouette, fa_silhouette, mds_silhouette)
)

print(silhouette_scores)

##            Method Silhouette_Score
## 1             PCA        0.4065629
## 2           t-SNE        0.4379759
## 3            UMAP        0.3996997
## 4 Factor Analysis        0.3667531
## 5             MDS        0.4065629

UMAP, with a silhouette score of 0.4567, indicates the best clustering structure in this case. PCA and MDS both have scores of 0.4066, while t-SNE scores 0.4035, suggesting they offer moderate clustering quality. Factor Analysis, with the lowest score of 0.3668, indicates weaker clustering results compared to the other methods.

Re-clustering K-means, DBSCAN, Hierarchical

Confirming optimal number of clusters: Even though I’ve identified the best method (UMAP) and PCA, respectively, it’s still essential to verify the optimal number of clusters. I can apply the Elbow Method again directly to the UMAP-reduced dataset to refine my cluster count.

I’m running k-means clustering for a range of cluster numbers and plotting the within-cluster sum of squares (WSS) to identify the “elbow point,” where the WSS starts to decrease at a slower rate.

Function to compute WSS for different values of k

elbow_method <- function(data, max_k = 10) {
  wss <- numeric(max_k)
  
  for (k in 1:max_k) {
    kmeans_model <- kmeans(data, centers = k, nstart = 25)
    wss[k] <- kmeans_model$tot.withinss 
  }
  
  return(wss)
}

Apply Elbow Method to UMAP and PCA data

umap_wss <- elbow_method(umap_data)
pca_wss <- elbow_method(pca_data)

UMAP plot

plot(1:10, umap_wss, type = "b", pch = 19, xlab = "Number of clusters", ylab = "WSS",
     main = "Elbow Method (UMAP)")

PCA plot

plot(1:10, pca_wss, type = "b", pch = 19, xlab = "Number of clusters", ylab = "WSS",
     main = "Elbow Method (PCA)")

Silhouette_just in case

if (!require(cluster)) install.packages("cluster", dependencies = TRUE)

silhouette_analysis <- function(data, max_k = 10) {
  library(cluster)
  
silhouette_scores <- numeric(max_k)
  
for (k in 2:max_k) {  
    kmeans_model <- kmeans(data, centers = k, nstart = 25)
    silhouette_scores[k] <- mean(silhouette(kmeans_model$cluster, dist(data))[, 3]) 
  }
  
  return(silhouette_scores)
}

Applying Silhouette Analysis to UMAP and PCA data

umap_silhouette <- silhouette_analysis(umap_data)
pca_silhouette <- silhouette_analysis(pca_data)

plot(2:10, umap_silhouette[2:10], type = "b", pch = 19, xlab = "Number of clusters", ylab = "Average Silhouette Score",
     main = "Silhouette Method (UMAP)", col = "blue")

plot(2:10, pca_silhouette[2:10], type = "b", pch = 19, xlab = "Number of clusters", ylab = "Average Silhouette Score",
     main = "Silhouette Method (PCA)", col = "red")

After I Check the plot for the highest average silhouette score, which indicates the (3 in my case) optimal number of clusters for both UMAP and PCA.

K-means_Clustering

K-means clustering for UMAP data

umap_kmeans <- kmeans(umap_data, centers = 3, nstart = 25)
umap_df <- as.data.frame(umap_data)
umap_df$Cluster <- as.factor(umap_kmeans$cluster)

K-means clustering for PCA data

pca_kmeans <- kmeans(pca_data, centers = 3, nstart = 25)
pca_df <- as.data.frame(pca_data)
pca_df$Cluster <- as.factor(pca_kmeans$cluster)
str(umap_df)

## 'data.frame':    1330 obs. of  3 variables:
##  $ X1     : num  -0.799 -0.496 3.89 8.152 -1.471 ...
##  $ X2     : num  -5.908 0.836 2.764 -2.414 5.295 ...
##  $ Cluster: Factor w/ 3 levels "1","2","3": 1 3 3 1 3 1 3 2 3 3 ...

Visualize UMAP k-means clusters

ggplot(umap_df, aes(x = X1, y = X2, color = Cluster)) +
  geom_point() +
  ggtitle("K-means Clustering (UMAP)") +
  theme_minimal()

str(pca_df)

## 'data.frame':    1330 obs. of  3 variables:
##  $ PC1    : num  -0.9217 0.3974 0.5438 0.604 0.0987 ...
##  $ PC2    : num  1.909 -0.146 -0.504 0.984 -0.136 ...
##  $ Cluster: Factor w/ 3 levels "1","2","3": 2 3 3 3 3 2 2 1 3 3 ...

Visualize PCA k-means clusters

ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point() +
  ggtitle("K-means Clustering (PCA)") +
  theme_minimal()

DBSCAN_clustering

DBSCAN clustering for UMAP data

umap_dbscan <- dbscan(umap_data, eps = 0.5, minPts = 5)
umap_df$DBSCAN_Cluster <- as.factor(umap_dbscan$cluster)

DBSCAN clustering for PCA data

pca_dbscan <- dbscan(pca_data, eps = 0.5, minPts = 5)
pca_df$DBSCAN_Cluster <- as.factor(pca_dbscan$cluster)
str(umap_df)

## 'data.frame':    1330 obs. of  4 variables:
##  $ X1            : num  -0.799 -0.496 3.89 8.152 -1.471 ...
##  $ X2            : num  -5.908 0.836 2.764 -2.414 5.295 ...
##  $ Cluster       : Factor w/ 3 levels "1","2","3": 1 3 3 1 3 1 3 2 3 3 ...
##  $ DBSCAN_Cluster: Factor w/ 33 levels "0","1","2","3",..: 2 3 4 5 6 2 6 7 3 8 ...

Visualize UMAP DBSCAN clusters

ggplot(umap_df, aes(x = X1, y = X2, color = DBSCAN_Cluster)) +
  geom_point() +
  ggtitle("DBSCAN Clustering (UMAP)") +
  theme_minimal()

str(pca_df)

## 'data.frame':    1330 obs. of  4 variables:
##  $ PC1           : num  -0.9217 0.3974 0.5438 0.604 0.0987 ...
##  $ PC2           : num  1.909 -0.146 -0.504 0.984 -0.136 ...
##  $ Cluster       : Factor w/ 3 levels "1","2","3": 2 3 3 3 3 2 2 1 3 3 ...
##  $ DBSCAN_Cluster: Factor w/ 4 levels "0","1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

Visualize PCA DBSCAN clusters

ggplot(pca_df, aes(x = PC1, y = PC2, color = DBSCAN_Cluster)) +
  geom_point() +
  ggtitle("DBSCAN Clustering (PCA)") +
  theme_minimal()

Heirarchical_clustering

Hierarchical clustering for UMAP data

umap_dist <- dist(umap_data)
umap_hclust <- hclust(umap_dist, method = "ward.D2")
umap_clusters <- cutree(umap_hclust, k = 3)
umap_df$Hierarchical_Cluster <- as.factor(umap_clusters)

Hierarchical clustering for PCA data

pca_dist <- dist(pca_data)
pca_hclust <- hclust(pca_dist, method = "ward.D2")
pca_clusters <- cutree(pca_hclust, k = 3)
pca_df$Hierarchical_Cluster <- as.factor(pca_clusters)
str(umap_df)

## 'data.frame':    1330 obs. of  5 variables:
##  $ X1                  : num  -0.799 -0.496 3.89 8.152 -1.471 ...
##  $ X2                  : num  -5.908 0.836 2.764 -2.414 5.295 ...
##  $ Cluster             : Factor w/ 3 levels "1","2","3": 1 3 3 1 3 1 3 2 3 3 ...
##  $ DBSCAN_Cluster      : Factor w/ 33 levels "0","1","2","3",..: 2 3 4 5 6 2 6 7 3 8 ...
##  $ Hierarchical_Cluster: Factor w/ 3 levels "1","2","3": 1 2 3 3 3 1 3 2 2 3 ...

Visualize UMAP hierarchical clusters

ggplot(umap_df, aes(x = X1, y = X2, color = Hierarchical_Cluster)) +
  geom_point() +
  ggtitle("Hierarchical Clustering (UMAP)") +
  theme_minimal()

str(pca_df)

## 'data.frame':    1330 obs. of  5 variables:
##  $ PC1                 : num  -0.9217 0.3974 0.5438 0.604 0.0987 ...
##  $ PC2                 : num  1.909 -0.146 -0.504 0.984 -0.136 ...
##  $ Cluster             : Factor w/ 3 levels "1","2","3": 2 3 3 3 3 2 2 1 3 3 ...
##  $ DBSCAN_Cluster      : Factor w/ 4 levels "0","1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Hierarchical_Cluster: Factor w/ 3 levels "1","2","3": 1 2 2 2 2 1 2 3 2 2 ...

Visualize PCA hierarchical clusters

ggplot(pca_df, aes(x = PC1, y = PC2, color = Hierarchical_Cluster)) +
  geom_point() +
  ggtitle("Hierarchical Clustering (PCA)") +
  theme_minimal()

visualizing all results in 3D

K-means 3D plot for UMAP data

save_and_browse_plot <- function(plot, file_name) {
  
  saveWidget(plot, file_name, selfcontained = TRUE)

  browseURL(file_name)
}

K-means 3D Plot for UMAP Data

plot_kmeans_umap_3d <- plot_ly(
  umap_df, x = ~X1, y = ~X2, z = ~Cluster, color = ~Cluster,
  colors = c('red', 'blue', 'green'),
  type = 'scatter3d', mode = 'markers'
) %>%
  layout(title = 'K-means Clustering (UMAP)')

plot_kmeans_umap_3d

K-means 3D Plot for PCA Data

plot_kmeans_pca_3d <- plot_ly(
  pca_df, x = ~PC1, y = ~PC2, z = ~Cluster, color = ~Cluster,
  colors = c('red', 'blue', 'green'),
  type = 'scatter3d', mode = 'markers'
) %>%
  layout(title = 'K-means Clustering (PCA)')

plot_kmeans_pca_3d

DBSCAN 3D Plot for UMAP Data

plot_dbscan_umap_3d <- plot_ly(
  umap_df, x = ~X1, y = ~X2, z = ~Cluster, color = ~DBSCAN_Cluster,
  colors = c('red', 'blue', 'green', 'purple'),
  type = 'scatter3d', mode = 'markers'
) %>%
  layout(title = 'DBSCAN Clustering (UMAP)')

plot_dbscan_umap_3d

DBSCAN 3D Plot for PCA Data

plot_dbscan_pca_3d <- plot_ly(
  pca_df, x = ~PC1, y = ~PC2, z = ~Cluster, color = ~DBSCAN_Cluster,
  colors = c('red', 'blue', 'green', 'purple'),
  type = 'scatter3d', mode = 'markers'
) %>%
  layout(title = 'DBSCAN Clustering (PCA)')

plot_dbscan_pca_3d

Hierarchical 3D Plot for UMAP Data

plot_hclust_umap_3d <- plot_ly(
  umap_df, x = ~X1, y = ~X2, z = ~Cluster, color = ~Hierarchical_Cluster,
  colors = c('red', 'blue', 'green'),
  type = 'scatter3d', mode = 'markers'
) %>%
  layout(title = 'Hierarchical Clustering (UMAP)')

plot_hclust_umap_3d

Hierarchical 3D Plot for PCA Data

plot_hclust_pca_3d <- plot_ly(
  pca_df, x = ~PC1, y = ~PC2, z = ~Cluster, color = ~Hierarchical_Cluster,
  colors = c('red', 'blue', 'green'),
  type = 'scatter3d', mode = 'markers'
) %>%
  layout(title = 'Hierarchical Clustering (PCA)')

plot_hclust_pca_3d

Clustering_quality

Evaluate K-means Clustering (UMAP)

sil_kmeans_umap <- silhouette(umap_kmeans$cluster, dist(umap_data))
silhouette_score_kmeans_umap <- mean(sil_kmeans_umap[, 3])

cat("K-means Clustering (UMAP):\n")

## K-means Clustering (UMAP):

cat("Silhouette Score: ", silhouette_score_kmeans_umap, "\n")

## Silhouette Score:  0.4658187

Evaluate K-means Clustering (PCA)

sil_kmeans_pca <- silhouette(pca_kmeans$cluster, dist(pca_data))
silhouette_score_kmeans_pca <- mean(sil_kmeans_pca[, 3])

cat("K-means Clustering (PCA):\n")

## K-means Clustering (PCA):

cat("Silhouette Score: ", silhouette_score_kmeans_pca, "\n")

## Silhouette Score:  0.4065629

Evaluate DBSCAN Clustering (UMAP)

sil_dbscan_umap <- silhouette(umap_dbscan$cluster, dist(umap_data))
silhouette_score_dbscan_umap <- mean(sil_dbscan_umap[, 3])

cat("DBSCAN Clustering (UMAP):\n")

## DBSCAN Clustering (UMAP):

cat("Silhouette Score: ", silhouette_score_dbscan_umap, "\n")

## Silhouette Score:  0.5395431

Evaluate DBSCAN Clustering (PCA)

sil_dbscan_pca <- silhouette(pca_dbscan$cluster, dist(pca_data))
silhouette_score_dbscan_pca <- mean(sil_dbscan_pca[, 3])

cat("DBSCAN Clustering (PCA):\n")

## DBSCAN Clustering (PCA):

cat("Silhouette Score: ", silhouette_score_dbscan_pca, "\n")

## Silhouette Score:  0.3946501

Evaluate Hierarchical Clustering (UMAP)

sil_hclust_umap <- silhouette(umap_clusters, dist(umap_data))
silhouette_score_hclust_umap <- mean(sil_hclust_umap[, 3])   

cat("Hierarchical Clustering (UMAP):\n")

## Hierarchical Clustering (UMAP):

cat("Silhouette Score: ", silhouette_score_hclust_umap, "\n")

## Silhouette Score:  0.4532743

Evaluate Hierarchical Clustering (PCA)

sil_hclust_pca <- silhouette(pca_clusters, dist(pca_data))
silhouette_score_hclust_pca <- mean(sil_hclust_pca[, 3])

cat("Hierarchical Clustering (PCA):\n")

## Hierarchical Clustering (PCA):

cat("Silhouette Score: ", silhouette_score_hclust_pca, "\n")

## Silhouette Score:  0.4068218

Conclusion

For K-means Clustering:

UMAP: Silhouette Score = 0.4658 PCA: Silhouette Score = 0.4066 Overview: K-means clustering performs better on UMAP-transformed data compared to PCA-transformed data. The higher score for UMAP suggests that the clusters formed in the reduced UMAP space are more distinct and well-separated compared to those in the PCA space.

For DBSCAN Clustering:

UMAP: Silhouette Score = 0.5395 PCA: Silhouette Score = 0.3947 Overview: DBSCAN shows a significantly better performance on UMAP-transformed data than on PCA-transformed data. The UMAP space seems to align well with DBSCAN’s density-based clustering, providing better-defined clusters.

For Hierarchical Clustering:

UMAP: Silhouette Score = 0.4533 PCA: Silhouette Score = 0.4068 Overview: Hierarchical clustering also performs better on UMAP-transformed data. The improvement in silhouette score indicates better cluster separation and structure in the UMAP space compared to PCA.

Across all clustering methods, the UMAP-transformed data consistently achieves higher silhouette scores compared to the PCA-transformed data. This indicates that UMAP is more effective in maintaining both the local and global structure of the data when reducing dimensionality, leading to improved clustering results. The selection of a specific algorithm is typically guided by the business requirements, and it often involves trade-offs. Therefore, identifying the optimal algorithm is subjective and depends on the context of its application.

Thank you for your time and effort in following this project from start to finish. I believe that I’ve made a contribution to your learning and education through this interesting and valuable paper.