The World Development Indicators dataset, derived from the World Bank’s Data portal, evaluates various economic, social, and environmental metrics across more than 190 countries. These indicators assess critical dimensions such as health, education, poverty, and infrastructure, providing insights into the developmental progress and challenges faced by different nations. By applying clustering and dimensionality reduction techniques to this dataset, we can uncover patterns in how countries perform across these indicators, group them based on similar developmental characteristics, and identify the key factors that influence overall development. This analysis helps to compare and contrast countries’ progress and supports targeted policy-making and investment decisions aimed at improving global development outcomes.
Evaluating 7 most important of these economic indicators is crucial for assessing a country’s overall performance as they provide a multidimensional view of its economic health, resilience, and development trajector:
Economic Growth and Development Trends: Indicators like GDP growth and GDP per capita highlight the scale of economic expansion and the inclusiveness of prosperity. They are fundamental for measuring progress over time and comparing nations’ economic well-being.
Investment and Structural Shifts: Metrics such as gross capital formation and sectoral transitions reveal a country’s focus on industrialization, modernization, or service-oriented growth. They are vital for understanding long-term economic transformation and diversification.
Trade Dynamics: Export and import data, along with net trade figures, showcase integration into global markets and competitiveness. These metrics are key for evaluating a nation’s trade strategy and its impact on the domestic economy.
Inflation and Price Stability: Stability in consumer prices reflects sound economic governance and policy effectiveness. High inflation or deflation can signal underlying economic imbalances or policy failures.
Resilience to Global Shocks: Recovery from crises like the 2008 financial crash or COVID-19 highlights economic adaptability and the robustness of governance. It is essential for gauging the ability to withstand and rebound from external disruptions.
Socioeconomic Changes: Indicators like population growth, urbanization, and demographic trends provide insights into labor market dynamics, infrastructure needs, and potential future challenges such as aging populations.
Environmental and Sustainability Metrics: Energy use and CO2 emissions shed light on the environmental impact of economic activities, emphasizing the importance of sustainability in long-term growth.
I chose this dataset because it provides valuable insights into key economic indicators that reflect the performance and development of countries. By analyzing metrics such as GDP growth, trade dynamics, investment trends, and sustainability factors, I can uncover patterns and trends crucial for evaluating global economic shifts. This dataset is particularly interesting as it captures a wide range of factors-economic growth, trade, demographics, and environmental metrics-that influence and shape the overall economic landscape of nations.
Additionally, the courses I have undertaken, such as Macroeconomics and Microeconomics, have equipped me with the theoretical knowledge needed to explore this dataset effectively. These courses have allowed me to delve deeper into understanding the dataset and uncover meaningful insights about global economic dynamics.
Clustering will enable the grouping of countries with similar economic characteristics, simplifying the comparison of economic performance across different clusters. By applying clustering algorithms, we can uncover relationships between key indicators like GDP growth, trade balance, and demographic trends. This will help identify patterns of development, such as clusters of emerging economies or countries with similar challenges. Additionally, clustering can aid in spotting outliers or unique cases that require deeper analysis, providing a foundation for identifying areas of improvement or further investigation into global economic trends.
The dataset likely contains numerous indicators, leading to high-dimensional data. Dimensionality reduction techniques like PCA (Principal Component Analysis) reduce the number of dimensions while retaining most of the original information, making the data easier to visualize and interpret.By reducing dimensions, these techniques identify the most significant indicators that contribute to the variance in the dataset, helping focus on the most critical factors for analysis. Overall, dimensionality reduction makes the dataset more manageable and improves the effectiveness of further analysis and modeling.
Data Loading: This step involves importing data into a working environment. Preprocessing: Preprocessing cleans and transforms the data to make it suitable for analysis. This includes handling missing values, scaling features, encoding categorical variables, and removing outliers. Visualization: Visualizing data helps understand its distribution, identify patterns, detect outliers, and recognize relationships between variables. Visualization, through methods like histograms and scatter plots, helps understand data distribution, detect patterns, and identify outliers. These steps improve data quality, ensure proper feature scaling, and allow better selection and interpretation of clustering methods.
K-means: K-means clustering partitions data into K clusters by minimizing the variance within clusters. It uses centroids to define cluster centers and is sensitive to outliers. The number of clusters must be predefined.
PAM (Partitioning Around Medoids): PAM is a robust clustering method that uses medoids (central points) instead of centroids, making it less sensitive to outliers. It is suitable for datasets with mixed types or where medoids are more interpretable.
Hierarchical Clustering: This method builds a tree-like structure of clusters, either agglomeratively (bottom-up) or divisively (top-down). It does not require specifying the number of clusters beforehand, making it flexible for exploring data.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters based on dense regions of points and can discover clusters of arbitrary shapes. It handles noise and outliers effectively, unlike K-means or hierarchical clustering.
PCA (Principal Component Analysis): PCA reduces dimensionality by transforming the original variables into a smaller set of uncorrelated variables called principal components, which capture the most variance in the data. It’s useful for linear data structures.
t-SNE (t-Distributed Stochastic Neighbor Embedding): t-SNE reduces dimensionality by converting the distances between data points into probabilities, preserving local relationships and creating a lower-dimensional map.
UMAP (Uniform Manifold Approximation and Projection): UMAP is a nonlinear dimension reduction technique that focuses on preserving the global structure of the data while also maintaining local neighbor relations. It’s faster and scales better than t-SNE, making it suitable for large datasets.
Factor Analysis: Factor Analysis identifies latent variables (factors) that explain the correlations among observed variables. It assumes that underlying factors influence the observed data and is often used in psychometrics and social sciences.
MDS (Multidimensional Scaling): MDS reduces dimensionality by preserving the pairwise distances between data points. It aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible.
The Elbow Method is a technique that helps determine the best number of clusters when using algorithms like K-means. You plot the variance explained by each cluster against the number of clusters, and you look for a point where the curve starts to level off, forming an “elbow.” This elbow indicates the ideal number of clusters, as adding more clusters beyond this point doesn’t improve the model much. It’s a helpful way to balance between overfitting (too many clusters) and underfitting (too few clusters).
The Silhouette Score is a way to measure how well the data points fit into their assigned clusters in clustering algorithms like K-means. It checks how similar each point is to others in the same cluster compared to points in nearby clusters.
+1 indicates that the points are well clustered, far from neighboring clusters.
0 indicates that the points are on or near the decision boundary between clusters.
-1 indicates that the points may have been assigned to the wrong cluster.
It’s really helpful for determining whether the clusters make sense and for comparing different clustering models to find the best fit for your data.
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 4.4.2
pacman::p_load("yaml", "plotly", "ggplot2", "Rtsne", "htmlwidgets", "stats", "ggcorrplot",
"readxl", "knitr", "clustertend", "hopkins", "factoextra", "GGally", "mclust",
"gridExtra", "dendextend", "cluster", "fpc", "clustMixType", "reshape2",
"MASS", "FactoMineR", "BiocManager", "gplots", "pheatmap", "tidyr", "umap",
"fpc", "flexclust", "cluster", "ClusterR", "NbClust", "DescTools", "pastecs",
"corrplot", "psych", "dbscan", "cluster.stats")
## Installing package into 'C:/Users/khami/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## Warning: package 'cluster.stats' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4/PACKAGES'
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'cluster.stats'
## Warning in pacman::p_load("yaml", "plotly", "ggplot2", "Rtsne", "htmlwidgets", : Failed to install/load:
## cluster.stats
Loading the required packages
setwd("C:\\Users\\khami\\Desktop\\University of Warsaw\\R\\Elbek aka")
Dataset <- read_excel("2011-2021_Data_Extract_From_World_Development_Indicators.xlsx",
sheet = "2017-21")
Setting the working directory and below summarizing the dataset
dim(Dataset)
## [1] 1330 19
#head(Dataset)
str(Dataset)
## tibble [1,330 × 19] (S3: tbl_df/tbl/data.frame)
## $ Time : num [1:1330] 2017 2017 2017 2017 2017 ...
## $ Time Code : chr [1:1330] "YR2017" "YR2017" "YR2017" "YR2017" ...
## $ Country Name : chr [1:1330] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Country Code : chr [1:1330] "AFG" "ALB" "DZA" "ASM" ...
## $ GDP (current US$) : num [1:1330] 1.88e+10 1.30e+10 1.90e+11 6.12e+08 3.00e+09 ...
## $ Gross capital formation (annual % growth) : num [1:1330] NA 4.37 -2.23 NA NA ...
## $ Exports of goods and services (% of GDP) : num [1:1330] NA 31.6 20.3 59.2 NA ...
## $ Inflation, consumer prices (annual %) : num [1:1330] 4.98 1.99 5.59 NA NA ...
## $ GNI per capita, Atlas method (current US$) : num [1:1330] 530 4290 4440 NA NA ...
## $ Foreign direct investment, net inflows (% of GDP) : num [1:1330] 0.275 7.855 0.648 NA NA ...
## $ Domestic credit provided by financial sector (% of GDP) : num [1:1330] NA 63.6 NA NA NA ...
## $ Trade (% of GDP) : num [1:1330] NA 78.2 49.8 161.4 NA ...
## $ Net trade in goods and services (BoP, current US$) : num [1:1330] -6.80e+09 -1.98e+09 -2.24e+10 NA NA ...
## $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 NA NA ...
## $ Employment in agriculture (% of total employment) : num [1:1330] 44 38.1 10.2 NA NA ...
## $ Age dependency ratio (% of working-age population) : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
## $ Tax revenue (% of GDP) : num [1:1330] 9.9 18.9 NA NA NA ...
## $ Energy use (kg of oil equivalent per capita) : logi [1:1330] NA NA NA NA NA NA ...
## $ Government expenditure on education, total (% of GDP) : num [1:1330] 4.34 3.61 5.73 NA 3.22 ...
summary(Dataset)
## Time Time Code Country Name Country Code
## Min. :2017 Length:1330 Length:1330 Length:1330
## 1st Qu.:2018 Class :character Class :character Class :character
## Median :2019 Mode :character Mode :character Mode :character
## Mean :2019
## 3rd Qu.:2020
## Max. :2021
##
## GDP (current US$) Gross capital formation (annual % growth)
## Min. :4.528e+07 Min. :-938.996
## 1st Qu.:9.617e+09 1st Qu.: -3.395
## Median :5.516e+10 Median : 3.564
## Mean :2.809e+12 Mean : 2.054
## 3rd Qu.:5.621e+11 3rd Qu.: 9.444
## Max. :9.785e+13 Max. : 182.284
## NA's :42 NA's :352
## Exports of goods and services (% of GDP) Inflation, consumer prices (annual %)
## Min. : 2.25 Min. : -3.233
## 1st Qu.: 22.29 1st Qu.: 1.414
## Median : 31.57 Median : 2.558
## Mean : 39.74 Mean : 5.543
## 3rd Qu.: 47.86 3rd Qu.: 4.283
## Max. :213.22 Max. :557.202
## NA's :199 NA's :199
## GNI per capita, Atlas method (current US$)
## Min. : 210
## 1st Qu.: 2083
## Median : 6110
## Mean : 14788
## 3rd Qu.: 18003
## Max. :120990
## NA's :94
## Foreign direct investment, net inflows (% of GDP)
## Min. :-1303.108
## 1st Qu.: 1.067
## Median : 2.225
## Mean : 4.850
## 3rd Qu.: 4.221
## Max. : 602.375
## NA's :153
## Domestic credit provided by financial sector (% of GDP) Trade (% of GDP)
## Min. : 6.613 Min. : 4.128
## 1st Qu.: 42.711 1st Qu.: 51.204
## Median : 64.740 Median : 71.222
## Mean : 85.079 Mean : 84.632
## 3rd Qu.: 98.346 3rd Qu.:101.218
## Max. :393.451 Max. :402.460
## NA's :1073 NA's :204
## Net trade in goods and services (BoP, current US$)
## Min. :-8.481e+11
## 1st Qu.:-2.281e+09
## Median :-2.180e+08
## Mean : 5.234e+09
## 3rd Qu.: 2.463e+09
## Max. : 5.252e+11
## NA's :410
## Labor force participation rate, total (% of total population ages 15+)
## Min. :31.40
## 1st Qu.:55.55
## Median :60.66
## Mean :60.24
## 3rd Qu.:65.76
## Max. :87.64
## NA's :155
## Employment in agriculture (% of total employment)
## Min. : 0.1078
## 1st Qu.: 5.8431
## Median :18.3271
## Mean :24.5136
## 3rd Qu.:38.9447
## Max. :86.3098
## NA's :155
## Age dependency ratio (% of working-age population) Tax revenue (% of GDP)
## Min. : 17.45 Min. : 0.0001
## 1st Qu.: 48.44 1st Qu.:11.6061
## Median : 54.54 Median :15.4216
## Mean : 58.65 Mean :16.1972
## 3rd Qu.: 67.75 3rd Qu.:20.5240
## Max. :105.29 Max. :44.4023
## NA's :5 NA's :494
## Energy use (kg of oil equivalent per capita)
## Mode:logical
## NA's:1330
##
##
##
##
##
## Government expenditure on education, total (% of GDP)
## Min. : 0.3585
## 1st Qu.: 3.2962
## Median : 4.1319
## Mean : 4.3082
## 3rd Qu.: 5.0636
## Max. :15.3770
## NA's :277
stat.desc(Dataset)
## Time Time Code Country Name Country Code GDP (current US$)
## nbr.val 1.330000e+03 NA NA NA 1.288000e+03
## nbr.null 0.000000e+00 NA NA NA 0.000000e+00
## nbr.na 0.000000e+00 NA NA NA 4.200000e+01
## min 2.017000e+03 NA NA NA 4.527660e+07
## max 2.021000e+03 NA NA NA 9.784830e+13
## range 4.000000e+00 NA NA NA 9.784825e+13
## sum 2.685270e+06 NA NA NA 3.618306e+15
## median 2.019000e+03 NA NA NA 5.515636e+10
## mean 2.019000e+03 NA NA NA 2.809244e+12
## SE.mean 3.879292e-02 NA NA NA 2.673606e+11
## CI.mean.0.95 7.610204e-02 NA NA NA 5.245104e+11
## var 2.001505e+00 NA NA NA 9.206841e+25
## std.dev 1.414746e+00 NA NA NA 9.595228e+12
## coef.var 7.007160e-04 NA NA NA 3.415591e+00
## Gross capital formation (annual % growth)
## nbr.val 978.000000
## nbr.null 0.000000
## nbr.na 352.000000
## min -938.995687
## max 182.283579
## range 1121.279267
## sum 2008.484716
## median 3.564137
## mean 2.053665
## SE.mean 1.204690
## CI.mean.0.95 2.364078
## var 1419.349717
## std.dev 37.674258
## coef.var 18.344887
## Exports of goods and services (% of GDP)
## nbr.val 1.131000e+03
## nbr.null 0.000000e+00
## nbr.na 1.990000e+02
## min 2.249870e+00
## max 2.132227e+02
## range 2.109728e+02
## sum 4.494378e+04
## median 3.156980e+01
## mean 3.973809e+01
## SE.mean 8.827692e-01
## CI.mean.0.95 1.732051e+00
## var 8.813674e+02
## std.dev 2.968783e+01
## coef.var 7.470875e-01
## Inflation, consumer prices (annual %)
## nbr.val 1131.0000000
## nbr.null 0.0000000
## nbr.na 199.0000000
## min -3.2333893
## max 557.2018174
## range 560.4352067
## sum 6269.5973731
## median 2.5575448
## mean 5.5434106
## SE.mean 0.7051755
## CI.mean.0.95 1.3836005
## var 562.4151196
## std.dev 23.7152929
## coef.var 4.2781051
## GNI per capita, Atlas method (current US$)
## nbr.val 1.236000e+03
## nbr.null 0.000000e+00
## nbr.na 9.400000e+01
## min 2.100000e+02
## max 1.209900e+05
## range 1.207800e+05
## sum 1.827822e+07
## median 6.110074e+03
## mean 1.478820e+04
## SE.mean 5.612330e+02
## CI.mean.0.95 1.101076e+03
## var 3.893184e+08
## std.dev 1.973115e+04
## coef.var 1.334250e+00
## Foreign direct investment, net inflows (% of GDP)
## nbr.val 1177.000000
## nbr.null 5.000000
## nbr.na 153.000000
## min -1303.108267
## max 602.374526
## range 1905.482793
## sum 5708.707365
## median 2.225386
## mean 4.850219
## SE.mean 1.434614
## CI.mean.0.95 2.814689
## var 2422.404873
## std.dev 49.217932
## coef.var 10.147570
## Domestic credit provided by financial sector (% of GDP)
## nbr.val 2.570000e+02
## nbr.null 0.000000e+00
## nbr.na 1.073000e+03
## min 6.612871e+00
## max 3.934513e+02
## range 3.868384e+02
## sum 2.186543e+04
## median 6.473999e+01
## mean 8.507951e+01
## SE.mean 4.231875e+00
## CI.mean.0.95 8.333722e+00
## var 4.602554e+03
## std.dev 6.784212e+01
## coef.var 7.973968e-01
## Trade (% of GDP)
## nbr.val 1.126000e+03
## nbr.null 0.000000e+00
## nbr.na 2.040000e+02
## min 4.127549e+00
## max 4.024597e+02
## range 3.983321e+02
## sum 9.529540e+04
## median 7.122244e+01
## mean 8.463179e+01
## SE.mean 1.608174e+00
## CI.mean.0.95 3.155357e+00
## var 2.912087e+03
## std.dev 5.396376e+01
## coef.var 6.376298e-01
## Net trade in goods and services (BoP, current US$)
## nbr.val 9.200000e+02
## nbr.null 0.000000e+00
## nbr.na 4.100000e+02
## min -8.480720e+11
## max 5.252276e+11
## range 1.373300e+12
## sum 4.815380e+12
## median -2.179975e+08
## mean 5.234109e+09
## SE.mean 2.259529e+09
## CI.mean.0.95 4.434436e+09
## var 4.697034e+21
## std.dev 6.853491e+10
## coef.var 1.309390e+01
## Labor force participation rate, total (% of total population ages 15+)
## nbr.val 1.175000e+03
## nbr.null 0.000000e+00
## nbr.na 1.550000e+02
## min 3.140200e+01
## max 8.764100e+01
## range 5.623900e+01
## sum 7.078724e+04
## median 6.065558e+01
## mean 6.024446e+01
## SE.mean 2.784123e-01
## CI.mean.0.95 5.462413e-01
## var 9.107827e+01
## std.dev 9.543493e+00
## coef.var 1.584128e-01
## Employment in agriculture (% of total employment)
## nbr.val 1.175000e+03
## nbr.null 0.000000e+00
## nbr.na 1.550000e+02
## min 1.077741e-01
## max 8.630985e+01
## range 8.620208e+01
## sum 2.880347e+04
## median 1.832706e+01
## mean 2.451359e+01
## SE.mean 6.059417e-01
## CI.mean.0.95 1.188850e+00
## var 4.314193e+02
## std.dev 2.077064e+01
## coef.var 8.473111e-01
## Age dependency ratio (% of working-age population)
## nbr.val 1.325000e+03
## nbr.null 0.000000e+00
## nbr.na 5.000000e+00
## min 1.745307e+01
## max 1.052923e+02
## range 8.783927e+01
## sum 7.771582e+04
## median 5.453799e+01
## mean 5.865345e+01
## SE.mean 4.430052e-01
## CI.mean.0.95 8.690687e-01
## var 2.600360e+02
## std.dev 1.612563e+01
## coef.var 2.749307e-01
## Tax revenue (% of GDP)
## nbr.val 8.360000e+02
## nbr.null 0.000000e+00
## nbr.na 4.940000e+02
## min 6.279221e-05
## max 4.440227e+01
## range 4.440220e+01
## sum 1.354087e+04
## median 1.542158e+01
## mean 1.619721e+01
## SE.mean 2.151279e-01
## CI.mean.0.95 4.222550e-01
## var 3.869009e+01
## std.dev 6.220136e+00
## coef.var 3.840251e-01
## Energy use (kg of oil equivalent per capita)
## nbr.val NA
## nbr.null NA
## nbr.na NA
## min NA
## max NA
## range NA
## sum NA
## median NA
## mean NA
## SE.mean NA
## CI.mean.0.95 NA
## var NA
## std.dev NA
## coef.var NA
## Government expenditure on education, total (% of GDP)
## nbr.val 1.053000e+03
## nbr.null 0.000000e+00
## nbr.na 2.770000e+02
## min 3.584790e-01
## max 1.537700e+01
## range 1.501852e+01
## sum 4.536514e+03
## median 4.131948e+00
## mean 4.308180e+00
## SE.mean 5.141068e-02
## CI.mean.0.95 1.008792e-01
## var 2.783141e+00
## std.dev 1.668275e+00
## coef.var 3.872342e-01
#View(Dataset)
options(scipen=999)
colSums(is.na(Dataset))
## Time
## 0
## Time Code
## 0
## Country Name
## 0
## Country Code
## 0
## GDP (current US$)
## 42
## Gross capital formation (annual % growth)
## 352
## Exports of goods and services (% of GDP)
## 199
## Inflation, consumer prices (annual %)
## 199
## GNI per capita, Atlas method (current US$)
## 94
## Foreign direct investment, net inflows (% of GDP)
## 153
## Domestic credit provided by financial sector (% of GDP)
## 1073
## Trade (% of GDP)
## 204
## Net trade in goods and services (BoP, current US$)
## 410
## Labor force participation rate, total (% of total population ages 15+)
## 155
## Employment in agriculture (% of total employment)
## 155
## Age dependency ratio (% of working-age population)
## 5
## Tax revenue (% of GDP)
## 494
## Energy use (kg of oil equivalent per capita)
## 1330
## Government expenditure on education, total (% of GDP)
## 277
Let’s examine what each of these dimensions corresponds to:
GDP (current US$) ~ Total monetary value of all goods and services produced within a country in a specific period, measured in current US dollars.
Gross capital formation (annual % growth) ~ Annual percentage growth rate of gross capital formation, indicating investments in physical assets like buildings, machinery, and infrastructure.
Exports of goods and services (% of GDP) ~ Value of all goods and services exported as a percentage of the country’s GDP.
Inflation, consumer prices (annual %) ~ Annual percentage change in the cost to the average consumer of acquiring a basket of goods and services.
GNI per capita, Atlas method (current US$) ~ Gross National Income divided by the population, adjusted using the Atlas method, expressed in current US dollars.
Foreign direct investment, net inflows (% of GDP) ~ The net inflows of investment to acquire a lasting management interest in a business operating in another economy, as a percentage of GDP.
Domestic credit provided by financial sector (% of GDP) ~ Total credit extended to various sectors of the economy by financial intermediaries, expressed as a percentage of GDP.
Trade (% of GDP) ~ Sum of exports and imports of goods and services measured as a share of GDP.
Net trade in goods and services (BoP, current US$) ~ The difference between exports and imports of goods and services, expressed in current US dollars.
Labor force participation rate, total (% of total population ages 15+) ~ The percentage of the population aged 15 and older that is economically active.
Employment in agriculture (% of total employment) ~ Percentage of total employment that is engaged in agriculture, forestry, and fishing.
Age dependency ratio (% of working-age population) ~ The ratio of dependents (people younger than 15 or older than 64) to the working-age population.
Tax revenue (% of GDP) ~ The total government tax revenue expressed as a percentage of GDP.
Government expenditure on education, total (% of GDP) ~ The total public expenditure on education as a percentage of GDP.
Visualizing the missing values per column in bar chart, histogram and scatter plot for better visibility of missing data
missing_counts <- colSums(is.na(Dataset))
missing_counts <- missing_counts[missing_counts > 0]
Bar chart for missing values per column
if (length(missing_counts) > 0) {
bar_chart <- ggplot(data.frame(Columns = names(missing_counts), Missing = missing_counts), aes(x = reorder(Columns, Missing), y = Missing)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Missing Values per Column", x = "Columns", y = "Number of Missing Values") +
theme_minimal()
} else {
bar_chart <- ggplot() +
annotate("text", x = 0.5, y = 0.5, label = "No Missing Values", size = 6, color = "green", hjust = 0.5, vjust = 0.5) +
theme_void() +
labs(title = "Missing Values per Column")
}
Histogram of missing values per row
gg_plot <- ggplot(data.frame(missing_per_row = rowSums(is.na(Dataset))), aes(x = missing_per_row)) +
geom_histogram(binwidth = 1, fill = "blue", color = "white") +
labs(title = "Histogram of Missing Values per Row", x = "Number of Missing Values", y = "Frequency")
Arranging all visualizations in one line
grid.arrange(gg_plot, bar_chart)
Data Cleaning Part started with filling NA with the mean of each column
colSums(is.na(Dataset))
## Time
## 0
## Time Code
## 0
## Country Name
## 0
## Country Code
## 0
## GDP (current US$)
## 42
## Gross capital formation (annual % growth)
## 352
## Exports of goods and services (% of GDP)
## 199
## Inflation, consumer prices (annual %)
## 199
## GNI per capita, Atlas method (current US$)
## 94
## Foreign direct investment, net inflows (% of GDP)
## 153
## Domestic credit provided by financial sector (% of GDP)
## 1073
## Trade (% of GDP)
## 204
## Net trade in goods and services (BoP, current US$)
## 410
## Labor force participation rate, total (% of total population ages 15+)
## 155
## Employment in agriculture (% of total employment)
## 155
## Age dependency ratio (% of working-age population)
## 5
## Tax revenue (% of GDP)
## 494
## Energy use (kg of oil equivalent per capita)
## 1330
## Government expenditure on education, total (% of GDP)
## 277
dim(Dataset)
## [1] 1330 19
Dataset[, 5:19] <- lapply(Dataset[, 5:19], function(x) { ifelse(is.na(x), mean(x, na.rm = TRUE), x)})
double checking
colSums(is.na(Dataset))
## Time
## 0
## Time Code
## 0
## Country Name
## 0
## Country Code
## 0
## GDP (current US$)
## 0
## Gross capital formation (annual % growth)
## 0
## Exports of goods and services (% of GDP)
## 0
## Inflation, consumer prices (annual %)
## 0
## GNI per capita, Atlas method (current US$)
## 0
## Foreign direct investment, net inflows (% of GDP)
## 0
## Domestic credit provided by financial sector (% of GDP)
## 0
## Trade (% of GDP)
## 0
## Net trade in goods and services (BoP, current US$)
## 0
## Labor force participation rate, total (% of total population ages 15+)
## 0
## Employment in agriculture (% of total employment)
## 0
## Age dependency ratio (% of working-age population)
## 0
## Tax revenue (% of GDP)
## 0
## Energy use (kg of oil equivalent per capita)
## 1330
## Government expenditure on education, total (% of GDP)
## 0
clearly found a column without any values, so to drop it
blank_cols <- apply(Dataset, 2, function(col) all(is.na(col) | col == "sheet1"))
Dataset<-Dataset[,!blank_cols]
#colSums(is.na(Dataset))
last inspection of the dataset; Count of missing values per column
missing_counts <- colSums(is.na(Dataset))
missing_counts <- missing_counts[missing_counts > 0]
Bar Chart for Missing Values per Column
if (length(missing_counts) > 0) {
bar_chart <- ggplot(data.frame(Columns = names(missing_counts), Missing = missing_counts), aes(x = reorder(Columns, Missing), y = Missing)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Missing Values per Column", x = "Columns", y = "Number of Missing Values") +
theme_minimal()
} else {
bar_chart <- ggplot() +
annotate("text", x = 0.5, y = 0.5, label = "No Missing Values", size = 6, color = "black", hjust = 0.5, vjust = 0.5) +
theme_void() +
labs(title = "Missing Values per Column")
}
Scatter Plot of Missing Values per Column
missing_indices <- which(is.na(Dataset), arr.ind = TRUE)
scatter_data <- data.frame(
Rows = missing_indices[, 1],
Columns = factor(missing_indices[, 2], levels = 1:ncol(Dataset))
)
scatter_plot <- ggplot(scatter_data, aes(x = Columns, y = Rows)) +
geom_point(color = "red", size = 1.5) +
labs(title = "Scatter Plot of Missing Values", x = "Columns", y = "Rows") +
theme_minimal()
Histogram of Missing Values per Row
gg_plot <- ggplot(data.frame(missing_per_row = rowSums(is.na(Dataset))), aes(x = missing_per_row)) +
geom_histogram(binwidth = 1, fill = "brown", color = "white") +
labs(title = "Missing Values per Row", x = "Number of Missing Values", y = "Frequency")
Proportion of Missing Values in the Dataset (vizualizing it in Pie Chart)
missing_proportion <- sum(is.na(Dataset)) / (nrow(Dataset) * ncol(Dataset)) * 100
pie_data <- data.frame(
Category = c("Missing", "Non-Missing"),
Value = c(sum(is.na(Dataset)), (nrow(Dataset) * ncol(Dataset)) - sum(is.na(Dataset)))
)
pie_chart <- ggplot(pie_data, aes(x = "", y = Value, fill = Category)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = paste("Proportion of Missing Values: ", round(missing_proportion, 2), "%")) +
theme_void() +
scale_fill_manual(values = c("black", "beige"))
Arranging all visualizations into a single view
grid.arrange(gg_plot, bar_chart, scatter_plot, pie_chart, ncol = 2)
Selecting only integer columns to start the procedure
summary(Dataset)
## Time Time Code Country Name Country Code
## Min. :2017 Length:1330 Length:1330 Length:1330
## 1st Qu.:2018 Class :character Class :character Class :character
## Median :2019 Mode :character Mode :character Mode :character
## Mean :2019
## 3rd Qu.:2020
## Max. :2021
## GDP (current US$) Gross capital formation (annual % growth)
## Min. : 45276595 Min. :-938.9957
## 1st Qu.: 10293768251 1st Qu.: -0.2633
## Median : 61945227965 Median : 2.0537
## Mean : 2809243890590 Mean : 2.0537
## 3rd Qu.: 870364123213 3rd Qu.: 6.6354
## Max. :97848299706500 Max. : 182.2836
## Exports of goods and services (% of GDP) Inflation, consumer prices (annual %)
## Min. : 2.25 Min. : -3.233
## 1st Qu.: 23.49 1st Qu.: 1.627
## Median : 36.81 Median : 3.016
## Mean : 39.74 Mean : 5.543
## 3rd Qu.: 44.26 3rd Qu.: 5.543
## Max. :213.22 Max. :557.202
## GNI per capita, Atlas method (current US$)
## Min. : 210
## 1st Qu.: 2220
## Median : 6946
## Mean : 14788
## 3rd Qu.: 16655
## Max. :120990
## Foreign direct investment, net inflows (% of GDP)
## Min. :-1303.108
## 1st Qu.: 1.258
## Median : 2.625
## Mean : 4.850
## 3rd Qu.: 4.850
## Max. : 602.375
## Domestic credit provided by financial sector (% of GDP) Trade (% of GDP)
## Min. : 6.613 Min. : 4.128
## 1st Qu.: 85.079 1st Qu.: 54.459
## Median : 85.079 Median : 79.906
## Mean : 85.079 Mean : 84.632
## 3rd Qu.: 85.079 3rd Qu.: 93.338
## Max. :393.451 Max. :402.460
## Net trade in goods and services (BoP, current US$)
## Min. :-848072000000
## 1st Qu.: -1138954934
## Median : 1637745563
## Mean : 5234108909
## 3rd Qu.: 5234108909
## Max. : 525227577932
## Labor force participation rate, total (% of total population ages 15+)
## Min. :31.40
## 1st Qu.:56.39
## Median :60.24
## Mean :60.24
## 3rd Qu.:65.09
## Max. :87.64
## Employment in agriculture (% of total employment)
## Min. : 0.1078
## 1st Qu.: 7.3030
## Median :24.5136
## Mean :24.5136
## 3rd Qu.:36.0843
## Max. :86.3098
## Age dependency ratio (% of working-age population) Tax revenue (% of GDP)
## Min. : 17.45 Min. : 0.00006
## 1st Qu.: 48.46 1st Qu.:13.90486
## Median : 54.58 Median :16.19721
## Mean : 58.65 Mean :16.19721
## 3rd Qu.: 67.67 3rd Qu.:17.55884
## Max. :105.29 Max. :44.40227
## Government expenditure on education, total (% of GDP)
## Min. : 0.3585
## 1st Qu.: 3.5262
## Median : 4.3082
## Mean : 4.3082
## 3rd Qu.: 4.8626
## Max. :15.3770
str(Dataset)
## tibble [1,330 × 18] (S3: tbl_df/tbl/data.frame)
## $ Time : num [1:1330] 2017 2017 2017 2017 2017 ...
## $ Time Code : chr [1:1330] "YR2017" "YR2017" "YR2017" "YR2017" ...
## $ Country Name : chr [1:1330] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Country Code : chr [1:1330] "AFG" "ALB" "DZA" "ASM" ...
## $ GDP (current US$) : num [1:1330] 18753456498 13019726212 189880896903 612000000 3000162081 ...
## $ Gross capital formation (annual % growth) : num [1:1330] 2.05 4.37 -2.23 2.05 2.05 ...
## $ Exports of goods and services (% of GDP) : num [1:1330] 39.7 31.6 20.3 59.2 39.7 ...
## $ Inflation, consumer prices (annual %) : num [1:1330] 4.98 1.99 5.59 5.54 5.54 ...
## $ GNI per capita, Atlas method (current US$) : num [1:1330] 530 4290 4440 14788 14788 ...
## $ Foreign direct investment, net inflows (% of GDP) : num [1:1330] 0.275 7.855 0.648 4.85 4.85 ...
## $ Domestic credit provided by financial sector (% of GDP) : num [1:1330] 85.1 63.6 85.1 85.1 85.1 ...
## $ Trade (% of GDP) : num [1:1330] 84.6 78.2 49.8 161.4 84.6 ...
## $ Net trade in goods and services (BoP, current US$) : num [1:1330] -6803560118 -1978658722 -22402635934 5234108909 5234108909 ...
## $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 60.2 60.2 ...
## $ Employment in agriculture (% of total employment) : num [1:1330] 44 38.1 10.2 24.5 24.5 ...
## $ Age dependency ratio (% of working-age population) : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
## $ Tax revenue (% of GDP) : num [1:1330] 9.9 18.9 16.2 16.2 16.2 ...
## $ Government expenditure on education, total (% of GDP) : num [1:1330] 4.34 3.61 5.73 4.31 3.22 ...
dim(Dataset)
## [1] 1330 18
measures<-Dataset[5:18]
Observing distribution of all columns in the dataset; selecting numeric columns; melting
numeric_columns <- sapply(measures, is.numeric)
melted_data_before <- melt(measures[, numeric_columns], variable.name = "Column", value.name = "Value")
## No id variables; using all as measure variables
head(melted_data_before)
## Column Value
## 1 GDP (current US$) 18753456498
## 2 GDP (current US$) 13019726212
## 3 GDP (current US$) 189880896903
## 4 GDP (current US$) 612000000
## 5 GDP (current US$) 3000162081
## 6 GDP (current US$) 73690154991
Ploting the distributions before standardization
distribution_plot_before <- ggplot(melted_data_before, aes(x = Value)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
facet_wrap(~ Column, scales = "free", ncol = 3) +
labs(title = "Distribution of Numeric Columns (Before Standardization)", x = "Value", y = "Frequency") +
theme_minimal()
Display the before standardization plot
grid.arrange(distribution_plot_before, ncol = 1)
## Warning: Computation failed in `stat_bin()`.
## Computation failed in `stat_bin()`.
## Caused by error in `bin_breaks_width()`:
## ! The number of histogram bins must be less than 1,000,000.
## ℹ Did you make `binwidth` too small?
View(measures)
Observing the outliers in Boxplot
numerical_columns <- sapply(measures, is.numeric)
numerical_data <- measures[, numerical_columns]
long_numerical_data <- gather(numerical_data, key = "variable", value = "value")
ggplot(long_numerical_data, aes(x = variable, y = value)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Boxplot to detect outliers")
summary(measures$`GDP (current US$)`)
## Min. 1st Qu. Median Mean 3rd Qu.
## 45276595 10293768251 61945227965 2809243890592 870364123213
## Max.
## 97848299706455
colSums(is.na(measures))
## GDP (current US$)
## 0
## Gross capital formation (annual % growth)
## 0
## Exports of goods and services (% of GDP)
## 0
## Inflation, consumer prices (annual %)
## 0
## GNI per capita, Atlas method (current US$)
## 0
## Foreign direct investment, net inflows (% of GDP)
## 0
## Domestic credit provided by financial sector (% of GDP)
## 0
## Trade (% of GDP)
## 0
## Net trade in goods and services (BoP, current US$)
## 0
## Labor force participation rate, total (% of total population ages 15+)
## 0
## Employment in agriculture (% of total employment)
## 0
## Age dependency ratio (% of working-age population)
## 0
## Tax revenue (% of GDP)
## 0
## Government expenditure on education, total (% of GDP)
## 0
Since we detected outliers in GDP, GNI and Net trade; first log transform(there are non positive numbers converted to NaN’s); I’ll handle them by adding a constant value to make them positive)
measures$log_GDP <- log(measures$`GDP (current US$)`)
measures$log_Trade <- log(measures$`Net trade in goods and services (BoP, current US$)`)
## Warning in log(measures$`Net trade in goods and services (BoP, current US$)`):
## NaNs produced
summary(measures$`Net trade in goods and services (BoP, current US$)`)
## Min. 1st Qu. Median Mean 3rd Qu.
## -848072000000 -1138954934 1637745563 5234108909 5234108909
## Max.
## 525227577932
sum(measures$`Net trade in goods and services (BoP, current US$)` <= 0)
## [1] 561
measures$log_Trade <- log(measures$`Net trade in goods and services (BoP, current US$)` + abs(min(measures$`Net trade in goods and services (BoP, current US$)`, na.rm = TRUE)) + 1)
measures$log_GNI <- log(measures$`GNI per capita, Atlas method (current US$)`)
Checking if i’ve got rid of NaN’s and all fine?
summary(measures$log_Trade)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.46 27.47 27.45 27.47 27.95
sum(is.na(measures$log_Trade))
## [1] 0
I will exclude the original GDP from the dataset; reshaped it for better visualization. The boxplots helped spot outliers, and I wrapped it up with a quick summary of log_GDP. Simple and quick to understand the data.
dim(measures)
## [1] 1330 17
str(measures)
## tibble [1,330 × 17] (S3: tbl_df/tbl/data.frame)
## $ GDP (current US$) : num [1:1330] 18753456498 13019726212 189880896903 612000000 3000162081 ...
## $ Gross capital formation (annual % growth) : num [1:1330] 2.05 4.37 -2.23 2.05 2.05 ...
## $ Exports of goods and services (% of GDP) : num [1:1330] 39.7 31.6 20.3 59.2 39.7 ...
## $ Inflation, consumer prices (annual %) : num [1:1330] 4.98 1.99 5.59 5.54 5.54 ...
## $ GNI per capita, Atlas method (current US$) : num [1:1330] 530 4290 4440 14788 14788 ...
## $ Foreign direct investment, net inflows (% of GDP) : num [1:1330] 0.275 7.855 0.648 4.85 4.85 ...
## $ Domestic credit provided by financial sector (% of GDP) : num [1:1330] 85.1 63.6 85.1 85.1 85.1 ...
## $ Trade (% of GDP) : num [1:1330] 84.6 78.2 49.8 161.4 84.6 ...
## $ Net trade in goods and services (BoP, current US$) : num [1:1330] -6803560118 -1978658722 -22402635934 5234108909 5234108909 ...
## $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 60.2 60.2 ...
## $ Employment in agriculture (% of total employment) : num [1:1330] 44 38.1 10.2 24.5 24.5 ...
## $ Age dependency ratio (% of working-age population) : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
## $ Tax revenue (% of GDP) : num [1:1330] 9.9 18.9 16.2 16.2 16.2 ...
## $ Government expenditure on education, total (% of GDP) : num [1:1330] 4.34 3.61 5.73 4.31 3.22 ...
## $ log_GDP : num [1:1330] 23.7 23.3 26 20.2 21.8 ...
## $ log_Trade : num [1:1330] 27.5 27.5 27.4 27.5 27.5 ...
## $ log_GNI : num [1:1330] 6.27 8.36 8.4 9.6 9.6 ...
measures<-measures[,2:17]
measures$`Net trade in goods and services (BoP, current US$)`<-NULL
measures$`GNI per capita, Atlas method (current US$)`<-NULL
View(measures)
numerical_columns <- sapply(measures, is.numeric)
numerical_data <- measures[, numerical_columns]
long_numerical_data <- gather(numerical_data, key = "variable", value = "value")
ggplot(long_numerical_data, aes(x = variable, y = value)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Boxplot to detect outliers")
colSums(is.na(measures))
## Gross capital formation (annual % growth)
## 0
## Exports of goods and services (% of GDP)
## 0
## Inflation, consumer prices (annual %)
## 0
## Foreign direct investment, net inflows (% of GDP)
## 0
## Domestic credit provided by financial sector (% of GDP)
## 0
## Trade (% of GDP)
## 0
## Labor force participation rate, total (% of total population ages 15+)
## 0
## Employment in agriculture (% of total employment)
## 0
## Age dependency ratio (% of working-age population)
## 0
## Tax revenue (% of GDP)
## 0
## Government expenditure on education, total (% of GDP)
## 0
## log_GDP
## 0
## log_Trade
## 0
## log_GNI
## 0
boxplot(measures$log_GDP, main = "Boxplot of Log-Transformed GDP")
summary(measures$log_GDP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.63 23.05 24.85 25.18 27.49 32.21
Calculate the correlation matrix
cor_matrix <- cor(numerical_data, use = "complete.obs")
long_cor_data <- melt(cor_matrix)
ggplot(data = long_cor_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
coord_fixed() +
labs(title = "Correlation Heatmap")
Each cell in the heatmap represents the Pearson correlation coefficient between two variables, with the color intensity indicating the strength and direction of the correlation: Red indicates a strong positive correlation, meaning that as one variable increases, the other tends to increase as well. Blue indicates a strong negative correlation, meaning that as one variable increases, the other tends to decrease. Light colors near white indicate weak or no correlation between the variables. Exports of goods and services (% of GDP) vs. GDP (current US$): There is a very strong positive correlation (0.97), indicating that higher exports are strongly associated with higher GDP. GNI per capita (Atlas method) vs. Domestic credit provided by the financial sector: There is a moderate positive correlation, suggesting that countries with higher GNI per capita tend to have higher domestic credit provided by the financial sector. Government expenditure on education vs. Tax revenue: These variables have a positive correlation, indicating that higher tax revenues are often associated with higher government spending on education. Inflation (consumer prices) vs. Net trade in goods and services: There is a negative correlation, indicating that higher inflation is generally associated with lower net trade in goods and services.
Now, I’ll create a pairplot using ggpairs
ggpairs(numerical_data, title = "Pairplot of Numerical Features")
Z-Standardizing the Dataset measures and naming it “measures_z” for further analysis
lapply(measures, class)
## $`Gross capital formation (annual % growth)`
## [1] "numeric"
##
## $`Exports of goods and services (% of GDP)`
## [1] "numeric"
##
## $`Inflation, consumer prices (annual %)`
## [1] "numeric"
##
## $`Foreign direct investment, net inflows (% of GDP)`
## [1] "numeric"
##
## $`Domestic credit provided by financial sector (% of GDP)`
## [1] "numeric"
##
## $`Trade (% of GDP)`
## [1] "numeric"
##
## $`Labor force participation rate, total (% of total population ages 15+)`
## [1] "numeric"
##
## $`Employment in agriculture (% of total employment)`
## [1] "numeric"
##
## $`Age dependency ratio (% of working-age population)`
## [1] "numeric"
##
## $`Tax revenue (% of GDP)`
## [1] "numeric"
##
## $`Government expenditure on education, total (% of GDP)`
## [1] "numeric"
##
## $log_GDP
## [1] "numeric"
##
## $log_Trade
## [1] "numeric"
##
## $log_GNI
## [1] "numeric"
measures_z<-as.data.frame(lapply(measures, scale))
K-Means
Let’s try with 5 clusters
kmeans_result <- kmeans(measures_z, centers = 5)
check sizes and centers of these clusters
kmeans_result$size
## [1] 298 200 539 35 258
kmeans_result$centers
## Gross.capital.formation..annual...growth.
## 1 0.01975934
## 2 0.07999539
## 3 -0.01209056
## 4 -0.71583054
## 5 0.03753303
## Exports.of.goods.and.services....of.GDP.
## 1 -0.2377363
## 2 -0.5594463
## 3 0.2459651
## 4 4.4789645
## 5 -0.4131949
## Inflation..consumer.prices..annual...
## 1 -0.1198305
## 2 0.2006299
## 3 -0.1040173
## 4 -0.1738549
## 5 0.2237742
## Foreign.direct.investment..net.inflows....of.GDP.
## 1 -0.14985136
## 2 -0.05866582
## 3 0.03211127
## 4 1.47054232
## 5 -0.04801585
## Domestic.credit.provided.by.financial.sector....of.GDP. Trade....of.GDP.
## 1 0.32282162 -0.3946953
## 2 -0.13548389 -0.5315882
## 3 -0.06450083 0.3181928
## 4 0.00000000 4.4317897
## 5 -0.13309348 -0.3979913
## Labor.force.participation.rate..total....of.total.population.ages.15..
## 1 0.3257064
## 2 -1.5001958
## 3 0.0340419
## 4 -0.2459693
## 5 0.7489884
## Employment.in.agriculture....of.total.employment.
## 1 -0.4701392
## 2 0.1481540
## 3 -0.4428752
## 4 -1.0159575
## 5 1.4912362
## Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 1 -0.6074597 -0.6211124
## 2 0.3304216 -0.3371721
## 3 -0.4457363 0.6346148
## 4 -0.7747784 0.4330110
## 5 1.4818131 -0.4057629
## Government.expenditure.on.education..total....of.GDP. log_GDP log_Trade
## 1 -0.1008557 1.11499483 -0.08565954
## 2 -0.2171585 0.03298209 0.01397512
## 3 0.4408338 -0.54989407 0.02649677
## 4 -0.3957379 -0.22161380 0.07611573
## 5 -0.5824492 -0.13455617 0.02242523
## log_GNI
## 1 0.5628902
## 2 -0.7098658
## 3 0.4567612
## 4 1.2001553
## 5 -1.2169296
measures$cluster <- kmeans_result$cluster
head(measures)
## # A tibble: 6 × 15
## Gross capital formation (annua…¹ Exports of goods and…² Inflation, consumer …³
## <dbl> <dbl> <dbl>
## 1 2.05 39.7 4.98
## 2 4.37 31.6 1.99
## 3 -2.23 20.3 5.59
## 4 2.05 59.2 5.54
## 5 2.05 39.7 5.54
## 6 3.00 29.0 29.8
## # ℹ abbreviated names: ¹`Gross capital formation (annual % growth)`,
## # ²`Exports of goods and services (% of GDP)`,
## # ³`Inflation, consumer prices (annual %)`
## # ℹ 12 more variables:
## # `Foreign direct investment, net inflows (% of GDP)` <dbl>,
## # `Domestic credit provided by financial sector (% of GDP)` <dbl>,
## # `Trade (% of GDP)` <dbl>, …
str(measures)
## tibble [1,330 × 15] (S3: tbl_df/tbl/data.frame)
## $ Gross capital formation (annual % growth) : num [1:1330] 2.05 4.37 -2.23 2.05 2.05 ...
## $ Exports of goods and services (% of GDP) : num [1:1330] 39.7 31.6 20.3 59.2 39.7 ...
## $ Inflation, consumer prices (annual %) : num [1:1330] 4.98 1.99 5.59 5.54 5.54 ...
## $ Foreign direct investment, net inflows (% of GDP) : num [1:1330] 0.275 7.855 0.648 4.85 4.85 ...
## $ Domestic credit provided by financial sector (% of GDP) : num [1:1330] 85.1 63.6 85.1 85.1 85.1 ...
## $ Trade (% of GDP) : num [1:1330] 84.6 78.2 49.8 161.4 84.6 ...
## $ Labor force participation rate, total (% of total population ages 15+): num [1:1330] 47.3 58.1 41.8 60.2 60.2 ...
## $ Employment in agriculture (% of total employment) : num [1:1330] 44 38.1 10.2 24.5 24.5 ...
## $ Age dependency ratio (% of working-age population) : num [1:1330] 91.3 46.2 54.8 57.8 38.8 ...
## $ Tax revenue (% of GDP) : num [1:1330] 9.9 18.9 16.2 16.2 16.2 ...
## $ Government expenditure on education, total (% of GDP) : num [1:1330] 4.34 3.61 5.73 4.31 3.22 ...
## $ log_GDP : num [1:1330] 23.7 23.3 26 20.2 21.8 ...
## $ log_Trade : num [1:1330] 27.5 27.5 27.4 27.5 27.5 ...
## $ log_GNI : num [1:1330] 6.27 8.36 8.4 9.6 9.6 ...
## $ cluster : int [1:1330] 2 3 2 3 3 5 3 1 3 3 ...
Mean of log_Trade/log_GDP/log_GNI
aggregate(data = measures, log_Trade ~ cluster, mean)
## cluster log_Trade
## 1 1 27.38524
## 2 2 27.46068
## 3 3 27.47016
## 4 4 27.50773
## 5 5 27.46708
aggregate(data = measures, log_GDP ~ cluster, mean)
## cluster log_GDP
## 1 1 28.49906
## 2 2 25.27370
## 3 3 23.53621
## 4 4 24.51478
## 5 5 24.77429
aggregate(data = measures, log_GNI ~ cluster, mean)
## cluster log_GNI
## 1 1 9.570689
## 2 2 7.854246
## 3 3 9.427563
## 4 4 10.430107
## 5 5 7.170418
attributes(kmeans_result)
## $names
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
##
## $class
## [1] "kmeans"
Changing the number of clusters to 4
measures_reduced <- measures[9:12]
measures_reduced_z <- as.data.frame(lapply(measures_reduced, scale))
measures_reduced_km3 <- kmeans(measures_reduced_z, 4)
plot(measures_reduced_z, col = measures_reduced_km3$cluster, pch=".", cex=4)
points(measures_reduced_km3$centers, col = 1:5, pch = 8, cex=2, lwd=2)
fviz_cluster(list(data=measures_reduced_z, cluster=measures_reduced_km3$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
Parallel coordinate plot for cluster characteristics
ggparcoord(data = cbind(measures_reduced_z, cluster = as.factor(measures_reduced_km3$cluster)),
columns = 1:(ncol(measures_reduced_z)), groupColumn = "cluster",
scale = "uniminmax", showPoints = TRUE,
title = "Parallel Coordinate Plot of Clusters",
alphaLines = 0.5) + theme_minimal()
narrow down the scope to compute the silhouette width faster (run the broader dataset on your own)
measures_reduced_km4 <- kmeans(measures_reduced_z, 3)
sil<-silhouette(measures_reduced_km4$cluster, dist(measures_reduced_z))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 312 0.30
## 2 2 418 0.37
## 3 3 600 0.20
alternative commands for k-means
k1 <- kmeans(measures_reduced_z, centers = 3, nstart = 25)
plot(measures_reduced_z, col = k1$cluster, pch = 16, main = "K-means Clustering")
points(k1$centers, col = 1:3, pch = 8, cex = 2, lwd = 2)
In observing the scatterplot matrix, there are instances where clusters overlap in certain dimensions. This overlap suggests that the separation between clusters is less distinct in these particular variable combinations. The spread of clusters also varies across the subplots, indicating that some variables play a more significant role in distinguishing the clusters than others.
summary(k1)
## Length Class Mode
## cluster 1330 -none- numeric
## centers 12 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
attributes(k1)
## $names
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
##
## $class
## [1] "kmeans"
print(k1$centers)
## Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 1 -0.3602481 0.6504696
## 2 -0.5553496 -0.5698846
## 3 1.4355602 -0.4952265
## Government.expenditure.on.education..total....of.GDP. log_GDP
## 1 0.4841496 -0.5795626
## 2 -0.2623144 0.9565580
## 3 -0.5844079 -0.1571521
I’m using Elbow Method to identify optimal number of clusters so far.
elbow_clus <- fviz_nbclust(measures_reduced_z, kmeans, method = "wss") +
ggtitle("Elbow Method for Optimal Clusters")
plot(elbow_clus)
Now Silhouette Method
sil_clus <- fviz_nbclust(measures_reduced_z, kmeans, method = "silhouette") +
ggtitle("Silhouette Method for Optimal Clusters")
plot(sil_clus)
PAM clustering with 3 clusters
c1 <- pam(measures_reduced_z, 3)
print(c1)
## Medoids:
## ID Age.dependency.ratio....of.working.age.population.
## [1,] 216 1.7045119
## [2,] 1210 -0.4511421
## [3,] 446 -0.4622888
## Tax.revenue....of.GDP.
## [1,] -0.2053302
## [2,] 0.2425268
## [3,] -0.4198928
## Government.expenditure.on.education..total....of.GDP. log_GDP
## [1,] -0.38978098 -0.4021940
## [2,] 0.00000000 -0.6082215
## [3,] -0.08670267 0.9441671
## Clustering vector:
## [1] 1 2 3 2 2 1 2 3 2 2 3 2 2 2 3 3 2 2 2 2 1 2 2 2 2 2 3 3 2 2 1 1 2 2 1 3 2
## [38] 1 1 2 3 3 3 1 1 1 2 1 2 2 2 2 3 2 2 2 3 3 3 2 1 1 2 2 1 2 2 2 3 2 1 1 2 3
## [75] 1 3 2 2 2 2 1 1 1 2 2 2 3 2 2 3 3 3 1 3 2 2 3 2 3 2 3 1 2 3 3 2 2 2 2 2 2
## [112] 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2 3 2 2 1 2 2 2 1 3 2 2 2 2 2 2 2 1 1 2 2 2
## [149] 2 1 2 3 1 3 3 3 3 2 2 3 3 3 1 1 2 1 3 1 2 2 1 3 2 2 2 2 1 2 1 3 3 2 2 3 2
## [186] 1 2 2 3 1 2 1 3 1 1 2 2 2 3 2 2 2 1 2 3 3 3 2 3 1 3 3 2 1 1 1 1 1 1 3 2 3
## [223] 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 2 2 3 1 2
## [260] 3 3 1 1 1 3 3 1 2 3 2 2 1 2 3 2 2 3 2 2 2 3 3 2 2 2 2 1 2 2 2 2 2 3 3 2 2
## [297] 1 1 2 2 1 3 2 1 1 2 2 3 3 1 1 1 2 1 2 2 2 2 3 2 2 2 3 3 3 2 1 1 2 2 1 2 2
## [334] 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1 2 2 2 3 2 2 3 3 3 1 3 2 2 3 2 3 2 3 1 2 3
## [371] 3 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2 3 2 2 1 2 2 2 1 3 2 2 2 2 2
## [408] 2 2 1 1 2 2 2 2 1 2 3 2 3 3 3 3 2 2 3 3 3 1 1 2 1 3 1 2 2 1 3 2 2 2 2 1 2
## [445] 1 3 3 2 2 3 2 1 2 2 3 1 2 1 3 2 1 2 2 2 3 2 2 2 1 2 3 3 3 2 2 2 3 3 2 1 1
## [482] 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 3 3
## [519] 3 3 2 2 3 1 3 3 3 1 1 1 3 3 1 2 3 2 2 1 2 3 2 2 3 2 2 2 3 3 2 2 2 2 1 2 2
## [556] 2 2 2 3 3 2 2 1 1 2 2 1 3 2 1 1 2 3 3 3 1 1 1 2 1 2 2 2 2 3 2 2 2 3 3 3 2
## [593] 1 1 2 2 1 2 2 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1 2 2 2 3 2 2 3 3 3 3 3 2 2 3
## [630] 2 3 2 3 1 2 3 3 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2 3 2 2 1 2 2 2
## [667] 1 3 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 3 1 3 3 3 3 2 2 3 3 3 1 1 2 1 3 1 2 2 1
## [704] 3 2 2 2 2 1 2 1 3 3 2 2 2 2 1 2 2 3 2 2 1 3 2 1 2 2 2 3 2 2 2 1 2 3 3 3 2
## [741] 3 1 3 3 2 1 1 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 3 3 3 3 1 3
## [778] 1 3 3 3 3 3 3 3 3 2 2 3 1 3 3 3 1 1 1 3 3 1 2 2 2 2 1 2 3 2 2 3 2 2 2 3 3
## [815] 2 3 2 2 1 2 2 2 2 2 3 3 2 2 1 1 2 2 1 3 2 1 1 2 3 3 3 1 1 1 2 1 2 2 2 2 3
## [852] 2 2 2 3 3 3 2 1 1 2 2 1 2 2 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1 2 2 2 3 2 2 3
## [889] 3 3 1 3 2 2 3 2 3 2 3 1 2 3 3 2 2 2 2 2 3 2 1 2 2 2 2 2 1 1 3 2 1 2 2 1 2
## [926] 3 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 3 1 3 3 3 3 2 2 3 3 3 1 1
## [963] 2 1 3 1 2 2 2 3 2 2 2 2 1 2 1 3 3 2 2 3 2 1 2 2 3 2 2 1 3 1 1 2 2 2 3 2 2
## [1000] 2 1 2 3 3 3 2 2 1 3 3 2 1 1 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 3
## [1037] 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 2 2 3 1 3 3 3 1 1 1 3 3 1 2 3 2 2 1 2 3 2
## [1074] 2 3 2 2 2 2 3 2 3 2 2 1 2 2 2 2 2 3 3 2 2 1 1 2 2 1 3 2 1 1 2 2 3 3 1 1 1
## [1111] 2 1 2 3 2 2 3 2 2 2 3 3 3 2 1 1 2 2 1 2 2 2 3 2 1 1 2 3 1 3 2 2 2 2 3 1 1
## [1148] 2 2 2 3 2 2 3 3 3 1 3 2 2 3 2 3 2 3 1 2 3 3 2 2 2 2 2 3 2 1 2 2 2 2 2 1 1
## [1185] 3 2 1 2 2 1 2 3 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 3 2 3 3 3 3
## [1222] 2 2 3 3 3 1 1 2 1 3 1 2 2 2 3 2 2 2 2 1 2 1 3 3 2 2 2 2 1 2 2 3 2 2 1 3 1
## [1259] 1 1 2 2 3 2 2 2 1 2 3 3 3 2 2 2 3 3 2 2 1 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 3
## [1296] 1 1 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 2 3 1 3 3 3 1 1 1 3 3
## Objective function:
## build swap
## 1.455119 1.372084
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
Displaying the medoids; clustering results;summary; info; plot and viszualization with factoextra
print(c1$medoids)
## Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## [1,] 1.7045119 -0.2053302
## [2,] -0.4511421 0.2425268
## [3,] -0.4622888 -0.4198928
## Government.expenditure.on.education..total....of.GDP. log_GDP
## [1,] -0.38978098 -0.4021940
## [2,] 0.00000000 -0.6082215
## [3,] -0.08670267 0.9441671
print(head(c1$clustering))
## [1] 1 2 3 2 2 1
sil <- silhouette(c1)
fviz_silhouette(sil) +
ggtitle("Silhouette Plot for PAM Clustering") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 305 0.30
## 2 2 592 0.20
## 3 3 433 0.34
fviz_cluster(c1, geom = "point", ellipse.type = "convex") +
ggtitle("Cluster Plot with PAM Clustering") +
theme_minimal()
Better graphics with cluster visualization and also another simple clustering method
fviz_cluster(c1, geom = "point", ellipse.type = "norm")
fviz_cluster(c1, geom = "point", ellipse.type = "convex")
pam_cluster_result <- eclust(measures_reduced_z, "pam", k = 3)
fviz_silhouette(pam_cluster_result)
## cluster size ave.sil.width
## 1 1 305 0.30
## 2 2 592 0.20
## 3 3 433 0.34
fviz_cluster(pam_cluster_result)
Experiment with different numbers of clusters and distance metrics
pam_manhattan <- eclust(measures_reduced_z, "pam", k = 2, hc_metric = "manhattan")
fviz_silhouette(pam_manhattan)
## cluster size ave.sil.width
## 1 1 315 0.35
## 2 2 1015 0.23
fviz_cluster(pam_manhattan)
Determining the optimal # of clusters using the elbow and Silhouette method
opt_clusters_elbow <- Optimal_Clusters_Medoids(measures_reduced_z, max_clusters = 10, distance_metric = "euclidean", plot_clusters = TRUE)
##
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?
## Warning: The plot can not be created for the specified number of clusters. This
## means the output data do not fit in the figure (plot) margins.
opt_clusters_silhouette <- Optimal_Clusters_Medoids(measures_reduced_z, max_clusters = 10, distance_metric = "euclidean",
plot_clusters = TRUE, criterion = "silhouette")
##
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?
## Warning: The plot can not be created for the specified number of clusters. This
## means the output data do not fit in the figure (plot) margins.
I aim to retain 70-95% of the total variance in the dataset. First, I use PCA to understand the variance distribution. For clustering and visualization, I utilize t-SNE or UMAP to gain insights into cluster structures. When I need interpretable factors, I apply Factor Analysis. To determine a meaningful epsilon (eps) value, I experiment with dimensionality reduction techniques. Additionally, I use MDS, as my dataset is of moderate size, to explore relationships within the data.
Inspecting correlation and removing unnecessary cols
cor_matrix <- cor(measures_z)
corrplot(cor_matrix, method = "color", type = "upper",
tl.cex = 0.4, tl.col = "black", order = "hclust",
number.cex = 0.4)
Find highly correlated features (absolute value > 0.8)
threshold <- 0.8
high_corr <- which(abs(cor_matrix) > threshold & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr_pairs <- data.frame(
Feature1 = rownames(cor_matrix)[high_corr[, 1]],
Feature2 = colnames(cor_matrix)[high_corr[, 2]],
Correlation = cor_matrix[high_corr]
)
high_corr_pairs
## Feature1
## 1 Trade....of.GDP.
## 2 Exports.of.goods.and.services....of.GDP.
## Feature2 Correlation
## 1 Exports.of.goods.and.services....of.GDP. 0.9660158
## 2 Trade....of.GDP. 0.9660158
drop_features <- unique(high_corr_pairs$Feature2)
reduced_data <- measures_reduced_z[, !(colnames(measures_reduced_z) %in% drop_features)]
Compute correlation matrix for reduced data
cor_matrix_reduced <- cor(reduced_data)
corrplot(cor_matrix_reduced, method = "color", type = "upper",
tl.cex = 0.7, tl.col = "black", order = "hclust")
Compute the correlation matrix
cor_matrix <- cor(measures_reduced_z, use = "pairwise.complete.obs")
high_cor_pairs <- which(abs(cor_matrix) > 0.8 & lower.tri(cor_matrix), arr.ind = TRUE)
high_cor_features <- data.frame(
Feature1 = rownames(cor_matrix)[high_cor_pairs[, 1]],
Feature2 = colnames(cor_matrix)[high_cor_pairs[, 2]],
Correlation = cor_matrix[high_cor_pairs]
)
print(high_cor_features)
## [1] Feature1 Feature2 Correlation
## <0 rows> (or 0-length row.names)
The result provided is fine-it simply means that no features in the dataset
Let’s decide on features to drop (manual selection)
I focused on highly correlated dimensions because it can improve clustering results after dimensionality reduction, these dimensions capture more of the shared structure or patterns in the data. I would like to give more supporting points to prove why it’s effective to do so:
Enhanced signal capture highly correlated dimensions often represent underlying relationships or patterns in the data. By combining these dimensions, the dimensionality reduction process can better identify and preserve the meaningful variance in the dataset, which is essential to form distinct clusters.
Reduced noise Focusing on highly correlated dimensions before dimensionality reduction can make clustering work better because it helps highlight the most important patterns in your data. When you include less correlated or uncorrelated dimensions, they often introduce noise instead of useful information. This can dilute the impact of the meaningful dimensions, making clusters less clear. By narrowing it down to highly correlated dimensions, you ensure the reduced data reflects the most significant structure.
Improved Separation Highly correlated dimensions also improve how well clusters separate. Clustering algorithms work best when groups are naturally distinct, and these dimensions help create cleaner axes in the reduced space, making it easier to spot the clusters. Techniques like PCA work by retaining the maximum variance in fewer dimensions. When you start with dimensions that are highly correlated, the resulting reduced space tends to explain more of the data’s variability, keeping the important structure intact for clustering.
Avoiding redundancy And even though highly correlated dimensions might feel redundant, PCA or similar methods automatically merge them into components. This avoids the redundancy while still capturing the shared variability, leaving you with a simpler yet informative dataset that’s great for clustering.
measures_reduced_z_filtered <- measures_reduced_z[, !(names(measures_reduced_z) %in% drop_features)]
dim(measures_reduced_z)
## [1] 1330 4
summary(measures_reduced_z)
## Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## Min. :-2.5598 Min. :-3.2852
## 1st Qu.:-0.6332 1st Qu.:-0.4649
## Median :-0.2528 Median : 0.0000
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5605 3rd Qu.: 0.2762
## Max. : 2.8977 Max. : 5.7207
## Government.expenditure.on.education..total....of.GDP. log_GDP
## Min. :-2.6610 Min. :-2.5318
## 1st Qu.:-0.5268 1st Qu.:-0.7114
## Median : 0.0000 Median :-0.1093
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3736 3rd Qu.: 0.7772
## Max. : 7.4574 Max. : 2.3614
head(measures_reduced_z)
## Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 1 2.02793548 -1.2775403
## 2 -0.77530049 0.5456519
## 3 -0.23757802 0.0000000
## 4 -0.05010052 0.0000000
## 5 -1.23359774 0.0000000
## 6 2.07626866 -1.4143803
## Government.expenditure.on.education..total....of.GDP. log_GDP
## 1 0.0235873 -0.51016343
## 2 -0.4692274 -0.63258071
## 3 0.9611751 0.26645656
## 4 0.0000000 -1.65827719
## 5 -0.7351023 -1.12498371
## 6 -1.2405433 -0.05107536
pca_result <- prcomp(measures_reduced_z, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.213 1.0575 0.8960 0.7791
## Proportion of Variance 0.368 0.2796 0.2007 0.1517
## Cumulative Proportion 0.368 0.6475 0.8482 1.0000
screeplot(pca_result, type = "lines")
reduced_data <- pca_result$x[, 1:4]
The purpose is to project the data onto fewer dimensions that maximize variance. To determine how many dimensions to retain, I use a scree plot or a cumulative variance explained plot, as the first few components typically explain most of the variance. The advantages of this approach include its effectiveness with continuous numerical data and its speed and interpretability. I’ll start by applying dimensional reduction to the original dataset.
safe_Rtsne <- function(data, dims = 2, perplexity = 30, verbose = TRUE) {
tryCatch(
{
result <- Rtsne(data, dims = dims, perplexity = perplexity, verbose = verbose)
return(result)
},
error = function(e) {
message("An error occurred while running t-SNE: ", e$message)
message("Attempting to remove duplicate rows and retry...")
unique_data <- data[!duplicated(data), ]
if (nrow(unique_data) < 2) {
stop("Not enough unique rows to perform t-SNE after removing duplicates.")
}
tryCatch(
{
result <- Rtsne(unique_data, dims = dims, perplexity = perplexity, verbose = verbose)
message("t-SNE completed successfully after removing duplicates.")
return(result)
},
error = function(e2) {
stop("t-SNE failed even after removing duplicates: ", e2$message)
}
)
}
)
}
tsne_result <- safe_Rtsne(measures_reduced_z)
## Performing PCA
## An error occurred while running t-SNE: Remove duplicates before running TSNE.
## Attempting to remove duplicate rows and retry...
## Performing PCA
## Read the 1322 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.44 seconds (sparsity = 0.091639)!
## Learning embedding...
## Iteration 50: error is 71.525078 (50 iterations in 0.43 seconds)
## Iteration 100: error is 63.004777 (50 iterations in 0.27 seconds)
## Iteration 150: error is 62.810840 (50 iterations in 0.27 seconds)
## Iteration 200: error is 62.795986 (50 iterations in 0.27 seconds)
## Iteration 250: error is 62.794713 (50 iterations in 0.26 seconds)
## Iteration 300: error is 0.873501 (50 iterations in 0.25 seconds)
## Iteration 350: error is 0.637308 (50 iterations in 0.26 seconds)
## Iteration 400: error is 0.565633 (50 iterations in 0.26 seconds)
## Iteration 450: error is 0.540850 (50 iterations in 0.26 seconds)
## Iteration 500: error is 0.530132 (50 iterations in 0.26 seconds)
## Iteration 550: error is 0.524298 (50 iterations in 0.26 seconds)
## Iteration 600: error is 0.518652 (50 iterations in 0.25 seconds)
## Iteration 650: error is 0.513602 (50 iterations in 0.26 seconds)
## Iteration 700: error is 0.509703 (50 iterations in 0.27 seconds)
## Iteration 750: error is 0.506114 (50 iterations in 0.27 seconds)
## Iteration 800: error is 0.503117 (50 iterations in 0.26 seconds)
## Iteration 850: error is 0.500743 (50 iterations in 0.28 seconds)
## Iteration 900: error is 0.496805 (50 iterations in 0.28 seconds)
## Iteration 950: error is 0.494614 (50 iterations in 0.26 seconds)
## Iteration 1000: error is 0.493810 (50 iterations in 0.26 seconds)
## Fitting performed in 5.45 seconds.
## t-SNE completed successfully after removing duplicates.
Identify duplicate rows
duplicates <- measures_reduced_z[duplicated(measures_reduced_z), ]
print(duplicates)
## Age.dependency.ratio....of.working.age.population. Tax.revenue....of.GDP.
## 261 -0.2371240 -1.030937
## 264 1.6620920 0.000000
## 519 0.0000000 0.000000
## 785 0.0000000 0.000000
## 793 -0.3130392 0.000000
## 796 1.5574232 0.000000
## 1051 0.0000000 0.000000
## 1317 0.0000000 0.000000
## Government.expenditure.on.education..total....of.GDP. log_GDP
## 261 -0.8402335 1.2376634
## 264 -0.3851322 0.9939864
## 519 0.0000000 1.1703052
## 785 0.0000000 1.1703052
## 793 -0.9822208 1.2588890
## 796 -0.5074079 1.0276584
## 1051 0.0000000 1.1703052
## 1317 0.0000000 1.1703052
Remove duplicate rows; ruun t-SNE; plot the results
measures_reduced_z_unique <- measures_reduced_z[!duplicated(measures_reduced_z), ]
tsne_result <- Rtsne(measures_reduced_z_unique, dims = 2, perplexity = 30, verbose = TRUE)
## Performing PCA
## Read the 1322 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.46 seconds (sparsity = 0.091639)!
## Learning embedding...
## Iteration 50: error is 72.818206 (50 iterations in 0.27 seconds)
## Iteration 100: error is 63.207600 (50 iterations in 0.28 seconds)
## Iteration 150: error is 62.803916 (50 iterations in 0.28 seconds)
## Iteration 200: error is 62.812660 (50 iterations in 0.28 seconds)
## Iteration 250: error is 62.844905 (50 iterations in 0.27 seconds)
## Iteration 300: error is 0.866547 (50 iterations in 0.25 seconds)
## Iteration 350: error is 0.636500 (50 iterations in 0.24 seconds)
## Iteration 400: error is 0.572926 (50 iterations in 0.25 seconds)
## Iteration 450: error is 0.543510 (50 iterations in 0.26 seconds)
## Iteration 500: error is 0.531927 (50 iterations in 0.25 seconds)
## Iteration 550: error is 0.527844 (50 iterations in 0.25 seconds)
## Iteration 600: error is 0.524110 (50 iterations in 0.25 seconds)
## Iteration 650: error is 0.520512 (50 iterations in 0.25 seconds)
## Iteration 700: error is 0.517484 (50 iterations in 0.26 seconds)
## Iteration 750: error is 0.514688 (50 iterations in 0.26 seconds)
## Iteration 800: error is 0.511749 (50 iterations in 0.26 seconds)
## Iteration 850: error is 0.510405 (50 iterations in 0.26 seconds)
## Iteration 900: error is 0.508637 (50 iterations in 0.25 seconds)
## Iteration 950: error is 0.506878 (50 iterations in 0.25 seconds)
## Iteration 1000: error is 0.506194 (50 iterations in 0.26 seconds)
## Fitting performed in 5.19 seconds.
plot(tsne_result$Y, col = "blue", pch = 19, main = "t-SNE Visualization")
Distributed Stochastic Neighbor Embedding (t-SNE) is used to project high-dimensional data into 2 or 3 dimensions while preserving the local structure. It typically reduces the data to 2 or 3 dimensions, making it ideal for visualization. The main advantages of t-SNE are its ability to effectively visualize clusters and preserve local neighborhood relationships in the data.
Running UMAP
umap_result <- umap(measures_reduced_z)
clusters <- kmeans_result$cluster
cluster_colors <- rainbow(length(unique(clusters)))
color_map <- cluster_colors[clusters]
plot(umap_result$layout, col = color_map, pch = 19, main = "UMAP Visualization")
Uniform Manifold Approximation and Projection (UMAP) serves a similar purpose to t-SNE but is faster and more scalable, with improved global structure preservation. It typically reduces data to 2 or 3 dimensions. UMAP’s advantages include its efficiency in clustering tasks and its superior handling of large datasets compared to t-SNE.
fa_result <- fa(measures_reduced_z, nfactors = 2, rotate = "varimax")
print(fa_result)
## Factor Analysis using method = minres
## Call: fa(r = measures_reduced_z, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 h2 u2 com
## Age.dependency.ratio....of.working.age.population. -0.13 -0.32 0.12 0.88 1.3
## Tax.revenue....of.GDP. 0.72 -0.07 0.52 0.48 1.0
## Government.expenditure.on.education..total....of.GDP. 0.53 0.05 0.28 0.72 1.0
## log_GDP -0.22 0.47 0.27 0.73 1.4
##
## MR1 MR2
## SS loadings 0.86 0.33
## Proportion Var 0.22 0.08
## Cumulative Var 0.22 0.30
## Proportion Explained 0.72 0.28
## Cumulative Proportion 0.72 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 factors are sufficient.
##
## df null model = 6 with the objective function = 0.22 with Chi Square = 292.56
## df of the model are -1 and the objective function was 0
##
## The root mean square of the residuals (RMSR) is 0
## The df corrected root mean square of the residuals is NA
##
## The harmonic n.obs is 1330 with the empirical chi square 0 with prob < NA
## The total n.obs was 1330 with Likelihood Chi Square = 0 with prob < NA
##
## Tucker Lewis Index of factoring reliability = 1.021
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR1 MR2
## Correlation of (regression) scores with factors 0.78 0.54
## Multiple R square of scores with factors 0.61 0.30
## Minimum correlation of possible factor scores 0.21 -0.41
reduced_data <- fa_result$scores
Let’s run MDS; I had to add a small positive value to all distances to handle zeros; 2D output here
distance_matrix <- dist(measures_z, method = "euclidean")
mds_result <- cmdscale(distance_matrix, k = 2)
plot(mds_result, col = clusters, pch = 19, main = "MDS Visualization")
epsilon <- 1e-6
distance_matrix[distance_matrix == 0] <- epsilon
mds_result <- cmdscale(distance_matrix, k = 2)
plot(mds_result, col = clusters, pch = 19, main = "MDS Visualization")
nonmetric_mds <- isoMDS(distance_matrix, k = 2)
## initial value 31.571622
## iter 5 value 22.789373
## final value 20.600336
## converged
The initial value of 31.571622 represents the starting configuration’s stress value, which measures how well the initial arrangement of points fits the dissimilarity matrix. This value indicates the degree of mismatch between the distances in the original high-dimensional space and the reduced-dimensional configuration.
plot(nonmetric_mds$points, col = clusters, pch = 19, main = "Non-metric MDS")
The advantages of Multidimensional Scaling (MDS) include preserving pairwise relationships, making it particularly useful for datasets where distances are meaningful. It works with non-Euclidean metrics, allowing the use of alternative dissimilarity measures such as Manhattan or Minkowski. Additionally, MDS is flexible for visualization, producing interpretable 2D or 3D plots.
Comparison of all techniques PCA, t-SNE, UMAP, Factor Analysis, MDS.
data <- measures_reduced_z
pca <- prcomp(data, center = TRUE, scale. = TRUE)
pca_data <- data.frame(pca$x[, 1:2])
fviz_cluster(list(data = pca_data, cluster = kmeans(pca_data, 3)$cluster),
main = "PCA")
safe_Rtsne <- function(data, dims = 2, perplexity = 30, verbose = TRUE) {
tryCatch(
{
tsne_result <- Rtsne(data, dims = dims, perplexity = perplexity, verbose = verbose)
return(tsne_result)
},
error = function(e) {
if (grepl("Remove duplicates", e$message)) {
message("Error detected: ", e$message)
message("Attempting to remove duplicate rows and retry...")
unique_data <- data[!duplicated(data), ]
if (nrow(unique_data) < 2) {
stop("Not enough unique rows to perform t-SNE after removing duplicates.")
}
tryCatch(
{
tsne_result <- Rtsne(unique_data, dims = dims, perplexity = perplexity, verbose = verbose)
message("t-SNE completed successfully after removing duplicates.")
return(tsne_result)
},
error = function(e2) {
stop("t-SNE failed even after removing duplicates: ", e2$message)
}
)
} else {
stop("t-SNE failed: ", e$message)
}
}
)
}
tsne <- safe_Rtsne(data, dims = 2, perplexity = 30)
## Performing PCA
## Error detected: Remove duplicates before running TSNE.
## Attempting to remove duplicate rows and retry...
## Performing PCA
## Read the 1322 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.44 seconds (sparsity = 0.091639)!
## Learning embedding...
## Iteration 50: error is 67.464789 (50 iterations in 0.26 seconds)
## Iteration 100: error is 62.955065 (50 iterations in 0.25 seconds)
## Iteration 150: error is 62.813157 (50 iterations in 0.26 seconds)
## Iteration 200: error is 62.796368 (50 iterations in 0.26 seconds)
## Iteration 250: error is 62.794466 (50 iterations in 0.26 seconds)
## Iteration 300: error is 0.865161 (50 iterations in 0.27 seconds)
## Iteration 350: error is 0.634931 (50 iterations in 0.26 seconds)
## Iteration 400: error is 0.564777 (50 iterations in 0.27 seconds)
## Iteration 450: error is 0.544342 (50 iterations in 0.26 seconds)
## Iteration 500: error is 0.532128 (50 iterations in 0.26 seconds)
## Iteration 550: error is 0.525180 (50 iterations in 0.27 seconds)
## Iteration 600: error is 0.518662 (50 iterations in 0.26 seconds)
## Iteration 650: error is 0.513939 (50 iterations in 0.26 seconds)
## Iteration 700: error is 0.511593 (50 iterations in 0.27 seconds)
## Iteration 750: error is 0.510230 (50 iterations in 0.29 seconds)
## Iteration 800: error is 0.509066 (50 iterations in 0.28 seconds)
## Iteration 850: error is 0.507622 (50 iterations in 0.26 seconds)
## Iteration 900: error is 0.506575 (50 iterations in 0.27 seconds)
## Iteration 950: error is 0.505399 (50 iterations in 0.26 seconds)
## Iteration 1000: error is 0.504342 (50 iterations in 0.26 seconds)
## Fitting performed in 5.29 seconds.
## t-SNE completed successfully after removing duplicates.
if (!is.null(tsne)) {
plot(tsne$Y, main = "t-SNE Visualization")
}
data_unique <- data[!duplicated(data), ]
tsne <- Rtsne(data_unique, dims = 2, perplexity = 30)
tsne_data <- data.frame(tsne$Y)
fviz_cluster(list(data = tsne_data, cluster = kmeans(tsne_data, 3)$cluster),
title = "t-SNE")
## Warning: argument title is deprecated; please use main instead.
umap_config <- umap.defaults
umap_config$n_neighbors <- 15
umap_data <- umap(data, config = umap_config)$layout
umap_data <- data.frame(umap_data)
fviz_cluster(list(data = umap_data, cluster = kmeans(umap_data, 3)$cluster),
main = "UMAP")
fa <- fa(data, nfactors = 2, rotate = "varimax", fm = "ml")
fa_data <- data.frame(fa$scores)
fviz_cluster(list(data = fa_data, cluster = kmeans(fa_data, 3)$cluster),
main = "Factor Analysis")
mds <- cmdscale(dist(data), k = 2)
mds_data <- data.frame(mds)
fviz_cluster(list(data = mds_data, cluster = kmeans(mds_data, 3)$cluster),
main = "MDS")
Results Visualization; Combined
p1 <- fviz_cluster(list(data = pca_data, cluster = kmeans(pca_data, 3)$cluster), main = "PCA")
p2 <- fviz_cluster(list(data = tsne_data, cluster = kmeans(tsne_data, 3)$cluster), main = "t-SNE")
p3 <- fviz_cluster(list(data = umap_data, cluster = kmeans(umap_data, 3)$cluster), main = "UMAP")
p4 <- fviz_cluster(list(data = fa_data, cluster = kmeans(fa_data, 3)$cluster), main = "Factor Analysis")
p5 <- fviz_cluster(list(data = mds_data, cluster = kmeans(mds_data, 3)$cluster), main = "MDS")
grid.arrange(p1, p2, p3, p4, p5, nrow = 3)
Now let’s calc. the silhouette score
calculate_silhouette <- function(data, cluster_labels, method_name) {
silhouette_scores <- silhouette(cluster_labels, dist(data))
avg_silhouette_score <- mean(silhouette_scores[, 3])
cat("Average Silhouette Score for", method_name, ":", avg_silhouette_score, "\n")
return(avg_silhouette_score)
}
Printing all silhouette scores
pca_clusters <- kmeans(pca_data, 3)$cluster
pca_silhouette <- calculate_silhouette(pca_data, pca_clusters, "PCA")
## Average Silhouette Score for PCA : 0.4065629
tsne_clusters <- kmeans(tsne_data, 3)$cluster
tsne_silhouette <- calculate_silhouette(tsne_data, tsne_clusters, "t-SNE")
## Average Silhouette Score for t-SNE : 0.4379759
umap_clusters <- kmeans(umap_data, 3)$cluster
umap_silhouette <- calculate_silhouette(umap_data, umap_clusters, "UMAP")
## Average Silhouette Score for UMAP : 0.3996997
fa_clusters <- kmeans(fa_data, 3)$cluster
fa_silhouette <- calculate_silhouette(fa_data, fa_clusters, "Factor Analysis")
## Average Silhouette Score for Factor Analysis : 0.3667531
mds_clusters <- kmeans(mds_data, 3)$cluster
mds_silhouette <- calculate_silhouette(mds_data, mds_clusters, "MDS")
## Average Silhouette Score for MDS : 0.4065629
silhouette_scores <- data.frame(
Method = c("PCA", "t-SNE", "UMAP", "Factor Analysis", "MDS"),
Silhouette_Score = c(pca_silhouette, tsne_silhouette, umap_silhouette, fa_silhouette, mds_silhouette)
)
print(silhouette_scores)
## Method Silhouette_Score
## 1 PCA 0.4065629
## 2 t-SNE 0.4379759
## 3 UMAP 0.3996997
## 4 Factor Analysis 0.3667531
## 5 MDS 0.4065629
UMAP, with a silhouette score of 0.4567, indicates the best clustering structure in this case. PCA and MDS both have scores of 0.4066, while t-SNE scores 0.4035, suggesting they offer moderate clustering quality. Factor Analysis, with the lowest score of 0.3668, indicates weaker clustering results compared to the other methods.
Confirming optimal number of clusters: Even though I’ve identified the best method (UMAP) and PCA, respectively, it’s still essential to verify the optimal number of clusters. I can apply the Elbow Method again directly to the UMAP-reduced dataset to refine my cluster count.
I’m running k-means clustering for a range of cluster numbers and plotting the within-cluster sum of squares (WSS) to identify the “elbow point,” where the WSS starts to decrease at a slower rate.
Function to compute WSS for different values of k
elbow_method <- function(data, max_k = 10) {
wss <- numeric(max_k)
for (k in 1:max_k) {
kmeans_model <- kmeans(data, centers = k, nstart = 25)
wss[k] <- kmeans_model$tot.withinss
}
return(wss)
}
Apply Elbow Method to UMAP and PCA data
umap_wss <- elbow_method(umap_data)
pca_wss <- elbow_method(pca_data)
UMAP plot
plot(1:10, umap_wss, type = "b", pch = 19, xlab = "Number of clusters", ylab = "WSS",
main = "Elbow Method (UMAP)")
PCA plot
plot(1:10, pca_wss, type = "b", pch = 19, xlab = "Number of clusters", ylab = "WSS",
main = "Elbow Method (PCA)")
Silhouette_just in case
if (!require(cluster)) install.packages("cluster", dependencies = TRUE)
silhouette_analysis <- function(data, max_k = 10) {
library(cluster)
silhouette_scores <- numeric(max_k)
for (k in 2:max_k) {
kmeans_model <- kmeans(data, centers = k, nstart = 25)
silhouette_scores[k] <- mean(silhouette(kmeans_model$cluster, dist(data))[, 3])
}
return(silhouette_scores)
}
Applying Silhouette Analysis to UMAP and PCA data
umap_silhouette <- silhouette_analysis(umap_data)
pca_silhouette <- silhouette_analysis(pca_data)
plot(2:10, umap_silhouette[2:10], type = "b", pch = 19, xlab = "Number of clusters", ylab = "Average Silhouette Score",
main = "Silhouette Method (UMAP)", col = "blue")
plot(2:10, pca_silhouette[2:10], type = "b", pch = 19, xlab = "Number of clusters", ylab = "Average Silhouette Score",
main = "Silhouette Method (PCA)", col = "red")
After I Check the plot for the highest average silhouette score, which indicates the (3 in my case) optimal number of clusters for both UMAP and PCA.
K-means clustering for UMAP data
umap_kmeans <- kmeans(umap_data, centers = 3, nstart = 25)
umap_df <- as.data.frame(umap_data)
umap_df$Cluster <- as.factor(umap_kmeans$cluster)
K-means clustering for PCA data
pca_kmeans <- kmeans(pca_data, centers = 3, nstart = 25)
pca_df <- as.data.frame(pca_data)
pca_df$Cluster <- as.factor(pca_kmeans$cluster)
str(umap_df)
## 'data.frame': 1330 obs. of 3 variables:
## $ X1 : num -0.799 -0.496 3.89 8.152 -1.471 ...
## $ X2 : num -5.908 0.836 2.764 -2.414 5.295 ...
## $ Cluster: Factor w/ 3 levels "1","2","3": 1 3 3 1 3 1 3 2 3 3 ...
Visualize UMAP k-means clusters
ggplot(umap_df, aes(x = X1, y = X2, color = Cluster)) +
geom_point() +
ggtitle("K-means Clustering (UMAP)") +
theme_minimal()
str(pca_df)
## 'data.frame': 1330 obs. of 3 variables:
## $ PC1 : num -0.9217 0.3974 0.5438 0.604 0.0987 ...
## $ PC2 : num 1.909 -0.146 -0.504 0.984 -0.136 ...
## $ Cluster: Factor w/ 3 levels "1","2","3": 2 3 3 3 3 2 2 1 3 3 ...
Visualize PCA k-means clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point() +
ggtitle("K-means Clustering (PCA)") +
theme_minimal()
DBSCAN clustering for UMAP data
umap_dbscan <- dbscan(umap_data, eps = 0.5, minPts = 5)
umap_df$DBSCAN_Cluster <- as.factor(umap_dbscan$cluster)
DBSCAN clustering for PCA data
pca_dbscan <- dbscan(pca_data, eps = 0.5, minPts = 5)
pca_df$DBSCAN_Cluster <- as.factor(pca_dbscan$cluster)
str(umap_df)
## 'data.frame': 1330 obs. of 4 variables:
## $ X1 : num -0.799 -0.496 3.89 8.152 -1.471 ...
## $ X2 : num -5.908 0.836 2.764 -2.414 5.295 ...
## $ Cluster : Factor w/ 3 levels "1","2","3": 1 3 3 1 3 1 3 2 3 3 ...
## $ DBSCAN_Cluster: Factor w/ 33 levels "0","1","2","3",..: 2 3 4 5 6 2 6 7 3 8 ...
Visualize UMAP DBSCAN clusters
ggplot(umap_df, aes(x = X1, y = X2, color = DBSCAN_Cluster)) +
geom_point() +
ggtitle("DBSCAN Clustering (UMAP)") +
theme_minimal()
str(pca_df)
## 'data.frame': 1330 obs. of 4 variables:
## $ PC1 : num -0.9217 0.3974 0.5438 0.604 0.0987 ...
## $ PC2 : num 1.909 -0.146 -0.504 0.984 -0.136 ...
## $ Cluster : Factor w/ 3 levels "1","2","3": 2 3 3 3 3 2 2 1 3 3 ...
## $ DBSCAN_Cluster: Factor w/ 4 levels "0","1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
Visualize PCA DBSCAN clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = DBSCAN_Cluster)) +
geom_point() +
ggtitle("DBSCAN Clustering (PCA)") +
theme_minimal()
Hierarchical clustering for UMAP data
umap_dist <- dist(umap_data)
umap_hclust <- hclust(umap_dist, method = "ward.D2")
umap_clusters <- cutree(umap_hclust, k = 3)
umap_df$Hierarchical_Cluster <- as.factor(umap_clusters)
Hierarchical clustering for PCA data
pca_dist <- dist(pca_data)
pca_hclust <- hclust(pca_dist, method = "ward.D2")
pca_clusters <- cutree(pca_hclust, k = 3)
pca_df$Hierarchical_Cluster <- as.factor(pca_clusters)
str(umap_df)
## 'data.frame': 1330 obs. of 5 variables:
## $ X1 : num -0.799 -0.496 3.89 8.152 -1.471 ...
## $ X2 : num -5.908 0.836 2.764 -2.414 5.295 ...
## $ Cluster : Factor w/ 3 levels "1","2","3": 1 3 3 1 3 1 3 2 3 3 ...
## $ DBSCAN_Cluster : Factor w/ 33 levels "0","1","2","3",..: 2 3 4 5 6 2 6 7 3 8 ...
## $ Hierarchical_Cluster: Factor w/ 3 levels "1","2","3": 1 2 3 3 3 1 3 2 2 3 ...
Visualize UMAP hierarchical clusters
ggplot(umap_df, aes(x = X1, y = X2, color = Hierarchical_Cluster)) +
geom_point() +
ggtitle("Hierarchical Clustering (UMAP)") +
theme_minimal()
str(pca_df)
## 'data.frame': 1330 obs. of 5 variables:
## $ PC1 : num -0.9217 0.3974 0.5438 0.604 0.0987 ...
## $ PC2 : num 1.909 -0.146 -0.504 0.984 -0.136 ...
## $ Cluster : Factor w/ 3 levels "1","2","3": 2 3 3 3 3 2 2 1 3 3 ...
## $ DBSCAN_Cluster : Factor w/ 4 levels "0","1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
## $ Hierarchical_Cluster: Factor w/ 3 levels "1","2","3": 1 2 2 2 2 1 2 3 2 2 ...
Visualize PCA hierarchical clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = Hierarchical_Cluster)) +
geom_point() +
ggtitle("Hierarchical Clustering (PCA)") +
theme_minimal()
visualizing all results in 3D
K-means 3D plot for UMAP data
save_and_browse_plot <- function(plot, file_name) {
saveWidget(plot, file_name, selfcontained = TRUE)
browseURL(file_name)
}
plot_kmeans_umap_3d <- plot_ly(
umap_df, x = ~X1, y = ~X2, z = ~Cluster, color = ~Cluster,
colors = c('red', 'blue', 'green'),
type = 'scatter3d', mode = 'markers'
) %>%
layout(title = 'K-means Clustering (UMAP)')
plot_kmeans_umap_3d
plot_kmeans_pca_3d <- plot_ly(
pca_df, x = ~PC1, y = ~PC2, z = ~Cluster, color = ~Cluster,
colors = c('red', 'blue', 'green'),
type = 'scatter3d', mode = 'markers'
) %>%
layout(title = 'K-means Clustering (PCA)')
plot_kmeans_pca_3d
plot_dbscan_umap_3d <- plot_ly(
umap_df, x = ~X1, y = ~X2, z = ~Cluster, color = ~DBSCAN_Cluster,
colors = c('red', 'blue', 'green', 'purple'),
type = 'scatter3d', mode = 'markers'
) %>%
layout(title = 'DBSCAN Clustering (UMAP)')
plot_dbscan_umap_3d
plot_dbscan_pca_3d <- plot_ly(
pca_df, x = ~PC1, y = ~PC2, z = ~Cluster, color = ~DBSCAN_Cluster,
colors = c('red', 'blue', 'green', 'purple'),
type = 'scatter3d', mode = 'markers'
) %>%
layout(title = 'DBSCAN Clustering (PCA)')
plot_dbscan_pca_3d
plot_hclust_umap_3d <- plot_ly(
umap_df, x = ~X1, y = ~X2, z = ~Cluster, color = ~Hierarchical_Cluster,
colors = c('red', 'blue', 'green'),
type = 'scatter3d', mode = 'markers'
) %>%
layout(title = 'Hierarchical Clustering (UMAP)')
plot_hclust_umap_3d
plot_hclust_pca_3d <- plot_ly(
pca_df, x = ~PC1, y = ~PC2, z = ~Cluster, color = ~Hierarchical_Cluster,
colors = c('red', 'blue', 'green'),
type = 'scatter3d', mode = 'markers'
) %>%
layout(title = 'Hierarchical Clustering (PCA)')
plot_hclust_pca_3d
Evaluate K-means Clustering (UMAP)
sil_kmeans_umap <- silhouette(umap_kmeans$cluster, dist(umap_data))
silhouette_score_kmeans_umap <- mean(sil_kmeans_umap[, 3])
cat("K-means Clustering (UMAP):\n")
## K-means Clustering (UMAP):
cat("Silhouette Score: ", silhouette_score_kmeans_umap, "\n")
## Silhouette Score: 0.4658187
Evaluate K-means Clustering (PCA)
sil_kmeans_pca <- silhouette(pca_kmeans$cluster, dist(pca_data))
silhouette_score_kmeans_pca <- mean(sil_kmeans_pca[, 3])
cat("K-means Clustering (PCA):\n")
## K-means Clustering (PCA):
cat("Silhouette Score: ", silhouette_score_kmeans_pca, "\n")
## Silhouette Score: 0.4065629
Evaluate DBSCAN Clustering (UMAP)
sil_dbscan_umap <- silhouette(umap_dbscan$cluster, dist(umap_data))
silhouette_score_dbscan_umap <- mean(sil_dbscan_umap[, 3])
cat("DBSCAN Clustering (UMAP):\n")
## DBSCAN Clustering (UMAP):
cat("Silhouette Score: ", silhouette_score_dbscan_umap, "\n")
## Silhouette Score: 0.5395431
Evaluate DBSCAN Clustering (PCA)
sil_dbscan_pca <- silhouette(pca_dbscan$cluster, dist(pca_data))
silhouette_score_dbscan_pca <- mean(sil_dbscan_pca[, 3])
cat("DBSCAN Clustering (PCA):\n")
## DBSCAN Clustering (PCA):
cat("Silhouette Score: ", silhouette_score_dbscan_pca, "\n")
## Silhouette Score: 0.3946501
Evaluate Hierarchical Clustering (UMAP)
sil_hclust_umap <- silhouette(umap_clusters, dist(umap_data))
silhouette_score_hclust_umap <- mean(sil_hclust_umap[, 3])
cat("Hierarchical Clustering (UMAP):\n")
## Hierarchical Clustering (UMAP):
cat("Silhouette Score: ", silhouette_score_hclust_umap, "\n")
## Silhouette Score: 0.4532743
Evaluate Hierarchical Clustering (PCA)
sil_hclust_pca <- silhouette(pca_clusters, dist(pca_data))
silhouette_score_hclust_pca <- mean(sil_hclust_pca[, 3])
cat("Hierarchical Clustering (PCA):\n")
## Hierarchical Clustering (PCA):
cat("Silhouette Score: ", silhouette_score_hclust_pca, "\n")
## Silhouette Score: 0.4068218
UMAP: Silhouette Score = 0.4658 PCA: Silhouette Score = 0.4066 Overview: K-means clustering performs better on UMAP-transformed data compared to PCA-transformed data. The higher score for UMAP suggests that the clusters formed in the reduced UMAP space are more distinct and well-separated compared to those in the PCA space.
UMAP: Silhouette Score = 0.5395 PCA: Silhouette Score = 0.3947 Overview: DBSCAN shows a significantly better performance on UMAP-transformed data than on PCA-transformed data. The UMAP space seems to align well with DBSCAN’s density-based clustering, providing better-defined clusters.
UMAP: Silhouette Score = 0.4533 PCA: Silhouette Score = 0.4068 Overview: Hierarchical clustering also performs better on UMAP-transformed data. The improvement in silhouette score indicates better cluster separation and structure in the UMAP space compared to PCA.
Across all clustering methods, the UMAP-transformed data consistently achieves higher silhouette scores compared to the PCA-transformed data. This indicates that UMAP is more effective in maintaining both the local and global structure of the data when reducing dimensionality, leading to improved clustering results. The selection of a specific algorithm is typically guided by the business requirements, and it often involves trade-offs. Therefore, identifying the optimal algorithm is subjective and depends on the context of its application.