Analysis of similarity between countries based on Euclidean distance using socio-economic indicators

Objective

The objective of this analysis is to assess the degree of similarity between the analyzed units (countries) by constructing a proximity matrix based on the Euclidean distance, using variables such as GDP, level of urbanization, and CO₂ emissions. The calculation of distances allows the identification of relationships of similarity or dissimilarity between observations, providing the foundation for highlighting the structure of the data and for the subsequent application of classification methods.

Finding the smallest distance between countries

## Cea mai mica distanta este: 2.552263

The countries with the smallest distance are:

##  is registered between  Mongolia and Armenia

The countries with the smallest distance are Mongolia and Armenia. Minimum distance value: 2.5523

Proximity matrix is:

Table. The proximity matrix
	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Argentina
Afghanistan	0.000	3087.0913	3892.1637	37615.567	2269.2984	16403.735	11827.601
Albania	3087.091	0.0000	805.0987	34528.575	817.9743	13316.805	8740.539
Algeria	3892.164	805.0987	0.0000	33723.601	1622.9236	12511.885	7935.529
Andorra	37615.567	34528.5746	33723.6010	0.000	35346.5112	21211.868	25788.099
Angola	2269.298	817.9743	1622.9236	35346.511	0.0000	14134.762	9558.450
Antigua and Barbuda	16403.735	13316.8047	12511.8847	21211.868	14134.7616	0.000	4576.768
Argentina	11827.601	8740.5392	7935.5291	25788.099	9558.4503	4576.768	0.000
Armenia	2553.015	534.3506	1339.2147	35062.811	283.7184	13851.074	9274.744
Aruba	28782.294	25695.3337	24890.3821	8833.339	26513.2806	12378.565	16954.940
Australia	53545.559	50458.5790	49653.6103	15930.014	51276.5184	37141.835	41718.112

the maximum distance value:

## The highest value is:  73619.37

## and is registered between Norway and Burundi

The countries with the biggest distance are Norway and Burundi. Maximum distance value: 73619.37.

Scaling the variable

To compare observations properly, we first scale the numeric variables:

\[ z_i = \frac{x_i - \bar{x}}{\sigma_x} \]

where:
- \(x_i\) is the original value of observation \(i\)
- \(\bar{x}\) is the mean of the variable
- \(\sigma_x\) is the standard deviation of the variable
- \(z_i\) is the scaled value

After scaling, the Euclidean distance between two observations \(i\) and \(j\) is recalculated as:

\[ d_{ij} = \sqrt{\sum_{k=1}^{p} (z_{ik} - z_{jk})^2} \]

where:
- \(p\) is the number of variables
- \(z_{ik}, z_{jk}\) is the scaled values of observations \(i\) and \(j\) for variable \(k\)

This process ensures that the distances between observations are not biased by the scale of the variables.

The value of the distance between countries is presented in the form of a proximity matrix:

## Minimum distance: 5.033195 is registered between Tanzania and Tajikistan

## Maximum distance: 73619.37 is registered between Norway and Burundi

Table. The proximity matrix
metoda	tip	Tara_1	Tara_2	distanta
euclidean	min	Tanzania	Tajikistan	5.0332
euclidean	max	Norway	Burundi	73619.3750

The proximity matric calculated using the scaled values of variables

## Minimum distance: 0.00978173 is registered between Mozambique and Madagascar

## Maximum distance: 6.434425 is registered between Kuwait and Burundi

Table. Minimum and maximum distance between countries
metoda	tip	Tara_1	Tara_2	distanta
euclidean	min	Mozambique	Madagascar	0.0098
euclidean	max	Kuwait	Burundi	6.4344

# plot(hc_complete, labels = pdata_country$country, main = "Dendrograma: clusterizare ierarhica",
#      ylab = "Inaltimea arborelui", xlab = "Distanta euclidiana", sub = "Algoritmul complete",
#      cex = 0.9,  
#      cex.sub = 1.4, cex.main = 1.4, cex.lab = 1.4, font.lab = 2, font.sub = 2,
#      hang = -1)
# rect.hclust(hc_complete, k = 3, border = 2:4)  # Împărțirea în 3 clustere

set.seed(123)
subset_data <- pdata_country[1:50, ]
d <- dist(scale(subset_data[, 1:3]), method = "euclidean", p = 3)
hc <- hclust(d, method = "complete")

# plot(hc, labels = subset_data$country, cex = 0.7, main="Dendograma unui subset de date",
#      ylab = "Inaltimea arborelui", xlab = "Distanta euclidiana", sub = "Algoritmul complete",
#      cex = 0.9,  
#      cex.sub = 1.4, cex.main = 1.4, cex.lab = 1.4, font.lab = 2, font.sub = 2,
#      hang = -1)
# rect.hclust(hc, k = 2, border = c("red", "blue", "green"))

Conclusions

Small distances between countries show a great similarity between them and will be part of the same cluster

Large distances between countries show little similarity between them, which is why they will be part of different clusters

Bibliography

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Françoi R, Grolemun G, Haye A, Henr L, Heste J, Kuh M, Pederse TL, Mille E, Bach SM, Müll K, Oo ,J, Robins ,D, Seid ,DP, Spi ,V, Takahas ,K, Vaugh ,D, Wil ,C, W ,K, Yutani ,H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Wickham H, Hester J, Bryan J (2025). readr: Read Rectangular Text Data. doi:10.32614/CRAN.package.readr https://doi.org/10.32614/CRAN.package.readr, R package version 2.1.6, https://CRAN.R-project.org/package=readr.

Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50, https://yihui.org/knitr/.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

de Vries A, Ripley BD (2024). ggdendro: Create Dendrograms and Tree Diagrams Using ‘ggplot2’. doi:10.32614/CRAN.package.ggdendro https://doi.org/10.32614/CRAN.package.ggdendro, R package version 0.2.0, https://CRAN.R-project.org/package=ggdendro.

How to Calculate Euclidean Distance in R?, https://www.geeksforgeeks.org/r-language/how-to-calculate-euclidean-distance-in-r/

Distance Matrix Computation, https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html

Zach Bobbitt, How to Calculate Euclidean Distance in R (With Examples), https://www.statology.org/euclidean-distance-in-r/

Analysis of similarity between countries based on Euclidean distance using socio-economic indicators

by Irimia Mihaela

2026-03-29

Scaling the variable

Conclusions

Bibliography