Objective

The objective of this analysis is to assess the degree of similarity between the analyzed units (countries) by constructing a proximity matrix based on the Euclidean distance, using variables such as GDP, level of urbanization, and CO₂ emissions. The calculation of distances allows the identification of relationships of similarity or dissimilarity between observations, providing the foundation for highlighting the structure of the data and for the subsequent application of classification methods.

Finding the smallest distance between countries

## Cea mai mica distanta este: 2.552263

The countries with the smallest distance are:

##  is registered between  Mongolia and Armenia

The countries with the smallest distance are Mongolia and Armenia. Minimum distance value: 2.5523

Proximity matrix is:

Table. The proximity matrix
Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina
Afghanistan 0.000 3087.0913 3892.1637 37615.567 2269.2984 16403.735 11827.601
Albania 3087.091 0.0000 805.0987 34528.575 817.9743 13316.805 8740.539
Algeria 3892.164 805.0987 0.0000 33723.601 1622.9236 12511.885 7935.529
Andorra 37615.567 34528.5746 33723.6010 0.000 35346.5112 21211.868 25788.099
Angola 2269.298 817.9743 1622.9236 35346.511 0.0000 14134.762 9558.450
Antigua and Barbuda 16403.735 13316.8047 12511.8847 21211.868 14134.7616 0.000 4576.768
Argentina 11827.601 8740.5392 7935.5291 25788.099 9558.4503 4576.768 0.000
Armenia 2553.015 534.3506 1339.2147 35062.811 283.7184 13851.074 9274.744
Aruba 28782.294 25695.3337 24890.3821 8833.339 26513.2806 12378.565 16954.940
Australia 53545.559 50458.5790 49653.6103 15930.014 51276.5184 37141.835 41718.112

the maximum distance value:

## The highest value is:  73619.37
## and is registered between Norway and Burundi

The countries with the biggest distance are Norway and Burundi. Maximum distance value: 73619.37.

Scaling the variable

To compare observations properly, we first scale the numeric variables:

\[ z_i = \frac{x_i - \bar{x}}{\sigma_x} \]

where:
- \(x_i\) is the original value of observation \(i\)
- \(\bar{x}\) is the mean of the variable
- \(\sigma_x\) is the standard deviation of the variable
- \(z_i\) is the scaled value

After scaling, the Euclidean distance between two observations \(i\) and \(j\) is recalculated as:

\[ d_{ij} = \sqrt{\sum_{k=1}^{p} (z_{ik} - z_{jk})^2} \]

where:
- \(p\) is the number of variables
- \(z_{ik}, z_{jk}\) is the scaled values of observations \(i\) and \(j\) for variable \(k\)

This process ensures that the distances between observations are not biased by the scale of the variables.

The value of the distance between countries is presented in the form of a proximity matrix:

## Minimum distance: 5.033195 is registered between Tanzania and Tajikistan
## Maximum distance: 73619.37 is registered between Norway and Burundi
Table. The proximity matrix
metoda tip Tara_1 Tara_2 distanta
euclidean min Tanzania Tajikistan 5.0332
euclidean max Norway Burundi 73619.3750

The proximity matric calculated using the scaled values of variables

## Minimum distance: 0.00978173 is registered between Mozambique and Madagascar
## Maximum distance: 6.434425 is registered between Kuwait and Burundi
Table. Minimum and maximum distance between countries
metoda tip Tara_1 Tara_2 distanta
euclidean min Mozambique Madagascar 0.0098
euclidean max Kuwait Burundi 6.4344
# plot(hc_complete, labels = pdata_country$country, main = "Dendrograma: clusterizare ierarhica",
#      ylab = "Inaltimea arborelui", xlab = "Distanta euclidiana", sub = "Algoritmul complete",
#      cex = 0.9,  
#      cex.sub = 1.4, cex.main = 1.4, cex.lab = 1.4, font.lab = 2, font.sub = 2,
#      hang = -1)
# rect.hclust(hc_complete, k = 3, border = 2:4)  # Împărțirea în 3 clustere

set.seed(123)
subset_data <- pdata_country[1:50, ]
d <- dist(scale(subset_data[, 1:3]), method = "euclidean", p = 3)
hc <- hclust(d, method = "complete")

# plot(hc, labels = subset_data$country, cex = 0.7, main="Dendograma unui subset de date",
#      ylab = "Inaltimea arborelui", xlab = "Distanta euclidiana", sub = "Algoritmul complete",
#      cex = 0.9,  
#      cex.sub = 1.4, cex.main = 1.4, cex.lab = 1.4, font.lab = 2, font.sub = 2,
#      hang = -1)
# rect.hclust(hc, k = 2, border = c("red", "blue", "green"))

Conclusions

Small distances between countries show a great similarity between them and will be part of the same cluster

Large distances between countries show little similarity between them, which is why they will be part of different clusters

Bibliography

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Françoi R, Grolemun G, Haye A, Henr L, Heste J, Kuh M, Pederse TL, Mille E, Bach SM, Müll K, Oo ,J, Robins ,D, Seid ,DP, Spi ,V, Takahas ,K, Vaugh ,D, Wil ,C, W ,K, Yutani ,H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Wickham H, Hester J, Bryan J (2025). readr: Read Rectangular Text Data. doi:10.32614/CRAN.package.readr https://doi.org/10.32614/CRAN.package.readr, R package version 2.1.6, https://CRAN.R-project.org/package=readr.

Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50, https://yihui.org/knitr/.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

de Vries A, Ripley BD (2024). ggdendro: Create Dendrograms and Tree Diagrams Using ‘ggplot2’. doi:10.32614/CRAN.package.ggdendro https://doi.org/10.32614/CRAN.package.ggdendro, R package version 0.2.0, https://CRAN.R-project.org/package=ggdendro.

How to Calculate Euclidean Distance in R?, https://www.geeksforgeeks.org/r-language/how-to-calculate-euclidean-distance-in-r/

Distance Matrix Computation, https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html

Zach Bobbitt, How to Calculate Euclidean Distance in R (With Examples), https://www.statology.org/euclidean-distance-in-r/