Objective
The objective of this analysis is to assess the degree of similarity between the analyzed units (countries) by constructing a proximity matrix based on the Euclidean distance, using variables such as GDP, level of urbanization, and CO₂ emissions. The calculation of distances allows the identification of relationships of similarity or dissimilarity between observations, providing the foundation for highlighting the structure of the data and for the subsequent application of classification methods.
Finding the smallest distance between countries
## Cea mai mica distanta este: 2.552263
The countries with the smallest distance are:
## is registered between Mongolia and Armenia
The countries with the smallest distance are Mongolia and Armenia. Minimum distance value: 2.5523
Proximity matrix is:
| Afghanistan | Albania | Algeria | Andorra | Angola | Antigua and Barbuda | Argentina | |
|---|---|---|---|---|---|---|---|
| Afghanistan | 0.000 | 3087.0913 | 3892.1637 | 37615.567 | 2269.2984 | 16403.735 | 11827.601 |
| Albania | 3087.091 | 0.0000 | 805.0987 | 34528.575 | 817.9743 | 13316.805 | 8740.539 |
| Algeria | 3892.164 | 805.0987 | 0.0000 | 33723.601 | 1622.9236 | 12511.885 | 7935.529 |
| Andorra | 37615.567 | 34528.5746 | 33723.6010 | 0.000 | 35346.5112 | 21211.868 | 25788.099 |
| Angola | 2269.298 | 817.9743 | 1622.9236 | 35346.511 | 0.0000 | 14134.762 | 9558.450 |
| Antigua and Barbuda | 16403.735 | 13316.8047 | 12511.8847 | 21211.868 | 14134.7616 | 0.000 | 4576.768 |
| Argentina | 11827.601 | 8740.5392 | 7935.5291 | 25788.099 | 9558.4503 | 4576.768 | 0.000 |
| Armenia | 2553.015 | 534.3506 | 1339.2147 | 35062.811 | 283.7184 | 13851.074 | 9274.744 |
| Aruba | 28782.294 | 25695.3337 | 24890.3821 | 8833.339 | 26513.2806 | 12378.565 | 16954.940 |
| Australia | 53545.559 | 50458.5790 | 49653.6103 | 15930.014 | 51276.5184 | 37141.835 | 41718.112 |
the maximum distance value:
## The highest value is: 73619.37
## and is registered between Norway and Burundi
The countries with the biggest distance are Norway and Burundi. Maximum distance value: 73619.37.
To compare observations properly, we first scale the numeric variables:
\[ z_i = \frac{x_i - \bar{x}}{\sigma_x} \]
where:
- \(x_i\) is the original value of
observation \(i\)
- \(\bar{x}\) is the mean of the
variable
- \(\sigma_x\) is the standard
deviation of the variable
- \(z_i\) is the scaled value
After scaling, the Euclidean distance between two observations \(i\) and \(j\) is recalculated as:
\[ d_{ij} = \sqrt{\sum_{k=1}^{p} (z_{ik} - z_{jk})^2} \]
where:
- \(p\) is the number of
variables
- \(z_{ik}, z_{jk}\) is the scaled
values of observations \(i\) and \(j\) for variable \(k\)
This process ensures that the distances between observations are not biased by the scale of the variables.
The value of the distance between countries is presented in the form of a proximity matrix:
## Minimum distance: 5.033195 is registered between Tanzania and Tajikistan
## Maximum distance: 73619.37 is registered between Norway and Burundi
| metoda | tip | Tara_1 | Tara_2 | distanta |
|---|---|---|---|---|
| euclidean | min | Tanzania | Tajikistan | 5.0332 |
| euclidean | max | Norway | Burundi | 73619.3750 |
The proximity matric calculated using the scaled values of variables
## Minimum distance: 0.00978173 is registered between Mozambique and Madagascar
## Maximum distance: 6.434425 is registered between Kuwait and Burundi
| metoda | tip | Tara_1 | Tara_2 | distanta |
|---|---|---|---|---|
| euclidean | min | Mozambique | Madagascar | 0.0098 |
| euclidean | max | Kuwait | Burundi | 6.4344 |
# plot(hc_complete, labels = pdata_country$country, main = "Dendrograma: clusterizare ierarhica",
# ylab = "Inaltimea arborelui", xlab = "Distanta euclidiana", sub = "Algoritmul complete",
# cex = 0.9,
# cex.sub = 1.4, cex.main = 1.4, cex.lab = 1.4, font.lab = 2, font.sub = 2,
# hang = -1)
# rect.hclust(hc_complete, k = 3, border = 2:4) # Împărțirea în 3 clustere
set.seed(123)
subset_data <- pdata_country[1:50, ]
d <- dist(scale(subset_data[, 1:3]), method = "euclidean", p = 3)
hc <- hclust(d, method = "complete")
# plot(hc, labels = subset_data$country, cex = 0.7, main="Dendograma unui subset de date",
# ylab = "Inaltimea arborelui", xlab = "Distanta euclidiana", sub = "Algoritmul complete",
# cex = 0.9,
# cex.sub = 1.4, cex.main = 1.4, cex.lab = 1.4, font.lab = 2, font.sub = 2,
# hang = -1)
# rect.hclust(hc, k = 2, border = c("red", "blue", "green"))
Small distances between countries show a great similarity between them and will be part of the same cluster
Large distances between countries show little similarity between them, which is why they will be part of different clusters
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Françoi R, Grolemun G, Haye A, Henr L, Heste J, Kuh M, Pederse TL, Mille E, Bach SM, Müll K, Oo ,J, Robins ,D, Seid ,DP, Spi ,V, Takahas ,K, Vaugh ,D, Wil ,C, W ,K, Yutani ,H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H, Hester J, Bryan J (2025). readr: Read Rectangular Text Data. doi:10.32614/CRAN.package.readr https://doi.org/10.32614/CRAN.package.readr, R package version 2.1.6, https://CRAN.R-project.org/package=readr.
Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50, https://yihui.org/knitr/.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.
de Vries A, Ripley BD (2024). ggdendro: Create Dendrograms and Tree Diagrams Using ‘ggplot2’. doi:10.32614/CRAN.package.ggdendro https://doi.org/10.32614/CRAN.package.ggdendro, R package version 0.2.0, https://CRAN.R-project.org/package=ggdendro.
How to Calculate Euclidean Distance in R?, https://www.geeksforgeeks.org/r-language/how-to-calculate-euclidean-distance-in-r/
Distance Matrix Computation, https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html
Zach Bobbitt, How to Calculate Euclidean Distance in R (With Examples), https://www.statology.org/euclidean-distance-in-r/