## Rows: 210
## Columns: 7
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "American…
## $ `Population (millions)` <dbl> 40.09946, 2.81167, 44.17797, 0.04504, 0.07903,…
## $ Surface_area <dbl> 652.86, 28.75, 2381.74, 0.20, 0.47, 1246.70, 0…
## $ Population_density <dbl> 59.75228, 103.57113, 18.24366, 230.94500, 165.…
## $ Gross_national_income <dbl> 15.69540, 17.18400, 161.49500, NA, 3.55199, 58…
## $ Purchasing_power_parity <dbl> 67.3584, 43.8569, 524.0170, NA, NA, 206.2780, …
## $ Gross_domestic_product <dbl> -22.96530, 9.52603, 1.79842, 0.64838, 7.11048,…
When viewing just a few rows of the data set, we can see that when it comes to clustering we will need to scale the data. At first, let us look for NA’s or anything else that will make us clean the data set.
## [1] 8
We have 8 total NA’s in our data set. We will eliminate them – assuming this is not going to affect our results.
## Country Population (millions) Surface_area Population_density
## 1: Afghanistan 40.09946 652.86 59.75228
## 2: Albania 2.81167 28.75 103.57113
## 3: Algeria 44.17797 2381.74 18.24366
## 4: Angola 34.50377 1246.70 26.81358
## 5: Antigua and Barbuda 0.09322 0.44 210.60000
## ---
## 200: Sub-Saharan Africa 1181.16000 24328.35 48.18998
## 201: Low income 709.08850 16027.49 43.87508
## 202: Lower middle income 3398.19000 25638.20 135.14403
## 203: Upper middle income 2503.14000 54597.60 46.83663
## 204: High income 1240.63000 37488.82 35.20690
## Gross_national_income Purchasing_power_parity Gross_domestic_product
## 1: 15.69540 67.3584 -22.96530
## 2: 17.18400 43.8569 9.52603
## 3: 161.49500 524.0170 1.79842
## 4: 58.87880 206.2780 -2.05072
## 5: 1.47135 1.9365 4.64410
## ---
## 200: 1845.56000 4641.3900 1.56710
## 201: 532.35500 1468.6600 -0.15827
## 202: 8391.83000 26605.4000 4.31219
## 203: 25927.80000 49003.8000 7.06526
## 204: 59699.60000 68513.9000 5.25904
Now that we have our data set free of Na’s we will continue with clustering analysis.
Let’s normalize our data. We will only normalize the fields 5 - 7.
## Rows: 204
## Columns: 7
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola",…
## $ `Population (millions)` <dbl> 40.09946, 2.81167, 44.17797, 34.50377, 0.09322…
## $ Surface_area <dbl> 652.86, 28.75, 2381.74, 1246.70, 0.44, 2780.40…
## $ Population_density <dbl> 59.75228, 103.57113, 18.24366, 26.81358, 210.6…
## $ Gross_national_income <dbl> 1.646042e-04, 1.802964e-04, 1.701565e-03, 6.19…
## $ Purchasing_power_parity <dbl> 4.607352e-04, 2.997906e-04, 3.588057e-03, 1.41…
## $ Gross_domestic_product <dbl> 0.0000000, 0.5173576, 0.3943113, 0.3330217, 0.…
Now that our data is normalized, let us find the distance between the data and the hierarchical cluster, we will also create a dendogram. Let us first find the optimal number of clusters and validate such:
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 4 proposed 7 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
According to the majority rule, our optimal number of cluster is 2.
Optimal number of partitions and method to be used
##
## Clustering Methods:
## hierarchical kmeans diana fanny pam clara agnes
##
## Cluster sizes:
## 2 3 4 5 6 7
##
## Validation Measures:
## 2 3 4 5 6 7
##
## hierarchical Connectivity 4.8758 6.0008 9.8587 15.1567 19.8044 22.1794
## Dunn 0.3421 0.3421 0.2582 0.1522 0.1760 0.1760
## Silhouette 0.8827 0.8309 0.7564 0.6769 0.6951 0.5718
## kmeans Connectivity 4.8758 8.6266 21.1524 17.4349 19.4567 28.3933
## Dunn 0.3421 0.1064 0.0042 0.0072 0.0078 0.0142
## Silhouette 0.8827 0.7464 0.5028 0.5225 0.5285 0.5136
## diana Connectivity 4.8758 8.7337 15.3917 19.4504 20.5754 25.8734
## Dunn 0.3421 0.2582 0.0124 0.0124 0.0197 0.0294
## Silhouette 0.8827 0.7666 0.3855 0.4581 0.4568 0.4498
## fanny Connectivity 14.1119 16.4496 35.2198 35.4758 52.3488 67.5194
## Dunn 0.0023 0.0035 0.0015 0.0011 0.0002 0.0007
## Silhouette 0.3253 0.2650 0.1961 0.2404 0.1655 0.1549
## pam Connectivity 14.8845 12.1190 20.7956 23.6385 27.4111 32.7944
## Dunn 0.0009 0.0038 0.0044 0.0088 0.0048 0.0048
## Silhouette 0.2921 0.3315 0.4249 0.4342 0.4246 0.4135
## clara Connectivity 25.7433 22.1448 16.3179 28.7151 41.1179 35.1266
## Dunn 0.0026 0.0038 0.0051 0.0014 0.0027 0.0025
## Silhouette 0.3872 0.3317 0.4210 0.3258 0.3708 0.4125
## agnes Connectivity 4.8758 6.0008 9.8587 15.1567 19.8044 22.1794
## Dunn 0.3421 0.3421 0.2582 0.1522 0.1760 0.1760
## Silhouette 0.8827 0.8309 0.7564 0.6769 0.6951 0.5718
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 4.8758 hierarchical 2
## Dunn 0.3421 hierarchical 2
## Silhouette 0.8827 hierarchical 2
The ClValid function give us that the best method to use is the hierarchical clustering and the best \(k = 2.\)
We decided to use a phylogenetic type because it can be easily visualize compared to other dendograms. These algorithm groups the observations that are closer to each other. Now we will display two other phylogenetic trees using the complete and the average methods.
When using the complete method we can see that the observations are more spread and the dominant color is blue, but when it comes to the average we can see almost the same tree flipped and its color inverted.
We will keep using \(k = 2\).
From the graph above we can see that there are few countries in the red cluster. We can also notice that at higher Gross National Income one will have a higher Purchasing Power (except for three observations around 2500 in the x-axis).
Ahora utilizaremos las kmedianas:
Here we can see that having a greater population doesn’t necessarily means that the national income will be greater. The clusters are based on the distance between the national income: greater incomes are colored blue and lower incomes are in the red cluster. We can also notice that there are few countries that with a higher income than the rest (blue cluster): 9 countries.