Examen 2

## Rows: 210
## Columns: 7
## $ Country                 <chr> "Afghanistan", "Albania", "Algeria", "American…
## $ `Population (millions)` <dbl> 40.09946, 2.81167, 44.17797, 0.04504, 0.07903,…
## $ Surface_area            <dbl> 652.86, 28.75, 2381.74, 0.20, 0.47, 1246.70, 0…
## $ Population_density      <dbl> 59.75228, 103.57113, 18.24366, 230.94500, 165.…
## $ Gross_national_income   <dbl> 15.69540, 17.18400, 161.49500, NA, 3.55199, 58…
## $ Purchasing_power_parity <dbl> 67.3584, 43.8569, 524.0170, NA, NA, 206.2780, …
## $ Gross_domestic_product  <dbl> -22.96530, 9.52603, 1.79842, 0.64838, 7.11048,…

When viewing just a few rows of the data set, we can see that when it comes to clustering we will need to scale the data. At first, let us look for NA’s or anything else that will make us clean the data set.

## [1] 8

We have 8 total NA’s in our data set. We will eliminate them – assuming this is not going to affect our results.

##                  Country Population (millions) Surface_area Population_density
##   1:         Afghanistan              40.09946       652.86           59.75228
##   2:             Albania               2.81167        28.75          103.57113
##   3:             Algeria              44.17797      2381.74           18.24366
##   4:              Angola              34.50377      1246.70           26.81358
##   5: Antigua and Barbuda               0.09322         0.44          210.60000
##  ---                                                                          
## 200:  Sub-Saharan Africa            1181.16000     24328.35           48.18998
## 201:          Low income             709.08850     16027.49           43.87508
## 202: Lower middle income            3398.19000     25638.20          135.14403
## 203: Upper middle income            2503.14000     54597.60           46.83663
## 204:         High income            1240.63000     37488.82           35.20690
##      Gross_national_income Purchasing_power_parity Gross_domestic_product
##   1:              15.69540                 67.3584              -22.96530
##   2:              17.18400                 43.8569                9.52603
##   3:             161.49500                524.0170                1.79842
##   4:              58.87880                206.2780               -2.05072
##   5:               1.47135                  1.9365                4.64410
##  ---                                                                     
## 200:            1845.56000               4641.3900                1.56710
## 201:             532.35500               1468.6600               -0.15827
## 202:            8391.83000              26605.4000                4.31219
## 203:           25927.80000              49003.8000                7.06526
## 204:           59699.60000              68513.9000                5.25904

Now that we have our data set free of Na’s we will continue with clustering analysis.

Hierarchical Clustering

Let’s normalize our data. We will only normalize the fields 5 - 7.

## Rows: 204
## Columns: 7
## $ Country                 <chr> "Afghanistan", "Albania", "Algeria", "Angola",…
## $ `Population (millions)` <dbl> 40.09946, 2.81167, 44.17797, 34.50377, 0.09322…
## $ Surface_area            <dbl> 652.86, 28.75, 2381.74, 1246.70, 0.44, 2780.40…
## $ Population_density      <dbl> 59.75228, 103.57113, 18.24366, 26.81358, 210.6…
## $ Gross_national_income   <dbl> 1.646042e-04, 1.802964e-04, 1.701565e-03, 6.19…
## $ Purchasing_power_parity <dbl> 4.607352e-04, 2.997906e-04, 3.588057e-03, 1.41…
## $ Gross_domestic_product  <dbl> 0.0000000, 0.5173576, 0.3943113, 0.3330217, 0.…

Now that our data is normalized, let us find the distance between the data and the hierarchical cluster, we will also create a dendogram. Let us first find the optimal number of clusters and validate such:

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 4 proposed 7 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

According to the majority rule, our optimal number of cluster is 2.

Optimal number of partitions and method to be used

## 
## Clustering Methods:
##  hierarchical kmeans diana fanny pam clara agnes 
## 
## Cluster sizes:
##  2 3 4 5 6 7 
## 
## Validation Measures:
##                                  2       3       4       5       6       7
##                                                                           
## hierarchical Connectivity   4.8758  6.0008  9.8587 15.1567 19.8044 22.1794
##              Dunn           0.3421  0.3421  0.2582  0.1522  0.1760  0.1760
##              Silhouette     0.8827  0.8309  0.7564  0.6769  0.6951  0.5718
## kmeans       Connectivity   4.8758  8.6266 21.1524 17.4349 19.4567 28.3933
##              Dunn           0.3421  0.1064  0.0042  0.0072  0.0078  0.0142
##              Silhouette     0.8827  0.7464  0.5028  0.5225  0.5285  0.5136
## diana        Connectivity   4.8758  8.7337 15.3917 19.4504 20.5754 25.8734
##              Dunn           0.3421  0.2582  0.0124  0.0124  0.0197  0.0294
##              Silhouette     0.8827  0.7666  0.3855  0.4581  0.4568  0.4498
## fanny        Connectivity  14.1119 16.4496 35.2198 35.4758 52.3488 67.5194
##              Dunn           0.0023  0.0035  0.0015  0.0011  0.0002  0.0007
##              Silhouette     0.3253  0.2650  0.1961  0.2404  0.1655  0.1549
## pam          Connectivity  14.8845 12.1190 20.7956 23.6385 27.4111 32.7944
##              Dunn           0.0009  0.0038  0.0044  0.0088  0.0048  0.0048
##              Silhouette     0.2921  0.3315  0.4249  0.4342  0.4246  0.4135
## clara        Connectivity  25.7433 22.1448 16.3179 28.7151 41.1179 35.1266
##              Dunn           0.0026  0.0038  0.0051  0.0014  0.0027  0.0025
##              Silhouette     0.3872  0.3317  0.4210  0.3258  0.3708  0.4125
## agnes        Connectivity   4.8758  6.0008  9.8587 15.1567 19.8044 22.1794
##              Dunn           0.3421  0.3421  0.2582  0.1522  0.1760  0.1760
##              Silhouette     0.8827  0.8309  0.7564  0.6769  0.6951  0.5718
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 4.8758 hierarchical 2       
## Dunn         0.3421 hierarchical 2       
## Silhouette   0.8827 hierarchical 2

The ClValid function give us that the best method to use is the hierarchical clustering and the best \(k = 2.\)

We decided to use a phylogenetic type because it can be easily visualize compared to other dendograms. These algorithm groups the observations that are closer to each other. Now we will display two other phylogenetic trees using the complete and the average methods.

When using the complete method we can see that the observations are more spread and the dominant color is blue, but when it comes to the average we can see almost the same tree flipped and its color inverted.

Partitioning methods

We will keep using \(k = 2\).

k-means

From the graph above we can see that there are few countries in the red cluster. We can also notice that at higher Gross National Income one will have a higher Purchasing Power (except for three observations around 2500 in the x-axis).

K-medians

Ahora utilizaremos las kmedianas:

Here we can see that having a greater population doesn’t necessarily means that the national income will be greater. The clusters are based on the distance between the national income: greater incomes are colored blue and lower incomes are in the red cluster. We can also notice that there are few countries that with a higher income than the rest (blue cluster): 9 countries.

Examen 2

Billy J Cedeño Nazario

2023-05-17

Hierarchical Clustering

Partitioning methods

k-means

K-medians