Countries of the world

Dataset description

These data contain a lot of information on 167 countries of the world concerning 10 variables:

country = Name of the country.
child_mort = Death of children under 5 years of age per 1000 live births.
exports = Exports of goods and services per capita. Given as %age of the GDP per capita.
health = Total health spending per capita. Given as %age of GDP per capita.
imports = Imports of goods and services per capita. Given as %age of the GDP per capita.
Income = Net income per person.
Inflation = The measurement of the annual growth rate of the Total GDP.
life_expec = The average number of years a new born child would live if the current mortality patterns are to remain the same.
total_fer = The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp = The GDP per capita. Calculated as the Total GDP divided by the total population.

Loading and visualise data

At first i load the data set.

I don’t need the Country label column because as i have said before, I’ll use unsupervised machine learning technique and them works with unlabeled data.

Let’s show the data which we are working with:

Structure

## 'data.frame':    167 obs. of  9 variables:
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

Summary

child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Min. : 2.60	Min. : 0.109	Min. : 1.810	Min. : 0.0659	Min. : 609	Min. : -4.210	Min. :32.10	Min. :1.150	Min. : 231
1st Qu.: 8.25	1st Qu.: 23.800	1st Qu.: 4.920	1st Qu.: 30.2000	1st Qu.: 3355	1st Qu.: 1.810	1st Qu.:65.30	1st Qu.:1.795	1st Qu.: 1330
Median : 19.30	Median : 35.000	Median : 6.320	Median : 43.3000	Median : 9960	Median : 5.390	Median :73.10	Median :2.410	Median : 4660
Mean : 38.27	Mean : 41.109	Mean : 6.816	Mean : 46.8902	Mean : 17145	Mean : 7.782	Mean :70.56	Mean :2.948	Mean : 12964
3rd Qu.: 62.10	3rd Qu.: 51.350	3rd Qu.: 8.600	3rd Qu.: 58.7500	3rd Qu.: 22800	3rd Qu.: 10.750	3rd Qu.:76.80	3rd Qu.:3.880	3rd Qu.: 14050
Max. :208.00	Max. :200.000	Max. :17.900	Max. :174.0000	Max. :125000	Max. :104.000	Max. :82.80	Max. :7.490	Max. :105000

Data analysis

First we have to explore and visualize the data:

Histogram of Variables

I haven’t included Income,gdpp and life_expec in Boxplot, since as seen in density plot their values are very extended.

There are not many outlier; any outlier are present in imports,healt, Inflaction and total_fer.

Then i have plotted the correlation among the variables:

Is possible to see how The most positive correlated variable are total_fer with child_mort and Income with gdpp The most negative correlated variable are child_mort with life_expec and always life_expec with total_fer.

Clustering

Analysis introduction

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a dataset with no pre-existing labels.

In this first part of analysis i’m going to use two different type of clustering unsupervised method that are K-mean algorithm and Hierarchical clustering in order to classified the country in group.

So, i’m going to use a Country-data to cluster different types of Country through their variables.

Object of analysis

Categorise the countries by clustering using socio-economic and health factors that determine the overall development of the country.

Clustering data preparation

I have to normalize the variables to express them in the same range of values.

In other words, normalization means adjusting values measured on different scales to a common scale.

K-mean cluster

K-means is an unsupervised machine learning algorithm that works with unlabeled data.

It’s aim is minimize the differences within cluster and maximize the differences between clusters.

K-means clustering is a vector quantization method, which aims is to partition n observations into k clusters where each observation belongs to the cluster with the closest mean (cluster centroid), which serves as a prototype of the cluster.

Determine the optimum number of clusters

The K-Elbow Visualizer implements the elbow method of selecting the optimal number of clusters for K-means clustering.

The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

This method suggest 3 cluster.

## [1] 1.0000000 0.6993750 0.5531766 0.4662047 0.4356131 0.3907742 0.3712062
## [8] 0.3542132
## attr(,"class")
## [1] "k-means clustering"

To have a confirm i’ve tried also Optimal number of cluster with Gap statistic.

This method compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.

Also this method suggest that 3 is the optimal number of cluster.

K-mean clustering execution

So, let’s create 3 cluster with a k-mean algorithm

Hierarchical Clustering execution

In this method, each object is assigned to its own cluster.

Then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.

At each stage, distances between clusters are recomputed by a dissimilarity formula according to the particular clustering method that is in use.

Aglomerative approach

Dissimilarity matrix

Complete linkage

The distance between two clusters is defined as the maximum distance between an observation in one cluster and an observation in the other cluster.

The criterion for choosing the pair of clusters to merge at each step is based on proximity between two clusters is the proximity between their two most distant objects.

Ploting the Complete linkage i can see that dendrogram suggest: split in 3 cluster.

Is possible to see that the cluster are not well divided and moreover one cluster are very small.

Ward.D2 linkage

Ward’s minimum variance method - however dissimilarities are squared before clustering.

The criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.

Ploting Ward.D2 linkage and as dendrogram suggest: split in 3 cluster.

This time the plot is better than the preovious, cluster are well define, so i decided to keep the Ward.D2 linkage as Hierarchical Clustering method.

Hence, let’s extract the cluster from the Hierarchical Clustering with Ward.D2 linkage.

Cluster comparison

Going forward, I want to find which method have given the best clusterization.

So, I have compared through the silhouette and showing the cluster.

I can conclude that between the two clusterizations I prefer the K-means method.

K-means

The k-means clustering have a silhouette value of 0.28.

##   cluster size ave.sil.width
## 1       1   36          0.15
## 2       2   84          0.36
## 3       3   47          0.24

Hierarchical Clustering with Ward.D2 linkage

The Hierarchical Clustering with Ward.D2 linkage have a silhouette value of 0.25.

##   cluster size ave.sil.width
## 1       1   27          0.45
## 2       2  106          0.21
## 3       3   34          0.19

Cluster visualisation

Let’s visualize the chosen clusters.

Then visualise also the Clusters Boxplot.

Moreover in order to see what the clusters represent.

I have Added to the Dataset the variable that indicates the cluster to which it belongs and then aggregate the data in order to see the mean of the variable present in it.

Through the cluster features we can see what the clusters represent.

cluster 1 - Developed countries, in fact the country in this group are those with best value of variable that represent develop as exports, health, imports, income, life_expec and gdpp and a low value of the other variables.
cluster 2 - Developing countries, this group has value of the variables between the first and last group.
cluster 3 - Underdeveloped countries, so called “third world countries”, the country in this group have opposite characteristics to the first group.

##   Kmeans_cluster child_mort  exports   health  imports    income inflation
## 1              1    5.00000 58.73889 8.807778 51.49167 45672.222  2.671250
## 2              2   21.92738 40.24392 6.200952 47.47340 12305.595  7.600905
## 3              3   92.96170 29.15128 6.388511 42.32340  3942.404 12.019681
##   life_expec total_fer      gdpp
## 1   80.12778  1.752778 42494.444
## 2   72.81429  2.307500  6486.452
## 3   59.18723  5.008085  1922.383

Now visualize a plot that show the size of each cluster

	Cluster 1	Cluster 2	Cluster 3
N of Country	36	84	47

Conclusion

As conclusion i decide to group the country as the cluster have classified it and print the name of country that belongs to the cluster

Developed

The Country classified in this group are:

##  [1] "Australia"            "Austria"              "Bahrain"             
##  [4] "Belgium"              "Brunei"               "Canada"              
##  [7] "Cyprus"               "Czech Republic"       "Denmark"             
## [10] "Finland"              "France"               "Germany"             
## [13] "Greece"               "Iceland"              "Ireland"             
## [16] "Israel"               "Italy"                "Japan"               
## [19] "Kuwait"               "Luxembourg"           "Malta"               
## [22] "Netherlands"          "New Zealand"          "Norway"              
## [25] "Portugal"             "Qatar"                "Singapore"           
## [28] "Slovak Republic"      "Slovenia"             "South Korea"         
## [31] "Spain"                "Sweden"               "Switzerland"         
## [34] "United Arab Emirates" "United Kingdom"       "United States"

Developing

The Country classified in this group are:

##  [1] "Albania"                        "Algeria"                       
##  [3] "Antigua and Barbuda"            "Argentina"                     
##  [5] "Armenia"                        "Azerbaijan"                    
##  [7] "Bahamas"                        "Bangladesh"                    
##  [9] "Barbados"                       "Belarus"                       
## [11] "Belize"                         "Bhutan"                        
## [13] "Bolivia"                        "Bosnia and Herzegovina"        
## [15] "Brazil"                         "Bulgaria"                      
## [17] "Cambodia"                       "Cape Verde"                    
## [19] "Chile"                          "China"                         
## [21] "Colombia"                       "Costa Rica"                    
## [23] "Croatia"                        "Dominican Republic"            
## [25] "Ecuador"                        "Egypt"                         
## [27] "El Salvador"                    "Estonia"                       
## [29] "Fiji"                           "Georgia"                       
## [31] "Grenada"                        "Guatemala"                     
## [33] "Guyana"                         "Hungary"                       
## [35] "India"                          "Indonesia"                     
## [37] "Iran"                           "Jamaica"                       
## [39] "Jordan"                         "Kazakhstan"                    
## [41] "Kyrgyz Republic"                "Latvia"                        
## [43] "Lebanon"                        "Libya"                         
## [45] "Lithuania"                      "Macedonia, FYR"                
## [47] "Malaysia"                       "Maldives"                      
## [49] "Mauritius"                      "Micronesia, Fed. Sts."         
## [51] "Moldova"                        "Mongolia"                      
## [53] "Montenegro"                     "Morocco"                       
## [55] "Myanmar"                        "Nepal"                         
## [57] "Oman"                           "Panama"                        
## [59] "Paraguay"                       "Peru"                          
## [61] "Philippines"                    "Poland"                        
## [63] "Romania"                        "Russia"                        
## [65] "Samoa"                          "Saudi Arabia"                  
## [67] "Serbia"                         "Seychelles"                    
## [69] "Solomon Islands"                "Sri Lanka"                     
## [71] "St. Vincent and the Grenadines" "Suriname"                      
## [73] "Tajikistan"                     "Thailand"                      
## [75] "Tonga"                          "Tunisia"                       
## [77] "Turkey"                         "Turkmenistan"                  
## [79] "Ukraine"                        "Uruguay"                       
## [81] "Uzbekistan"                     "Vanuatu"                       
## [83] "Venezuela"                      "Vietnam"

Underdeveloped

The Country classified in this group are:

##  [1] "Afghanistan"              "Angola"                  
##  [3] "Benin"                    "Botswana"                
##  [5] "Burkina Faso"             "Burundi"                 
##  [7] "Cameroon"                 "Central African Republic"
##  [9] "Chad"                     "Comoros"                 
## [11] "Congo, Dem. Rep."         "Congo, Rep."             
## [13] "Cote d'Ivoire"            "Equatorial Guinea"       
## [15] "Eritrea"                  "Gabon"                   
## [17] "Gambia"                   "Ghana"                   
## [19] "Guinea"                   "Guinea-Bissau"           
## [21] "Haiti"                    "Iraq"                    
## [23] "Kenya"                    "Kiribati"                
## [25] "Lao"                      "Lesotho"                 
## [27] "Liberia"                  "Madagascar"              
## [29] "Malawi"                   "Mali"                    
## [31] "Mauritania"               "Mozambique"              
## [33] "Namibia"                  "Niger"                   
## [35] "Nigeria"                  "Pakistan"                
## [37] "Rwanda"                   "Senegal"                 
## [39] "Sierra Leone"             "South Africa"            
## [41] "Sudan"                    "Tanzania"                
## [43] "Timor-Leste"              "Togo"                    
## [45] "Uganda"                   "Yemen"                   
## [47] "Zambia"

Dimensional Reduction

Analysis introduction

Dimensionality reduction is the problem of taking a matrix with many observations and “packing” it into a matrix with fewer observations that preserves as much information as possible in the full matrix.

The main components are the simplest of the methodologies to do this. It is based on the search for an orthonormal basis (a set of perpendicular vectors) within the dimensional space of the dataset which explains the greatest possible amount of variance in the dataset

Principal component analysis is the simplest example of dimensionality reduction.

Object of analysis

Find the number of principal component that can represent my database and understand their characteristics

Principal component analysis

Principal Component Analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets by turning a large set of variables into a smaller one that still contains most of the information in the large set

The PCA finds the directions that have the greatest variance.

PCA is generally used as a tool for analyzing / visualizing exploratory data

The main elements of the PCA:

Eigenvectors: principal axes of the subspace of maximum variance
Eigenvalues: variance of the inputs projected along the principal axes
Estimated dimensionality: number of significant (non-negative) eigenvalues

PCA data preparation

After have deleted the cluster variable that i have assign in the previous analysis.

The first step for PCA is the normalization of data

PCA analysis

Find the principal components and visualize their characteristics and plot them.

## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
## 
## Rotation (n x k) = (9 x 9):
##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
##                            PC8     PC9
## Standard deviation     0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion  0.99257 1.00000

The variable positively correlated are grouped together, while those negatively correlated are positioned on opposite sides of the plot origin.

Find the best number of PC

In order to find the best number Principal Components we can see the percentage of explained variance expressed by each of them and their eigenvalues that is is a number which telling you how much variance there is in the data in that direction.

Percentage of explained variance

I search for an amount of explained variance of at least 80%

In this case right number of PC is 4

## [1] "Percentage of explained variance with 4 PC is: 0.87191"

Eigenvalue

I search for PC that have a eingenvalue greater than 1

Also in this case right number of PC is 4 and this confirm what saw in the percentage of explained variance.

Then visualise also the table with all the eingenvalue and percentage of explained variance

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.13565658       45.9517398                    45.95174
## Dim.2 1.54634631       17.1816257                    63.13337
## Dim.3 1.17038330       13.0042589                    76.13762
## Dim.4 0.99478456       11.0531618                    87.19079
## Dim.5 0.66061903        7.3402114                    94.53100
## Dim.6 0.22358112        2.4842347                    97.01523
## Dim.7 0.11343874        1.2604304                    98.27566
## Dim.8 0.08831536        0.9812817                    99.25694
## Dim.9 0.06687501        0.7430556                   100.00000

Analysis of components

Going forward, let’s also displaying the most significant variables that constitute PC below expressed for the 4 most important PC

PC 1

## life_expec child_mort  total_fer     income       gdpp    exports  inflation 
##  0.4258394 -0.4195194 -0.4037290  0.3984411  0.3926448  0.2838970 -0.1931729 
##    imports     health 
##  0.1614824  0.1508378

PC 2

##      imports      exports       health   life_expec   child_mort    total_fer 
## -0.671820644 -0.613163494  0.243086779  0.222706743 -0.192883937 -0.155233106 
##         gdpp       income    inflation 
##  0.046022396 -0.022535530  0.008404473

PC 3

##   inflation      health      income     imports     exports        gdpp 
## -0.64251951  0.59663237 -0.30154750  0.29992674 -0.14476069 -0.12297749 
##  life_expec  child_mort   total_fer 
## -0.11391854  0.02954353 -0.01954925

PC 4

##         gdpp       health       income    total_fer   child_mort   life_expec 
## -0.531994575 -0.461897497 -0.392159039 -0.378303645 -0.370653262  0.203797235 
##    inflation      imports      exports 
## -0.150441762  0.071907461 -0.003091019

Then let’s visualize graphically the individual variable contribution to the most important 4 PC .

I can conclude my analysis saying that:

First component is related to the variables concerning the characteristics of the lifespan, in fact the most contribution is given by the variables life_expec, child_mort, total_fer
Second component concerns trade relations with other countries, in fact the most important variables are imports, exports.
Third component is characterized by those variables that reflect people’s quality of life, in fact the most contribution is given by inflation, health.
Fourth component concerns the economic productivity of a country, in fact the highest value of contribution is given by gdpp

K-mean and Hierarchical clustering & PCA dimension reduction

Riccardo Ventura

Countries of the world

Dataset description

Loading and visualise data

Structure

Summary

Data analysis

Clustering

Analysis introduction

Object of analysis

Clustering data preparation

K-mean cluster

Determine the optimum number of clusters

K-mean clustering execution

Hierarchical Clustering execution

Aglomerative approach

Complete linkage

Ward.D2 linkage

Cluster comparison

K-means

Hierarchical Clustering with Ward.D2 linkage

Cluster visualisation

Conclusion

Developed

Developing

Underdeveloped

Dimensional Reduction

Analysis introduction

Object of analysis

Principal component analysis

PCA data preparation

PCA analysis

Find the best number of PC

Percentage of explained variance

Eigenvalue

Analysis of components

PC 1

PC 2

PC 3

PC 4