These data contain a lot of information on 167 countries of the world concerning 10 variables:
At first i load the data set.
I don’t need the Country
label column because as i have said before, I’ll use unsupervised machine learning technique and them works with unlabeled data.
Let’s show the data which we are working with:
## 'data.frame': 167 obs. of 9 variables:
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
---|---|---|---|---|---|---|---|---|---|
Min. : 2.60 | Min. : 0.109 | Min. : 1.810 | Min. : 0.0659 | Min. : 609 | Min. : -4.210 | Min. :32.10 | Min. :1.150 | Min. : 231 | |
1st Qu.: 8.25 | 1st Qu.: 23.800 | 1st Qu.: 4.920 | 1st Qu.: 30.2000 | 1st Qu.: 3355 | 1st Qu.: 1.810 | 1st Qu.:65.30 | 1st Qu.:1.795 | 1st Qu.: 1330 | |
Median : 19.30 | Median : 35.000 | Median : 6.320 | Median : 43.3000 | Median : 9960 | Median : 5.390 | Median :73.10 | Median :2.410 | Median : 4660 | |
Mean : 38.27 | Mean : 41.109 | Mean : 6.816 | Mean : 46.8902 | Mean : 17145 | Mean : 7.782 | Mean :70.56 | Mean :2.948 | Mean : 12964 | |
3rd Qu.: 62.10 | 3rd Qu.: 51.350 | 3rd Qu.: 8.600 | 3rd Qu.: 58.7500 | 3rd Qu.: 22800 | 3rd Qu.: 10.750 | 3rd Qu.:76.80 | 3rd Qu.:3.880 | 3rd Qu.: 14050 | |
Max. :208.00 | Max. :200.000 | Max. :17.900 | Max. :174.0000 | Max. :125000 | Max. :104.000 | Max. :82.80 | Max. :7.490 | Max. :105000 |
First we have to explore and visualize the data:
Histogram of Variables
I haven’t included Income
,gdpp
and life_expec
in Boxplot, since as seen in density plot their values are very extended.
There are not many outlier; any outlier are present in imports
,healt
, Inflaction
and total_fer
.
Then i have plotted the correlation among the variables:
Is possible to see how The most positive correlated variable are total_fer
with child_mort
and Income
with gdpp
The most negative correlated variable are child_mort
with life_expec
and always life_expec
with total_fer
.
Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a dataset with no pre-existing labels.
In this first part of analysis i’m going to use two different type of clustering unsupervised method that are K-mean algorithm and Hierarchical clustering in order to classified the country in group.
So, i’m going to use a Country-data to cluster different types of Country through their variables.
Categorise the countries by clustering using socio-economic and health factors that determine the overall development of the country.
I have to normalize the variables to express them in the same range of values.
In other words, normalization means adjusting values measured on different scales to a common scale.
K-means is an unsupervised machine learning algorithm that works with unlabeled data.
It’s aim is minimize the differences within cluster and maximize the differences between clusters.
K-means clustering is a vector quantization method, which aims is to partition n observations into k clusters where each observation belongs to the cluster with the closest mean (cluster centroid), which serves as a prototype of the cluster.
The K-Elbow Visualizer implements the elbow method of selecting the optimal number of clusters for K-means clustering.
The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.
This method suggest 3 cluster.
## [1] 1.0000000 0.6993750 0.5531766 0.4662047 0.4356131 0.3907742 0.3712062
## [8] 0.3542132
## attr(,"class")
## [1] "k-means clustering"
To have a confirm i’ve tried also Optimal number of cluster with Gap statistic.
This method compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.
Also this method suggest that 3 is the optimal number of cluster.
So, let’s create 3 cluster with a k-mean algorithm
In this method, each object is assigned to its own cluster.
Then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
At each stage, distances between clusters are recomputed by a dissimilarity formula according to the particular clustering method that is in use.
Dissimilarity matrix
The distance between two clusters is defined as the maximum distance between an observation in one cluster and an observation in the other cluster.
The criterion for choosing the pair of clusters to merge at each step is based on proximity between two clusters is the proximity between their two most distant objects.
Ploting the Complete linkage i can see that dendrogram suggest: split in 3 cluster.
Is possible to see that the cluster are not well divided and moreover one cluster are very small.
Ward’s minimum variance method - however dissimilarities are squared before clustering.
The criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.
Ploting Ward.D2 linkage and as dendrogram suggest: split in 3 cluster.
This time the plot is better than the preovious, cluster are well define, so i decided to keep the Ward.D2 linkage as Hierarchical Clustering method.
Hence, let’s extract the cluster from the Hierarchical Clustering with Ward.D2 linkage.
Going forward, I want to find which method have given the best clusterization.
So, I have compared through the silhouette and showing the cluster.
I can conclude that between the two clusterizations I prefer the K-means method.
The k-means clustering have a silhouette value of 0.28.
## cluster size ave.sil.width
## 1 1 36 0.15
## 2 2 84 0.36
## 3 3 47 0.24
The Hierarchical Clustering with Ward.D2 linkage have a silhouette value of 0.25.
## cluster size ave.sil.width
## 1 1 27 0.45
## 2 2 106 0.21
## 3 3 34 0.19
Let’s visualize the chosen clusters.
Then visualise also the Clusters Boxplot.
Moreover in order to see what the clusters represent.
I have Added to the Dataset the variable that indicates the cluster to which it belongs and then aggregate the data in order to see the mean of the variable present in it.
Through the cluster features we can see what the clusters represent.
cluster 1 - Developed countries, in fact the country in this group are those with best value of variable that represent develop as exports
, health
, imports
, income
, life_expec
and gdpp
and a low value of the other variables.
cluster 2 - Developing countries, this group has value of the variables between the first and last group.
cluster 3 - Underdeveloped countries, so called “third world countries”, the country in this group have opposite characteristics to the first group.
## Kmeans_cluster child_mort exports health imports income inflation
## 1 1 5.00000 58.73889 8.807778 51.49167 45672.222 2.671250
## 2 2 21.92738 40.24392 6.200952 47.47340 12305.595 7.600905
## 3 3 92.96170 29.15128 6.388511 42.32340 3942.404 12.019681
## life_expec total_fer gdpp
## 1 80.12778 1.752778 42494.444
## 2 72.81429 2.307500 6486.452
## 3 59.18723 5.008085 1922.383
Now visualize a plot that show the size of each cluster
Cluster 1 | Cluster 2 | Cluster 3 | |
---|---|---|---|
N of Country | 36 | 84 | 47 |
As conclusion i decide to group the country as the cluster have classified it and print the name of country that belongs to the cluster
The Country classified in this group are:
## [1] "Australia" "Austria" "Bahrain"
## [4] "Belgium" "Brunei" "Canada"
## [7] "Cyprus" "Czech Republic" "Denmark"
## [10] "Finland" "France" "Germany"
## [13] "Greece" "Iceland" "Ireland"
## [16] "Israel" "Italy" "Japan"
## [19] "Kuwait" "Luxembourg" "Malta"
## [22] "Netherlands" "New Zealand" "Norway"
## [25] "Portugal" "Qatar" "Singapore"
## [28] "Slovak Republic" "Slovenia" "South Korea"
## [31] "Spain" "Sweden" "Switzerland"
## [34] "United Arab Emirates" "United Kingdom" "United States"
The Country classified in this group are:
## [1] "Albania" "Algeria"
## [3] "Antigua and Barbuda" "Argentina"
## [5] "Armenia" "Azerbaijan"
## [7] "Bahamas" "Bangladesh"
## [9] "Barbados" "Belarus"
## [11] "Belize" "Bhutan"
## [13] "Bolivia" "Bosnia and Herzegovina"
## [15] "Brazil" "Bulgaria"
## [17] "Cambodia" "Cape Verde"
## [19] "Chile" "China"
## [21] "Colombia" "Costa Rica"
## [23] "Croatia" "Dominican Republic"
## [25] "Ecuador" "Egypt"
## [27] "El Salvador" "Estonia"
## [29] "Fiji" "Georgia"
## [31] "Grenada" "Guatemala"
## [33] "Guyana" "Hungary"
## [35] "India" "Indonesia"
## [37] "Iran" "Jamaica"
## [39] "Jordan" "Kazakhstan"
## [41] "Kyrgyz Republic" "Latvia"
## [43] "Lebanon" "Libya"
## [45] "Lithuania" "Macedonia, FYR"
## [47] "Malaysia" "Maldives"
## [49] "Mauritius" "Micronesia, Fed. Sts."
## [51] "Moldova" "Mongolia"
## [53] "Montenegro" "Morocco"
## [55] "Myanmar" "Nepal"
## [57] "Oman" "Panama"
## [59] "Paraguay" "Peru"
## [61] "Philippines" "Poland"
## [63] "Romania" "Russia"
## [65] "Samoa" "Saudi Arabia"
## [67] "Serbia" "Seychelles"
## [69] "Solomon Islands" "Sri Lanka"
## [71] "St. Vincent and the Grenadines" "Suriname"
## [73] "Tajikistan" "Thailand"
## [75] "Tonga" "Tunisia"
## [77] "Turkey" "Turkmenistan"
## [79] "Ukraine" "Uruguay"
## [81] "Uzbekistan" "Vanuatu"
## [83] "Venezuela" "Vietnam"
The Country classified in this group are:
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Equatorial Guinea"
## [15] "Eritrea" "Gabon"
## [17] "Gambia" "Ghana"
## [19] "Guinea" "Guinea-Bissau"
## [21] "Haiti" "Iraq"
## [23] "Kenya" "Kiribati"
## [25] "Lao" "Lesotho"
## [27] "Liberia" "Madagascar"
## [29] "Malawi" "Mali"
## [31] "Mauritania" "Mozambique"
## [33] "Namibia" "Niger"
## [35] "Nigeria" "Pakistan"
## [37] "Rwanda" "Senegal"
## [39] "Sierra Leone" "South Africa"
## [41] "Sudan" "Tanzania"
## [43] "Timor-Leste" "Togo"
## [45] "Uganda" "Yemen"
## [47] "Zambia"
Dimensionality reduction is the problem of taking a matrix with many observations and “packing” it into a matrix with fewer observations that preserves as much information as possible in the full matrix.
The main components are the simplest of the methodologies to do this. It is based on the search for an orthonormal basis (a set of perpendicular vectors) within the dimensional space of the dataset which explains the greatest possible amount of variance in the dataset
Principal component analysis is the simplest example of dimensionality reduction.
Find the number of principal component that can represent my database and understand their characteristics
Principal Component Analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets by turning a large set of variables into a smaller one that still contains most of the information in the large set
The PCA finds the directions that have the greatest variance.
PCA is generally used as a tool for analyzing / visualizing exploratory data
The main elements of the PCA:
Eigenvectors: principal axes of the subspace of maximum variance
Eigenvalues: variance of the inputs projected along the principal axes
Estimated dimensionality: number of significant (non-negative) eigenvalues
After have deleted the cluster variable that i have assign in the previous analysis.
The first step for PCA is the normalization of data
Find the principal components and visualize their characteristics and plot them.
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
## Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
## Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
## PC8 PC9
## Standard deviation 0.29718 0.25860
## Proportion of Variance 0.00981 0.00743
## Cumulative Proportion 0.99257 1.00000
The variable positively correlated are grouped together, while those negatively correlated are positioned on opposite sides of the plot origin.
In order to find the best number Principal Components we can see the percentage of explained variance expressed by each of them and their eigenvalues that is is a number which telling you how much variance there is in the data in that direction.
I search for an amount of explained variance of at least 80%
In this case right number of PC is 4
## [1] "Percentage of explained variance with 4 PC is: 0.87191"
I search for PC that have a eingenvalue greater than 1
Also in this case right number of PC is 4 and this confirm what saw in the percentage of explained variance.
Then visualise also the table with all the eingenvalue and percentage of explained variance
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.13565658 45.9517398 45.95174
## Dim.2 1.54634631 17.1816257 63.13337
## Dim.3 1.17038330 13.0042589 76.13762
## Dim.4 0.99478456 11.0531618 87.19079
## Dim.5 0.66061903 7.3402114 94.53100
## Dim.6 0.22358112 2.4842347 97.01523
## Dim.7 0.11343874 1.2604304 98.27566
## Dim.8 0.08831536 0.9812817 99.25694
## Dim.9 0.06687501 0.7430556 100.00000
Going forward, let’s also displaying the most significant variables that constitute PC below expressed for the 4 most important PC
## life_expec child_mort total_fer income gdpp exports inflation
## 0.4258394 -0.4195194 -0.4037290 0.3984411 0.3926448 0.2838970 -0.1931729
## imports health
## 0.1614824 0.1508378
## imports exports health life_expec child_mort total_fer
## -0.671820644 -0.613163494 0.243086779 0.222706743 -0.192883937 -0.155233106
## gdpp income inflation
## 0.046022396 -0.022535530 0.008404473
## inflation health income imports exports gdpp
## -0.64251951 0.59663237 -0.30154750 0.29992674 -0.14476069 -0.12297749
## life_expec child_mort total_fer
## -0.11391854 0.02954353 -0.01954925
## gdpp health income total_fer child_mort life_expec
## -0.531994575 -0.461897497 -0.392159039 -0.378303645 -0.370653262 0.203797235
## inflation imports exports
## -0.150441762 0.071907461 -0.003091019
Then let’s visualize graphically the individual variable contribution to the most important 4 PC .
I can conclude my analysis saying that:
First component is related to the variables concerning the characteristics of the lifespan, in fact the most contribution is given by the variables life_expec
, child_mort
, total_fer
Second component concerns trade relations with other countries, in fact the most important variables are imports
, exports
.
Third component is characterized by those variables that reflect people’s quality of life, in fact the most contribution is given by inflation
, health
.
Fourth component concerns the economic productivity of a country, in fact the highest value of contribution is given by gdpp