Although there have been a lot of studies in the past regarding clustering of countries on the basis of GDP and life expectancy. It was found that not many have considered other socio-economic factors like Mortality rate and level of Schooling and their effect has not been publicised much. As a result, this study tries to perform different clustering techniques like K-means and PAM, in order to get some insights on the dataset made available by Deeksha Russell and Duan Wang, who gathered the data from the WHO and United Nations websites.
library(factoextra)
library(clValid)
library(flexclust)
library(clustertend)
library(cluster)
library(ClusterR)
library(readxl)
library(fpc)
library(gridExtra)
library(corrplot)
data <- read_excel("dataset.xlsx")
The dataset includes 183 observations(countries) and 22 variables. The variable description is as following:
The data related to life expectancy, health factors for 183 countries has been collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and its corresponding economic data was collected from the United Nation website for the year 2015.
head(data)
## # A tibble: 6 x 22
## Country Year Status `Life expectanc~ `Adult Mortalit~ `infant deaths` Alcohol
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ 2015 Devel~ 65 263 62 0.01
## 2 Albania 2015 Devel~ 77.8 74 0 4.6
## 3 Algeria 2015 Devel~ 75.6 19 21 NA
## 4 Angola 2015 Devel~ 52.4 335 66 NA
## 5 Antigu~ 2015 Devel~ 76.4 13 0 NA
## 6 Argent~ 2015 Devel~ 76.3 116 8 NA
## # ... with 15 more variables: percentage expenditure <dbl>, Hepatitis B <dbl>,
## # Measles <dbl>, BMI <dbl>, under-five deaths <dbl>, Polio <dbl>,
## # Total expenditure <dbl>, Diphtheria <dbl>, HIV/AIDS <dbl>, GDP <dbl>,
## # Population <dbl>, thinness 10-19 years <dbl>, thinness 5-9 years <dbl>,
## # Income composition of resources <dbl>, Schooling <dbl>
str(data)
## tibble [183 x 22] (S3: tbl_df/tbl/data.frame)
## $ Country : chr [1:183] "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ Year : num [1:183] 2015 2015 2015 2015 2015 ...
## $ Status : chr [1:183] "Developing" "Developing" "Developing" "Developing" ...
## $ Life expectancy : num [1:183] 65 77.8 75.6 52.4 76.4 76.3 74.8 82.8 81.5 72.7 ...
## $ Adult Mortality : num [1:183] 263 74 19 335 13 116 118 59 65 118 ...
## $ infant deaths : num [1:183] 62 0 21 66 0 8 1 1 0 5 ...
## $ Alcohol : num [1:183] 0.01 4.6 NA NA NA NA NA NA NA NA ...
## $ percentage expenditure : num [1:183] 71.3 365 0 0 0 ...
## $ Hepatitis B : num [1:183] 65 99 95 64 99 94 94 93 93 96 ...
## $ Measles : num [1:183] 1154 0 63 118 0 ...
## $ BMI : num [1:183] 19.1 58 59.5 23.3 47.7 62.8 54.9 66.6 57.6 52.5 ...
## $ under-five deaths : num [1:183] 83 0 24 98 0 9 1 1 0 6 ...
## $ Polio : num [1:183] 6 99 95 7 86 93 96 93 93 98 ...
## $ Total expenditure : num [1:183] 8.16 6 NA NA NA NA NA NA NA NA ...
## $ Diphtheria : num [1:183] 65 99 95 64 99 94 94 93 93 96 ...
## $ HIV/AIDS : num [1:183] 0.1 0.1 0.1 1.9 0.2 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num [1:183] 584 3954 4133 3696 13567 ...
## $ Population : num [1:183] 33736494 28873 39871528 2785935 NA ...
## $ thinness 10-19 years : num [1:183] 17.2 1.2 6 8.3 3.3 1 2.1 0.6 1.9 2.8 ...
## $ thinness 5-9 years : num [1:183] 17.3 1.3 5.8 8.2 3.3 0.9 2.2 0.6 2.1 2.9 ...
## $ Income composition of resources: num [1:183] 0.479 0.762 0.743 0.531 0.784 0.826 0.741 0.937 0.892 0.758 ...
## $ Schooling : num [1:183] 10.1 14.2 14.4 11.4 13.9 17.3 12.7 20.4 15.9 12.7 ...
As it can be seen from the above statistics, there are some missing values as well as some variables are probably not important for the further analysis, so there is a need for cleaning the dataset.
# Removing the Year and Status variables
data <- data[,c(-2,-3)]
sapply(data, function(x) sum(is.na(x)))
## Country Life expectancy
## 0 0
## Adult Mortality infant deaths
## 0 0
## Alcohol percentage expenditure
## 177 0
## Hepatitis B Measles
## 9 0
## BMI under-five deaths
## 2 0
## Polio Total expenditure
## 0 181
## Diphtheria HIV/AIDS
## 0 0
## GDP Population
## 29 41
## thinness 10-19 years thinness 5-9 years
## 2 2
## Income composition of resources Schooling
## 10 10
Clearly, there exists missing values in almost all the variables. However, two variables namely “Alcohol” and “Total expenditure” has huge number of NAs. So, it is better to drop off these two variables.
data[c("Alcohol","Total expenditure")] <- NULL
data <- na.omit(data)
After cleaning the dataset and removing the missing values, the dataset now contains information about 130 countries and 18 factors in total.
The variable with names of nations is of character type. Hence, let’s make it the rownames instead of a variable.
rownames(data) <- data$Country
finaldata <- data[,2:18]
rownames(finaldata) <- rownames(data)
finaldata <- scale(finaldata) # Scaling the dataset
clusterable <- get_clust_tendency(finaldata, n = nrow(finaldata)-1, graph = FALSE)
clusterable$hopkins_stat
## [1] 0.8615404
The above value of hopkin’s statistic(0.8615404) clearly suggest that the dataset is highly clusterable.
To find optimal number of clusters, different methods would be used to compare and come up with the best result.
f1 <- fviz_nbclust(finaldata, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(finaldata, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
The above plots suggest that using both K-means and PAM algorithm, 2 clusters would be the best case for this dataset.
f3 <- fviz_nbclust(finaldata, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal number of clusters \n K-means")
f4 <- fviz_nbclust(finaldata, FUNcluster = cluster::pam, method = "wss") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f3, f4, ncol=2)
Now, this plot is a bit complicated as there is no obvious answer as to how many clusters would be ideal. Visualizing the plot properly, one can argue if 2, 5 or 9 clusters should be used. So, all these possibilities would be considered and tried out in the analysis.
k_max <- 10
wss <- sapply(1:k_max, function(k){kmeans(finaldata, k,
nstart=50,iter.max = 1000 )$tot.withinss})
wss
## [1] 2193.0000 1632.7573 1350.9019 1191.8486 1057.0326 944.5931 814.7547
## [8] 728.3500 653.1678 604.9573
plot(1:k_max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
This plot further emphasises that the number of clusters should be 2 for the dataset as the elbow appears at 2.
kmean2 <- eclust(finaldata, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
cluster2 <- fviz_cluster(kmean2, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
sil2 <- fviz_silhouette(kmean2)
## cluster size ave.sil.width
## 1 1 82 0.43
## 2 2 48 0.11
grid.arrange(cluster2, sil2, ncol=2)
kmean2$cluster
## Afghanistan Albania Algeria
## 2 1 1
## Angola Argentina Armenia
## 2 1 1
## Australia Austria Azerbaijan
## 1 1 1
## Bangladesh Belarus Belgium
## 2 1 1
## Belize Benin Bhutan
## 1 2 2
## Bosnia and Herzegovina Botswana Brazil
## 1 1 1
## Bulgaria Burkina Faso Burundi
## 1 2 2
## Cabo Verde Cambodia Cameroon
## 1 1 2
## Canada Central African Republic Chad
## 1 2 2
## Chile China Colombia
## 1 1 1
## Comoros Costa Rica Croatia
## 2 1 1
## Cyprus Djibouti Dominican Republic
## 1 2 1
## Ecuador El Salvador Equatorial Guinea
## 1 1 2
## Estonia Ethiopia Fiji
## 1 2 1
## France Gabon Georgia
## 1 2 1
## Germany Ghana Greece
## 1 2 1
## Guatemala Guinea Guinea-Bissau
## 2 2 2
## Guyana Haiti Honduras
## 1 2 1
## India Indonesia Iraq
## 2 2 2
## Ireland Israel Italy
## 1 1 1
## Jamaica Jordan Kazakhstan
## 1 1 1
## Kenya Kiribati Latvia
## 2 1 1
## Lebanon Lesotho Liberia
## 1 2 2
## Lithuania Luxembourg Madagascar
## 1 1 2
## Malawi Malaysia Maldives
## 2 1 1
## Mali Malta Mauritania
## 2 1 2
## Mauritius Mexico Mongolia
## 1 1 1
## Montenegro Morocco Mozambique
## 1 1 2
## Myanmar Namibia Nepal
## 2 2 2
## Netherlands Nicaragua Niger
## 1 1 2
## Nigeria Pakistan Panama
## 2 2 1
## Paraguay Peru Philippines
## 1 1 2
## Poland Portugal Romania
## 1 1 1
## Russian Federation Rwanda Samoa
## 1 2 1
## Sao Tome and Principe Senegal Serbia
## 1 2 1
## Seychelles Sierra Leone Solomon Islands
## 1 2 1
## South Africa Spain Sri Lanka
## 2 1 1
## Suriname Swaziland Sweden
## 1 2 1
## Tajikistan Thailand Timor-Leste
## 1 1 2
## Togo Tonga Trinidad and Tobago
## 2 1 1
## Tunisia Turkey Turkmenistan
## 1 1 1
## Uganda Ukraine Uruguay
## 2 1 1
## Uzbekistan Vanuatu Zambia
## 1 1 2
## Zimbabwe
## 2
The above outputs suggest that this is a good fit for the dataset as the average silhouette width for both the clusters is positive. This means the points in both the clusters are clustered properly.
kmean5 <- eclust(finaldata, k=5 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
cluster5 <- fviz_cluster(kmean5, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")
sil5 <- fviz_silhouette(kmean5)
## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 18 -0.01
## 3 3 30 0.18
## 4 4 26 0.20
## 5 5 55 0.23
grid.arrange(cluster5, sil5, ncol=2)
kmean5$cluster
## Afghanistan Albania Algeria
## 3 4 5
## Angola Argentina Armenia
## 2 4 5
## Australia Austria Azerbaijan
## 4 4 5
## Bangladesh Belarus Belgium
## 3 5 4
## Belize Benin Bhutan
## 5 3 3
## Bosnia and Herzegovina Botswana Brazil
## 5 5 5
## Bulgaria Burkina Faso Burundi
## 5 3 3
## Cabo Verde Cambodia Cameroon
## 5 5 3
## Canada Central African Republic Chad
## 4 2 2
## Chile China Colombia
## 4 5 5
## Comoros Costa Rica Croatia
## 3 5 4
## Cyprus Djibouti Dominican Republic
## 5 3 5
## Ecuador El Salvador Equatorial Guinea
## 5 5 2
## Estonia Ethiopia Fiji
## 4 3 5
## France Gabon Georgia
## 4 2 5
## Germany Ghana Greece
## 4 3 4
## Guatemala Guinea Guinea-Bissau
## 2 2 3
## Guyana Haiti Honduras
## 5 2 5
## India Indonesia Iraq
## 1 2 2
## Ireland Israel Italy
## 4 4 4
## Jamaica Jordan Kazakhstan
## 5 5 5
## Kenya Kiribati Latvia
## 3 5 4
## Lebanon Lesotho Liberia
## 5 3 2
## Lithuania Luxembourg Madagascar
## 4 4 3
## Malawi Malaysia Maldives
## 3 5 5
## Mali Malta Mauritania
## 3 4 3
## Mauritius Mexico Mongolia
## 5 5 5
## Montenegro Morocco Mozambique
## 4 5 2
## Myanmar Namibia Nepal
## 3 3 3
## Netherlands Nicaragua Niger
## 4 5 3
## Nigeria Pakistan Panama
## 2 3 5
## Paraguay Peru Philippines
## 5 2 2
## Poland Portugal Romania
## 4 4 5
## Russian Federation Rwanda Samoa
## 5 3 5
## Sao Tome and Principe Senegal Serbia
## 5 3 5
## Seychelles Sierra Leone Solomon Islands
## 5 3 5
## South Africa Spain Sri Lanka
## 3 4 5
## Suriname Swaziland Sweden
## 5 2 4
## Tajikistan Thailand Timor-Leste
## 5 5 5
## Togo Tonga Trinidad and Tobago
## 3 5 5
## Tunisia Turkey Turkmenistan
## 5 5 5
## Uganda Ukraine Uruguay
## 3 2 4
## Uzbekistan Vanuatu Zambia
## 5 5 2
## Zimbabwe
## 3
Although some clusters have positive average silhouette width, there exists two clusters with values close to 0(0 and -0.01) suggesting the idea of using 5 clusters not so good. However, let’s analyze the same for 9 clusters.
kmean9 <- eclust(finaldata, k=9 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
cluster9 <- fviz_cluster(kmean9, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 9 clusters")
sil9 <- fviz_silhouette(kmean9)
## cluster size ave.sil.width
## 1 1 8 0.16
## 2 2 12 0.09
## 3 3 16 0.20
## 4 4 9 0.50
## 5 5 20 0.19
## 6 6 51 0.27
## 7 7 2 0.13
## 8 8 11 0.07
## 9 9 1 0.00
grid.arrange(cluster9, sil9, ncol=2)
kmean9$cluster
## Afghanistan Albania Algeria
## 1 6 6
## Angola Argentina Armenia
## 2 6 6
## Australia Austria Azerbaijan
## 4 4 6
## Bangladesh Belarus Belgium
## 1 6 6
## Belize Benin Bhutan
## 5 3 1
## Bosnia and Herzegovina Botswana Brazil
## 6 3 6
## Bulgaria Burkina Faso Burundi
## 6 5 3
## Cabo Verde Cambodia Cameroon
## 5 5 3
## Canada Central African Republic Chad
## 4 2 2
## Chile China Colombia
## 6 6 6
## Comoros Costa Rica Croatia
## 5 6 6
## Cyprus Djibouti Dominican Republic
## 6 3 6
## Ecuador El Salvador Equatorial Guinea
## 6 6 2
## Estonia Ethiopia Fiji
## 6 5 6
## France Gabon Georgia
## 4 2 6
## Germany Ghana Greece
## 4 5 6
## Guatemala Guinea Guinea-Bissau
## 8 2 3
## Guyana Haiti Honduras
## 5 2 6
## India Indonesia Iraq
## 9 7 8
## Ireland Israel Italy
## 6 4 6
## Jamaica Jordan Kazakhstan
## 6 6 6
## Kenya Kiribati Latvia
## 3 8 6
## Lebanon Lesotho Liberia
## 6 3 2
## Lithuania Luxembourg Madagascar
## 6 6 5
## Malawi Malaysia Maldives
## 3 5 1
## Mali Malta Mauritania
## 3 4 5
## Mauritius Mexico Mongolia
## 6 6 6
## Montenegro Morocco Mozambique
## 6 6 2
## Myanmar Namibia Nepal
## 1 3 1
## Netherlands Nicaragua Niger
## 4 6 5
## Nigeria Pakistan Panama
## 7 1 8
## Paraguay Peru Philippines
## 5 8 2
## Poland Portugal Romania
## 6 6 8
## Russian Federation Rwanda Samoa
## 6 5 8
## Sao Tome and Principe Senegal Serbia
## 5 5 6
## Seychelles Sierra Leone Solomon Islands
## 6 3 5
## South Africa Spain Sri Lanka
## 3 4 1
## Suriname Swaziland Sweden
## 6 2 6
## Tajikistan Thailand Timor-Leste
## 5 6 5
## Togo Tonga Trinidad and Tobago
## 3 8 8
## Tunisia Turkey Turkmenistan
## 6 6 5
## Uganda Ukraine Uruguay
## 3 8 6
## Uzbekistan Vanuatu Zambia
## 6 8 2
## Zimbabwe
## 3
The above results and plots clearly suggest that there is some serious overlapping of clusters which is not a great thing. It means out of all these possibilties given by K-means algorithm, classifying the countries into 2 clusters would be the best idea.
Let’s now use PAM algorithm to check what results does it gives.
pam2 <- eclust(finaldata, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cpam2 <- fviz_cluster(pam2, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
silpam2 <- fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 54 0.11
## 2 2 76 0.42
grid.arrange(cpam2, silpam2, ncol=2)
pam2$clustering
## Afghanistan Albania Algeria
## 1 2 2
## Angola Argentina Armenia
## 1 2 2
## Australia Austria Azerbaijan
## 2 2 2
## Bangladesh Belarus Belgium
## 1 2 2
## Belize Benin Bhutan
## 2 1 1
## Bosnia and Herzegovina Botswana Brazil
## 2 1 2
## Bulgaria Burkina Faso Burundi
## 2 1 1
## Cabo Verde Cambodia Cameroon
## 2 1 1
## Canada Central African Republic Chad
## 2 1 1
## Chile China Colombia
## 2 2 2
## Comoros Costa Rica Croatia
## 1 2 2
## Cyprus Djibouti Dominican Republic
## 2 1 2
## Ecuador El Salvador Equatorial Guinea
## 2 2 1
## Estonia Ethiopia Fiji
## 2 1 2
## France Gabon Georgia
## 2 1 2
## Germany Ghana Greece
## 2 1 2
## Guatemala Guinea Guinea-Bissau
## 1 1 1
## Guyana Haiti Honduras
## 1 1 2
## India Indonesia Iraq
## 1 1 1
## Ireland Israel Italy
## 2 2 2
## Jamaica Jordan Kazakhstan
## 2 2 2
## Kenya Kiribati Latvia
## 1 2 2
## Lebanon Lesotho Liberia
## 2 1 1
## Lithuania Luxembourg Madagascar
## 2 2 1
## Malawi Malaysia Maldives
## 1 2 2
## Mali Malta Mauritania
## 1 2 1
## Mauritius Mexico Mongolia
## 2 2 2
## Montenegro Morocco Mozambique
## 2 2 1
## Myanmar Namibia Nepal
## 1 1 1
## Netherlands Nicaragua Niger
## 2 2 1
## Nigeria Pakistan Panama
## 1 1 2
## Paraguay Peru Philippines
## 2 2 1
## Poland Portugal Romania
## 2 2 2
## Russian Federation Rwanda Samoa
## 2 1 2
## Sao Tome and Principe Senegal Serbia
## 1 1 2
## Seychelles Sierra Leone Solomon Islands
## 2 1 1
## South Africa Spain Sri Lanka
## 1 2 2
## Suriname Swaziland Sweden
## 2 1 2
## Tajikistan Thailand Timor-Leste
## 1 2 1
## Togo Tonga Trinidad and Tobago
## 1 2 2
## Tunisia Turkey Turkmenistan
## 2 2 2
## Uganda Ukraine Uruguay
## 1 2 2
## Uzbekistan Vanuatu Zambia
## 2 2 1
## Zimbabwe
## 1
The outputs suggest the same as that of K-means that classifying the nations into 2 clusters would be a good way going forward in the analysis. However, let’s consider other possibilities before concluding anything.
pam5 <- eclust(finaldata, k=5 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cpam5 <- fviz_cluster(pam5, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 5 clusters")
silpam5 <- fviz_silhouette(pam5)
## cluster size ave.sil.width
## 1 1 46 0.13
## 2 2 65 0.31
## 3 3 9 0.53
## 4 4 9 0.30
## 5 5 1 0.00
grid.arrange(cpam5, silpam5, ncol=2)
pam5$clustering
## Afghanistan Albania Algeria
## 1 2 2
## Angola Argentina Armenia
## 1 2 2
## Australia Austria Azerbaijan
## 3 3 2
## Bangladesh Belarus Belgium
## 1 2 2
## Belize Benin Bhutan
## 2 1 1
## Bosnia and Herzegovina Botswana Brazil
## 2 1 2
## Bulgaria Burkina Faso Burundi
## 2 1 1
## Cabo Verde Cambodia Cameroon
## 2 1 1
## Canada Central African Republic Chad
## 3 1 1
## Chile China Colombia
## 2 2 2
## Comoros Costa Rica Croatia
## 1 2 2
## Cyprus Djibouti Dominican Republic
## 2 1 2
## Ecuador El Salvador Equatorial Guinea
## 2 2 4
## Estonia Ethiopia Fiji
## 2 1 2
## France Gabon Georgia
## 3 4 2
## Germany Ghana Greece
## 3 1 2
## Guatemala Guinea Guinea-Bissau
## 1 1 1
## Guyana Haiti Honduras
## 1 4 2
## India Indonesia Iraq
## 5 1 1
## Ireland Israel Italy
## 2 3 2
## Jamaica Jordan Kazakhstan
## 2 2 2
## Kenya Kiribati Latvia
## 1 2 2
## Lebanon Lesotho Liberia
## 2 1 1
## Lithuania Luxembourg Madagascar
## 2 2 1
## Malawi Malaysia Maldives
## 1 2 2
## Mali Malta Mauritania
## 1 3 1
## Mauritius Mexico Mongolia
## 2 2 2
## Montenegro Morocco Mozambique
## 2 2 4
## Myanmar Namibia Nepal
## 1 1 1
## Netherlands Nicaragua Niger
## 3 2 1
## Nigeria Pakistan Panama
## 1 1 2
## Paraguay Peru Philippines
## 2 4 4
## Poland Portugal Romania
## 2 2 2
## Russian Federation Rwanda Samoa
## 2 1 2
## Sao Tome and Principe Senegal Serbia
## 1 1 2
## Seychelles Sierra Leone Solomon Islands
## 2 1 1
## South Africa Spain Sri Lanka
## 1 3 2
## Suriname Swaziland Sweden
## 2 4 2
## Tajikistan Thailand Timor-Leste
## 1 2 1
## Togo Tonga Trinidad and Tobago
## 1 2 2
## Tunisia Turkey Turkmenistan
## 2 2 2
## Uganda Ukraine Uruguay
## 1 4 2
## Uzbekistan Vanuatu Zambia
## 2 2 4
## Zimbabwe
## 1
Clearly, the result suggest it is not a bad idea to consider 5 clusters but the presence of overlapping as seen in the plot and also the negative silhouette width for some countries suggest it would be better to consider more options.
pam9 <- eclust(finaldata, k=9 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cpam9 <- fviz_cluster(pam9, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 9 clusters")
silpam9 <- fviz_silhouette(pam9)
## cluster size ave.sil.width
## 1 1 8 0.19
## 2 2 1 0.00
## 3 3 54 0.28
## 4 4 33 0.18
## 5 5 9 0.50
## 6 6 13 0.27
## 7 7 9 0.24
## 8 8 1 0.00
## 9 9 2 0.14
grid.arrange(cpam9, silpam9, ncol=2)
pam9$clustering
## Afghanistan Albania Algeria
## 1 2 3
## Angola Argentina Armenia
## 4 3 3
## Australia Austria Azerbaijan
## 5 5 3
## Bangladesh Belarus Belgium
## 1 3 3
## Belize Benin Bhutan
## 6 4 1
## Bosnia and Herzegovina Botswana Brazil
## 3 4 3
## Bulgaria Burkina Faso Burundi
## 3 4 4
## Cabo Verde Cambodia Cameroon
## 3 6 4
## Canada Central African Republic Chad
## 5 4 4
## Chile China Colombia
## 3 3 3
## Comoros Costa Rica Croatia
## 4 6 3
## Cyprus Djibouti Dominican Republic
## 6 4 3
## Ecuador El Salvador Equatorial Guinea
## 3 3 7
## Estonia Ethiopia Fiji
## 3 4 3
## France Gabon Georgia
## 5 7 3
## Germany Ghana Greece
## 5 4 3
## Guatemala Guinea Guinea-Bissau
## 6 4 4
## Guyana Haiti Honduras
## 4 7 3
## India Indonesia Iraq
## 8 9 4
## Ireland Israel Italy
## 3 5 3
## Jamaica Jordan Kazakhstan
## 3 3 3
## Kenya Kiribati Latvia
## 4 3 3
## Lebanon Lesotho Liberia
## 3 4 4
## Lithuania Luxembourg Madagascar
## 3 3 4
## Malawi Malaysia Maldives
## 4 6 1
## Mali Malta Mauritania
## 4 5 4
## Mauritius Mexico Mongolia
## 3 3 3
## Montenegro Morocco Mozambique
## 3 3 7
## Myanmar Namibia Nepal
## 1 4 1
## Netherlands Nicaragua Niger
## 5 3 4
## Nigeria Pakistan Panama
## 9 1 3
## Paraguay Peru Philippines
## 6 7 7
## Poland Portugal Romania
## 3 3 6
## Russian Federation Rwanda Samoa
## 6 4 3
## Sao Tome and Principe Senegal Serbia
## 6 4 6
## Seychelles Sierra Leone Solomon Islands
## 3 4 6
## South Africa Spain Sri Lanka
## 4 5 1
## Suriname Swaziland Sweden
## 3 7 3
## Tajikistan Thailand Timor-Leste
## 6 3 4
## Togo Tonga Trinidad and Tobago
## 4 3 3
## Tunisia Turkey Turkmenistan
## 3 3 3
## Uganda Ukraine Uruguay
## 4 7 3
## Uzbekistan Vanuatu Zambia
## 3 3 7
## Zimbabwe
## 4
Again, the same issue of overlapping exist in this case. Considering all the options, using 2 clusters seems to be the best option.
hcluster2 <- eclust(finaldata, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hcluster2, cex=0.6, hang=-1, main = "Dendrogram for 2 clusters")
rect.hclust(hcluster2, k=2, border='blue')
hcluster5 <- eclust(finaldata, k=5, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hcluster5, cex=0.6, hang=-1, main = "Dendrogram for 5 clusters")
rect.hclust(hcluster5, k=5, border='blue')
hcluster9 <- eclust(finaldata, k=9, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hcluster9, cex=0.6, hang=-1, main = "Dendrogram for 9 clusters")
rect.hclust(hcluster9, k=9, border='blue')
Considering all the possibilies, 2 clusters is the best option for the analysis. This conclusion is the result of analysing and comparing the results of K-means, PAM and Hieararchical clustering techniques applied on different number of clusters.
str(finaldata)
## num [1:130, 1:17] -0.718 0.883 0.608 -2.293 0.695 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:130] "Afghanistan" "Albania" "Algeria" "Angola" ...
## ..$ : chr [1:17] "Life expectancy" "Adult Mortality" "infant deaths" "percentage expenditure" ...
## - attr(*, "scaled:center")= Named num [1:17] 70.74 158.68 27.72 3.36 80.65 ...
## ..- attr(*, "names")= chr [1:17] "Life expectancy" "Adult Mortality" "infant deaths" "percentage expenditure" ...
## - attr(*, "scaled:scale")= Named num [1:17] 8 99.5 96.4 32.6 25 ...
## ..- attr(*, "names")= chr [1:17] "Life expectancy" "Adult Mortality" "infant deaths" "percentage expenditure" ...
summary(finaldata)
## Life expectancy Adult Mortality infant deaths percentage expenditure
## Min. :-2.4685 Min. :-1.5849 Min. :-0.28750 Min. :-0.103
## 1st Qu.:-0.6273 1st Qu.:-0.8009 1st Qu.:-0.28750 1st Qu.:-0.103
## Median : 0.1761 Median :-0.1325 Median :-0.25639 Median :-0.103
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.: 0.6606 3rd Qu.: 0.5661 3rd Qu.:-0.07491 3rd Qu.:-0.103
## Max. : 1.7829 Max. : 3.2701 Max. : 9.14972 Max. :11.104
## Hepatitis B Measles BMI under-five deaths
## Min. :-2.9870 Min. :-0.1942 Min. :-1.79432 Min. :-0.29658
## 1st Qu.:-0.1362 1st Qu.:-0.1942 1st Qu.:-0.81184 1st Qu.:-0.28851
## Median : 0.4140 Median :-0.1924 Median : 0.03618 Median :-0.27238
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.6140 3rd Qu.:-0.1712 3rd Qu.: 0.97502 3rd Qu.:-0.09085
## Max. : 0.7341 Max. : 9.7050 Max. : 1.74874 Max. : 8.57783
## Polio Diphtheria HIV/AIDS GDP
## Min. :-2.9535 Min. :-3.2996 Min. :-0.4508 Min. :-0.59040
## 1st Qu.:-0.1130 1st Qu.:-0.2081 1st Qu.:-0.4508 1st Qu.:-0.52421
## Median : 0.4080 Median : 0.4145 Median :-0.4508 Median :-0.37280
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6242 3rd Qu.: 0.6077 3rd Qu.:-0.1877 3rd Qu.: 0.03107
## Max. : 0.7028 Max. : 0.6936 Max. : 5.6010 Max. : 5.00462
## Population thinness 10-19 years thinness 5-9 years
## Min. :-0.38534 Min. :-1.0457 Min. :-1.0455
## 1st Qu.:-0.37741 1st Qu.:-0.7204 1st Qu.:-0.7216
## Median :-0.32265 Median :-0.2924 Median :-0.2972
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.02251 3rd Qu.: 0.4153 3rd Qu.: 0.4008
## Max. : 8.16959 Max. : 5.0265 Max. : 5.0298
## Income composition of resources Schooling
## Min. :-2.1677 Min. :-2.661775
## 1st Qu.:-0.8220 1st Qu.:-0.705366
## Median : 0.1766 Median : 0.001115
## Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.7222 3rd Qu.: 0.698538
## Max. : 1.7340 Max. : 2.772694
correlation <- cor(finaldata, method = 'pearson')
round(correlation, 2) # Rounding off the values to 2 decimal digits
## Life expectancy Adult Mortality infant deaths
## Life expectancy 1.00 -0.73 -0.21
## Adult Mortality -0.73 1.00 0.15
## infant deaths -0.21 0.15 1.00
## percentage expenditure 0.06 -0.06 -0.02
## Hepatitis B 0.37 -0.13 -0.08
## Measles -0.05 0.03 0.82
## BMI 0.54 -0.35 -0.21
## under-five deaths -0.24 0.18 0.99
## Polio 0.49 -0.30 -0.12
## Diphtheria 0.47 -0.23 -0.11
## HIV/AIDS -0.62 0.63 0.07
## GDP 0.49 -0.31 -0.12
## Population -0.03 0.03 0.27
## thinness 10-19 years -0.46 0.25 0.56
## thinness 5-9 years -0.45 0.26 0.56
## Income composition of resources 0.90 -0.59 -0.20
## Schooling 0.81 -0.47 -0.22
## percentage expenditure Hepatitis B Measles
## Life expectancy 0.06 0.37 -0.05
## Adult Mortality -0.06 -0.13 0.03
## infant deaths -0.02 -0.08 0.82
## percentage expenditure 1.00 0.05 -0.02
## Hepatitis B 0.05 1.00 0.03
## Measles -0.02 0.03 1.00
## BMI 0.05 0.15 -0.13
## under-five deaths -0.02 -0.09 0.79
## Polio 0.01 0.50 -0.01
## Diphtheria 0.05 0.90 0.02
## HIV/AIDS -0.05 -0.34 -0.04
## GDP -0.03 0.09 -0.07
## Population -0.02 -0.05 0.13
## thinness 10-19 years -0.02 -0.04 0.38
## thinness 5-9 years -0.02 -0.09 0.37
## Income composition of resources 0.03 0.28 -0.06
## Schooling 0.03 0.30 -0.06
## BMI under-five deaths Polio Diphtheria
## Life expectancy 0.54 -0.24 0.49 0.47
## Adult Mortality -0.35 0.18 -0.30 -0.23
## infant deaths -0.21 0.99 -0.12 -0.11
## percentage expenditure 0.05 -0.02 0.01 0.05
## Hepatitis B 0.15 -0.09 0.50 0.90
## Measles -0.13 0.79 -0.01 0.02
## BMI 1.00 -0.22 0.20 0.17
## under-five deaths -0.22 1.00 -0.14 -0.13
## Polio 0.20 -0.14 1.00 0.58
## Diphtheria 0.17 -0.13 0.58 1.00
## HIV/AIDS -0.27 0.10 -0.38 -0.41
## GDP 0.39 -0.12 0.22 0.20
## Population 0.01 0.31 -0.23 -0.05
## thinness 10-19 years -0.49 0.55 -0.18 -0.08
## thinness 5-9 years -0.51 0.54 -0.18 -0.13
## Income composition of resources 0.62 -0.22 0.44 0.40
## Schooling 0.61 -0.24 0.39 0.39
## HIV/AIDS GDP Population thinness 10-19 years
## Life expectancy -0.62 0.49 -0.03 -0.46
## Adult Mortality 0.63 -0.31 0.03 0.25
## infant deaths 0.07 -0.12 0.27 0.56
## percentage expenditure -0.05 -0.03 -0.02 -0.02
## Hepatitis B -0.34 0.09 -0.05 -0.04
## Measles -0.04 -0.07 0.13 0.38
## BMI -0.27 0.39 0.01 -0.49
## under-five deaths 0.10 -0.12 0.31 0.55
## Polio -0.38 0.22 -0.23 -0.18
## Diphtheria -0.41 0.20 -0.05 -0.08
## HIV/AIDS 1.00 -0.19 0.02 0.17
## GDP -0.19 1.00 0.07 -0.29
## Population 0.02 0.07 1.00 -0.01
## thinness 10-19 years 0.17 -0.29 -0.01 1.00
## thinness 5-9 years 0.15 -0.29 -0.02 0.97
## Income composition of resources -0.48 0.57 0.03 -0.51
## Schooling -0.39 0.57 0.05 -0.50
## thinness 5-9 years
## Life expectancy -0.45
## Adult Mortality 0.26
## infant deaths 0.56
## percentage expenditure -0.02
## Hepatitis B -0.09
## Measles 0.37
## BMI -0.51
## under-five deaths 0.54
## Polio -0.18
## Diphtheria -0.13
## HIV/AIDS 0.15
## GDP -0.29
## Population -0.02
## thinness 10-19 years 0.97
## thinness 5-9 years 1.00
## Income composition of resources -0.50
## Schooling -0.49
## Income composition of resources Schooling
## Life expectancy 0.90 0.81
## Adult Mortality -0.59 -0.47
## infant deaths -0.20 -0.22
## percentage expenditure 0.03 0.03
## Hepatitis B 0.28 0.30
## Measles -0.06 -0.06
## BMI 0.62 0.61
## under-five deaths -0.22 -0.24
## Polio 0.44 0.39
## Diphtheria 0.40 0.39
## HIV/AIDS -0.48 -0.39
## GDP 0.57 0.57
## Population 0.03 0.05
## thinness 10-19 years -0.51 -0.50
## thinness 5-9 years -0.50 -0.49
## Income composition of resources 1.00 0.92
## Schooling 0.92 1.00
corrplot(correlation, type = 'lower')
The plot clearly shows that there are some variables which are correlated with some other variables in the dataset. This means we can use some dimension reduction techniques for easy computation of the analysis.
pca <- prcomp(finaldata, center=TRUE, scale=TRUE)
fviz_eig(pca)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4628 1.7347 1.3759 1.0696 1.00512 0.9475 0.88970
## Proportion of Variance 0.3568 0.1770 0.1114 0.0673 0.05943 0.0528 0.04656
## Cumulative Proportion 0.3568 0.5338 0.6452 0.7125 0.77191 0.8247 0.87127
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.74603 0.67020 0.63889 0.5638 0.42917 0.32022 0.30815
## Proportion of Variance 0.03274 0.02642 0.02401 0.0187 0.01083 0.00603 0.00559
## Cumulative Proportion 0.90401 0.93043 0.95444 0.9731 0.98398 0.99001 0.99559
## PC15 PC16 PC17
## Standard deviation 0.21087 0.16581 0.05414
## Proportion of Variance 0.00262 0.00162 0.00017
## Cumulative Proportion 0.99821 0.99983 1.00000
The above results suggest that the maximum variance explained by a single component is that by the first component, i.e, PC1, which explains around 35.68% of the total variance. However, this is not that great a number to simply select this component. So, lets look at the eigen values of each variable.
eigen(cor(finaldata))$values
## [1] 6.065615085 3.009336738 1.893096958 1.144090950 1.010267284 0.897680014
## [7] 0.791562083 0.556563930 0.449167193 0.408180830 0.317863416 0.184191068
## [13] 0.102539868 0.094953774 0.044467282 0.027492730 0.002930797
fviz_eig(pca, choice='eigenvalue')
Looking at the results above and also by the Scree-plot, it becomes clear that there are 5 variables with an eigen value of more than 1. As a general thumb rule, it is a good practice to select those variables which has eigen value of 1 or more. This means the first five variables will be good for further analysis.
variablepca <- get_pca_var(pca)
options(ggrepel.max.overlaps = Inf) # increasing the overlap capacity
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib",
gradient.cols = c("blue", "green", "red"), repel = TRUE)
fviz_contrib(pca, choice = "var", axes = 1:5)
The plot above suggests that the variables namely ‘infant deaths’, ‘under-five deaths’, ‘percentage expenditure’, ‘Diphtheria’, ‘Life expectancy’, ‘Hepatitis B’, ‘Income composition of resources’ have the major contributions. Apart from these, variables ‘thinness 5-9 years’, ‘Schooling’, ‘thinness 10-19 years’ and ‘Adult mortality’ are also some of the other important contributors for the dimensions from 1 to 5.