Clustering and Dimension reduction for nations on the basis of various socio-economic factors

Introduction

Although there have been a lot of studies in the past regarding clustering of countries on the basis of GDP and life expectancy. It was found that not many have considered other socio-economic factors like Mortality rate and level of Schooling and their effect has not been publicised much. As a result, this study tries to perform different clustering techniques like K-means and PAM, in order to get some insights on the dataset made available by Deeksha Russell and Duan Wang, who gathered the data from the WHO and United Nations websites.

Loading the necessary libraries

library(factoextra)
library(clValid)
library(flexclust)
library(clustertend)
library(cluster)
library(ClusterR)
library(readxl)
library(fpc)
library(gridExtra)
library(corrplot)

Importing the dataset(XLSX format)

data <- read_excel("dataset.xlsx")

About the dataset

The dataset includes 183 observations(countries) and 22 variables. The variable description is as following:

Country - Names of the countries
Year - Year of observation(in this case only 2015)
Status - whether developed or developing
Life Expectancy - Average time a citizen of any country is expected to live(in years)
Adult Mortality - Probability of dying between 15 and 60 years per 1000 population
Infant deaths - Number of Infant Deaths per 1000 population
Alcohol - Alcohol, recorded per capita (15+) consumption (in litres)
Percentage expenditure - Expenditure on health as a percentage of GDP per capita (%)
Hepatitis B - Immunization coverage among 1-year olds (%)
Measles - Number of reported cases per 1000 population
BMI - Average Body Mass Index of entire population
Under-five deaths - Number of under-five deaths per 1000 population
Polio - Immunization coverage among 1-year olds (%)
Total expenditure - Government expenditure on health industry as a percentage of total government expenditure(%)
Diphtheria - Immunization coverage among 1-year olds (%)
HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years)
GDP - Gross Domestic Product per capita (in USD)
Population - Population of the country
Thinness 10-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
Thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%)
Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Schooling - Number of years of Schooling

The data related to life expectancy, health factors for 183 countries has been collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and its corresponding economic data was collected from the United Nation website for the year 2015.

Basic Exploratory Data Analysis

Viewing some initial observations

head(data)

## # A tibble: 6 x 22
##   Country  Year Status `Life expectanc~ `Adult Mortalit~ `infant deaths` Alcohol
##   <chr>   <dbl> <chr>             <dbl>            <dbl>           <dbl>   <dbl>
## 1 Afghan~  2015 Devel~             65                263              62    0.01
## 2 Albania  2015 Devel~             77.8               74               0    4.6 
## 3 Algeria  2015 Devel~             75.6               19              21   NA   
## 4 Angola   2015 Devel~             52.4              335              66   NA   
## 5 Antigu~  2015 Devel~             76.4               13               0   NA   
## 6 Argent~  2015 Devel~             76.3              116               8   NA   
## # ... with 15 more variables: percentage expenditure <dbl>, Hepatitis B <dbl>,
## #   Measles <dbl>, BMI <dbl>, under-five deaths <dbl>, Polio <dbl>,
## #   Total expenditure <dbl>, Diphtheria <dbl>, HIV/AIDS <dbl>, GDP <dbl>,
## #   Population <dbl>, thinness  10-19 years <dbl>, thinness 5-9 years <dbl>,
## #   Income composition of resources <dbl>, Schooling <dbl>

Structure of the dataset

str(data)

## tibble [183 x 22] (S3: tbl_df/tbl/data.frame)
##  $ Country                        : chr [1:183] "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Year                           : num [1:183] 2015 2015 2015 2015 2015 ...
##  $ Status                         : chr [1:183] "Developing" "Developing" "Developing" "Developing" ...
##  $ Life expectancy                : num [1:183] 65 77.8 75.6 52.4 76.4 76.3 74.8 82.8 81.5 72.7 ...
##  $ Adult Mortality                : num [1:183] 263 74 19 335 13 116 118 59 65 118 ...
##  $ infant deaths                  : num [1:183] 62 0 21 66 0 8 1 1 0 5 ...
##  $ Alcohol                        : num [1:183] 0.01 4.6 NA NA NA NA NA NA NA NA ...
##  $ percentage expenditure         : num [1:183] 71.3 365 0 0 0 ...
##  $ Hepatitis B                    : num [1:183] 65 99 95 64 99 94 94 93 93 96 ...
##  $ Measles                        : num [1:183] 1154 0 63 118 0 ...
##  $ BMI                            : num [1:183] 19.1 58 59.5 23.3 47.7 62.8 54.9 66.6 57.6 52.5 ...
##  $ under-five deaths              : num [1:183] 83 0 24 98 0 9 1 1 0 6 ...
##  $ Polio                          : num [1:183] 6 99 95 7 86 93 96 93 93 98 ...
##  $ Total expenditure              : num [1:183] 8.16 6 NA NA NA NA NA NA NA NA ...
##  $ Diphtheria                     : num [1:183] 65 99 95 64 99 94 94 93 93 96 ...
##  $ HIV/AIDS                       : num [1:183] 0.1 0.1 0.1 1.9 0.2 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num [1:183] 584 3954 4133 3696 13567 ...
##  $ Population                     : num [1:183] 33736494 28873 39871528 2785935 NA ...
##  $ thinness  10-19 years          : num [1:183] 17.2 1.2 6 8.3 3.3 1 2.1 0.6 1.9 2.8 ...
##  $ thinness 5-9 years             : num [1:183] 17.3 1.3 5.8 8.2 3.3 0.9 2.2 0.6 2.1 2.9 ...
##  $ Income composition of resources: num [1:183] 0.479 0.762 0.743 0.531 0.784 0.826 0.741 0.937 0.892 0.758 ...
##  $ Schooling                      : num [1:183] 10.1 14.2 14.4 11.4 13.9 17.3 12.7 20.4 15.9 12.7 ...

As it can be seen from the above statistics, there are some missing values as well as some variables are probably not important for the further analysis, so there is a need for cleaning the dataset.

Data Cleaning

# Removing the Year and Status variables
data <- data[,c(-2,-3)]

Checking for column wise missing values

sapply(data, function(x) sum(is.na(x)))

##                         Country                 Life expectancy 
##                               0                               0 
##                 Adult Mortality                   infant deaths 
##                               0                               0 
##                         Alcohol          percentage expenditure 
##                             177                               0 
##                     Hepatitis B                         Measles 
##                               9                               0 
##                             BMI               under-five deaths 
##                               2                               0 
##                           Polio               Total expenditure 
##                               0                             181 
##                      Diphtheria                        HIV/AIDS 
##                               0                               0 
##                             GDP                      Population 
##                              29                              41 
##           thinness  10-19 years              thinness 5-9 years 
##                               2                               2 
## Income composition of resources                       Schooling 
##                              10                              10

Clearly, there exists missing values in almost all the variables. However, two variables namely “Alcohol” and “Total expenditure” has huge number of NAs. So, it is better to drop off these two variables.

Dropping missing values

data[c("Alcohol","Total expenditure")] <- NULL
data <- na.omit(data)

After cleaning the dataset and removing the missing values, the dataset now contains information about 130 countries and 18 factors in total.

Data pre-processing

The variable with names of nations is of character type. Hence, let’s make it the rownames instead of a variable.

rownames(data) <- data$Country

finaldata <- data[,2:18]
rownames(finaldata) <- rownames(data)

finaldata <- scale(finaldata)    # Scaling the dataset

Checking the clustering tendancy of the dataset

clusterable <- get_clust_tendency(finaldata, n = nrow(finaldata)-1, graph = FALSE)
clusterable$hopkins_stat

## [1] 0.8615404

The above value of hopkin’s statistic(0.8615404) clearly suggest that the dataset is highly clusterable.

Finding optimal number of clusters

To find optimal number of clusters, different methods would be used to compare and come up with the best result.

Using Silhoutte statistic

f1 <- fviz_nbclust(finaldata, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")

f2 <- fviz_nbclust(finaldata, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

The above plots suggest that using both K-means and PAM algorithm, 2 clusters would be the best case for this dataset.

Using total Within Sum of Squares of clusters statistics

f3 <- fviz_nbclust(finaldata, FUNcluster = kmeans, method = "wss") + 
  ggtitle("Optimal number of clusters \n K-means")

f4 <- fviz_nbclust(finaldata, FUNcluster = cluster::pam, method = "wss") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f3, f4, ncol=2)

Now, this plot is a bit complicated as there is no obvious answer as to how many clusters would be ideal. Visualizing the plot properly, one can argue if 2, 5 or 9 clusters should be used. So, all these possibilities would be considered and tried out in the analysis.

Using Elbow method

k_max <- 10

wss <- sapply(1:k_max, function(k){kmeans(finaldata, k, 
                                          nstart=50,iter.max = 1000 )$tot.withinss})

wss

##  [1] 2193.0000 1632.7573 1350.9019 1191.8486 1057.0326  944.5931  814.7547
##  [8]  728.3500  653.1678  604.9573

plot(1:k_max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

This plot further emphasises that the number of clusters should be 2 for the dataset as the elbow appears at 2.

K-means clustering

Considering 2 clusters

kmean2 <- eclust(finaldata, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

cluster2 <- fviz_cluster(kmean2, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")

sil2 <- fviz_silhouette(kmean2)

##   cluster size ave.sil.width
## 1       1   82          0.43
## 2       2   48          0.11

grid.arrange(cluster2, sil2, ncol=2)

kmean2$cluster

##              Afghanistan                  Albania                  Algeria 
##                        2                        1                        1 
##                   Angola                Argentina                  Armenia 
##                        2                        1                        1 
##                Australia                  Austria               Azerbaijan 
##                        1                        1                        1 
##               Bangladesh                  Belarus                  Belgium 
##                        2                        1                        1 
##                   Belize                    Benin                   Bhutan 
##                        1                        2                        2 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                        1                        1                        1 
##                 Bulgaria             Burkina Faso                  Burundi 
##                        1                        2                        2 
##               Cabo Verde                 Cambodia                 Cameroon 
##                        1                        1                        2 
##                   Canada Central African Republic                     Chad 
##                        1                        2                        2 
##                    Chile                    China                 Colombia 
##                        1                        1                        1 
##                  Comoros               Costa Rica                  Croatia 
##                        2                        1                        1 
##                   Cyprus                 Djibouti       Dominican Republic 
##                        1                        2                        1 
##                  Ecuador              El Salvador        Equatorial Guinea 
##                        1                        1                        2 
##                  Estonia                 Ethiopia                     Fiji 
##                        1                        2                        1 
##                   France                    Gabon                  Georgia 
##                        1                        2                        1 
##                  Germany                    Ghana                   Greece 
##                        1                        2                        1 
##                Guatemala                   Guinea            Guinea-Bissau 
##                        2                        2                        2 
##                   Guyana                    Haiti                 Honduras 
##                        1                        2                        1 
##                    India                Indonesia                     Iraq 
##                        2                        2                        2 
##                  Ireland                   Israel                    Italy 
##                        1                        1                        1 
##                  Jamaica                   Jordan               Kazakhstan 
##                        1                        1                        1 
##                    Kenya                 Kiribati                   Latvia 
##                        2                        1                        1 
##                  Lebanon                  Lesotho                  Liberia 
##                        1                        2                        2 
##                Lithuania               Luxembourg               Madagascar 
##                        1                        1                        2 
##                   Malawi                 Malaysia                 Maldives 
##                        2                        1                        1 
##                     Mali                    Malta               Mauritania 
##                        2                        1                        2 
##                Mauritius                   Mexico                 Mongolia 
##                        1                        1                        1 
##               Montenegro                  Morocco               Mozambique 
##                        1                        1                        2 
##                  Myanmar                  Namibia                    Nepal 
##                        2                        2                        2 
##              Netherlands                Nicaragua                    Niger 
##                        1                        1                        2 
##                  Nigeria                 Pakistan                   Panama 
##                        2                        2                        1 
##                 Paraguay                     Peru              Philippines 
##                        1                        1                        2 
##                   Poland                 Portugal                  Romania 
##                        1                        1                        1 
##       Russian Federation                   Rwanda                    Samoa 
##                        1                        2                        1 
##    Sao Tome and Principe                  Senegal                   Serbia 
##                        1                        2                        1 
##               Seychelles             Sierra Leone          Solomon Islands 
##                        1                        2                        1 
##             South Africa                    Spain                Sri Lanka 
##                        2                        1                        1 
##                 Suriname                Swaziland                   Sweden 
##                        1                        2                        1 
##               Tajikistan                 Thailand              Timor-Leste 
##                        1                        1                        2 
##                     Togo                    Tonga      Trinidad and Tobago 
##                        2                        1                        1 
##                  Tunisia                   Turkey             Turkmenistan 
##                        1                        1                        1 
##                   Uganda                  Ukraine                  Uruguay 
##                        2                        1                        1 
##               Uzbekistan                  Vanuatu                   Zambia 
##                        1                        1                        2 
##                 Zimbabwe 
##                        2

The above outputs suggest that this is a good fit for the dataset as the average silhouette width for both the clusters is positive. This means the points in both the clusters are clustered properly.

Considering 5 clusters

kmean5 <- eclust(finaldata, k=5 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

cluster5 <- fviz_cluster(kmean5, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")

sil5 <- fviz_silhouette(kmean5)

##   cluster size ave.sil.width
## 1       1    1          0.00
## 2       2   18         -0.01
## 3       3   30          0.18
## 4       4   26          0.20
## 5       5   55          0.23

grid.arrange(cluster5, sil5, ncol=2)

kmean5$cluster

##              Afghanistan                  Albania                  Algeria 
##                        3                        4                        5 
##                   Angola                Argentina                  Armenia 
##                        2                        4                        5 
##                Australia                  Austria               Azerbaijan 
##                        4                        4                        5 
##               Bangladesh                  Belarus                  Belgium 
##                        3                        5                        4 
##                   Belize                    Benin                   Bhutan 
##                        5                        3                        3 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                        5                        5                        5 
##                 Bulgaria             Burkina Faso                  Burundi 
##                        5                        3                        3 
##               Cabo Verde                 Cambodia                 Cameroon 
##                        5                        5                        3 
##                   Canada Central African Republic                     Chad 
##                        4                        2                        2 
##                    Chile                    China                 Colombia 
##                        4                        5                        5 
##                  Comoros               Costa Rica                  Croatia 
##                        3                        5                        4 
##                   Cyprus                 Djibouti       Dominican Republic 
##                        5                        3                        5 
##                  Ecuador              El Salvador        Equatorial Guinea 
##                        5                        5                        2 
##                  Estonia                 Ethiopia                     Fiji 
##                        4                        3                        5 
##                   France                    Gabon                  Georgia 
##                        4                        2                        5 
##                  Germany                    Ghana                   Greece 
##                        4                        3                        4 
##                Guatemala                   Guinea            Guinea-Bissau 
##                        2                        2                        3 
##                   Guyana                    Haiti                 Honduras 
##                        5                        2                        5 
##                    India                Indonesia                     Iraq 
##                        1                        2                        2 
##                  Ireland                   Israel                    Italy 
##                        4                        4                        4 
##                  Jamaica                   Jordan               Kazakhstan 
##                        5                        5                        5 
##                    Kenya                 Kiribati                   Latvia 
##                        3                        5                        4 
##                  Lebanon                  Lesotho                  Liberia 
##                        5                        3                        2 
##                Lithuania               Luxembourg               Madagascar 
##                        4                        4                        3 
##                   Malawi                 Malaysia                 Maldives 
##                        3                        5                        5 
##                     Mali                    Malta               Mauritania 
##                        3                        4                        3 
##                Mauritius                   Mexico                 Mongolia 
##                        5                        5                        5 
##               Montenegro                  Morocco               Mozambique 
##                        4                        5                        2 
##                  Myanmar                  Namibia                    Nepal 
##                        3                        3                        3 
##              Netherlands                Nicaragua                    Niger 
##                        4                        5                        3 
##                  Nigeria                 Pakistan                   Panama 
##                        2                        3                        5 
##                 Paraguay                     Peru              Philippines 
##                        5                        2                        2 
##                   Poland                 Portugal                  Romania 
##                        4                        4                        5 
##       Russian Federation                   Rwanda                    Samoa 
##                        5                        3                        5 
##    Sao Tome and Principe                  Senegal                   Serbia 
##                        5                        3                        5 
##               Seychelles             Sierra Leone          Solomon Islands 
##                        5                        3                        5 
##             South Africa                    Spain                Sri Lanka 
##                        3                        4                        5 
##                 Suriname                Swaziland                   Sweden 
##                        5                        2                        4 
##               Tajikistan                 Thailand              Timor-Leste 
##                        5                        5                        5 
##                     Togo                    Tonga      Trinidad and Tobago 
##                        3                        5                        5 
##                  Tunisia                   Turkey             Turkmenistan 
##                        5                        5                        5 
##                   Uganda                  Ukraine                  Uruguay 
##                        3                        2                        4 
##               Uzbekistan                  Vanuatu                   Zambia 
##                        5                        5                        2 
##                 Zimbabwe 
##                        3

Although some clusters have positive average silhouette width, there exists two clusters with values close to 0(0 and -0.01) suggesting the idea of using 5 clusters not so good. However, let’s analyze the same for 9 clusters.

Considering 9 clusters

kmean9 <- eclust(finaldata, k=9 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

cluster9 <- fviz_cluster(kmean9, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 9 clusters")

sil9 <- fviz_silhouette(kmean9)

##   cluster size ave.sil.width
## 1       1    8          0.16
## 2       2   12          0.09
## 3       3   16          0.20
## 4       4    9          0.50
## 5       5   20          0.19
## 6       6   51          0.27
## 7       7    2          0.13
## 8       8   11          0.07
## 9       9    1          0.00

grid.arrange(cluster9, sil9, ncol=2)

kmean9$cluster

##              Afghanistan                  Albania                  Algeria 
##                        1                        6                        6 
##                   Angola                Argentina                  Armenia 
##                        2                        6                        6 
##                Australia                  Austria               Azerbaijan 
##                        4                        4                        6 
##               Bangladesh                  Belarus                  Belgium 
##                        1                        6                        6 
##                   Belize                    Benin                   Bhutan 
##                        5                        3                        1 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                        6                        3                        6 
##                 Bulgaria             Burkina Faso                  Burundi 
##                        6                        5                        3 
##               Cabo Verde                 Cambodia                 Cameroon 
##                        5                        5                        3 
##                   Canada Central African Republic                     Chad 
##                        4                        2                        2 
##                    Chile                    China                 Colombia 
##                        6                        6                        6 
##                  Comoros               Costa Rica                  Croatia 
##                        5                        6                        6 
##                   Cyprus                 Djibouti       Dominican Republic 
##                        6                        3                        6 
##                  Ecuador              El Salvador        Equatorial Guinea 
##                        6                        6                        2 
##                  Estonia                 Ethiopia                     Fiji 
##                        6                        5                        6 
##                   France                    Gabon                  Georgia 
##                        4                        2                        6 
##                  Germany                    Ghana                   Greece 
##                        4                        5                        6 
##                Guatemala                   Guinea            Guinea-Bissau 
##                        8                        2                        3 
##                   Guyana                    Haiti                 Honduras 
##                        5                        2                        6 
##                    India                Indonesia                     Iraq 
##                        9                        7                        8 
##                  Ireland                   Israel                    Italy 
##                        6                        4                        6 
##                  Jamaica                   Jordan               Kazakhstan 
##                        6                        6                        6 
##                    Kenya                 Kiribati                   Latvia 
##                        3                        8                        6 
##                  Lebanon                  Lesotho                  Liberia 
##                        6                        3                        2 
##                Lithuania               Luxembourg               Madagascar 
##                        6                        6                        5 
##                   Malawi                 Malaysia                 Maldives 
##                        3                        5                        1 
##                     Mali                    Malta               Mauritania 
##                        3                        4                        5 
##                Mauritius                   Mexico                 Mongolia 
##                        6                        6                        6 
##               Montenegro                  Morocco               Mozambique 
##                        6                        6                        2 
##                  Myanmar                  Namibia                    Nepal 
##                        1                        3                        1 
##              Netherlands                Nicaragua                    Niger 
##                        4                        6                        5 
##                  Nigeria                 Pakistan                   Panama 
##                        7                        1                        8 
##                 Paraguay                     Peru              Philippines 
##                        5                        8                        2 
##                   Poland                 Portugal                  Romania 
##                        6                        6                        8 
##       Russian Federation                   Rwanda                    Samoa 
##                        6                        5                        8 
##    Sao Tome and Principe                  Senegal                   Serbia 
##                        5                        5                        6 
##               Seychelles             Sierra Leone          Solomon Islands 
##                        6                        3                        5 
##             South Africa                    Spain                Sri Lanka 
##                        3                        4                        1 
##                 Suriname                Swaziland                   Sweden 
##                        6                        2                        6 
##               Tajikistan                 Thailand              Timor-Leste 
##                        5                        6                        5 
##                     Togo                    Tonga      Trinidad and Tobago 
##                        3                        8                        8 
##                  Tunisia                   Turkey             Turkmenistan 
##                        6                        6                        5 
##                   Uganda                  Ukraine                  Uruguay 
##                        3                        8                        6 
##               Uzbekistan                  Vanuatu                   Zambia 
##                        6                        8                        2 
##                 Zimbabwe 
##                        3

The above results and plots clearly suggest that there is some serious overlapping of clusters which is not a great thing. It means out of all these possibilties given by K-means algorithm, classifying the countries into 2 clusters would be the best idea.

Let’s now use PAM algorithm to check what results does it gives.

PAM clustering

Considering 2 clusters

pam2 <- eclust(finaldata, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cpam2 <- fviz_cluster(pam2, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")

silpam2 <- fviz_silhouette(pam2)

##   cluster size ave.sil.width
## 1       1   54          0.11
## 2       2   76          0.42

grid.arrange(cpam2, silpam2, ncol=2)

pam2$clustering

##              Afghanistan                  Albania                  Algeria 
##                        1                        2                        2 
##                   Angola                Argentina                  Armenia 
##                        1                        2                        2 
##                Australia                  Austria               Azerbaijan 
##                        2                        2                        2 
##               Bangladesh                  Belarus                  Belgium 
##                        1                        2                        2 
##                   Belize                    Benin                   Bhutan 
##                        2                        1                        1 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                        2                        1                        2 
##                 Bulgaria             Burkina Faso                  Burundi 
##                        2                        1                        1 
##               Cabo Verde                 Cambodia                 Cameroon 
##                        2                        1                        1 
##                   Canada Central African Republic                     Chad 
##                        2                        1                        1 
##                    Chile                    China                 Colombia 
##                        2                        2                        2 
##                  Comoros               Costa Rica                  Croatia 
##                        1                        2                        2 
##                   Cyprus                 Djibouti       Dominican Republic 
##                        2                        1                        2 
##                  Ecuador              El Salvador        Equatorial Guinea 
##                        2                        2                        1 
##                  Estonia                 Ethiopia                     Fiji 
##                        2                        1                        2 
##                   France                    Gabon                  Georgia 
##                        2                        1                        2 
##                  Germany                    Ghana                   Greece 
##                        2                        1                        2 
##                Guatemala                   Guinea            Guinea-Bissau 
##                        1                        1                        1 
##                   Guyana                    Haiti                 Honduras 
##                        1                        1                        2 
##                    India                Indonesia                     Iraq 
##                        1                        1                        1 
##                  Ireland                   Israel                    Italy 
##                        2                        2                        2 
##                  Jamaica                   Jordan               Kazakhstan 
##                        2                        2                        2 
##                    Kenya                 Kiribati                   Latvia 
##                        1                        2                        2 
##                  Lebanon                  Lesotho                  Liberia 
##                        2                        1                        1 
##                Lithuania               Luxembourg               Madagascar 
##                        2                        2                        1 
##                   Malawi                 Malaysia                 Maldives 
##                        1                        2                        2 
##                     Mali                    Malta               Mauritania 
##                        1                        2                        1 
##                Mauritius                   Mexico                 Mongolia 
##                        2                        2                        2 
##               Montenegro                  Morocco               Mozambique 
##                        2                        2                        1 
##                  Myanmar                  Namibia                    Nepal 
##                        1                        1                        1 
##              Netherlands                Nicaragua                    Niger 
##                        2                        2                        1 
##                  Nigeria                 Pakistan                   Panama 
##                        1                        1                        2 
##                 Paraguay                     Peru              Philippines 
##                        2                        2                        1 
##                   Poland                 Portugal                  Romania 
##                        2                        2                        2 
##       Russian Federation                   Rwanda                    Samoa 
##                        2                        1                        2 
##    Sao Tome and Principe                  Senegal                   Serbia 
##                        1                        1                        2 
##               Seychelles             Sierra Leone          Solomon Islands 
##                        2                        1                        1 
##             South Africa                    Spain                Sri Lanka 
##                        1                        2                        2 
##                 Suriname                Swaziland                   Sweden 
##                        2                        1                        2 
##               Tajikistan                 Thailand              Timor-Leste 
##                        1                        2                        1 
##                     Togo                    Tonga      Trinidad and Tobago 
##                        1                        2                        2 
##                  Tunisia                   Turkey             Turkmenistan 
##                        2                        2                        2 
##                   Uganda                  Ukraine                  Uruguay 
##                        1                        2                        2 
##               Uzbekistan                  Vanuatu                   Zambia 
##                        2                        2                        1 
##                 Zimbabwe 
##                        1

The outputs suggest the same as that of K-means that classifying the nations into 2 clusters would be a good way going forward in the analysis. However, let’s consider other possibilities before concluding anything.

Considering 5 clusters

pam5 <- eclust(finaldata, k=5 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cpam5 <- fviz_cluster(pam5, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 5 clusters")

silpam5 <- fviz_silhouette(pam5)

##   cluster size ave.sil.width
## 1       1   46          0.13
## 2       2   65          0.31
## 3       3    9          0.53
## 4       4    9          0.30
## 5       5    1          0.00

grid.arrange(cpam5, silpam5, ncol=2)

pam5$clustering

##              Afghanistan                  Albania                  Algeria 
##                        1                        2                        2 
##                   Angola                Argentina                  Armenia 
##                        1                        2                        2 
##                Australia                  Austria               Azerbaijan 
##                        3                        3                        2 
##               Bangladesh                  Belarus                  Belgium 
##                        1                        2                        2 
##                   Belize                    Benin                   Bhutan 
##                        2                        1                        1 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                        2                        1                        2 
##                 Bulgaria             Burkina Faso                  Burundi 
##                        2                        1                        1 
##               Cabo Verde                 Cambodia                 Cameroon 
##                        2                        1                        1 
##                   Canada Central African Republic                     Chad 
##                        3                        1                        1 
##                    Chile                    China                 Colombia 
##                        2                        2                        2 
##                  Comoros               Costa Rica                  Croatia 
##                        1                        2                        2 
##                   Cyprus                 Djibouti       Dominican Republic 
##                        2                        1                        2 
##                  Ecuador              El Salvador        Equatorial Guinea 
##                        2                        2                        4 
##                  Estonia                 Ethiopia                     Fiji 
##                        2                        1                        2 
##                   France                    Gabon                  Georgia 
##                        3                        4                        2 
##                  Germany                    Ghana                   Greece 
##                        3                        1                        2 
##                Guatemala                   Guinea            Guinea-Bissau 
##                        1                        1                        1 
##                   Guyana                    Haiti                 Honduras 
##                        1                        4                        2 
##                    India                Indonesia                     Iraq 
##                        5                        1                        1 
##                  Ireland                   Israel                    Italy 
##                        2                        3                        2 
##                  Jamaica                   Jordan               Kazakhstan 
##                        2                        2                        2 
##                    Kenya                 Kiribati                   Latvia 
##                        1                        2                        2 
##                  Lebanon                  Lesotho                  Liberia 
##                        2                        1                        1 
##                Lithuania               Luxembourg               Madagascar 
##                        2                        2                        1 
##                   Malawi                 Malaysia                 Maldives 
##                        1                        2                        2 
##                     Mali                    Malta               Mauritania 
##                        1                        3                        1 
##                Mauritius                   Mexico                 Mongolia 
##                        2                        2                        2 
##               Montenegro                  Morocco               Mozambique 
##                        2                        2                        4 
##                  Myanmar                  Namibia                    Nepal 
##                        1                        1                        1 
##              Netherlands                Nicaragua                    Niger 
##                        3                        2                        1 
##                  Nigeria                 Pakistan                   Panama 
##                        1                        1                        2 
##                 Paraguay                     Peru              Philippines 
##                        2                        4                        4 
##                   Poland                 Portugal                  Romania 
##                        2                        2                        2 
##       Russian Federation                   Rwanda                    Samoa 
##                        2                        1                        2 
##    Sao Tome and Principe                  Senegal                   Serbia 
##                        1                        1                        2 
##               Seychelles             Sierra Leone          Solomon Islands 
##                        2                        1                        1 
##             South Africa                    Spain                Sri Lanka 
##                        1                        3                        2 
##                 Suriname                Swaziland                   Sweden 
##                        2                        4                        2 
##               Tajikistan                 Thailand              Timor-Leste 
##                        1                        2                        1 
##                     Togo                    Tonga      Trinidad and Tobago 
##                        1                        2                        2 
##                  Tunisia                   Turkey             Turkmenistan 
##                        2                        2                        2 
##                   Uganda                  Ukraine                  Uruguay 
##                        1                        4                        2 
##               Uzbekistan                  Vanuatu                   Zambia 
##                        2                        2                        4 
##                 Zimbabwe 
##                        1

Clearly, the result suggest it is not a bad idea to consider 5 clusters but the presence of overlapping as seen in the plot and also the negative silhouette width for some countries suggest it would be better to consider more options.

Considering 9 clusters

pam9 <- eclust(finaldata, k=9 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cpam9 <- fviz_cluster(pam9, finaldata, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 9 clusters")

silpam9 <- fviz_silhouette(pam9)

##   cluster size ave.sil.width
## 1       1    8          0.19
## 2       2    1          0.00
## 3       3   54          0.28
## 4       4   33          0.18
## 5       5    9          0.50
## 6       6   13          0.27
## 7       7    9          0.24
## 8       8    1          0.00
## 9       9    2          0.14

grid.arrange(cpam9, silpam9, ncol=2)

pam9$clustering

##              Afghanistan                  Albania                  Algeria 
##                        1                        2                        3 
##                   Angola                Argentina                  Armenia 
##                        4                        3                        3 
##                Australia                  Austria               Azerbaijan 
##                        5                        5                        3 
##               Bangladesh                  Belarus                  Belgium 
##                        1                        3                        3 
##                   Belize                    Benin                   Bhutan 
##                        6                        4                        1 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                        3                        4                        3 
##                 Bulgaria             Burkina Faso                  Burundi 
##                        3                        4                        4 
##               Cabo Verde                 Cambodia                 Cameroon 
##                        3                        6                        4 
##                   Canada Central African Republic                     Chad 
##                        5                        4                        4 
##                    Chile                    China                 Colombia 
##                        3                        3                        3 
##                  Comoros               Costa Rica                  Croatia 
##                        4                        6                        3 
##                   Cyprus                 Djibouti       Dominican Republic 
##                        6                        4                        3 
##                  Ecuador              El Salvador        Equatorial Guinea 
##                        3                        3                        7 
##                  Estonia                 Ethiopia                     Fiji 
##                        3                        4                        3 
##                   France                    Gabon                  Georgia 
##                        5                        7                        3 
##                  Germany                    Ghana                   Greece 
##                        5                        4                        3 
##                Guatemala                   Guinea            Guinea-Bissau 
##                        6                        4                        4 
##                   Guyana                    Haiti                 Honduras 
##                        4                        7                        3 
##                    India                Indonesia                     Iraq 
##                        8                        9                        4 
##                  Ireland                   Israel                    Italy 
##                        3                        5                        3 
##                  Jamaica                   Jordan               Kazakhstan 
##                        3                        3                        3 
##                    Kenya                 Kiribati                   Latvia 
##                        4                        3                        3 
##                  Lebanon                  Lesotho                  Liberia 
##                        3                        4                        4 
##                Lithuania               Luxembourg               Madagascar 
##                        3                        3                        4 
##                   Malawi                 Malaysia                 Maldives 
##                        4                        6                        1 
##                     Mali                    Malta               Mauritania 
##                        4                        5                        4 
##                Mauritius                   Mexico                 Mongolia 
##                        3                        3                        3 
##               Montenegro                  Morocco               Mozambique 
##                        3                        3                        7 
##                  Myanmar                  Namibia                    Nepal 
##                        1                        4                        1 
##              Netherlands                Nicaragua                    Niger 
##                        5                        3                        4 
##                  Nigeria                 Pakistan                   Panama 
##                        9                        1                        3 
##                 Paraguay                     Peru              Philippines 
##                        6                        7                        7 
##                   Poland                 Portugal                  Romania 
##                        3                        3                        6 
##       Russian Federation                   Rwanda                    Samoa 
##                        6                        4                        3 
##    Sao Tome and Principe                  Senegal                   Serbia 
##                        6                        4                        6 
##               Seychelles             Sierra Leone          Solomon Islands 
##                        3                        4                        6 
##             South Africa                    Spain                Sri Lanka 
##                        4                        5                        1 
##                 Suriname                Swaziland                   Sweden 
##                        3                        7                        3 
##               Tajikistan                 Thailand              Timor-Leste 
##                        6                        3                        4 
##                     Togo                    Tonga      Trinidad and Tobago 
##                        4                        3                        3 
##                  Tunisia                   Turkey             Turkmenistan 
##                        3                        3                        3 
##                   Uganda                  Ukraine                  Uruguay 
##                        4                        7                        3 
##               Uzbekistan                  Vanuatu                   Zambia 
##                        3                        3                        7 
##                 Zimbabwe 
##                        4

Again, the same issue of overlapping exist in this case. Considering all the options, using 2 clusters seems to be the best option.

Hierarchical clustering

Considering 2 clusters

hcluster2 <- eclust(finaldata, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

plot(hcluster2, cex=0.6, hang=-1, main = "Dendrogram for 2 clusters")
rect.hclust(hcluster2, k=2, border='blue')

Considering 5 clusters

hcluster5 <- eclust(finaldata, k=5, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

plot(hcluster5, cex=0.6, hang=-1, main = "Dendrogram for 5 clusters")
rect.hclust(hcluster5, k=5, border='blue')

Considering 9 clusters

hcluster9 <- eclust(finaldata, k=9, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

plot(hcluster9, cex=0.6, hang=-1, main = "Dendrogram for 9 clusters")
rect.hclust(hcluster9, k=9, border='blue')

Conclusion:

Considering all the possibilies, 2 clusters is the best option for the analysis. This conclusion is the result of analysing and comparing the results of K-means, PAM and Hieararchical clustering techniques applied on different number of clusters.

Dimension reduction

str(finaldata)

##  num [1:130, 1:17] -0.718 0.883 0.608 -2.293 0.695 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:130] "Afghanistan" "Albania" "Algeria" "Angola" ...
##   ..$ : chr [1:17] "Life expectancy" "Adult Mortality" "infant deaths" "percentage expenditure" ...
##  - attr(*, "scaled:center")= Named num [1:17] 70.74 158.68 27.72 3.36 80.65 ...
##   ..- attr(*, "names")= chr [1:17] "Life expectancy" "Adult Mortality" "infant deaths" "percentage expenditure" ...
##  - attr(*, "scaled:scale")= Named num [1:17] 8 99.5 96.4 32.6 25 ...
##   ..- attr(*, "names")= chr [1:17] "Life expectancy" "Adult Mortality" "infant deaths" "percentage expenditure" ...

summary(finaldata)

##  Life expectancy   Adult Mortality   infant deaths      percentage expenditure
##  Min.   :-2.4685   Min.   :-1.5849   Min.   :-0.28750   Min.   :-0.103        
##  1st Qu.:-0.6273   1st Qu.:-0.8009   1st Qu.:-0.28750   1st Qu.:-0.103        
##  Median : 0.1761   Median :-0.1325   Median :-0.25639   Median :-0.103        
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000        
##  3rd Qu.: 0.6606   3rd Qu.: 0.5661   3rd Qu.:-0.07491   3rd Qu.:-0.103        
##  Max.   : 1.7829   Max.   : 3.2701   Max.   : 9.14972   Max.   :11.104        
##   Hepatitis B         Measles             BMI           under-five deaths 
##  Min.   :-2.9870   Min.   :-0.1942   Min.   :-1.79432   Min.   :-0.29658  
##  1st Qu.:-0.1362   1st Qu.:-0.1942   1st Qu.:-0.81184   1st Qu.:-0.28851  
##  Median : 0.4140   Median :-0.1924   Median : 0.03618   Median :-0.27238  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.6140   3rd Qu.:-0.1712   3rd Qu.: 0.97502   3rd Qu.:-0.09085  
##  Max.   : 0.7341   Max.   : 9.7050   Max.   : 1.74874   Max.   : 8.57783  
##      Polio           Diphtheria         HIV/AIDS            GDP          
##  Min.   :-2.9535   Min.   :-3.2996   Min.   :-0.4508   Min.   :-0.59040  
##  1st Qu.:-0.1130   1st Qu.:-0.2081   1st Qu.:-0.4508   1st Qu.:-0.52421  
##  Median : 0.4080   Median : 0.4145   Median :-0.4508   Median :-0.37280  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.6242   3rd Qu.: 0.6077   3rd Qu.:-0.1877   3rd Qu.: 0.03107  
##  Max.   : 0.7028   Max.   : 0.6936   Max.   : 5.6010   Max.   : 5.00462  
##    Population       thinness  10-19 years thinness 5-9 years
##  Min.   :-0.38534   Min.   :-1.0457       Min.   :-1.0455   
##  1st Qu.:-0.37741   1st Qu.:-0.7204       1st Qu.:-0.7216   
##  Median :-0.32265   Median :-0.2924       Median :-0.2972   
##  Mean   : 0.00000   Mean   : 0.0000       Mean   : 0.0000   
##  3rd Qu.:-0.02251   3rd Qu.: 0.4153       3rd Qu.: 0.4008   
##  Max.   : 8.16959   Max.   : 5.0265       Max.   : 5.0298   
##  Income composition of resources   Schooling        
##  Min.   :-2.1677                 Min.   :-2.661775  
##  1st Qu.:-0.8220                 1st Qu.:-0.705366  
##  Median : 0.1766                 Median : 0.001115  
##  Mean   : 0.0000                 Mean   : 0.000000  
##  3rd Qu.: 0.7222                 3rd Qu.: 0.698538  
##  Max.   : 1.7340                 Max.   : 2.772694

Checking correlation among variables

correlation <- cor(finaldata, method = 'pearson')
round(correlation, 2) # Rounding off the values to 2 decimal digits

##                                 Life expectancy Adult Mortality infant deaths
## Life expectancy                            1.00           -0.73         -0.21
## Adult Mortality                           -0.73            1.00          0.15
## infant deaths                             -0.21            0.15          1.00
## percentage expenditure                     0.06           -0.06         -0.02
## Hepatitis B                                0.37           -0.13         -0.08
## Measles                                   -0.05            0.03          0.82
## BMI                                        0.54           -0.35         -0.21
## under-five deaths                         -0.24            0.18          0.99
## Polio                                      0.49           -0.30         -0.12
## Diphtheria                                 0.47           -0.23         -0.11
## HIV/AIDS                                  -0.62            0.63          0.07
## GDP                                        0.49           -0.31         -0.12
## Population                                -0.03            0.03          0.27
## thinness  10-19 years                     -0.46            0.25          0.56
## thinness 5-9 years                        -0.45            0.26          0.56
## Income composition of resources            0.90           -0.59         -0.20
## Schooling                                  0.81           -0.47         -0.22
##                                 percentage expenditure Hepatitis B Measles
## Life expectancy                                   0.06        0.37   -0.05
## Adult Mortality                                  -0.06       -0.13    0.03
## infant deaths                                    -0.02       -0.08    0.82
## percentage expenditure                            1.00        0.05   -0.02
## Hepatitis B                                       0.05        1.00    0.03
## Measles                                          -0.02        0.03    1.00
## BMI                                               0.05        0.15   -0.13
## under-five deaths                                -0.02       -0.09    0.79
## Polio                                             0.01        0.50   -0.01
## Diphtheria                                        0.05        0.90    0.02
## HIV/AIDS                                         -0.05       -0.34   -0.04
## GDP                                              -0.03        0.09   -0.07
## Population                                       -0.02       -0.05    0.13
## thinness  10-19 years                            -0.02       -0.04    0.38
## thinness 5-9 years                               -0.02       -0.09    0.37
## Income composition of resources                   0.03        0.28   -0.06
## Schooling                                         0.03        0.30   -0.06
##                                   BMI under-five deaths Polio Diphtheria
## Life expectancy                  0.54             -0.24  0.49       0.47
## Adult Mortality                 -0.35              0.18 -0.30      -0.23
## infant deaths                   -0.21              0.99 -0.12      -0.11
## percentage expenditure           0.05             -0.02  0.01       0.05
## Hepatitis B                      0.15             -0.09  0.50       0.90
## Measles                         -0.13              0.79 -0.01       0.02
## BMI                              1.00             -0.22  0.20       0.17
## under-five deaths               -0.22              1.00 -0.14      -0.13
## Polio                            0.20             -0.14  1.00       0.58
## Diphtheria                       0.17             -0.13  0.58       1.00
## HIV/AIDS                        -0.27              0.10 -0.38      -0.41
## GDP                              0.39             -0.12  0.22       0.20
## Population                       0.01              0.31 -0.23      -0.05
## thinness  10-19 years           -0.49              0.55 -0.18      -0.08
## thinness 5-9 years              -0.51              0.54 -0.18      -0.13
## Income composition of resources  0.62             -0.22  0.44       0.40
## Schooling                        0.61             -0.24  0.39       0.39
##                                 HIV/AIDS   GDP Population thinness  10-19 years
## Life expectancy                    -0.62  0.49      -0.03                 -0.46
## Adult Mortality                     0.63 -0.31       0.03                  0.25
## infant deaths                       0.07 -0.12       0.27                  0.56
## percentage expenditure             -0.05 -0.03      -0.02                 -0.02
## Hepatitis B                        -0.34  0.09      -0.05                 -0.04
## Measles                            -0.04 -0.07       0.13                  0.38
## BMI                                -0.27  0.39       0.01                 -0.49
## under-five deaths                   0.10 -0.12       0.31                  0.55
## Polio                              -0.38  0.22      -0.23                 -0.18
## Diphtheria                         -0.41  0.20      -0.05                 -0.08
## HIV/AIDS                            1.00 -0.19       0.02                  0.17
## GDP                                -0.19  1.00       0.07                 -0.29
## Population                          0.02  0.07       1.00                 -0.01
## thinness  10-19 years               0.17 -0.29      -0.01                  1.00
## thinness 5-9 years                  0.15 -0.29      -0.02                  0.97
## Income composition of resources    -0.48  0.57       0.03                 -0.51
## Schooling                          -0.39  0.57       0.05                 -0.50
##                                 thinness 5-9 years
## Life expectancy                              -0.45
## Adult Mortality                               0.26
## infant deaths                                 0.56
## percentage expenditure                       -0.02
## Hepatitis B                                  -0.09
## Measles                                       0.37
## BMI                                          -0.51
## under-five deaths                             0.54
## Polio                                        -0.18
## Diphtheria                                   -0.13
## HIV/AIDS                                      0.15
## GDP                                          -0.29
## Population                                   -0.02
## thinness  10-19 years                         0.97
## thinness 5-9 years                            1.00
## Income composition of resources              -0.50
## Schooling                                    -0.49
##                                 Income composition of resources Schooling
## Life expectancy                                            0.90      0.81
## Adult Mortality                                           -0.59     -0.47
## infant deaths                                             -0.20     -0.22
## percentage expenditure                                     0.03      0.03
## Hepatitis B                                                0.28      0.30
## Measles                                                   -0.06     -0.06
## BMI                                                        0.62      0.61
## under-five deaths                                         -0.22     -0.24
## Polio                                                      0.44      0.39
## Diphtheria                                                 0.40      0.39
## HIV/AIDS                                                  -0.48     -0.39
## GDP                                                        0.57      0.57
## Population                                                 0.03      0.05
## thinness  10-19 years                                     -0.51     -0.50
## thinness 5-9 years                                        -0.50     -0.49
## Income composition of resources                            1.00      0.92
## Schooling                                                  0.92      1.00

corrplot(correlation, type = 'lower')

The plot clearly shows that there are some variables which are correlated with some other variables in the dataset. This means we can use some dimension reduction techniques for easy computation of the analysis.

Principal Component Analysis(PCA)

Finding optimal number of components

pca <- prcomp(finaldata, center=TRUE, scale=TRUE)
fviz_eig(pca)

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     2.4628 1.7347 1.3759 1.0696 1.00512 0.9475 0.88970
## Proportion of Variance 0.3568 0.1770 0.1114 0.0673 0.05943 0.0528 0.04656
## Cumulative Proportion  0.3568 0.5338 0.6452 0.7125 0.77191 0.8247 0.87127
##                            PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.74603 0.67020 0.63889 0.5638 0.42917 0.32022 0.30815
## Proportion of Variance 0.03274 0.02642 0.02401 0.0187 0.01083 0.00603 0.00559
## Cumulative Proportion  0.90401 0.93043 0.95444 0.9731 0.98398 0.99001 0.99559
##                           PC15    PC16    PC17
## Standard deviation     0.21087 0.16581 0.05414
## Proportion of Variance 0.00262 0.00162 0.00017
## Cumulative Proportion  0.99821 0.99983 1.00000

The above results suggest that the maximum variance explained by a single component is that by the first component, i.e, PC1, which explains around 35.68% of the total variance. However, this is not that great a number to simply select this component. So, lets look at the eigen values of each variable.

eigen(cor(finaldata))$values

##  [1] 6.065615085 3.009336738 1.893096958 1.144090950 1.010267284 0.897680014
##  [7] 0.791562083 0.556563930 0.449167193 0.408180830 0.317863416 0.184191068
## [13] 0.102539868 0.094953774 0.044467282 0.027492730 0.002930797

fviz_eig(pca, choice='eigenvalue')

Looking at the results above and also by the Scree-plot, it becomes clear that there are 5 variables with an eigen value of more than 1. As a general thumb rule, it is a good practice to select those variables which has eigen value of 1 or more. This means the first five variables will be good for further analysis.

Component analysis

variablepca <- get_pca_var(pca)
options(ggrepel.max.overlaps = Inf)    # increasing the overlap capacity
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", 
             gradient.cols = c("blue", "green", "red"), repel = TRUE)

Checking the contribution of each variable

fviz_contrib(pca, choice = "var", axes = 1:5)

Conclusion:

The plot above suggests that the variables namely ‘infant deaths’, ‘under-five deaths’, ‘percentage expenditure’, ‘Diphtheria’, ‘Life expectancy’, ‘Hepatitis B’, ‘Income composition of resources’ have the major contributions. Apart from these, variables ‘thinness 5-9 years’, ‘Schooling’, ‘thinness 10-19 years’ and ‘Adult mortality’ are also some of the other important contributors for the dimensions from 1 to 5.

Clustering and Dimension reduction for nations on the basis of various socio-economic factors

Gunneet Singh

February 25, 2022

Introduction

Loading the necessary libraries

Importing the dataset(XLSX format)

About the dataset

Basic Exploratory Data Analysis

Viewing some initial observations

Structure of the dataset

Data Cleaning

Checking for column wise missing values

Dropping missing values

Data pre-processing

Checking the clustering tendancy of the dataset

Finding optimal number of clusters

Using Silhoutte statistic

Using total Within Sum of Squares of clusters statistics

Using Elbow method

K-means clustering

Considering 2 clusters

Considering 5 clusters

Considering 9 clusters

PAM clustering

Considering 2 clusters

Considering 5 clusters

Considering 9 clusters

Hierarchical clustering

Considering 2 clusters

Considering 5 clusters

Considering 9 clusters

Conclusion:

Dimension reduction

Checking correlation among variables

Principal Component Analysis(PCA)

Finding optimal number of components

Component analysis

Checking the contribution of each variable

Conclusion:

References: