Dataset Description

This dataset includes data from each country regarding socio-economic and health. This data helps estimate the development level of each country and compair them among eachother. The examples of usage could be:

  • estimating an exclusive pricing for a product for each country.
  • deciding how to split humanitarian help in underdeveloped countries
  • categorizing each country to the development levels (as we will do in this analysis)

The link to the source of the dataset: Unsupervised Learning on Country Data.

Dataset upload

#Dataset is loaded. Note that the data set should be in the same location as this rmd file
temp <- read.csv("Country-data.csv")

#We will separate the "country" column since it has character values.

rownames(temp) <- temp$country
temp2 <- temp[, -1]
country_data <-scale(temp2)

Variables

Summary of Unsupervised Learning on Country Dataset Variables
Variable Description
country Name of the country
child_mort Death of children under 5 years of age per 1000 live births
exports Exports of goods and services per capita. Given as %age of the GDP per capita
health Total health spending per capita. Given as %age of GDP per capita
imports Imports of goods and services per capita. Given as %age of the GDP per capita
Income Net income per person
Inflation The measurement of the annual growth rate of the Total GDP
life_expec The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp The GDP per capita. Calculated as the Total GDP divided by the total population.

Creating Hierarchical Clustering

#Calculation of the distances (euclidean method)
distances = dist(country_data, method = "euclidean")

#Creating Hierarchical Clustering
clusterCountries = hclust(distances, method = "ward.D2")

#The display of these data
plot(clusterCountries)

# Creating clusters.
clusterGroups = cutree(clusterCountries, k = 3)

#The display of the subsets
plot(clusterGroups)

#Displaying cluster 1
colMeans(subset(country_data, clusterGroups == 1))
##  child_mort     exports      health     imports      income   inflation 
##  1.65638682 -0.63911207 -0.11236616 -0.29852849 -0.80687277 -0.06045525 
##  life_expec   total_fer        gdpp 
## -1.49637729  1.64200130 -0.67087483
#Displaying cluster 2
colMeans(subset(country_data, clusterGroups == 2))
##  child_mort     exports      health     imports      income   inflation 
## -0.16494698 -0.04080723 -0.16819684  0.04937351 -0.30100537  0.12664922 
##  life_expec   total_fer        gdpp 
##  0.04115624 -0.19377211 -0.35773328
#Displaying cluster 3
colMeans(subset(country_data, clusterGroups == 3))
##  child_mort     exports      health     imports      income   inflation 
## -0.80111954  0.63475270  0.61361032  0.08313756  1.57918040 -0.34683900 
##  life_expec   total_fer        gdpp 
##  1.05998899 -0.69982916  1.64803967

Conclusion

The clusters are devided to 3 groups representing the development level (Underdeveloped, Developing, Developed). In order to link the groups with development levels, few things need to be considered:

  • For the positive indicators (exports, health, etc.), the more the value increases, the more desirable outcome becomes.
  • For the negative indicators (child_mort, Inflation, etc.), the more the value decreases, the more desirable outcome becomes.

Based on the above indicators and data, it is possible to deduce the linkage of the groups. Specifically:

  • cluster 1 is Underdeveloped. It tends to have the lowest positive indicator and highest negative indicators values compairing to the rest groups.
  • cluster 2 is Developing. It tends to have values in the middle compairing to the rest groups.
  • cluster 3 is Developed. It tends to have the highest positive indicator and lowest negative indicators values compairing to the rest groups.

As a result,

Creating K-means Clustering

# Creating k-means clusters.
country_kmeans <- kmeans(country_data, centers = 3, nstart = 25)

#The display of K-means data
fviz_cluster(country_kmeans,
             data = country_data,
             geom = "point",
             ggtheme = theme_minimal())