Objective:
To categorize the countries using socio-economic and health factors that determine the overall development of the country.
Problem Statement:
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorize the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
Context:
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
e1071, tidyverse, plotly, htmltools, devtools, caret, NbClust, reshape2, rvest, magrittr, stringr, cowplot, ggmap
DATA DICTIONARY
* country: Name of the country
* child_mort: Death of children under 5 years of age per 1000 live births
* exports: Exports of goods and services per capita. Given as %age of the GDP per capita
* health: Total health spending per capita. Given as %age of GDP per capita
* imports: Imports of goods and services per capita. Given as %age of the GDP per capita
* income: Net income per person
* inflation: The measurement of the annual growth rate of the Total GDP
* life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
* total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
* gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.
Peep First Five Rows
## # A tibble: 6 × 10
## country child_mort exports health imports income inflation life_expec
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 90.2 10 7.58 44.9 1610 9.44 56.2
## 2 Albania 16.6 28 6.55 48.6 9930 4.49 76.3
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.1 76.5
## 4 Angola 119 62.3 2.85 42.9 5900 22.4 60.1
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8
## 6 Argentina 14.5 18.9 8.1 16 18700 20.9 75.8
## # … with 2 more variables: total_fer <dbl>, gdpp <dbl>
Data Dimensions
## Shape: 167 10
## Columns: country child_mort exports health imports income inflation life_expec total_fer gdpp
## Country Labels: 'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda' ...
Our data set luckily does not have any missing values, so we can proceed without any worries that NULL values will hinder our model’s performance
## Total Missing Values: 0
In order to optimize the performance of our K-Means algorithm, we need to ensure that all variables are appropriately classified. There are no chr variables to convert to factors, considering we removed the country name for the actual training of the model.
## tibble [167 × 9] (S3: tbl_df/tbl/data.frame)
## $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
We visualized the correlation between each variables to see the strength of the relationship between each variables. This will give us good indication how would our cluster partition data into groups, which are GDP, life expectancy, and income.
Additionally, we wanted to plot the distributions of each variables to passively look for outliers and to better understand the data in general.
We implemented a MinMax Scaler to prepare the data set for the K-Means algorithm. Distance is the central mechanism in which K-Means clusters the data, and therefore we need to ensure everything is on the same scale.
## # A tibble: 6 × 9
## child_mort exports health imports income inflation life_expec total_fer
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.426 0.0495 0.359 0.258 0.00805 0.126 0.475 0.737
## 2 0.0682 0.140 0.295 0.279 0.0749 0.0804 0.872 0.0789
## 3 0.120 0.192 0.147 0.180 0.0988 0.188 0.876 0.274
## 4 0.567 0.311 0.0646 0.246 0.0425 0.246 0.552 0.790
## 5 0.0375 0.227 0.262 0.338 0.149 0.0522 0.882 0.155
## 6 0.0579 0.0940 0.391 0.0916 0.145 0.232 0.862 0.192
## # … with 1 more variable: gdpp <dbl>
We decided to use K-Means Clustering because our goal is to categorize the countries using socio-economic and health factors that determine the overall development of the country.
Initially, we created K-means model with 2 centers because we expected the clusters will be divided into two groups: developed countries and underdeveloped countries. However, the variance explained was low at 0.3935. Before we are going to do hyperparameter tuning, we are first going to visualize our model to see how it looks.
country_kmeans = kmeans(
countries,
centers=2,
algorithm="Lloyd",
iter.max=30
)
Evaluate Cluster Quality
## Variance Explained: 0.393483
The visualized socio-economic clusters looks great. But since it only has 2 clusters, it does not give us too much detail and result may look too generalized as the second cluster only had countries from Africa and some from Asia. We wanted to narrow down further to get better idea on which countries have the direst need for aid.
Load Map Data
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
Visualize Socio-Economic Clusters
We used few different metrics to determine the best number of clusters. Both from the elbow chart and NbClust gave us 3 as the best number of clusters to used. Therefore, we will be retrain the model with 3 centers.
Elbow Method
NbClust Method
Variance explained has increased to 0.5479977, which is a lot better than the previous model (0.3935). We will visualized the clusters to the result to see if we need more tuning.
The graph gives us more detail on breaking down the countries with their socio-economic status. However, it appears that the countries with lower socio-economic status remains the same (majority of them are from Africa and some Asia). Therefore, we determined that the output of the model would not improve any further with our current dataset and decided to use this as our final model. We will be visualize the model in 3D to determine the 10 countries that need the aids the most.
final_kmeans <- kmeans(
countries,
centers=3,
algorithm="Lloyd",
iter.max=30
)
Evaluate Cluster Quality
## Variance Explained: 0.547986
Visualize Socio-Economic Clusters
In order to further visualize the K-Means clusters, we used plotly to create a 3D visual on the basis of income, life expectancy, and GDP per Capita. As brielfy mentioned before, these variables highly correlated to each other and clearly illustrate the breakdown of the clusters. The purple data points represent more underdeveloped nations, the orange represents developing nations, and the green represents the developed nations. Beyond correlation, our group thought these variables made the most sense considering each’s importance to a nation’s wellbeing.
In conclusion, the 10 countries that need the direst need of aid we selected are: Haiti, Central African Republic, Lesotho, Malawi, Zambia, Mozambique, Sierra Leone, Guinea-Bissau, Afghanistan, and Uganda. These countries have the lowest net income per person, life expectancy, and the GDP per capita; furthermore, as made evident by the graph above, they are all in the underdeveloped cluster.
Our K-Means model performed pretty well and was able to capture distinct clusters of nations. Furthermore, the consensus of three clusters for the final model makes a lot of sense, as nations can generally be classified as underdeveloped, developing, or developed. Although the explained variance of our model wasn’t incredibly high, this is expedited for a limited number of clusters as the model is grouping continents of countries and countries that are on completely opposite ends of the Earth. In order to effectively maximize the explainced variance of our model, we would need to significantly increase the number of clusers; this would inevitably lead to overfitting and an unnecessary amount of complexity in the model.
Overall, we are happy with the model’s performance and believe it definitely provides a lot of value. Recognizing and visualizing the disparity in the wellbeing of nations is a necessary step to providing aid.
Our data set does not include any protected classes and therefore we will not be conducting a fairness assessment. That being said, it is important to note that each country inherently has a wide variety of demographic information.
While we were building the model, we factored in net income per person, life expectancy, and the GDP per capita as the vital variables in deciding a socio-economic status of a country. However, there might be other variables that might be considered more important. Furthermore, even though we computed ten countries with the direst need of aid, we still need to dig further into which areas HELP International wants to focus on more. For example, those selected countries have areas of problems that must be prioritized, and aid cannot be blindly given to them. Therefore, further analysis on researching which area those countries need to be aided on should be conducted next.
Additionally, attempting to build a variety of unsupervised learning algorithms might have proved to produce better results. Although K-Means is likely one of the best algorithms for a clustering problem like this, there are definitely multiple other options. For example, algorithms like DBSCAN could potentially work as well.
Lastly, we definitely could benefit from more data for training. Yet, unlike many other machine learning problems where more data could be collected, there are a finite number of countries that can be used at data points. Therefore, we could experiment with feature engineering with the existing columns, or look for different features all together that we believe are relevant to the wellbeing of a country.