DS3001 Final Project

PROJECT DESCRIPTION

Objective:
To categorize the countries using socio-economic and health factors that determine the overall development of the country.

Problem Statement:
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorize the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

Context:
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

DATA PREPARATION

Step 0 — Import Libraries

e1071, tidyverse, plotly, htmltools, devtools, caret, NbClust, reshape2, rvest, magrittr, stringr, cowplot, ggmap

Step 1 — Load Data

DATA DICTIONARY
* country: Name of the country
* child_mort: Death of children under 5 years of age per 1000 live births
* exports: Exports of goods and services per capita. Given as %age of the GDP per capita
* health: Total health spending per capita. Given as %age of GDP per capita
* imports: Imports of goods and services per capita. Given as %age of the GDP per capita
* income: Net income per person
* inflation: The measurement of the annual growth rate of the Total GDP
* life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
* total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
* gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Peep First Five Rows

## # A tibble: 6 × 10
##   country             child_mort exports health imports income inflation life_expec
##   <chr>                    <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1 Afghanistan               90.2    10     7.58    44.9   1610      9.44       56.2
## 2 Albania                   16.6    28     6.55    48.6   9930      4.49       76.3
## 3 Algeria                   27.3    38.4   4.17    31.4  12900     16.1        76.5
## 4 Angola                   119      62.3   2.85    42.9   5900     22.4        60.1
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44       76.8
## 6 Argentina                 14.5    18.9   8.1     16    18700     20.9        75.8
## # … with 2 more variables: total_fer <dbl>, gdpp <dbl>

Data Dimensions

## Shape: 167 10

## Columns: country child_mort exports health imports income inflation life_expec total_fer gdpp

## Country Labels: 'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda' ...

Step 2 — Check for Missing Values

Our data set luckily does not have any missing values, so we can proceed without any worries that NULL values will hinder our model’s performance

## Total Missing Values: 0

Step 3 — Ensure Correct Data Types

In order to optimize the performance of our K-Means algorithm, we need to ensure that all variables are appropriately classified. There are no chr variables to convert to factors, considering we removed the country name for the actual training of the model.

## tibble [167 × 9] (S3: tbl_df/tbl/data.frame)
##  $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

Step 4 — Explore Data Distributions

We visualized the correlation between each variables to see the strength of the relationship between each variables. This will give us good indication how would our cluster partition data into groups, which are GDP, life expectancy, and income.

Additionally, we wanted to plot the distributions of each variables to passively look for outliers and to better understand the data in general.

Scatter Matrix

Histogram Matrix

Correlation Matrix

Step 5 — Normalize the Data

We implemented a MinMax Scaler to prepare the data set for the K-Means algorithm. Distance is the central mechanism in which K-Means clusters the data, and therefore we need to ensure everything is on the same scale.

## # A tibble: 6 × 9
##   child_mort exports health imports  income inflation life_expec total_fer
##        <dbl>   <dbl>  <dbl>   <dbl>   <dbl>     <dbl>      <dbl>     <dbl>
## 1     0.426   0.0495 0.359   0.258  0.00805    0.126       0.475    0.737 
## 2     0.0682  0.140  0.295   0.279  0.0749     0.0804      0.872    0.0789
## 3     0.120   0.192  0.147   0.180  0.0988     0.188       0.876    0.274 
## 4     0.567   0.311  0.0646  0.246  0.0425     0.246       0.552    0.790 
## 5     0.0375  0.227  0.262   0.338  0.149      0.0522      0.882    0.155 
## 6     0.0579  0.0940 0.391   0.0916 0.145      0.232       0.862    0.192 
## # … with 1 more variable: gdpp <dbl>

K-MEANS CLUSTERING

We decided to use K-Means Clustering because our goal is to categorize the countries using socio-economic and health factors that determine the overall development of the country.

Initially, we created K-means model with 2 centers because we expected the clusters will be divided into two groups: developed countries and underdeveloped countries. However, the variance explained was low at 0.3935. Before we are going to do hyperparameter tuning, we are first going to visualize our model to see how it looks.

Step 6 — Initial K-Means Model

country_kmeans = kmeans(
  countries,
  centers=2,
  algorithm="Lloyd",
  iter.max=30
)

Evaluate Cluster Quality

## Variance Explained: 0.393483

Step 7 — Visualize Clusters

The visualized socio-economic clusters looks great. But since it only has 2 clusters, it does not give us too much detail and result may look too generalized as the second cluster only had countries from Africa and some from Asia. We wanted to narrow down further to get better idea on which countries have the direst need for aid.

Load Map Data

##        long      lat group order region subregion
## 1 -69.89912 12.45200     1     1  Aruba      <NA>
## 2 -69.89571 12.42300     1     2  Aruba      <NA>
## 3 -69.94219 12.43853     1     3  Aruba      <NA>
## 4 -70.00415 12.50049     1     4  Aruba      <NA>
## 5 -70.06612 12.54697     1     5  Aruba      <NA>
## 6 -70.05088 12.59707     1     6  Aruba      <NA>

Visualize Socio-Economic Clusters

Step 8 — Hyperparameter Tuning

We used few different metrics to determine the best number of clusters. Both from the elbow chart and NbClust gave us 3 as the best number of clusters to used. Therefore, we will be retrain the model with 3 centers.

Elbow Method

NbClust Method

Step 9 — Final Model (k=3)

Variance explained has increased to 0.5479977, which is a lot better than the previous model (0.3935). We will visualized the clusters to the result to see if we need more tuning.

The graph gives us more detail on breaking down the countries with their socio-economic status. However, it appears that the countries with lower socio-economic status remains the same (majority of them are from Africa and some Asia). Therefore, we determined that the output of the model would not improve any further with our current dataset and decided to use this as our final model. We will be visualize the model in 3D to determine the 10 countries that need the aids the most.

final_kmeans <- kmeans(
    countries, 
    centers=3,
    algorithm="Lloyd",
    iter.max=30
)

Evaluate Cluster Quality

## Variance Explained: 0.547986

Visualize Socio-Economic Clusters

Step 10 — Visualize 3D Clusters

In order to further visualize the K-Means clusters, we used plotly to create a 3D visual on the basis of income, life expectancy, and GDP per Capita. As brielfy mentioned before, these variables highly correlated to each other and clearly illustrate the breakdown of the clusters. The purple data points represent more underdeveloped nations, the orange represents developing nations, and the green represents the developed nations. Beyond correlation, our group thought these variables made the most sense considering each’s importance to a nation’s wellbeing.

CONCLUSION

In conclusion, the 10 countries that need the direst need of aid we selected are: Haiti, Central African Republic, Lesotho, Malawi, Zambia, Mozambique, Sierra Leone, Guinea-Bissau, Afghanistan, and Uganda. These countries have the lowest net income per person, life expectancy, and the GDP per capita; furthermore, as made evident by the graph above, they are all in the underdeveloped cluster.

Our K-Means model performed pretty well and was able to capture distinct clusters of nations. Furthermore, the consensus of three clusters for the final model makes a lot of sense, as nations can generally be classified as underdeveloped, developing, or developed. Although the explained variance of our model wasn’t incredibly high, this is expedited for a limited number of clusters as the model is grouping continents of countries and countries that are on completely opposite ends of the Earth. In order to effectively maximize the explainced variance of our model, we would need to significantly increase the number of clusers; this would inevitably lead to overfitting and an unnecessary amount of complexity in the model.

Overall, we are happy with the model’s performance and believe it definitely provides a lot of value. Recognizing and visualizing the disparity in the wellbeing of nations is a necessary step to providing aid.

FAIRNESS ASSESSMENT

Our data set does not include any protected classes and therefore we will not be conducting a fairness assessment. That being said, it is important to note that each country inherently has a wide variety of demographic information.

FUTURE WORK

While we were building the model, we factored in net income per person, life expectancy, and the GDP per capita as the vital variables in deciding a socio-economic status of a country. However, there might be other variables that might be considered more important. Furthermore, even though we computed ten countries with the direst need of aid, we still need to dig further into which areas HELP International wants to focus on more. For example, those selected countries have areas of problems that must be prioritized, and aid cannot be blindly given to them. Therefore, further analysis on researching which area those countries need to be aided on should be conducted next.

Additionally, attempting to build a variety of unsupervised learning algorithms might have proved to produce better results. Although K-Means is likely one of the best algorithms for a clustering problem like this, there are definitely multiple other options. For example, algorithms like DBSCAN could potentially work as well.

Lastly, we definitely could benefit from more data for training. Yet, unlike many other machine learning problems where more data could be collected, there are a finite number of countries that can be used at data points. Therefore, we could experiment with feature engineering with the existing columns, or look for different features all together that we believe are relevant to the wellbeing of a country.