World Bank Country Development Project
Executive Summary:
The labels of ‘developing’ and ‘developed’ in reference to the
progress of countries around the world are in need of a revamp. As a
result, using economic data collected from the World Bank, this project
will transcend these traditional groupings and forge new ones. The
following sections discuss the initial data analysis, the steps in the
process of cleaning the data, simplifying the number of predictors, and
clustering the countries to obtain these new groupings in the Methods
Section. Each plot included is fully interactive and able to be
downloaded. Finally, there is a detailed discussion of the groups
themselves in the Discussion section.
Problem Description:
Using a dataset derived from the World Bank Development Indicators,
containing a variety of social, economic, and environmental metrics,
this project seeks to develop a small number of new “development
archetypes” that summarize broad global patterns in national progress.
The traditional groupings (“developed” and “developing”) are simplistic
and may fail to capture new trends.
Exploratory Data Analysis:
First and foremost, the data must be made ready for analysis. This
given set contained a far amount of missing data. The figure below
contains the percentage of missing data for each explanatory
variable.
| Country |
0.0000000 |
| Code |
0.0000000 |
| Access_electricity |
0.0092166 |
| Agriculture |
0.0921659 |
| Birth_rate |
0.0000000 |
| CO2 |
0.0645161 |
| Health_expenditure |
0.1152074 |
| GDP |
0.0368664 |
| Education_expenditure |
0.2165899 |
| Hospital_beds |
0.4423963 |
| Internet_use |
0.1428571 |
| Life_expectancy |
0.0000000 |
| Literacy_rate |
0.8064516 |
| Infant_mortality |
0.0967742 |
| Access_water |
0.3686636 |
| Cropland |
0.0737327 |
| Physicians |
0.4331797 |
| HIV |
0.2995392 |
| Unemployment |
0.4516129 |
| Urban |
0.0092166 |
| Renewable |
0.0230415 |
| Education_years |
0.0967742 |
| Imports |
0.1612903 |
| Exports |
0.1612903 |
| Food_production |
0.1013825 |
| Women_parliament |
0.1244240 |
| Poverty |
0.6820276 |
As can be seen in Figure 1.1, some variables are missing up to 80% of
their values. This not only diminishes their ability as predictors but
also throws into question their inclusion in the data set. For this
project, I opted to throw out all variables that had more than 20%
missing.
| Country |
0.0000000 |
| Code |
0.0000000 |
| Access_electricity |
0.0092166 |
| Agriculture |
0.0921659 |
| Birth_rate |
0.0000000 |
| CO2 |
0.0645161 |
| Health_expenditure |
0.1152074 |
| GDP |
0.0368664 |
| Internet_use |
0.1428571 |
| Life_expectancy |
0.0000000 |
| Infant_mortality |
0.0967742 |
| Cropland |
0.0737327 |
| Urban |
0.0092166 |
| Renewable |
0.0230415 |
| Education_years |
0.0967742 |
| Imports |
0.1612903 |
| Exports |
0.1612903 |
| Food_production |
0.1013825 |
| Women_parliament |
0.1244240 |
Figure 1.2 shows the remaining variables that will be considered in
the analysis. These variables will be the most important in clustering
the countries in the data. Next, there was the issue of missing data.
Principal Component Analysis(PCA) requires a full dataset in order to
work, so missing data causes a large hurdle in analysis. To tackle this
problem, I sought to impute, or put in new values, for all the missing
data. Two approaches, MICE(Multiple Imputation by Chained Equations) and
KNN(K-Nearest Neighbors), would work for this data given the wealth of
other predictors available when a row contains a missing value. The
methodology and outcomes of this imputation will be discussed in the
Methods section. My approach of throwing out a few predictors makes it
so each variable remaining had 20% or less missing, which falls under
the maximum amount for KNN imputation. Not only that, 80% of data still
provides a decent amount of information without the imputation washing
out the true data. Next, I had to choose between performing principal
component analysis on the correlation or covariance matrix. The core
difference is that using a correlation matrix will scale all variables
whereas using a covariance matrix gives more prominence to larger
values. That is, a correlation matrix will allow better focus on
relationships between variables, which is what this project is seeking
to explore.
What can be noted from Figures 1.3 and 1.4 is the amount of principal
components involved in further analysis: 3. The plots are almost
identical in terms of variance explained, both total and by each
component. Both plots show an “elbow” of variance explained at 3
principal components and explain about 62% of all variance in the data.
Since the variance explained was extremely similar for both S and R at
the same number of principal components, I opted to use the correlation
matrix. Since many of the values in the original data set vary wildly
across the various predictors, I thought a scaled data set would be able
to better examine relationships between variables and allow for more
effective clustering. Now, moving forward, we will split the countries
into clusters using k-means based on their principal component scores.
K-means and principal component scores will be explained in depth in the
Methods section.
Methods:
Firstly, to explain the imputation more in depth. I opted to employ
K-nearest neighbors(KNN) imputation to account for the missing data.
What KNN does is compute the euclidean distance between the missing
point and all other points, find the K closest points in the data(for a
selected number K), and averages their values in the missing variable to
create a value for the missing point. KNN takes advantage of
correlations between variables to impute, something very important when
the project is very focused on relationships between features. Once
imputed, the data set was full and ready for full analysis. Next,
performing PCA. In general, PCA is a dimension reduction method that
simplifies relationships between many variables down to just a few. The
algorithm takes correlated variables and finds new axes, other than just
the x- and y-axis, that explain the most possible variance. In our case,
there are 17 different predictors in the data, and we want to condense
those down to a number of predictors that can be visualized and
clustered. Once PCA is completed, I chose to employ K-means clustering
to group the data. K-means clustering takes a set(K) number of groups
and breaks the data into that many groups. More specifically, each group
its own center(usually an average), then the closest points to each
center as determined by euclidean distance are folded into that group.
In this specific case, the center of each cluster is an average
principal component score. Then, the centers of the groups are
recalculated with the new groups. The algorithm then continues iterating
until each point is in a group and the groups are unchanging through 2
consecutive iterations. Choosing the K for K-means is a process through
which the Within-Group Sum of Squares(WGSS) is minimized. The WGSS
metric is one that evaluates how close the points in each cluster are to
each other. A small WGSS points towards good selection of clusters.
Iterating through each value of K and choosing the one with the lowest
WGSS, or at least an elbow, is how K is selected.
As can be seen in Figure 2.1, 9 groups is the optimal amount for
clustering. Therefore, we will move forward creating 9 clusters. The
goodness of fit of the clusters will be evaluated in the Discussion
section. The clusters were based on several factors: population growth
or stagnation(birth rate), economy size(GDP), urban versus rural, pre-
vs post-industrial(CO2 emissions), and food vs non-food production.
Results
Cluster 1 - Stagnating Large Urbanized Post-Industrial
Non-Agricultural Importers
| Bahrain |
| Brunei Darussalam |
| Gibraltar |
| Kuwait |
| New Caledonia |
| Oman |
| Palau |
| Qatar |
| Saudi Arabia |
| United Arab Emirates |
Cluster 2 - Stagnating Medium Rural Pre-Industrial Non-Food
Exporters
| Angola |
| Benin |
| Botswana |
| Cameroon |
| Congo, Rep. |
| Cote d’Ivoire |
| Equatorial Guinea |
| Eritrea |
| Eswatini |
| Gabon |
| Gambia, The |
| Haiti |
| Kenya |
| Lesotho |
| Mauritania |
| Myanmar |
| Namibia |
| Nigeria |
| Pakistan |
| Papua New Guinea |
| Solomon Islands |
| Sudan |
| Togo |
| Vanuatu |
| Yemen, Rep. |
| Zambia |
| Zimbabwe |
Cluster 3 - Stagnating Medium Urbanizing Industrializing Non-Food
Exporters
| Albania |
| American Samoa |
| Azerbaijan |
| Bahamas, The |
| Belarus |
| Belize |
| Bosnia and Herzegovina |
| Bulgaria |
| Channel Islands |
| China |
| Croatia |
| Cyprus |
| Czechia |
| Estonia |
| Georgia |
| Greece |
| Guam |
| Hungary |
| Iran, Islamic Rep. |
| Jamaica |
| Jordan |
| Kazakhstan |
| Korea, Rep. |
| Latvia |
| Lebanon |
| Libya |
| Lithuania |
| Malaysia |
| Maldives |
| Montenegro |
| Morocco |
| Northern Mariana Islands |
| Panama |
| Poland |
| Romania |
| Russian Federation |
| Serbia |
| Seychelles |
| Slovak Republic |
| Slovenia |
| St. Martin (French part) |
| Suriname |
| Thailand |
| Trinidad and Tobago |
| Tunisia |
| Turkiye |
| Ukraine |
| Viet Nam |
| West Bank and Gaza |
Cluster 4 - Growing Small Rural Pre-Industrial Raw Material
Exporters
| Antigua and Barbuda |
| Armenia |
| Aruba |
| Barbados |
| Bolivia |
| Brazil |
| British Virgin Islands |
| Cabo Verde |
| Colombia |
| Dominica |
| Dominican Republic |
| Ecuador |
| Egypt, Arab Rep. |
| El Salvador |
| Grenada |
| Guatemala |
| Honduras |
| Korea, Dem. People’s Rep. |
| Kosovo |
| Mauritius |
| Mexico |
| Moldova |
| North Macedonia |
| Paraguay |
| Peru |
| Philippines |
| Samoa |
| Sri Lanka |
| St. Kitts and Nevis |
| St. Lucia |
| St. Vincent and the Grenadines |
| Tonga |
| Venezuela, RB |
Cluster 5 - Rich Micro and City-States
| Andorra |
| Argentina |
| Australia |
| Austria |
| Belgium |
| Bermuda |
| Canada |
| Cayman Islands |
| Chile |
| Costa Rica |
| Cuba |
| Curacao |
| Denmark |
| Faroe Islands |
| Finland |
| France |
| French Polynesia |
| Germany |
| Greenland |
| Iceland |
| Isle of Man |
| Israel |
| Italy |
| Japan |
| Macao SAR, China |
| Netherlands |
| New Zealand |
| Norway |
| Portugal |
| Puerto Rico (US) |
| Sint Maarten (Dutch part) |
| Spain |
| Sweden |
| Switzerland |
| Turks and Caicos Islands |
| United Kingdom |
| United States |
| Uruguay |
| Virgin Islands (U.S.) |
Cluster 6 - Growing Small Rural Pre-Industrial Cash Crop
Exporters
| Afghanistan |
| Burkina Faso |
| Burundi |
| Central African Republic |
| Chad |
| Congo, Dem. Rep. |
| Ethiopia |
| Guinea |
| Guinea-Bissau |
| Liberia |
| Madagascar |
| Malawi |
| Mali |
| Mozambique |
| Niger |
| Rwanda |
| Sierra Leone |
| Somalia, Fed. Rep. |
| South Sudan |
| Tanzania |
| Uganda |
Cluster 7 - Stagnating Large Hyper Urban Post-Industrial Oil
Exporters
| Comoros |
| Kiribati |
| Marshall Islands |
| Micronesia, Fed. Sts. |
| Nauru |
| Sao Tome and Principe |
| Tuvalu |
Cluster 8 - Growing Small Rural Pre-Industrial Food Producers
| Algeria |
| Bangladesh |
| Bhutan |
| Cambodia |
| Djibouti |
| Fiji |
| Ghana |
| Guyana |
| India |
| Indonesia |
| Iraq |
| Kyrgyz Republic |
| Lao PDR |
| Mongolia |
| Nepal |
| Nicaragua |
| Senegal |
| South Africa |
| Syrian Arab Republic |
| Tajikistan |
| Timor-Leste |
| Turkmenistan |
| Uzbekistan |
Cluster 9 - Stagnating Large Post-Industrial Balanced
| Hong Kong SAR, China |
| Ireland |
| Liechtenstein |
| Luxembourg |
| Malta |
| Monaco |
| San Marino |
| Singapore |
Discussion:
This section will explore my cluster selection in depth and discuss
curiosities within the clusters.
In order to evaluate cluster selection, as stated in the Methods
section, a measure of Within-Group Sum of Squares(WGSS) is often used.
WGSS is used in K-means clustering and is minimized by the process. As
before, the WGSS measure of the 9 clusters is fairly low, at least
compared to where the WGSS began. Figure 4.1 shows the WGSS measure for
each value of K.Additionally, as can be seen in Figures 4.2, at the
selected k-value of 9, the Between-Group Sum of Squares(BGSS) is very
high. Where WGSS measures the “closeness” of observations within their
own cluster, BGSS measures how far apart the cluster centers are. A high
BGSS means the clusters are sufficiently far apart, thereby proving the
separation between groups to be appropriate and informative. BGSS and
WGSS tell different information, but can come together to prove
appropriate and correct cluster assignments. Since the WGSS and BGSS
measures of my K-means clustering both indicate very solid cluster
splits, I feel confident that my groups are very definitive. As for
labeling clusters, the process proved very difficult with 9 of them.
Attempting to give each group distinct, characteristic names while
adhering to the criteria I set was challenging, but I believe it went
well. There are differences in the names of every group, if only a
difference in one variable. Not every name features every single
variable of interest(birth rate, GDP, urban population, CO2 emissions,
agriculture, food production), but some characteristics superseded these
new groupings. For example, I noticed that the countries in Cluster
5(apart from Ireland) were all micro-states(Singapore, Liechtenstein,
Monaco, etc.), so I opted to just use that characteristic to name the
cluster. Similarly, with Cluster 7, I chose to denote the countries as
“Stagnating Large Hyper Urban Post-Industrial Oil Exporters”. Now,
according to my criteria, “Stagnating Large Hyper Urban Post-Industrial
Non-Food Exporters” would be sufficient description, but the majority of
the countries are large oil exporters, like Saudi Arabia, United Arab
Emirates, Kuwait. Now, although these countries are among the world
leaders in oil exports, so are the United States and Russia. Why are
those two states not included? I think it points to the balance of the
economy. Cluster 7 states must focus primarily on oil, whereas larger
countries like the US and Russia are more balanced, or are focused
elsewhere. I found it interesting that these groupings largely
transcended regional boundaries. Apart from the few distinctly regional
African and Middle East clusters, a few of the groups have countries
from all over. Specifically, clusters 3 and 6 both feature countries
from Africa, South America, the Pacific, and Asia. This shows that
economic progress, at least in 2020 and according to my clustering
algorithm, can and does transcend regional boundaries. There were also a
few anomalies in the clusters. Two very apparent ones were
Ireland(Cluster 5) and New Caledonia(Cluster 7). Now, I’ve already
mentioned that Cluster 5 holds the rich microstates of the world, so the
Republic of Ireland being there certainly was unexpected. What’s more,
another distinct small group, Cluster 7, contained something of an
outsider. New Caledonia, the small French territory in the South
Pacific, being placed with the many Middle Eastern oil giants seemed
curious.
Figure 4.3
| New Caledonia |
Agriculture |
-0.9631722 |
| New Caledonia |
CO2 |
2.5021291 |
| New Caledonia |
Exports |
1.2719960 |
However, some digging revealed that New Caledonia exports much more
than the mean, has very high CO2 emissions compared to the mean and
relatively low agricultural output. This groups New Caledonia in these
perceived “oil giants” despite not having an economy as geared towards
oil. While this may not be the exact reasoning behind why New Caledonia
is in this group, I still find it interesting that a South Pacific
territory can bear many similarities to small, Middle Eastern states.
Overall, the 9 groups are extremely informative, indicating distinct
shifts in what determines a country’s progress in the modern age. Though
“developed” and “developing” are very broad, more in-depth analysis
shows birth rate, urbanization, the relationship between industry and
agriculture, and economy size are the most telling indicators of a
country’s progress. More urbanized, balanced, large economies with
smaller birth rates show development better than anything else. Finally,
I want to list out my 9 groups from least to most developed(least
developed at the top, most at the bottom) in Figure 4.4.
Figure 4.4
| Growing Small Rural Pre-Industrial Raw Material
Exporters |
4 |
| Growing Small Rural Pre-Industrial Food Producers |
8 |
| Growing Small Rural Pre-Industrial Cash Crop
Exporters |
6 |
| Stagnating Medium Rural Pre-Industrial Non-Food
Exporters |
2 |
| Stagnating Medium Urbanizing Industrializing Non-Food
Exporters |
3 |
| Stagnating Large Urbanized Post-Industrial
Non-Agricultural Importers |
1 |
| Stagnating Large Hyper Urban Post-Industrial Oil
Exporters |
7 |
| Stagnating Large Post-Industrial Balanced |
9 |
| Rich Micro and City-States |
5 |