World Bank Country Development Project

Executive Summary:

The labels of ‘developing’ and ‘developed’ in reference to the progress of countries around the world are in need of a revamp. As a result, using economic data collected from the World Bank, this project will transcend these traditional groupings and forge new ones. The following sections discuss the initial data analysis, the steps in the process of cleaning the data, simplifying the number of predictors, and clustering the countries to obtain these new groupings in the Methods Section. Each plot included is fully interactive and able to be downloaded. Finally, there is a detailed discussion of the groups themselves in the Discussion section.

Problem Description:

Using a dataset derived from the World Bank Development Indicators, containing a variety of social, economic, and environmental metrics, this project seeks to develop a small number of new “development archetypes” that summarize broad global patterns in national progress. The traditional groupings (“developed” and “developing”) are simplistic and may fail to capture new trends.

Exploratory Data Analysis:

First and foremost, the data must be made ready for analysis. This given set contained a far amount of missing data. The figure below contains the percentage of missing data for each explanatory variable.

Figure 1.1
Country 0.0000000
Code 0.0000000
Access_electricity 0.0092166
Agriculture 0.0921659
Birth_rate 0.0000000
CO2 0.0645161
Health_expenditure 0.1152074
GDP 0.0368664
Education_expenditure 0.2165899
Hospital_beds 0.4423963
Internet_use 0.1428571
Life_expectancy 0.0000000
Literacy_rate 0.8064516
Infant_mortality 0.0967742
Access_water 0.3686636
Cropland 0.0737327
Physicians 0.4331797
HIV 0.2995392
Unemployment 0.4516129
Urban 0.0092166
Renewable 0.0230415
Education_years 0.0967742
Imports 0.1612903
Exports 0.1612903
Food_production 0.1013825
Women_parliament 0.1244240
Poverty 0.6820276

As can be seen in Figure 1.1, some variables are missing up to 80% of their values. This not only diminishes their ability as predictors but also throws into question their inclusion in the data set. For this project, I opted to throw out all variables that had more than 20% missing.

Figure 1.2
Country 0.0000000
Code 0.0000000
Access_electricity 0.0092166
Agriculture 0.0921659
Birth_rate 0.0000000
CO2 0.0645161
Health_expenditure 0.1152074
GDP 0.0368664
Internet_use 0.1428571
Life_expectancy 0.0000000
Infant_mortality 0.0967742
Cropland 0.0737327
Urban 0.0092166
Renewable 0.0230415
Education_years 0.0967742
Imports 0.1612903
Exports 0.1612903
Food_production 0.1013825
Women_parliament 0.1244240

Figure 1.2 shows the remaining variables that will be considered in the analysis. These variables will be the most important in clustering the countries in the data. Next, there was the issue of missing data. Principal Component Analysis(PCA) requires a full dataset in order to work, so missing data causes a large hurdle in analysis. To tackle this problem, I sought to impute, or put in new values, for all the missing data. Two approaches, MICE(Multiple Imputation by Chained Equations) and KNN(K-Nearest Neighbors), would work for this data given the wealth of other predictors available when a row contains a missing value. The methodology and outcomes of this imputation will be discussed in the Methods section. My approach of throwing out a few predictors makes it so each variable remaining had 20% or less missing, which falls under the maximum amount for KNN imputation. Not only that, 80% of data still provides a decent amount of information without the imputation washing out the true data. Next, I had to choose between performing principal component analysis on the correlation or covariance matrix. The core difference is that using a correlation matrix will scale all variables whereas using a covariance matrix gives more prominence to larger values. That is, a correlation matrix will allow better focus on relationships between variables, which is what this project is seeking to explore.

What can be noted from Figures 1.3 and 1.4 is the amount of principal components involved in further analysis: 3. The plots are almost identical in terms of variance explained, both total and by each component. Both plots show an “elbow” of variance explained at 3 principal components and explain about 62% of all variance in the data. Since the variance explained was extremely similar for both S and R at the same number of principal components, I opted to use the correlation matrix. Since many of the values in the original data set vary wildly across the various predictors, I thought a scaled data set would be able to better examine relationships between variables and allow for more effective clustering. Now, moving forward, we will split the countries into clusters using k-means based on their principal component scores. K-means and principal component scores will be explained in depth in the Methods section.

Methods:

Firstly, to explain the imputation more in depth. I opted to employ K-nearest neighbors(KNN) imputation to account for the missing data. What KNN does is compute the euclidean distance between the missing point and all other points, find the K closest points in the data(for a selected number K), and averages their values in the missing variable to create a value for the missing point. KNN takes advantage of correlations between variables to impute, something very important when the project is very focused on relationships between features. Once imputed, the data set was full and ready for full analysis. Next, performing PCA. In general, PCA is a dimension reduction method that simplifies relationships between many variables down to just a few. The algorithm takes correlated variables and finds new axes, other than just the x- and y-axis, that explain the most possible variance. In our case, there are 17 different predictors in the data, and we want to condense those down to a number of predictors that can be visualized and clustered. Once PCA is completed, I chose to employ K-means clustering to group the data. K-means clustering takes a set(K) number of groups and breaks the data into that many groups. More specifically, each group its own center(usually an average), then the closest points to each center as determined by euclidean distance are folded into that group. In this specific case, the center of each cluster is an average principal component score. Then, the centers of the groups are recalculated with the new groups. The algorithm then continues iterating until each point is in a group and the groups are unchanging through 2 consecutive iterations. Choosing the K for K-means is a process through which the Within-Group Sum of Squares(WGSS) is minimized. The WGSS metric is one that evaluates how close the points in each cluster are to each other. A small WGSS points towards good selection of clusters. Iterating through each value of K and choosing the one with the lowest WGSS, or at least an elbow, is how K is selected.

As can be seen in Figure 2.1, 9 groups is the optimal amount for clustering. Therefore, we will move forward creating 9 clusters. The goodness of fit of the clusters will be evaluated in the Discussion section. The clusters were based on several factors: population growth or stagnation(birth rate), economy size(GDP), urban versus rural, pre- vs post-industrial(CO2 emissions), and food vs non-food production.

Results

Cluster 1 - Stagnating Large Urbanized Post-Industrial Non-Agricultural Importers

Country
Bahrain
Brunei Darussalam
Gibraltar
Kuwait
New Caledonia
Oman
Palau
Qatar
Saudi Arabia
United Arab Emirates

Cluster 2 - Stagnating Medium Rural Pre-Industrial Non-Food Exporters

Country
Angola
Benin
Botswana
Cameroon
Congo, Rep.
Cote d’Ivoire
Equatorial Guinea
Eritrea
Eswatini
Gabon
Gambia, The
Haiti
Kenya
Lesotho
Mauritania
Myanmar
Namibia
Nigeria
Pakistan
Papua New Guinea
Solomon Islands
Sudan
Togo
Vanuatu
Yemen, Rep.
Zambia
Zimbabwe

Cluster 3 - Stagnating Medium Urbanizing Industrializing Non-Food Exporters

Country
Albania
American Samoa
Azerbaijan
Bahamas, The
Belarus
Belize
Bosnia and Herzegovina
Bulgaria
Channel Islands
China
Croatia
Cyprus
Czechia
Estonia
Georgia
Greece
Guam
Hungary
Iran, Islamic Rep.
Jamaica
Jordan
Kazakhstan
Korea, Rep.
Latvia
Lebanon
Libya
Lithuania
Malaysia
Maldives
Montenegro
Morocco
Northern Mariana Islands
Panama
Poland
Romania
Russian Federation
Serbia
Seychelles
Slovak Republic
Slovenia
St. Martin (French part)
Suriname
Thailand
Trinidad and Tobago
Tunisia
Turkiye
Ukraine
Viet Nam
West Bank and Gaza

Cluster 4 - Growing Small Rural Pre-Industrial Raw Material Exporters

Country
Antigua and Barbuda
Armenia
Aruba
Barbados
Bolivia
Brazil
British Virgin Islands
Cabo Verde
Colombia
Dominica
Dominican Republic
Ecuador
Egypt, Arab Rep.
El Salvador
Grenada
Guatemala
Honduras
Korea, Dem. People’s Rep.
Kosovo
Mauritius
Mexico
Moldova
North Macedonia
Paraguay
Peru
Philippines
Samoa
Sri Lanka
St. Kitts and Nevis
St. Lucia
St. Vincent and the Grenadines
Tonga
Venezuela, RB

Cluster 5 - Rich Micro and City-States

Country
Andorra
Argentina
Australia
Austria
Belgium
Bermuda
Canada
Cayman Islands
Chile
Costa Rica
Cuba
Curacao
Denmark
Faroe Islands
Finland
France
French Polynesia
Germany
Greenland
Iceland
Isle of Man
Israel
Italy
Japan
Macao SAR, China
Netherlands
New Zealand
Norway
Portugal
Puerto Rico (US)
Sint Maarten (Dutch part)
Spain
Sweden
Switzerland
Turks and Caicos Islands
United Kingdom
United States
Uruguay
Virgin Islands (U.S.)

Cluster 6 - Growing Small Rural Pre-Industrial Cash Crop Exporters

Country
Afghanistan
Burkina Faso
Burundi
Central African Republic
Chad
Congo, Dem. Rep.
Ethiopia
Guinea
Guinea-Bissau
Liberia
Madagascar
Malawi
Mali
Mozambique
Niger
Rwanda
Sierra Leone
Somalia, Fed. Rep.
South Sudan
Tanzania
Uganda

Cluster 7 - Stagnating Large Hyper Urban Post-Industrial Oil Exporters

Country
Comoros
Kiribati
Marshall Islands
Micronesia, Fed. Sts.
Nauru
Sao Tome and Principe
Tuvalu

Cluster 8 - Growing Small Rural Pre-Industrial Food Producers

Country
Algeria
Bangladesh
Bhutan
Cambodia
Djibouti
Fiji
Ghana
Guyana
India
Indonesia
Iraq
Kyrgyz Republic
Lao PDR
Mongolia
Nepal
Nicaragua
Senegal
South Africa
Syrian Arab Republic
Tajikistan
Timor-Leste
Turkmenistan
Uzbekistan

Cluster 9 - Stagnating Large Post-Industrial Balanced

Country
Hong Kong SAR, China
Ireland
Liechtenstein
Luxembourg
Malta
Monaco
San Marino
Singapore

Discussion:

This section will explore my cluster selection in depth and discuss curiosities within the clusters.

In order to evaluate cluster selection, as stated in the Methods section, a measure of Within-Group Sum of Squares(WGSS) is often used. WGSS is used in K-means clustering and is minimized by the process. As before, the WGSS measure of the 9 clusters is fairly low, at least compared to where the WGSS began. Figure 4.1 shows the WGSS measure for each value of K.Additionally, as can be seen in Figures 4.2, at the selected k-value of 9, the Between-Group Sum of Squares(BGSS) is very high. Where WGSS measures the “closeness” of observations within their own cluster, BGSS measures how far apart the cluster centers are. A high BGSS means the clusters are sufficiently far apart, thereby proving the separation between groups to be appropriate and informative. BGSS and WGSS tell different information, but can come together to prove appropriate and correct cluster assignments. Since the WGSS and BGSS measures of my K-means clustering both indicate very solid cluster splits, I feel confident that my groups are very definitive. As for labeling clusters, the process proved very difficult with 9 of them. Attempting to give each group distinct, characteristic names while adhering to the criteria I set was challenging, but I believe it went well. There are differences in the names of every group, if only a difference in one variable. Not every name features every single variable of interest(birth rate, GDP, urban population, CO2 emissions, agriculture, food production), but some characteristics superseded these new groupings. For example, I noticed that the countries in Cluster 5(apart from Ireland) were all micro-states(Singapore, Liechtenstein, Monaco, etc.), so I opted to just use that characteristic to name the cluster. Similarly, with Cluster 7, I chose to denote the countries as “Stagnating Large Hyper Urban Post-Industrial Oil Exporters”. Now, according to my criteria, “Stagnating Large Hyper Urban Post-Industrial Non-Food Exporters” would be sufficient description, but the majority of the countries are large oil exporters, like Saudi Arabia, United Arab Emirates, Kuwait. Now, although these countries are among the world leaders in oil exports, so are the United States and Russia. Why are those two states not included? I think it points to the balance of the economy. Cluster 7 states must focus primarily on oil, whereas larger countries like the US and Russia are more balanced, or are focused elsewhere. I found it interesting that these groupings largely transcended regional boundaries. Apart from the few distinctly regional African and Middle East clusters, a few of the groups have countries from all over. Specifically, clusters 3 and 6 both feature countries from Africa, South America, the Pacific, and Asia. This shows that economic progress, at least in 2020 and according to my clustering algorithm, can and does transcend regional boundaries. There were also a few anomalies in the clusters. Two very apparent ones were Ireland(Cluster 5) and New Caledonia(Cluster 7). Now, I’ve already mentioned that Cluster 5 holds the rich microstates of the world, so the Republic of Ireland being there certainly was unexpected. What’s more, another distinct small group, Cluster 7, contained something of an outsider. New Caledonia, the small French territory in the South Pacific, being placed with the many Middle Eastern oil giants seemed curious.

Figure 4.3
Country Variable Value
New Caledonia Agriculture -0.9631722
New Caledonia CO2 2.5021291
New Caledonia Exports 1.2719960

However, some digging revealed that New Caledonia exports much more than the mean, has very high CO2 emissions compared to the mean and relatively low agricultural output. This groups New Caledonia in these perceived “oil giants” despite not having an economy as geared towards oil. While this may not be the exact reasoning behind why New Caledonia is in this group, I still find it interesting that a South Pacific territory can bear many similarities to small, Middle Eastern states. Overall, the 9 groups are extremely informative, indicating distinct shifts in what determines a country’s progress in the modern age. Though “developed” and “developing” are very broad, more in-depth analysis shows birth rate, urbanization, the relationship between industry and agriculture, and economy size are the most telling indicators of a country’s progress. More urbanized, balanced, large economies with smaller birth rates show development better than anything else. Finally, I want to list out my 9 groups from least to most developed(least developed at the top, most at the bottom) in Figure 4.4.

Figure 4.4
Designation Cluster
Growing Small Rural Pre-Industrial Raw Material Exporters 4
Growing Small Rural Pre-Industrial Food Producers 8
Growing Small Rural Pre-Industrial Cash Crop Exporters 6
Stagnating Medium Rural Pre-Industrial Non-Food Exporters 2
Stagnating Medium Urbanizing Industrializing Non-Food Exporters 3
Stagnating Large Urbanized Post-Industrial Non-Agricultural Importers 1
Stagnating Large Hyper Urban Post-Industrial Oil Exporters 7
Stagnating Large Post-Industrial Balanced 9
Rich Micro and City-States 5