Analyzing Development through K-means Clustering and Matching

I. Abstract

No single metric exists for evaluating development which encapsulates both human and industrial development. This paper proposes a more comprehensive approach of classifying development by using GDP, life expectancy, urban population percentage and carbon dioxide emissions as equally weighted metrics for evaluation. Countries were classified into levels of development through a k-means clustering algorithm. We formed four distinct groups where in general, the higher the multivariate average of a country’s attributes, the more developed a country was considered. Using the same cluster centers, but for U.S. state data, we match states into their corresponding development clusters. All but four states were classified in the highest development group. The results of our country classification produced similar development groups to those formed using other commonly used development indexes. Our classification mechanism elevated countries with high emissions to higher development groups as compared to previously defined country development groups. We argue that environmental impact is an important component that should be considered in the creation of any future comprehensive development metric.

II. Introduction

Twenty percent of the world’s population lives on less than one dollar a day. Meanwhile, in highly developed countries, such as the United States, the average household spends 140 times as much (“Economics Online”). A country’s level of development has a profound impact on its resources, capital, and ultimately, the lives of its citizens. Although no universal definition for development exists, typical methods for evaluating development rely on a variety of economic, political, and social factors. The most commonly cited metric used to measure development is the Human Development Index (HDI). This index combines a country’s average life expectancy, educational attainment and per capita income and assigns a score accordingly. HDI, however, provides a very limited scope of development as it does not consider factors related to industrialization, such as infrastructure and trade. An alternative metric that is used to measure development is the Competitive Industrialization Performance Index (CIP). This index assigns a score based on factors related to manufacturing, but does not consider variables related to human capital. Currently, there is not any well defined index that measures a country’s development that incorporates both human development and industrialization factors.

The goal of this paper is to propose a method of classifying countries’ development that is more comprehensive than the commonly used development metrics. For both the HDI and CIP, ranges of scores are used to form groups of countries based on their level of development. We form groups of countries through combining factors from HDI and CIP to compare our groups to the groups formed from these indexes. Additionally, we include an unprecedented component of environmental impact, as we believe this is an important, yet often overlooked component of a country’s development. The four variables we utilized to group countries are as follows:

Carbon dioxide emissions per capita
Gross Domestic Product (GDP) per capita
Life expectancy
Percent of population living in urban centers

We performed a k-means clustering analysis with these variables to identify four distinct groups of countries by their level of development. We analyzed these groups by comparing the average values for each of the four variables. We also considered other trends and similarities present within groups, to gain a better understanding of the relationship between these four variables. To evaluate our cluster results, we then matched U.S. states to country cluster groups through a matching algorithm. Lastly, we compare the groups generated through clustering to the groups formed by the HDI and CIP metrics. Through our findings, we hope a new metric can be built that provides a more accurate representation of a country’s development in a global context.

III. Methodology

Our data was formed by compiling four separate datasets obtained from the CIA’s World Factbook (“World Bank Open Data”). Each dataset contained one of our variables of interest (emissions, per capita GDP, life expectancy and urban population) over time for each country. We excluded figures for groups of countries such as “Highly Indebted Poor Countries (HIPC)”, and “European Union”, as figures for each country member of these groups were already accounted for individually. After cleaning our data, we were left with 186 countries in our dataset. We then merged the data into one dataset and selected data from 2010-2013. Most of the data was only available through 2013, but rather than selecting a single year, we felt it would provide for a more stable and accurate analysis if we averaged each metric over a range of years. This also helped deal with missing values, as a country only needed one value over the four year span to be included in the final dataset. We chose to utilize the median of the four years, rather than the mean, as this reduced the potential impact of having a large outlier distort the average. Alternatively we could have taken the average over a large time frame (ten or fifteen years), however, time itself has an impact on the some of these variables, and could introduce a latent variable in our analysis.

rm(list=ls()) # Clear objects from Memory
cat("\014") # Clear Console:
setwd("~/Documents/Loyola Coursework/Spring 2017/Nonparametric/Nonparametric Project")
library(dplyr)

##### Emission Per Capita #####
# Here I remove entries that are missing emissions data for 2010-2014
# I then take the median value for 2010-2014 for each entry and record that as emi$ave

emi <- read.csv("Emi_percapita.csv")
head(emi)
emi <- emi[,-c(2:4)]
head(emi)
ncol(emi)

emi$good <- rep(NA, len = nrow(emi))
for(i in 1: nrow(emi)){
  emi[i, "good"] <- sum(is.na(emi[i,52:55])) < 4
}
emi <- emi[emi$good,]

emi$ave <- rep(NA, len = nrow(emi))
for(i in 1: nrow(emi)){
  x <- as.numeric(emi[i,52:55])
  emi[i, "ave"] <- median(x, na.rm = TRUE)
}

#hist(emi$ave, breaks = 15)

head(emi)

##### GDP Per Capita #####
# Here I remove entries that are missing emissions data for 2010-2014
# I then take the median value for 2010-2014 for each entry and record that as ave

gdp <- read.csv("Gdp.csv")
gdp <- gdp[,-c(2:5, 59:61)]
names(gdp)
ncol(gdp)

gdp$good <- rep(NA, len = nrow(gdp))
for(i in 1: nrow(gdp)){
  gdp[i, "good"] <- sum(is.na(gdp[i,51:54])) < 4
}
gdp <- gdp[gdp$good,]

gdp$ave <- rep(NA, len = nrow(gdp))
for(i in 1: nrow(gdp)){
  x <- as.numeric(gdp[i,51:54])
  gdp[i, "ave"] <- median(x, na.rm = TRUE)
}

head(gdp)

#hist(gdp$ave, breaks = 100)

##### Life Expectancy #####
# Here I remove entries that are missing emissions data for 2010-2014
# I then take the median value for 2010-2014 for each entry and record that as ave


life <- read.csv("life.csv")
ncol(life)
names(life)
life <- life[, -c(2:5,59:61)]
names(life)

life$good <- rep(NA, len = nrow(life))
for(i in 1: nrow(life)){
  life[i, "good"] <- sum(is.na(life[i,51:54])) < 4
}
life <- life[life$good,]

life$ave <- rep(NA, len = nrow(life))
for(i in 1: nrow(life)){
  x <- as.numeric(life[i,51:54])
  life[i, "ave"] <- median(x, na.rm = TRUE)
}

head(life)

#hist(life$ave, breaks = 100)

##### Urban Population #####
# Here I remove entries that are missing emissions data for 2010-2014
# I then take the median value for 2010-2014 for each entry and record that as ave

urban <- read.csv("urban.csv")
ncol(urban)
names(urban)
urban <- urban[, -c(2:5, 59:61)]
names(urban)

urban$good <- rep(NA, len = nrow(urban))
for(i in 1: nrow(urban)){
  urban[i, "good"] <- sum(is.na(urban[i,51:54])) < 4
}
urban <- urban[urban$good,]

urban$ave <- rep(NA, len = nrow(urban))
for(i in 1: nrow(urban)){
  x <- as.numeric(urban[i,51:54])
  urban[i, "ave"] <- median(x, na.rm = TRUE)
}

head(urban)

#hist(urban$ave, breaks = 100)

#### Merging Data  ######
# Pulling the average for each of the above attributes and combining them into a single dataset by country name. Any countries that do not contain entries for all four datasets will be excluded when the merge occurs.

names(emi)[names(emi)=="ave"] <- "EmiAve"
names(gdp)[names(gdp)=="ave"] <- "GdpAve"
names(life)[names(life) == "ave"] <- "LifeAve"
names(urban)[names(urban) == "ave"] <- "UrbanAve"

names(gdp)

emimerge <- emi[, c(1,ncol(emi))]
gdpmerge <- gdp[, c(1, ncol(gdp))]
lifemerge <- life[, c(1,ncol(life))]
urbanmerge <- urban[, c(1, ncol(urban))]

head(urbanmerge)

total <- merge(emimerge,gdpmerge,by="Country.Name")
head(total)
total1 <- merge(total, lifemerge, by = "Country.Name")
head(total1)
kdat <- merge(total1, urbanmerge, by = "Country.Name")
head(kdat)

#Removing inputs that are not countries and writing the dataset to be used in kmeans clustering

kdat<-kdat[!(kdat$Country.Name %in% c("Arab World","Caribbean small states", "Central Europe and the Baltics", "Early-demographic dividend","East Asia & Pacific", "East Asia & Pacific (excluding high income)", "East Asia & Pacific (IDA & IBRD countries)", "Euro area", "Europe & Central Asia","Europe & Central Asia (excluding high income)", "Europe & Central Asia (IDA & IBRD countries)","European Union", "Fragile and conflict affected situations", "Heavily indebted poor countries (HIPC)", "IDA & IBRD total", "IDA blend", "IDA only", "IDA total",
"Late-demographic dividend", "Latin America & Caribbean",
"Latin America & Caribbean (excluding high income)", "Latin America & the Caribbean (IDA & IBRD countries)", "Least developed countries: UN classification", "Low & middle income", "Low income", "Lower middle income", "Middle East & North Africa", "Middle East & North Africa (excluding high income)", "Middle East & North Africa (IDA & IBRD countries)", "Middle income", "North America", "OECD members", "Other small states", "Pacific island small states",
"Post-demographic dividend", "Pre-demographic dividend",
"Small states", "South Asia", "South Asia (IDA & IBRD)", "Sub-Saharan Africa",
"Sub-Saharan Africa (excluding high income)", "Sub-Saharan Africa (IDA & IBRD countries)",
"Upper middle income", "World")),]

write.csv(kdat, file = "kdat.csv", row.names = FALSE)

####### State Data ######
# reading in the state data for each of the attributes examined. each dataset
# contains one value: the attributes value in 2010.
# I then merge the attributes into a single dataset and write the csv to be used later in matching

stateemi <- read.csv("State_EmiperCapita.csv")
statelife <- read.csv("State_life.csv")
stateurban <- read.csv("States_Urban.csv")
stateGDP <- read.csv("States_GDP.csv")

emimerge <- stateemi[, c(1, ncol(stateemi))]
gdpmerge <- stateGDP[, c(1, ncol(stateGDP))]
lifemerge <- statelife[, c(1,ncol(statelife))]
urbanmerge <- stateurban[, c(1, ncol(stateurban))]

head(emimerge)

total <- merge(emimerge,gdpmerge,by="State")
head(total)
total1 <- merge(total,lifemerge,by="State")
head(total1)
pdat <- merge(total1, urbanmerge, by = "State")
head(pdat)

write.csv(pdat, file = "pdat.csv", row.names = FALSE)

Part I: Kmeans on Country Data

To assign the countries to groups based on the variables that were previously stated, we used the nonparametric method known as k-means clustering. K-means clustering is an iterative procedure used to classify data points into “k” specified number of groups (Jeevan, 2015). This is considered an unsupervised machine learning technique. The objective of unsupervised machine learning is to gain information about the underlying distribution of the data. Through the use of k-means clustering, we gain further insight on the data by obtaining clusters. The clusters can then be analyzed and interpreted within the context of the data. In order to form the groups, first, centroids, which are the center of the groups, are defined randomly. After the centroids are randomly placed, data points are assigned to groups based on which centroid it is located closest to. This closeness is measured by Euclidean distance. These steps are repeated until convergence, in which the centroids are relocated to the average of the points within each cluster (Jeevan, 2015).

In order to perform k-means clustering, it must be assumed that equal weighting is applied to each variable, which usually requires the data to be normalized. If data is not normalized, resulting clusters will be influenced by the magnitude of values contained within a variable, rather than the variable’s relative importance. For example, GDP contained very large numbers compared to the rest of the variables, and therefore resulting clusters would be largely based on a country’s GDP if the data was not normalized. We normalized our data nonparametrically, rather than through a typical z-score normalization. Following the same algorithm, for each variable (emissions, GDP, life expectancy and urban population) we subtracted each country’s metric value by the median metric value for that variable and then divided the median absolute deviation for that variable. However, the normalization was not sufficient for GDP and emissions, as we found these variables contained large outliers that overpowered the clustering process. Therefore, we chose to apply a log transformation to the GDP and emissions variable. In the side by side scatter plots provided below, one can observe the relative linearity achieved from the transformation of the four variables. Now with equal weighting assumed, we could begin analyzing our clusters.

Scaling Country Data

kdat <- read.csv("kdat.csv")
head(kdat)
kdatNorm <- kdat

scalemadpar <- function(x){
 c(median = median(x, na.rm = TRUE),mad = mad(x, na.rm = TRUE))
}
madpars <- sapply(kdat[,c("LifeAve","UrbanAve")], scalemadpar)

for(colname in colnames(madpars)){
  kdatNorm[,colname] <- kdatNorm[,colname] - madpars["median", colname]
  kdatNorm[,colname] <- kdatNorm[,colname]/madpars["mad", colname]
}

kdatNorm["GdpAve"] <- log10(kdatNorm["GdpAve"])
kdatNorm["EmiAve"] <- log10(kdatNorm["EmiAve"])

head(kdatNorm)

# Write Normalized Data
write.csv(kdatNorm, file = "kdatNORM.csv", row.names = FALSE)

Kmeans Clustering

To choose the appropriate number of clusters for our data, we utilized the “elbow method”. The goal of the elbow method is to find the smallest value of k (number of clusters) that sufficiently minimizes the sum of squared errors (SSE). In general, as the number of clusters increases, the sum of squared errors will asymptotically approach zero. For the first few clusters, the SSE will decrease dramatically, and then at a distinct point or range of points, “the elbow”, the SSE will reduce only slightly. The elbow therefore provides a suggestion for the number of clusters that should be used for a given dataset.

Using the elbow method, we determined that the optimal number of clusters for our data was between two and four. The sum of squares is minimized when k=4, and is only improved incrementally when increasing k beyond four. This is visible from the SSE within plot that is provided above. HDI and CIP typically utilize four groups in their analysis as well, so the choice of four groups allowed for a consistent comparison of our results with the other metrics.

kdatNorm <- read.csv("kdatNORM.csv")
mydata <- kdatNorm[,-1]

# Choosing Number of Clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares", main="Determining the Optimal Value of k")

##### Kmeans Groups and Plot

The following figure is a plot of our country data separated into four clusters, after the k-means algorithm was performed in R. The plot is a side by side scatter plot of the four variables utilized in our clustering analysis. This plot is a two dimensional representation of four dimensional clustering analysis. Each point represents normalized values of a given country and the different colors represent the four groups generated from our clustering algorithm. There is strong positive linear correlation between the normalized values of emissions, GDP, and life expectancy, and a weaker positive linear correlation between urban population and each of the other three variables. This plot tells a story about the relationship between these variables, and their effect on the clustering process. Though our understanding is speculative, we interpret this graph in the following way. Wealthier countries will often achieve their prosperity through industrial development, and the effects of this will be evident through higher emissions. Wealthier countries will also have more capital to invest in healthcare, contributing to higher life expectancies. The percent of population living in urban centers, while not as strongly correlated with the other three variables, is an important variable nonetheless. Flourishing urban centers are a sign that jobs exist, and educated people move to these areas to fill those jobs. Therefore, urban population is a proxy for infrastructure and human capital.

# Table
set.seed(4)
kmeansmod <- kmeans(x = kdatNorm[,-1], centers = 4, nstart = 10)
kgroups <- data.frame(kdatNorm[1], kmeansmod$cluster)
kdatNormClusters <- data.frame(kdatNorm[order(kmeansmod$cluster), 1], kmeansmod$cluster[order(kmeansmod$cluster)])
names(kdatNormClusters) <- c("Country.Name", "Cluster")

# Graph
pairs(kdatNorm[,-1],  col=c(1:5)[kmeansmod$cluster],
      pch=c(0:(5 - 1))[kmeansmod$cluster], main="K-means Clustering for Country Data")

Write Normalized Data With Cluster Group

kdatNormClusters1 <- merge(kdatNormClusters, kdat, by = "Country.Name")
kdatNormClusters1[order(kdatNormClusters1$Cluster),]

write.csv(kdatNormClusters1, file = "kdatClusters.csv", row.names = FALSE)

The sizes of each cluster are similar. The largest cluster is cluster two, which contains approximately 32% of the countries, and the smallest cluster is cluster four, which contains approximately 21% of the countries. A table is provided in the appendix to display which countries are in each cluster.

Next, we identified and analyzed the defining characteristics of each cluster. The countries included in cluster one have the highest average values for emissions, GDP, life expectancy, and urban population. Mostly comprised of wealthy European nations and Middle East “oil giants”, and the wealthier countries from North and South America, this cluster in some ways represents the world’s economic elite. Based on the attributes of this group, we have labeled cluster one the “most industrialized countries”.

Cluster two contains most of the countries located in Central and South America. The European countries that have a smaller urban population and smaller GDP are also contained in this cluster along with a portion of the Middle Eastern countries that do not export large amounts of oil. This is the most geographically diverse cluster. We have named cluster two, the “mostly developed countries”. The factor that most differentiates cluster one from cluster two is GDP. There is a 80% drop (from $33,833 to $6,830) in average GDP, which is the largest change in GDP seen when comparing the clusters in a hierarchical manner.

Members of cluster three are mainly island nations as well as Asian Pacific countries. It is interesting to note that many of the countries contained in this cluster happen to be located near the equator. We have labeled cluster three, the “developing countries”. When comparing the developing to the mostly developed countries (cluster two), we see a significant shift in the average urban population percentage. There is a 42% decrease in this measure, which is the greatest urban population percentage change seen when comparing the four development groups.

With the exception of Afghanistan and Haiti, cluster four is comprised entirely of African countries. All of the countries in this cluster have very low emissions, GDP, life expectancies, and percentage of urban population. We will refer to this group as the “third world countries”. Interestingly, the urban population is less than 1% lower in developing countries than in third world countries. There is, however, a 83% decrease in emissions which is the largest decrease observed of any factor in any category. Also, there is a 16% decrease in life expectancy when comparing these two groups which is the largest decrease observed for life expectancy. People residing in the developing countries on average live ten years longer than those who live in the third world countries.

##### Can unnormalize centroids for analysis

centerOG <- as.data.frame(kmeansmod$center)
for(colname in colnames(madpars)){
  centerOG[,colname] <- centerOG[,colname] * madpars["mad", colname]
  centerOG[,colname] <- centerOG[,colname] + madpars["median", colname]
}

centerOG["GdpAve"] <- 10^(centerOG["GdpAve"])
centerOG["EmiAve"] <- 10^(centerOG["EmiAve"])

centerOG

CenterTable<-order(centerOG$EmiAve,decreasing=TRUE)
CenterTable <-centerOG[order(-centerOG$EmiAve), ]
Cluster<-c(1,2,3,4)
CenterTable<-cbind(Cluster,CenterTable[,1:4])

CenterTable

Part II: Matching State Data

As a test for our resulting clusters, we compiled the same data for the U.S. states and utilized principles of matching to assign each state to a corresponding development cluster. Matching is a technique used to compare similar observations where one observation is given a treatment, and the other does not receive the treatment. This technique is commonly utilized in medicine to evaluate the effectiveness of medication. Patients with similar characteristics are identified (“matched”) for the purpose of conducting a controlled study. For our dataset, we match the individual states to countries with similar characteristics in order to obtain cluster assignments for each state.

More specifically, matching was achieved by comparing the attributes from the state data with the cluster centroids created from the country data. The state data was first normalized using the coefficients produced in the normalization process for the country data. This ensured the state data would be considered on the same scale. We then wrote an algorithm iteratively comparing the dimensions of each centroid to the dimensions of our observed state data. Finally, we assigned each state to the cluster whose centroid minimizes the Euclidian distance between the state and centroid dimensions.

Reading in State Data

pdat <- read.csv("pdat.csv")
colnames(pdat) <- c("State", "EmiAve","GdpAve", "LifeAve", "UrbanAve")
head(pdat)

Normalization On States

pdatNorm <- pdat
for(colname in colnames(madpars)){
  pdatNorm[,colname] <- pdatNorm[,colname] - madpars["median", colname]
  pdatNorm[,colname] <- pdatNorm[,colname]/madpars["mad", colname]
}

pdatNorm["GdpAve"] <- log10(pdatNorm["GdpAve"])
pdatNorm["EmiAve"] <- log10(pdatNorm["EmiAve"])

head(pdatNorm)

Matching

MatchKMeans <- function(observation, centroids)
{
  SumOfSquareDist <- rep(0, nrow(centroids))
  for (centroidNumber in 1:nrow(centroids))
  {
    for(dim in 1:ncol(centroids))
    {
      Dist <- centroids[centroidNumber, dim] - observation[dim]
      SumOfSquareDist[centroidNumber] <- SumOfSquareDist[centroidNumber] + Dist^2
    } # for
  } # for
  which.min(SumOfSquareDist)
} # MatchKMeans

clusternum <- c(0,0)

for(i in 1:nrow(pdatNorm)){
clusternum[i] <- MatchKMeans(observation = pdatNorm[i, c("EmiAve", "GdpAve", "LifeAve", "UrbanAve")], centroids = kmeansmod$centers)
#print(clusternum)
}

pdatNorm<-cbind(pdatNorm,clusternum)

Matching Results

The goal was to use state data to determine which cluster states would belong to if they were treated as individual countries. States were matched to whichever cluster minimized the distance between the normalized state averages and the cluster center for the country data. Upon completing our matching procedure, it was determined that 46 of the states belong to the most industrialized cluster. Based on our data, the other 4 states, namely Maine, Mississippi, Vermont, and West Virginia are the most similar to the mostly developed countries. All of these states have a very low urban population (44%), and a relatively low GDP compared to the US average. The low urban population is likely the main determinant of their placement into cluster two rather than cluster one.

These results were in line with our expectations for the matching procedure. Although in the United States we often focus on the discrepancies between states, the truth is compared to other countries in the world, every state is rich, life expectancies are high, and emissions are large. As a whole the United States has a lower than average urban population than other of the most industrialized countries (74% compared to 83% for cluster one).

pdat <- cbind(pdat, clusternum)
colMeans(pdat[,-1])
cluster1 <- group_by(pdat, clusternum)
summarise(cluster1, Emissions = mean(EmiAve), Gdp = mean(GdpAve), 
          LifeExpectancy = mean(LifeAve), UrbanPopulation = mean(UrbanAve))

Conclusion

By using emissions, GDP, life expectancy and urban population percentage we were able to form meaningful clusters that stand as an indicator for a country’s level of development. There are clear distinctions between groups, and a coherent trend exists between the variables. In general, the higher the multivariate average of a country’s attributes, resulted in a lower the cluster number assignment.

Our method of classifying countries provides an alternative perspective on the context of a country’s development as compared to the traditionally used HDI and CIP metrics. Whereas HDI focuses mostly on human capital and living conditions, and CIP focuses capabilities related to production, our clustering analysis provides a blend of human and industrial development factors. Additionally, our grouping method provides an unprecedented environmental factor, measured through carbon dioxide emissions. We believe that the environmental is an important, yet often overlooked, component of a country’s development. As we compare our method of classification to HDI and CIP, the environmental factor takes a central role in understanding the difference between our choice of clusters.

There are widespread similarities between our clusters and the clusters created from the HDI and CIP metrics. Our group labeled “most industrialized nations” is almost identical to the HDI’s “very high human development”. The main difference is likely due to the impact that emission rates has on our clusters. In our clusters, high emission rates are a defining characteristic, which causes some of the countries with very high development according to the HDI to be excluded, such as Liechtenstein, and Hungary. Similarly, our most industrialized cluster contains the Middle East oil giants, whereas the group with the highest development, according to HDI, does not contain these countries. For the other three groups, our clusters were also similar to those of the HDI clusters. Our clustering algorithm elevated countries such as Thailand, China and Honduras, likely because of their industrial capabilities which were not captured through any of the HDI variables.

The countries with the largest CIP values also closely resemble our “most industrialized nations” and the HDI’s countries with “very high human development.” As with the HDI, CIP does not include the Middle East oil giants in this elite group, but unlike HDI or our cluster, CIP includes the top four countries in the manufacturing industry: Mexico, China, Thailand and Malaysia. In general, the CIP rewards countries who produce in manufacturing versus countries who may earn revenue through tourism, service jobs or the sale of raw materials. However, as was the case for HDI and our clustering mechanism, the poorest countries reside in categories associated with low development.

It is evident that our method of classification serves as a hybrid of HDI and CIP, by balancing social and economic development. Our classification method also provides a unique perspective on the importance of considering environmental impact as a component of industrialization. As we understand the eminent need for combating climate change, and commit to reducing our carbon footprint, we may expect to see substantial changes in the economic makeup of the world’s great polluters. As these countries belonging to our “mostly industrialized nations” are forced to reduce emissions, will they retain their high GDP, life expectancy, and urban population? Only time will tell.

In further research, we would like to examine alternative proxies for environmental and human development. Specifically, we are interested to know how literacy rate, or years of education affect the cluster when replacing percent of urban population or life expectancy. Likewise, we are interested in replacing emissions with a similar environmental factors and then comparing the results to our previous clusters. We also believe that incorporating an economic indicator that accounts for income disparities could also enhance our classification, and thus enhance the current economic metrics that are used. In in the future, it may also be valuable to run a regression on our data to build a model and then compare the results to our findings in the clustering analysis. Through our work and continued research, we believe a better metric for development can be achieved. Specifically, we see the need for a more comprehensive development index that considers some of the negative impacts of development, such as environmental pollution.