Cluster Analysis on Countries

Objectives

  1. To Learn and Summarize (veryyyy briefly) what K-means and Hierchical Clustering is.
  2. To explore the countries data-set!
  3. To apply both techniques on the countries data-set, with the goal of finding a useful grouping. (Unsupervised learning)

Summary Notes

What is cluster analysis?

Cluster analysis is a form of unsupervised learning (where a response variable of interest isn’t measured for the observations in a dataset). The goal of cluster analysis is to group the observations in such a way that minimizes the distance between observations within groupings. This distance has multiple different criterias and methods for computing.

Clustering thinks about a chosen value (whether it’s an observation of an arbitrary position in the data-space) and thinks about what is the closest observation in terms of their “distance” in values across all the predictors.

what is hieraichal clustering?

Hierarchical clustering is a more visual method of clustering. It begins from the bottom of a dendrogram (think upside-down tree) and begins fusing the observations together into clusters. If an observation is closer to a group of already fused observations, it links with that group. As such, branches form and the dendrogram is made.

A choice is made as to the value h, or the cuttoff point where the distance between groups can no longer exceed. This helps avoiding overfitting the data.

what is k-means clustering?

K-means clustering attempts to group the data into distinct groups in a simple workforce. First, a value for how many clusters is chosen – (we can rely on some methods for determining k, such as the elbow plot, if we don’t have a predetermined idea). Second, all the observations are randomly put defined as part of one of the k groups.

Third, we allow centroids to land on our data-space, and they take on the mean value across all the predictors for the observations belonging to it. Gradually, we allow each observation to be reassigned to the centroid which it is closest to. Consequently, this method should hopefully minimize the distances between the observations within a group.

Exploratory Data Analysis

Library and Loading

Loading the relevant data analysis libraries.

load.libraries <- c( 'purrr', 'cluster', 'Amelia', 'dendextend',
                    'corrplot', 'tidyverse', 'GGally')
install.lib <- load.libraries[!load.libraries %in% installed.packages()]
for(libs in install.lib) install.packages(libs, dependences = TRUE)
sapply(load.libraries, require, character = TRUE)
##      purrr    cluster     Amelia dendextend   corrplot  tidyverse 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##     GGally 
##       TRUE

Loading the dataset.

countries <- read_csv("countries of the world.csv")

Data Preparation/Survey

It is time to analyze the structure of the data with a quick summary and str function.

head(countries)
## # A tibble: 6 x 20
##   Country Region Population `Area (sq. mi.)` `Pop. Density (~
##   <chr>   <chr>       <dbl>            <dbl> <chr>           
## 1 Afghan~ ASIA ~   31056997           647500 48,0            
## 2 Albania EASTE~    3581655            28748 124,6           
## 3 Algeria NORTH~   32930091          2381740 13,8            
## 4 Americ~ OCEAN~      57794              199 290,4           
## 5 Andorra WESTE~      71201              468 152,1           
## 6 Angola  SUB-S~   12127071          1246700 9,7             
## # ... with 15 more variables: `Coastline (coast/area ratio)` <chr>, `Net
## #   migration` <chr>, `Infant mortality (per 1000 births)` <dbl>, `GDP ($
## #   per capita)` <dbl>, `Literacy (%)` <dbl>, `Phones (per 1000)` <chr>,
## #   `Arable (%)` <chr>, `Crops (%)` <chr>, `Other (%)` <dbl>,
## #   Climate <dbl>, Birthrate <dbl>, Deathrate <dbl>, Agriculture <chr>,
## #   Industry <chr>, Service <chr>
str(countries)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 227 obs. of  20 variables:
##  $ Country                           : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Region                            : chr  "ASIA (EX. NEAR EAST)" "EASTERN EUROPE" "NORTHERN AFRICA" "OCEANIA" ...
##  $ Population                        : num  31056997 3581655 32930091 57794 71201 ...
##  $ Area (sq. mi.)                    : num  647500 28748 2381740 199 468 ...
##  $ Pop. Density (per sq. mi.)        : chr  "48,0" "124,6" "13,8" "290,4" ...
##  $ Coastline (coast/area ratio)      : chr  "0,00" "1,26" "0,04" "58,29" ...
##  $ Net migration                     : chr  "23,06" "-4,93" "-0,39" "-20,71" ...
##  $ Infant mortality (per 1000 births): num  16307 2152 31 927 405 ...
##  $ GDP ($ per capita)                : num  700 4500 6000 8000 19000 1900 8600 11000 11200 3500 ...
##  $ Literacy (%)                      : num  360 865 700 970 1000 420 950 890 971 986 ...
##  $ Phones (per 1000)                 : chr  "3,2" "71,2" "78,1" "259,5" ...
##  $ Arable (%)                        : chr  "12,13" "21,09" "3,22" "10" ...
##  $ Crops (%)                         : chr  "0,22" "4,42" "0,25" "15" ...
##  $ Other (%)                         : num  8765 7449 9653 75 9778 ...
##  $ Climate                           : num  1 3 1 2 3 NA 2 2 3 4 ...
##  $ Birthrate                         : num  466 1511 1714 2246 871 ...
##  $ Deathrate                         : num  2034 522 461 327 625 ...
##  $ Agriculture                       : chr  "0,38" "0,232" "0,101" NA ...
##  $ Industry                          : chr  "0,24" "0,188" "0,6" NA ...
##  $ Service                           : chr  "0,38" "0,579" "0,298" NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Region = col_character(),
##   ..   Population = col_double(),
##   ..   `Area (sq. mi.)` = col_double(),
##   ..   `Pop. Density (per sq. mi.)` = col_character(),
##   ..   `Coastline (coast/area ratio)` = col_character(),
##   ..   `Net migration` = col_character(),
##   ..   `Infant mortality (per 1000 births)` = col_number(),
##   ..   `GDP ($ per capita)` = col_double(),
##   ..   `Literacy (%)` = col_number(),
##   ..   `Phones (per 1000)` = col_character(),
##   ..   `Arable (%)` = col_character(),
##   ..   `Crops (%)` = col_character(),
##   ..   `Other (%)` = col_number(),
##   ..   Climate = col_number(),
##   ..   Birthrate = col_number(),
##   ..   Deathrate = col_number(),
##   ..   Agriculture = col_character(),
##   ..   Industry = col_character(),
##   ..   Service = col_character()
##   .. )

There are a number of columns I am not interested in. For the clustering with a distance linkage criteria, I will likely want all my predictors of interest to be numeric. Secondly, I want a few character columns removed. One is the region. Second is the climate which takes on four values and behaves as a factor variable. This won’t function well with my analysis.

attach(countries)

drop.cols <- c('Region', 'Climate')

countries <- countries %>% 
  select(-one_of(drop.cols))

Next, the country dataset is taking some index and numeric variables and treating them as characters. I would like the change that. The first step is to apply some regular expressions to the dataset to replace “,” with decimals. This won’t change their interpretation so long as I standardize the data before conducting analysis.

for(y in colnames(countries[2:18])) {  
  countries[[y]] <- gsub(pattern = ",", replacement = ".", x = countries[[y]])
  countries[[y]] <- as.numeric(countries[[y]])
} 

Now things are good! I want to rename a few columns to help with my coding later on.

names(countries)[names(countries) == "Phones (per 1000)"] <- "Phones"
names(countries)[names(countries) == "GDP ($ per capita)"] <- "GDP"
names(countries)[names(countries) == "Pop. Density (per sq. mi.)"] <- "Pop_Dens"
names(countries)[names(countries) == "Infant mortality (per 1000 births)"] <- "Infant_Mort"
names(countries)[names(countries) == "Literacy (%)"] <- "Literacy"

I am going to look into our NAs next. Amelia is one of my favorite packages, it contains a function called miss-map to visualize NA’s across a dataset.

missmap(countries)

3% missing! Not too bad. However, it appears that for those observations which have missing values, they are missing across multiple columns. There is no good way to support assumptions for imputing this data, so I will omit the NA’s. In the future, learning in-depth how to handle NA’s with cluster analysis will be a great next step for me.

countries_naOmit <- na.omit(countries)

The na.omit() function dropped our observations of countries from 227 to 179. Not great practice.

Now that our data is free of missing values and we have changed all our values to numeric data, it’s a good time to start analyzing some relationships!

Examining the Big Relationships

countries %>%
  gather(Features, value, 2:18) %>%
  ggplot(aes(x=value)) +
  geom_histogram(fill="orange3", colour="black") +
  facet_wrap(~Features, scales="free_x") +
  labs(x="Values", y="Frequency") 

corrplot(cor(countries_naOmit[2:18]), type = "lower", method = "ellipse")

The histograms share information about the distribution of each variable measured. The correlation plot shows us the correlation between all these different variables. Not too many tightly linear relationships, with the exception of the correlation between the number of phones and GDP of the country. (INTERESTING!) Most of the correlation regressions do not explain much of the data variation, shone by the large number of faded elipses.

Here is a close-up on a linear regression fitting of phones and GDP.

ggplot(countries_naOmit, aes(x= Phones, y= GDP)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) + 
  ggtitle(label = "Regression of GDP and Phones Owned (Per 1000) Across Countries")

It seems to have some about of heteroscadacity, with the variance increasing as GDP and phones per capita increase.

To further this analysis, I am going to examine some outlier countries for a few interesting variables. What are some countries that are in the extreme lows or highs of the GDP spectrum? What about population density? I will be using the dataset without NA’s omit for this analysis.

GDP Analysis

I’m going to make a subset of this dataframe filtering out those values which have high extremes and lower extremes. First it’ll be useful to analyze the distribution of the GDP feature.

countries_gdp <- countries %>%
  select(GDP, Country)

summary(countries_gdp)
##       GDP          Country         
##  Min.   :  500   Length:227        
##  1st Qu.: 1900   Class :character  
##  Median : 5550   Mode  :character  
##  Mean   : 9690                     
##  3rd Qu.:15700                     
##  Max.   :55100                     
##  NA's   :1
countries_gdp %>%
  ggplot(aes(x = 1, y = GDP)) + 
  geom_boxplot() +
  coord_flip()+
  ggtitle(label = "GDP per capita Across Countries")

Let’s see who’s at that high extreme in GDP? That’ll put us at a cutoff of 40,000 per capita. Big money!

countries_gdp_top <- countries_gdp %>% 
  filter(GDP > 40000) 

head(countries_gdp_top) #luxembourg? 
## # A tibble: 1 x 2
##     GDP Country   
##   <dbl> <chr>     
## 1 55100 Luxembourg

Luxembourg has the highest GDP. A small European country that resides in between Belgium, France, and Germany.

Let’s see who at the bottom of the GDP. Using the five number summary I made, I decided to use 600 as the cut-off per capita GDP.

countries_gdp_bot <- countries_gdp %>% 
  filter(GDP < 600) 

head(countries_gdp_bot)
## # A tibble: 3 x 2
##     GDP Country     
##   <dbl> <chr>       
## 1   500 East Timor  
## 2   500 Sierra Leone
## 3   500 Somalia

Three countries are reported all the way at the bottom of the GDP per capita. East Timor, Sierra Leone, and Somalia. They are all reported as $500 per capita GDP.

Density Analysis

Let’s look for extremities in the population density statistic.

countries_pDen <- countries %>%
  select(Pop_Dens, Country)

summary(countries_pDen)
##     Pop_Dens          Country         
##  Min.   :    0.00   Length:227        
##  1st Qu.:   29.15   Class :character  
##  Median :   78.80   Mode  :character  
##  Mean   :  379.05                     
##  3rd Qu.:  190.15                     
##  Max.   :16271.50
countries_pDen %>%
  ggplot(aes(x = 1, y = Pop_Dens)) + 
  geom_boxplot() +
  coord_flip()+
  ggtitle(label = "Population Density (per Sq. Mi.) Across Countries")

The presentation of this data is heavily is heavily skewed by an outlier. Let’s figure out what those observations are. Then I will remove the extreme outliers (>15000) and rexamine the distribution with another boxplot.

countries_pDen_top <- countries_pDen %>% 
  filter(Pop_Dens > 15000) 

head(countries_pDen_top)
## # A tibble: 2 x 2
##   Pop_Dens Country
##      <dbl> <chr>  
## 1   16183  Macau  
## 2   16272. Monaco

Macau and Monaco have very high population densities. Let’s look into the distribution without such high outliers.

countries_pDen %>%
  filter(Pop_Dens < 2000) %>%
  ggplot(aes(x = Pop_Dens)) + 
  geom_histogram() +
  ggtitle(label = "Population Density (per Sq. Mi.) Across Countries (Outlier Removed)")

countries_pDen %>% 
  arrange(Pop_Dens) %>% 
  head()
## # A tibble: 6 x 2
##   Pop_Dens Country       
##      <dbl> <chr>         
## 1      0   Greenland     
## 2      1   Western Sahara
## 3      1.8 Mongolia      
## 4      2.2 French Guiana 
## 5      2.5 Namibia       
## 6      2.6 Australia

Anything higher than 500 people per square mile is a bit outside the norm with countries. The lowest extremity are countries like Western Sahara, Mongolia, and Australia. Greenland has a 0 population density, which must be a data entry error.

Cluster Analysis

Time to attempt to group our countries based on this litany of different variables. I won’t go into this pretending to have any idea how to best group. In fact, I’ll experiment with two different methods: the k-means clustering and hierachical clustering.

Preparation

First, the data must be scaled. This is absolutely crucial because our features are measured on a lot of different scales. To make our clustering actually work across all these different indexes and numeric scales, standardizing is important!

countries_scaled <- as.data.frame(scale(countries_naOmit[2:18]))
countries_scaled <- cbind(countries_naOmit[,1], countries_scaled)

Hierachical Clustering - Attempt 1

In Hierachical Clustering, I will follow this flow: first, create a distance matrix. Second, make a dendogram (will look very cluttered, but will tell relative euclidean distance between each country to help me settle on a useful distance cutoff). Third, I will extract the clusters from a dendrogram. Lastly, it is time to explore the clusters and interesting information between the clusters.

Step one, make the distance matrix.

dist_countries <- dist(countries_scaled, method = "euclidean")
## Warning in dist(countries_scaled, method = "euclidean"): NAs introduced by
## coercion

Step two, extract clusters. For the linkage criteria, I have chosen the “average” criteria. I want to equally evaluate the means across all features in the clusters being devised.

hc_countries <- hclust(dist_countries, method = "average")

Step three, visualize my distance matrix with a dendrogram. The distance across the y-axis suggests dissimilarities between the countries being grouped are greater. I am interested in what this will look like!

dend_countries <- as.dendrogram(hc_countries) 
plot(dend_countries, main = "Clustering across 170 Countries", 
     ann = FALSE)

Not ideal! It is difficult to discern a good value for the dendrogram. Countries are similar and dissimilar between each other in two many different ways. It’s time to go back and focus in on a few different variables for a cleaner clustering.

Hierachical Clustering - Attempt 2

I’m going to isolate a few different features. The features I chose to focus on are GDP, literacy, infant mortality, birthrate, phones, and deathrate. These variables are more closer related to one another than say, the amount of coastline present in the country.

countries_t2 <- countries_naOmit %>% 
  select(GDP, Literacy, Infant_Mort, 
         Birthrate, Deathrate, Phones)

Let’s repeat the process with this more focused-in hierachical clustering analysis.

countries_t2_scaled <- as.data.frame(scale(countries_t2))

dist_countries_t2 <- dist(countries_t2_scaled, method = "euclidean")

hc_countries_t2 <- hclust(dist_countries_t2, method = "average")

dend_countries_t2 <- as.dendrogram(hc_countries_t2) 
plot(dend_countries_t2)

Not quite as bad! The countries do seem to be better grouped based on these characteristics. I’m going to pick h = 2 to capture some clusters that offer useful statistical comparisons.

dend_countries_colored <- color_branches(dend_countries_t2, h = 2)
plot(dend_countries_colored, label = dend_countries_colored$Country)

As shown by the coloration, we now have four different clusters of companies, with a fairly low total distance across from one another. Let’s observe what we can from the different clusters.

There are a few different methods for visualizing and analyzing hierachical clusters.

clusters_h2 <- cutree(hc_countries_t2, h = 3)
df_h2_average <- mutate(countries_t2, cluster = clusters_h2)

First, lets analyze the membership.

count(df_h2_average, cluster)
## # A tibble: 8 x 2
##   cluster     n
##     <int> <int>
## 1       1     1
## 2       2   137
## 3       3     1
## 4       4     1
## 5       5    46
## 6       6     5
## 7       7     1
## 8       8     2

There are seven clusters, but the membership of the different clusters vary dramatically. 2, 3, and 5 have a decent sized membership, but several are limited to >5 countries. Let’s focus on the clusters in which membership is greater.

df_h2_average %>% 
  filter(cluster %in% c(2, 3, 5)) %>% 
  group_by(cluster) %>%
  summarize_all(funs(mean(.)))
## # A tibble: 3 x 7
##   cluster    GDP Literacy Infant_Mort Birthrate Deathrate Phones
##     <int>  <dbl>    <dbl>       <dbl>     <dbl>     <dbl>  <dbl>
## 1       2 12288.     924.       1837.     1524.      677.  290. 
## 2       3  1900      420       19119      4511       242     7.8
## 3       5  1554.     563.       6973.     3699.     1132.   16.6
ggpairs(data = df_h2_average, columns=1:6, aes(colour=cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both")

According to the summarize function, we have three big clusters present. There is a certain logic to the data here.

Cluster 5: Lowest GDP per capita, lowest literacy rate, highest infant mortality and birthdate, highest deathrate, and lowest phones owned per capita. These are likely countries with the lowest economic development. This cluser has 38 countries.

Cluster 2: More catch-up economic development countries. Here is a middle GDP per capita (~$7,000), near to cluster 3’s literacy rate. These countries, on average, have considerably lower infant mortality than cluster 5 but more than cluster 3 by about 2000. The birthrate is higher by 600 in these countries compared to cluster 3. Phones are considerably less than cluster 3 but more than cluster 5. These countries, 102 in total, are in the middle-most region of economic development and occupy the largest cluster.

Cluster 3: Developed economics. Here we observe the lower infanty mortality, lowest population growth, slightly higher death rate (?), but higher percentage of phones owned and highest GDP per capita. These are 30 countries in total.

The clusters, in short, are low economic development, middle economic development, and high economic development.

Now it’s time to visit a different clustering method - kmeans.

Kmeans Clustering

In this form of clustering, I will follow a different but similar work-flow. Since the data has already been processed, I will skip this step by going straight to my cleaned dataset. First, I will estimate the best number of clusters (k) via the elbow method, replicating the clustering model with various k’s and plotting them for sum of squared errors. I will go through a second cluster search via the silhoutte method. Then I will extract clusters with the optimal k value and explore them.

First, the elbow plot. The kmeans() model is created with the countries_t2 data from my second attempt at hierachical clustering.

tot_withinss <- map_dbl (1:10, function(k) { 
  model <- kmeans(x = countries_t2_scaled, centers = k) 
  model$tot.withinss })

elbow_df <- data.frame(k = 1:10, tot_withinss = tot_withinss)

elbow_df %>%
  ggplot(aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

According to the elbow plot, a cluster of 2 has a big benefit in reducing the total squared errors between groups. The elbow is present, as well, at k = 3. Since our hierachical cluster model seemed to function well with the three clusters, I will operate under the “least developed”, “mid developed”, “most developed” mindset for this kmeans model.

The next test of k values is the silhoutte, which looks at the distance from one point to the other points within its cluster and the distance from the points of the nearest neighbor cluster. On the scale from -1 to 1, a 1 is well fitted, -1 is poorly fitted. Using k = 3, I will see if this clustering approach works well.

pam_k3 <- pam(countries_t2_scaled, k = 3) 
stil_plot <- silhouette(pam_k3) 
plot(stil_plot, main= "Silhouette Plot of K = 3")

According to the silhoutte analysis, we have an average width of 0.36. This means that the model moves towards being fairly well fit, but is not perfect. A few countries are not quite appropriate for the cluster (in our hierachical model, this was observed by the four other clusters drawn!). These outliers are bringing down the total fitness of the model.

Out of curiosity, I am going to run another silhoutte analysis with k = 2.

pam_k2 <- pam(countries_t2_scaled, k = 2) 
stil_plot_k2 <- silhouette(pam_k2) 
plot(stil_plot_k2, main= "Silhouette Plot of K = 2")

We get a better model fitness. But this is at the expense of less interesting groups. The smaller the clusters, the more nuanced our observations can get. I will stick with k = 3 for the analysis.

set.seed(9989)
countries_k3 <- kmeans(countries_t2_scaled, centers = 3)
countries_k3_final <- cbind(countries_t2, countries_k3$cluster)

count(countries_k3_final, countries_k3$cluster)
## # A tibble: 3 x 2
##   `countries_k3$cluster`     n
##                    <int> <int>
## 1                      1    43
## 2                      2    95
## 3                      3    56
countries_k3_final %>% 
  group_by(countries_k3$cluster) %>%
  summarize_all(funs(mean(.)))
## # A tibble: 3 x 7
##   `countries_k3$cl~    GDP Literacy Infant_Mort Birthrate Deathrate Phones
##               <int>  <dbl>    <dbl>       <dbl>     <dbl>     <dbl>  <dbl>
## 1                 1 25353.     971.        598.     1106.      701.  533. 
## 2                 2  6863.     895.       2263.     1661.      667.  184. 
## 3                 3  1632.     582.       7691.     3576.     1278.   15.5
ggpairs(data = countries_k3_final, columns=1:6, aes(colour=countries_k3$cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both")

In our k-means clustering model, our groups have 39, 88, and 52 countries respectively. The model assigned the countries into three groups that follow the same interpretation as our hierachical clustering model - group 1 is the well-developed, with the highest GDP per capita (by a lot), literacy, and phone ownership. It has the lowest birth rate and infant mortality rate. However, we still run into that confusing feature about the death rate - our higher developed countries cluster has a slightly higher death rate than mid-developed countries. That seems counter-intuitive.

Cluster 3 is our least developed countries, with the lowest GDP, literacy, and phone ownership. The Infanty mortality rate is the highest as is the birthrate and deathrate.

Conclusion

In this exploratory data analysis I gave my first shot at clustering! I used the hierachical modeling approach first, and learned its challenges and benefits. Then I used the k-means approach. Both models gave me the same results - three clusters, with countries in least, mid, and highest development.