The aim of this project is to analyze world countries based on the World Happiness Report of 2019 and create interesting maps based on the report.

The dataset is provided by Kaggle and can be accessed here.

1. Importing and cleaning data

Let’s start by loading the necessary packages and reading in the data.

library(tidyverse)
library(readr)
world_happiness_report_2019 <- read_csv("~/Downloads/world-happiness-report-2019.csv")
df <- world_happiness_report_2019

Let’s now look at the first 5 rows of the data set to see what it looks like:

head(df, 5)

## # A tibble: 5 x 11
##   `Country (regio~ Ladder `SD of Ladder` `Positive affec~ `Negative affec~
##   <chr>            <chr>           <int>            <int>            <int>
## 1 "per capita\""   "Heal~             NA               NA               NA
## 2 Finland          1                   4               41               10
## 3 Denmark          2                  13               24               26
## 4 Norway           3                   8               16               29
## 5 Iceland          4                   9                3                3
## # ... with 6 more variables: `Social support` <int>, Freedom <int>,
## #   Corruption <int>, Generosity <int>, `Log of GDP\nper capita` <int>,
## #   `Healthy life\nexpectancy` <int>

This certainly looks a bit messy and needs some cleaning up before we can continue. Specifically, we need to rename the columns, remove the row with missing values and change the Ladder variable type to integer.

names(df) <- c("Country", "Ladder", "SD_of_ladder", "Positive_affect", 
               "Negative_affect", "Social_support", "Freedom", "Corruption", 
               "Generosity", "Log_of_GDP_per_cap", "Healthy_life_exp")
df <- na.omit(df)
df$Ladder <- as.integer(df$Ladder)
glimpse(df)

## Observations: 140
## Variables: 11
## $ Country            <chr> "Finland", "Denmark", "Norway", "Iceland", ...
## $ Ladder             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ SD_of_ladder       <int> 4, 13, 8, 9, 1, 11, 18, 15, 23, 10, 26, 62,...
## $ Positive_affect    <int> 41, 24, 16, 3, 12, 44, 34, 22, 18, 64, 47, ...
## $ Negative_affect    <int> 10, 26, 29, 3, 25, 21, 8, 12, 49, 24, 37, 8...
## $ Social_support     <int> 2, 4, 3, 1, 15, 13, 25, 5, 20, 31, 7, 42, 3...
## $ Freedom            <int> 5, 6, 3, 7, 19, 11, 10, 8, 9, 26, 17, 16, 9...
## $ Corruption         <int> 4, 3, 8, 45, 12, 7, 6, 5, 11, 19, 13, 58, 7...
## $ Generosity         <int> 47, 22, 11, 3, 7, 16, 17, 8, 14, 25, 6, 75,...
## $ Log_of_GDP_per_cap <int> 22, 14, 7, 15, 12, 8, 13, 26, 19, 16, 18, 6...
## $ Healthy_life_exp   <int> 27, 23, 12, 13, 18, 4, 17, 14, 8, 15, 10, 2...

Everything seems good now. We can see that the variables contain rankings for each country - the lower the score, the better the ranking.

2. Clustering with kmeans

It would be interesting to see which countries fall into similar groups (clusters) based on the report. For this we will set a random seed for reproducibility and first “guesstimate”" that the number of clusters might be 5, for the purposes of this project. We will then see what the results look like and confirm the actual optimal number of clusters.

set.seed(12345) 
countries <- df 
df_clus <- df[-1]
cluster_solution <- kmeans(df_clus, centers = 5)
countries$cluster <- as.factor(cluster_solution$cluster)

Let’s take a look at the distribution of cluster assignments and visualize it:

table(countries$cluster)

## 
##  1  2  3  4  5 
## 27 29 37 25 22

ggplot(countries, aes(cluster, Ladder, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Life Satisfaction")

We can immediately see that the clusters seem to be quite uniformly distributed. If we look at the plot above we can notice that clusters 1 & 4 and 2 & 3 look rather similar in terms of distribution.

In order to confirm that the kmeans algorithm is indeed working properly, let’s repeat this with a different random seed to see if the results are similar:

set.seed(23456) 
countries2 <- countries 
df_clus2 <- df_clus
cluster_solution2 <- kmeans(df_clus2, centers = 5)
countries2$cluster <- as.factor(cluster_solution2$cluster)

table(countries2$cluster)

## 
##  1  2  3  4  5 
## 22 34 29 28 27

ggplot(countries2, aes(cluster, Ladder, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Ladder (Life Satisfaction) - Comparison")

The results are quite similar so we can conclude that the algorithm works well. This time the clusters 2 & 3 and 4 & 5 look very similar. Could it be that 5 is not the optimal number of clusters for this dataset? Let’s find out.

library(factoextra)
library(NbClust)

# Elbow method
fviz_nbclust(df_clus, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow method")

# Silhouette method
fviz_nbclust(df_clus, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")

# Gap statistic
set.seed(1234)
fviz_nbclust(df_clus, kmeans, nstart = 25,  method = "gap_stat", nboot = 50) +
  labs(subtitle = "Gap statistic method")

The elbow method suggests 4, while the silhouette method suggests 2 and the gap statistic method suggests 6. We’ll take the middle ground and repeat the clustering with 4 centers.

set.seed(12345) 
countries4 <- countries 
df_clus4 <- df[-1]
cluster_solution <- kmeans(df_clus4, centers = 4)
countries4$cluster <- as.factor(cluster_solution$cluster)

Let’s have a look at the distribution of the four cluster assignments:

# Load the GGally package
library(GGally)
table(countries4$cluster)

## 
##  1  2  3  4 
## 30 43 25 42

#Group by the cluster assignment and calculate averages
df4_clus_avg <- countries4 %>%
  group_by(cluster) %>%
  summarize_if(is.numeric, mean, na.rm=TRUE)

# View the resulting table
df4_clus_avg

## # A tibble: 4 x 11
##   cluster Ladder SD_of_ladder Positive_affect Negative_affect
##   <fct>    <dbl>        <dbl>           <dbl>           <dbl>
## 1 1         93.2        107.             52.9            98  
## 2 2         59.9         60.3            88.8            58.5
## 3 3         17.8         26.2            38.5            34.4
## 4 4        125.         108.            109.            114. 
## # ... with 6 more variables: Social_support <dbl>, Freedom <dbl>,
## #   Corruption <dbl>, Generosity <dbl>, Log_of_GDP_per_cap <dbl>,
## #   Healthy_life_exp <dbl>

#Create a parallel coordinate plot of the values:
ggparcoord(df4_clus_avg, columns = c(2:11), 
           groupColumn = "cluster", scale = "globalminmax", order = "skewness") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

3. Exploratory data analysis

Would it be possible to guess which countries those clusters might represent based on the above plot? My guess is that the blue line (cluster 3) most likely represents economically powerful countries such as those in Scandinavia, Western Europe, North America and Oceania (Australia and New Zealand). Notice how the Negative Affect in those countries ranks higher than Positive Affect.

It is also interesting to note how clusters 1 (the red line) and 2 (the green line) seem to be the opposities of each other - where one scores higher, the other one scores lower respectively. Now let’s look more closely. The countries in cluster 2 have the lowest average ranking in Corruption (meaning that they experience the highest rates of it), but at the same time they are the closest to the countries in cluster 3 in many other measures, such as Log of GDP per capita, Ladder (Life satisfaction), Negative effect, Social support and Healthy life expectancy. In fact, there even appears to be some correlation between clusters 3 and 2 - where one ranks high, the other one tends to follow.

Cluster 1 (the red line), on the other hand, appears to be quite the opposite of both cluster 2 and 3 - where those two rank high, cluster 1 ranks low. Interestingly though, cluster 1 ranks higher in Generosity and considerably higher in Freedom and Positive affect compared to cluster 2, and is much closer to cluster 3 in those respects. However, it is the second lowest ranking cluster in terms of Social support and Healthy life expectancy. While it is difficult to estimate which countries may belong to each cluster, I will guess that among countries in other regions we may find some Latin American countries in cluster 1 and post-socialist countries in cluster 2.

Cluster 4 consists of countries that rank the lowest in all measures except Corruption where the cluster holds the third best ranking. This cluster most likely consists of poor African, Middle Eastern and Asian countries.

Before we confirm what countries belong to each cluster, can visualize the distribution of the four clusters per each measure more closely:

#make a dot plot
countries4_gathered <- countries4 %>% gather(Measure, Ranking, Ladder:Healthy_life_exp) 

ggplot(countries4_gathered, aes(Measure, Ranking, col = cluster)) +
  geom_point(alpha = 0.3) +
  geom_jitter() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  ggtitle("Distribution of clusters per measure - 4 clusters")

countries4_gathered %>% 
ggplot(aes(cluster, Ranking, col = cluster)) +
  geom_point(alpha = 0.3) +
  geom_jitter() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  ggtitle("Distribution of 4 clusters - facetted by measure") +
  facet_wrap(~Measure)

Some very interesting observations stick out. While the highest ranking cluster in Corruption is cluster 3, one country from cluster 1 stands out as well. Let’s see what country that is:

countries4 %>%
  filter(cluster == 1) %>%
  select(Country, Corruption) %>%
  arrange(Corruption) %>%
  head(n = 5)

## # A tibble: 5 x 2
##   Country Corruption
##   <chr>        <int>
## 1 Rwanda           2
## 2 Myanmar         24
## 3 Bhutan          25
## 4 Gambia          26
## 5 Laos            27

We have just found out that Rwanda is the second highest ranking country in Corruption meaning that it is the second least corrupt country. Some countries from cluster 1 also stick out as high ranking in Freedom, Generosity and Positive affect. Let’s find out which countries from cluster 1 ranked 25 or higher in those measures:

countries4_gathered %>%
   filter(cluster == 1, 
          Measure %in% c("Freedom", "Generosity", "Positive_affect"),
          Ranking <= 25) %>%
   select(Measure, Country, Ranking) %>%
   arrange(Measure, Ranking)

## # A tibble: 15 x 3
##    Measure         Country     Ranking
##    <chr>           <chr>         <int>
##  1 Freedom         Cambodia          2
##  2 Freedom         Philippines      15
##  3 Freedom         Rwanda           21
##  4 Freedom         Laos             22
##  5 Freedom         Guatemala        25
##  6 Generosity      Myanmar           1
##  7 Generosity      Indonesia         2
##  8 Generosity      Bhutan           13
##  9 Positive_affect Paraguay          1
## 10 Positive_affect Laos              5
## 11 Positive_affect Guatemala         8
## 12 Positive_affect Indonesia         9
## 13 Positive_affect Ecuador          11
## 14 Positive_affect Honduras         13
## 15 Positive_affect El Salvador      23

We can see that Cambodia ranks 2nd highest in Freedom, Myanmar and Indonesia are the two highest ranking countries in Generosity and Paraguay ranks the highest in Positive affect.

We can also visualize the clusters more closely in separate plots per each measure:

ggplot(countries4, aes(cluster, Ladder, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Life Satisfaction")

ggplot(countries4, aes(cluster, Positive_affect, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Positive Affect")

ggplot(countries4, aes(cluster, Negative_affect, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Negative Affect")

ggplot(countries4, aes(cluster, Social_support, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Social Support")

ggplot(countries4, aes(cluster, Freedom, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Freedom")

ggplot(countries4, aes(cluster, Corruption, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Corruption")

ggplot(countries4, aes(cluster, Generosity, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Generosity")

ggplot(countries4, aes(cluster, Healthy_life_exp, col = cluster)) +
  geom_point(alpha = 0.6) +
  geom_jitter() +
  ggtitle("Distribution of clusters by Healthy Life Expectancy")

And finally, let’s see which countries belong to each cluster. To avoid long lists, we will limit the number of countries per cluster to 5 in no particular order. We will have a full overview of all the countries and clusters later on a world map.

#Cluster 1
countries4 %>%
  filter(cluster == 1) %>%
  select(Country) %>%
  head(5)

## # A tibble: 5 x 1
##   Country    
##   <chr>      
## 1 Guatemala  
## 2 El Salvador
## 3 Nicaragua  
## 4 Ecuador    
## 5 Honduras

#Cluster 2
countries4 %>%
  filter(cluster == 2) %>%
  select(Country) %>%
  head(5)

## # A tibble: 5 x 1
##   Country       
##   <chr>         
## 1 Israel        
## 2 Czech Republic
## 3 Mexico        
## 4 Chile         
## 5 Spain

#Cluster 3
countries4 %>%
  filter(cluster == 3) %>%
  select(Country) %>%
  head(5)

## # A tibble: 5 x 1
##   Country    
##   <chr>      
## 1 Finland    
## 2 Denmark    
## 3 Norway     
## 4 Iceland    
## 5 Netherlands

#Cluster 4
countries4 %>%
  filter(cluster == 4) %>%
  select(Country) %>%
  head(5)

## # A tibble: 5 x 1
##   Country    
##   <chr>      
## 1 Pakistan   
## 2 Algeria    
## 3 Morocco    
## 4 Cameroon   
## 5 Ivory Coast

4. Putting everything on a map

Now let’s put all the countries on a world map. Note that not all countries are included in the original report, therefore some countries will be left without any info on the map.

library(ggmap)
library(leaflet)
library(htmltools)

register_google(key = api)
geo <- geocode(countries4$Country, source = "google", messaging = FALSE)

#check that the new dataframe has the same nr of observations as countries4
glimpse(geo)

## Observations: 140
## Variables: 2
## $ lon <dbl> 25.748151, 9.501785, 8.468946, -19.020835, 5.291266, 8.227...
## $ lat <dbl> 61.924110, 56.263920, 60.472024, 64.963051, 52.132633, 46....

glimpse(countries4)

## Observations: 140
## Variables: 12
## $ Country            <chr> "Finland", "Denmark", "Norway", "Iceland", ...
## $ Ladder             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ SD_of_ladder       <int> 4, 13, 8, 9, 1, 11, 18, 15, 23, 10, 26, 62,...
## $ Positive_affect    <int> 41, 24, 16, 3, 12, 44, 34, 22, 18, 64, 47, ...
## $ Negative_affect    <int> 10, 26, 29, 3, 25, 21, 8, 12, 49, 24, 37, 8...
## $ Social_support     <int> 2, 4, 3, 1, 15, 13, 25, 5, 20, 31, 7, 42, 3...
## $ Freedom            <int> 5, 6, 3, 7, 19, 11, 10, 8, 9, 26, 17, 16, 9...
## $ Corruption         <int> 4, 3, 8, 45, 12, 7, 6, 5, 11, 19, 13, 58, 7...
## $ Generosity         <int> 47, 22, 11, 3, 7, 16, 17, 8, 14, 25, 6, 75,...
## $ Log_of_GDP_per_cap <int> 22, 14, 7, 15, 12, 8, 13, 26, 19, 16, 18, 6...
## $ Healthy_life_exp   <int> 27, 23, 12, 13, 18, 4, 17, 14, 8, 15, 10, 2...
## $ cluster            <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3...

#add geocodes to countries4
countries4$lon <- geo$lon
countries4$lat <- geo$lat

#just a little double check to make sure everything looks ok now
countries4 %>% select(Country, Ladder, lon, lat)

## # A tibble: 140 x 4
##    Country     Ladder     lon   lat
##    <chr>        <int>   <dbl> <dbl>
##  1 Finland          1   25.7   61.9
##  2 Denmark          2    9.50  56.3
##  3 Norway           3    8.47  60.5
##  4 Iceland          4  -19.0   65.0
##  5 Netherlands      5    5.29  52.1
##  6 Switzerland      6    8.23  46.8
##  7 Sweden           7   18.6   60.1
##  8 New Zealand      8  175.   -40.9
##  9 Canada           9 -106.    56.1
## 10 Austria         10   14.6   47.5
## # ... with 130 more rows

# Create data frames for each clusters
clus1 <- filter(countries4, cluster == "1")  
clus2 <- filter(countries4, cluster == "2")
clus3 <- filter(countries4, cluster == "3")
clus4 <- filter(countries4, cluster == "4")

# Create a layered leaflet map 
pal <- colorFactor(topo.colors(4), countries4$cluster)

c1 <- leaflet() %>% 
  addProviderTiles("CartoDB") %>% 
  addCircleMarkers(data = clus1, radius = 5, label = ~htmlEscape(cluster),
                   color = ~pal(cluster), group = "Cluster1")

c2 <- c1 %>%
  addCircleMarkers(data = clus2, radius = 5, label = ~htmlEscape(cluster),
                   color = ~pal(cluster), group = "Cluster2")
c3 <- c2 %>%
  addCircleMarkers(data = clus3, radius = 5, label = ~htmlEscape(cluster),
                   color = ~pal(cluster), group = "Cluster3")
c4 <- c3 %>%
  addCircleMarkers(data = clus4, radius = 5, label = ~htmlEscape(cluster),
                   color = ~pal(cluster), group = "Cluster4")

final_map <- c4 %>%
  addLayersControl(overlayGroups = c("Cluster1", "Cluster2", "Cluster3", "Cluster4"))

4.1. World countries clustered

final_map

Note that the country of Georgia was incorrectly geocoded as the state of Georgia (USA).

4.2. Maps of all rankings

#map of positive affect rankings
pal1 <- colorNumeric(palette = "RdBu", domain = countries4$Positive_affect)

pa_map <- leaflet(countries4) %>%
  addProviderTiles("CartoDB") %>%
  addCircleMarkers(radius = 5, 
              color = ~pal1(Positive_affect)) %>%
  addLegend(pal = pal1,
            values = c(1:140),
            opacity = 0.75,
            title = "Positive affect ranking",
            position = "topleft")
pa_map

#map of negative affect rankings
na_map <- leaflet(countries4) %>%
  addProviderTiles("CartoDB") %>%
  addCircleMarkers(radius = 5, 
                   color = ~pal1(Negative_affect)) %>%
  addLegend(pal = pal1,
            values = c(1:140),
            opacity = 0.75,
            title = "Negative affect ranking",
            position = "topleft")
na_map

#map of social support rankings
ss_map <- leaflet(countries4) %>%
  addProviderTiles("CartoDB") %>%
  addCircleMarkers(radius = 5, 
                   color = ~pal1(Social_support)) %>%
  addLegend(pal = pal1,
            values = c(1:140),
            opacity = 0.75,
            title = "Social support ranking",
            position = "topleft")
ss_map

#map of freedom rankings
f_map <- leaflet(countries4) %>%
  addProviderTiles("CartoDB") %>%
  addCircleMarkers(radius = 5, 
                   color = ~pal1(Freedom)) %>%
  addLegend(pal = pal1,
            values = c(1:140),
            opacity = 0.75,
            title = "Freedom ranking",
            position = "topleft")
f_map

#map of corruption rankings (highest is least corrupt)
c_map <- leaflet(countries4) %>%
  addProviderTiles("CartoDB") %>%
  addCircleMarkers(radius = 5, 
                   color = ~pal1(Corruption)) %>%
  addLegend(pal = pal1,
            values = c(1:140),
            opacity = 0.75,
            title = "Corruption ranking",
            position = "topleft")
c_map

#map of generosity rankings
g_map <- leaflet(countries4) %>%
  addProviderTiles("CartoDB") %>%
  addCircleMarkers(radius = 5, 
                   color = ~pal1(Generosity)) %>%
  addLegend(pal = pal1,
            values = c(1:140),
            opacity = 0.75,
            title = "Generosity ranking",
            position = "topleft")
g_map