Ed Bullen, 12 August 2016

Introduction

This document is the output from a “Hackathon” session with the Central London Data-Science Meetup Group. The task was to explore the World Happiness Data-Set published by the UN. The data provided includes a set of country metrics that are expected to influence general happiness and a separate summary survey “Happiness Score” for each country.

The Happiness Score is derived simply by interviewing a random population sample from each country in a poll that “asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale.

The original suggestion was to look at which measures in the metrics supplied for each country affected the happiness score reported by the survey the most. I took a different tack and looked at how well the countries divided up into clusters based on the supplied data and whether this could be used to group countries into “Happy” and “Unhappy” sets of countries using K-Means clustering.

I then tried using K-Means clustering to predict Happiness for a given test data-set after “training” a K-Means cluster with a larger training set of data from the World Happiness Data-Set.

This was a fairly artificial excerise to experiment with K-Means clustering and try out a very simple Machine Learning technique, but it did still give some interesting insight on the real data. This is more of a technical guide rather than a piece of real statistical research, so I have left all the code in-line in the document.

Approach

After sourcing the data (described in Part 1), the first stage was just to look at how well the data seperated into groups and whether these groups could be classified as different levels of Happiness (Part 2).

The next step (Part 3-5) is to artificially split the data up into a Training and Test data-set and create a K-Means cluster from the Training data. Then I used the known happiness scores from the Gallup Poll data to classify the clusters. Finally, I mapped my Test data-set to the appropriate cluster to see whether they were in the Top, Middle or Bottom cluster-group for happiness.

Outcome

The results were imperfect, as explored in more detail in “Part 2” below, but it was possible to get a fairly reliable indicator of a country’s happiness based on the 6 metrics provided. Given this, K-Means clustering offered a simple way for a “Machine Learning” approach to classify country data into a happiness group.

The imperfections in the results returned by the model are interesting - maybe there are other metrics that need to be considered or maybe this analysis is highlighting how imperfect the poll survey of Happiness is.

The main thing I took from this, however, was a simple template technical process to do this type of automated classification given the necessary training data-set.

Part 1: Source the Data

Two sets of data are required. Firstly, The various metrics associated with each country that are expected to influence the countries’s happiness - i.e. GDP, Life Expectency etc. Secondly, the Happiness Scores for each country which are simply gathered via a Gallup Poll to give a ranking 1 to 10, as described in the introduction.

## Download Data from "worldhapiness.report" web-site ##
library("XLConnect")   # library for reading Excel Spreadsheet Data

hurl <- 'http://worldhappiness.report/wp-content/uploads/sites/2/2016/03/Online-data-for-chapter-2-whr-2016.xlsx'
file <- "happiness.xlsx"
download.file(hurl, file, mode="wb")

h <- readWorksheetFromFile(file, sheet=3, startRow = 1, endCol = 2) # happiness scores
data <- readWorksheetFromFile(file, sheet=1, startRow = 1, endCol = 14) # country survey metrics

# Data tidy-up
data <- data[, -1]  #drop duplicate col
names(h)[1] <- "country"  #lowercase col-name
data <- data[data$year == 2015,] # just focus on one year 
data$country <- gsub('\\s', '', data$country) #strip spaces to join with h (happiness scores)
h$country <- gsub('\\s', '', h$country) #strip spaces to join with data (happiness scores)

Data-frame data contains the data with country metrics as follows:

names(data)
##  [1] "country"                          
##  [2] "year"                             
##  [3] "Life.Ladder"                      
##  [4] "Log.GDP.per.capita"               
##  [5] "Social.support"                   
##  [6] "Healthy.life.expectancy.at.birth" 
##  [7] "Freedom.to.make.life.choices"     
##  [8] "Generosity"                       
##  [9] "Perceptions.of.corruption"        
## [10] "Positive.affect"                  
## [11] "Negative.affect"                  
## [12] "Confidence.in.national.government"
## [13] "Democratic.Quality"

and data-frame h contains the happiness scores - first few rows listed below:

head(h)
##       country Happiness.score
## 1     Denmark           7.526
## 2 Switzerland           7.509
## 3     Iceland           7.501
## 4      Norway           7.498
## 5     Finland           7.413
## 6      Canada           7.404

Part 2: Create K-Means Cluster and Visulise Clustering Effectiveness

In this section we create create a K-Means cluster from cleaned up subset of the data data-frame using the standard R K-Means clustering package kmeans.

I didn’t spend much time looking at what would be a good number of clusters - this was a quick hackathon project and three clusters was not perfect but “seemed to work OK”.

The script below also adds the separately surveyed overall “Happiness Score” for each country.
It is important to note that the clusters are defined by 6 dimensions that are not specifically “Happiness” - they are measures such as GDP etc. The clusters are just clusters; we then use the Happiness Score to see how aligned these clusters defined by GDP, Freedom, Corruption etc are with the actual Happiness ranking.

# Just focus on col 4:9 - GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption
# where there is a complete data-set (no "NAs")
subset <- data[complete.cases(data[,c(1,4:9)]), ]
#join the hapiness ranking on as 1st column
subset <- merge(h,subset, by.x = "country", by.y="country")

# Create a K-Means cluster with 3 groups based on cols 5:10
# (GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption)
km <- kmeans(subset[, 5:10],3, iter.max=100)

Next, get 3 vectors of Happiness Score g1..g3 from the subset data-frame, referenced by the KMeans cluster 1..3:

g1 <- subset[km$cluster == 1,]$Happiness.score
g2 <- subset[km$cluster == 2,]$Happiness.score
g3 <- subset[km$cluster == 3,]$Happiness.score

Now, we can plot a Histogram showing the distribution of Happiness Score for each of the three clusters:

# plot option "col=rgb(x,x,x,0.5)"" gives fill transparency
hist(g1, xlim=c(0,10), col=rgb(1,0,0,0.5), breaks=seq(0.25,10,0.25)  
     , main = "Histogram of Happiness Score for 3 cluster-groups"
     , xlab = "Country Happiness Score")
hist(g2, xlim=c(0,10), col=rgb(0,1,0,0.5), breaks=seq(0.25,10,0.25), add=T)
hist(g3, xlim=c(0,10), col=rgb(0,0,1,0.5), breaks=seq(0.25,10,0.25), add=T)
legend("topright", c("Group1", "Group2", "Group3")
       , fill=c(rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(0,0,1,0.5)) )

From this we can see that the three clusters approximately map to 3 clusters of different “happiness” - i.e. the most happy set, the least happy set, and a medium set.

There is quite a bit of overlap, however. There are a number of exception cases where countries classed in the happiest set can be seen to have an actual happiness score that would position them much lower down the scale. I didn’t do any detailed analysis other than list out countries placed in the happiest group whos happiness score is actually below the median (see further down in this section).

This imperfection is very interesting - to me it implies that despite having all the aspects that make most countries happy, these exceptions are still not happy despite some other unknown factor.

Number of Countries in Each Cluster

print(km$size)
## [1] 47 36 30

Who is in the Happiest Group?

top <- which.max(c(mean(g1),mean(g2), mean(g3))) # which is the top group
happiest <- subset[km$cluster == top, 1:2]
print(happiest[order(happiest$Happiness.score, decreasing=TRUE), ], row.names = FALSE)
##               country Happiness.score
##               Denmark           7.526
##           Switzerland           7.509
##                Norway           7.498
##               Finland           7.413
##                Canada           7.404
##           Netherlands           7.339
##            NewZealand           7.334
##             Australia           7.313
##                Sweden           7.291
##                Israel           7.267
##               Austria           7.119
##          UnitedStates           7.104
##             CostaRica           7.087
##               Germany           6.994
##               Belgium           6.929
##               Ireland           6.907
##            Luxembourg           6.871
##                Mexico           6.778
##         UnitedKingdom           6.725
##                 Chile           6.705
##                Panama           6.701
##         CzechRepublic           6.596
##               Uruguay           6.545
##                France           6.478
##                 Spain           6.361
##              Slovakia           6.078
##                 Italy           5.977
##                 Japan           5.921
##            SouthKorea           5.835
##              Slovenia           5.768
##               Croatia           5.488
##  BosniaandHerzegovina           5.163
##               Lebanon           5.129
##              Portugal           5.123
##                Greece           5.033
##               Albania           4.655

Identify Exceptions in the Happiest Group

Identify who in the top group has lower happiness than the median happiness for the entire set of countries:

print(happiest[happiest$Happiness.score < median(h$Happiness.score), ],row.names = FALSE)
##               country Happiness.score
##               Albania           4.655
##  BosniaandHerzegovina           5.163
##                Greece           5.033
##               Lebanon           5.129
##              Portugal           5.123

Part 3: Split the Data into a Training and Test Data-Set

Having explored the concept of grouping the data into clusters of different Happiness based on certain indicators, it should be possible to develop a simple model to predict happiness based on a training set of data with known happiness scores - with the caveat that the accuracy will only be as good as seen previously in Part 2.

First, we need to split our data-set from the UN World Happiness Foundation into two parts - a Training Set and a Test Set:

subset <- data[complete.cases(data[,c(1,4:9)]), ]

# random split into a training and small test data-set
set.seed(1)
ind <- sample(nrow(subset),10)
# add TRUE / FALSE index to data
subset[["train"]] <- TRUE
subset[["train"]][ind] <- FALSE
train <- subset[subset[["train"]]==TRUE, ]
test <- subset[subset[["train"]]==FALSE, ]

Part 4: Use the Training Data to Create a K-Means Clustering Model

I couldn’t work out how to use the standard R K-means function kmeans() to do predictions with multi-dimension data-sets such as this. However, the kcca() function from the flexclust package looks very feature-rich and did everything I needed.

First, create a K-Means cluster:

library("flexclust")
# create K-Means cluster of 3 groups based on cols 4:9
#     (GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption)
km = kcca(train[ , 4:9], k=3, kccaFamily("kmeans"))

Next, Summarise in 3 groups based on the clustering algorithm:

g1 <- train[clusters(km) == 1, 1:2]  #cols 1:2 = country,Happiness.score
g2 <- train[clusters(km) == 2, 1:2]
g3 <- train[clusters(km) == 3, 1:2]

We have 3 groups, but we don’t know which is the happiest in our training set. We can use the survey data to classify our clusters developed in the training stage:

# join the happiness ranking on to the training data
g1 <- merge(h,g1, by.x = "country", by.y="country")
g2 <- merge(h,g2, by.x = "country", by.y="country")
g3 <- merge(h,g3, by.x = "country", by.y="country")           
# who is in the top and bottom groups?
top <- which.max(c(mean(g1$Happiness.score),mean(g2$Happiness.score), mean(g3$Happiness.score))) #
bottom <- which.min(c(mean(g1$Happiness.score),mean(g2$Happiness.score), mean(g3$Happiness.score)))

print(paste("Happiest Group is g", top, sep=""))
## [1] "Happiest Group is g1"
print(paste("Least Happy Group is g", bottom, sep=""))
## [1] "Least Happy Group is g2"

Now we are all set to use our km k-means model built with kcca() to make predictions against our Test data-set.

Part 5: Predict the Happiness of Countries in the Test Data Set

If we “pretend” that we don’t have survey happiness scores for the countries listed in the test data-set, we can use our K-Means model created in Part 4 to predict their happiness based on the metrics GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption.

First, just print out the Test data-set so we can see what we are working with:

names(test)[6:7] <- c("Life.expectancy", "Freedom")
print("Test Data Set is:", row.names = FALSE)
## [1] "Test Data Set is:"
print(test[, c(1,4,6,7)], row.names = FALSE)
##             country Log.GDP.per.capita Life.expectancy   Freedom
##             Belarus           9.725568        65.31599 0.6227534
##  Congo(Brazzaville)           8.685216        53.51811 0.8501725
##          ElSalvador           9.001607        63.90189 0.7333559
##              Guinea           7.037234        50.16096 0.6659530
##              Malawi           6.660712        54.48933 0.8013907
##          Mauritania           8.231690        53.24210 0.4470866
##          Montenegro           9.616770        65.11017 0.5833173
##               Spain          10.402864        73.37998 0.7320005
##              Sweden          10.712334        71.74087 0.9350721
##          Tajikistan           7.869648        61.64697 0.8465421

Next, run the prediction:

pred_test <- predict(km, newdata=test[ , 4:9], k=3, kccaFamily("kmeans"))

Finally, print out some results (remember, we worked out what the top and bottom group IDs were in Part 4):

Top Group

print(paste("Predict which country is in the happiest group, Group", top, sep=""), row.names = FALSE)
## [1] "Predict which country is in the happiest group, Group1"
print(test[pred_test == top, ]$country, row.names = FALSE)
## [1] "Spain"  "Sweden"

Bottom Group

print(paste("Predict which country is in the least happy group, Group", bottom, sep=""), row.names = FALSE)
## [1] "Predict which country is in the least happy group, Group2"
print(test[pred_test == bottom, ]$country, row.names = FALSE)
## [1] "Congo(Brazzaville)" "Guinea"             "Malawi"            
## [4] "Mauritania"

If these countries are checked back against their happiness score and ranking, it can be seen that the system works pretty well:

print(merge(h,test, by.x = "country", by.y="country")[1:2], row.names = FALSE)
##             country Happiness.score
##             Belarus           5.802
##  Congo(Brazzaville)           4.236
##          ElSalvador           6.068
##              Guinea           3.607
##              Malawi           4.156
##          Mauritania           4.201
##          Montenegro           5.161
##               Spain           6.361
##              Sweden           7.291
##          Tajikistan           4.996

Repeated testing with different random seeds shows that the prediction is far from absolute, however - there is some error as suggested by the histogram distribution of Happiness in the clusters as investigated in Part 2 above.