Ed Bullen, 12 August 2016
This document is the output from a “Hackathon” session with the Central London Data-Science Meetup Group. The task was to explore the World Happiness Data-Set published by the UN. The data provided includes a set of country metrics that are expected to influence general happiness and a separate summary survey “Happiness Score” for each country.
The Happiness Score is derived simply by interviewing a random population sample from each country in a poll that “asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale.”
The original suggestion was to look at which measures in the metrics supplied for each country affected the happiness score reported by the survey the most. I took a different tack and looked at how well the countries divided up into clusters based on the supplied data and whether this could be used to group countries into “Happy” and “Unhappy” sets of countries using K-Means clustering.
I then tried using K-Means clustering to predict Happiness for a given test data-set after “training” a K-Means cluster with a larger training set of data from the World Happiness Data-Set.
This was a fairly artificial excerise to experiment with K-Means clustering and try out a very simple Machine Learning technique, but it did still give some interesting insight on the real data. This is more of a technical guide rather than a piece of real statistical research, so I have left all the code in-line in the document.
After sourcing the data (described in Part 1), the first stage was just to look at how well the data seperated into groups and whether these groups could be classified as different levels of Happiness (Part 2).
The next step (Part 3-5) is to artificially split the data up into a Training and Test data-set and create a K-Means cluster from the Training data. Then I used the known happiness scores from the Gallup Poll data to classify the clusters. Finally, I mapped my Test data-set to the appropriate cluster to see whether they were in the Top, Middle or Bottom cluster-group for happiness.
The results were imperfect, as explored in more detail in “Part 2” below, but it was possible to get a fairly reliable indicator of a country’s happiness based on the 6 metrics provided. Given this, K-Means clustering offered a simple way for a “Machine Learning” approach to classify country data into a happiness group.
The imperfections in the results returned by the model are interesting - maybe there are other metrics that need to be considered or maybe this analysis is highlighting how imperfect the poll survey of Happiness is.
The main thing I took from this, however, was a simple template technical process to do this type of automated classification given the necessary training data-set.
Two sets of data are required. Firstly, The various metrics associated with each country that are expected to influence the countries’s happiness - i.e. GDP, Life Expectency etc. Secondly, the Happiness Scores for each country which are simply gathered via a Gallup Poll to give a ranking 1 to 10, as described in the introduction.
## Download Data from "worldhapiness.report" web-site ##
library("XLConnect") # library for reading Excel Spreadsheet Data
hurl <- 'http://worldhappiness.report/wp-content/uploads/sites/2/2016/03/Online-data-for-chapter-2-whr-2016.xlsx'
file <- "happiness.xlsx"
download.file(hurl, file, mode="wb")
h <- readWorksheetFromFile(file, sheet=3, startRow = 1, endCol = 2) # happiness scores
data <- readWorksheetFromFile(file, sheet=1, startRow = 1, endCol = 14) # country survey metrics
# Data tidy-up
data <- data[, -1] #drop duplicate col
names(h)[1] <- "country" #lowercase col-name
data <- data[data$year == 2015,] # just focus on one year
data$country <- gsub('\\s', '', data$country) #strip spaces to join with h (happiness scores)
h$country <- gsub('\\s', '', h$country) #strip spaces to join with data (happiness scores)
Data-frame data contains the data with country metrics as follows:
names(data)
## [1] "country"
## [2] "year"
## [3] "Life.Ladder"
## [4] "Log.GDP.per.capita"
## [5] "Social.support"
## [6] "Healthy.life.expectancy.at.birth"
## [7] "Freedom.to.make.life.choices"
## [8] "Generosity"
## [9] "Perceptions.of.corruption"
## [10] "Positive.affect"
## [11] "Negative.affect"
## [12] "Confidence.in.national.government"
## [13] "Democratic.Quality"
and data-frame h contains the happiness scores - first few rows listed below:
head(h)
## country Happiness.score
## 1 Denmark 7.526
## 2 Switzerland 7.509
## 3 Iceland 7.501
## 4 Norway 7.498
## 5 Finland 7.413
## 6 Canada 7.404
In this section we create create a K-Means cluster from cleaned up subset
of the data
data-frame using the standard R K-Means clustering package kmeans
.
I didn’t spend much time looking at what would be a good number of clusters - this was a quick hackathon project and three clusters was not perfect but “seemed to work OK”.
The script below also adds the separately surveyed overall “Happiness Score” for each country.
It is important to note that the clusters are defined by 6 dimensions that are not specifically “Happiness” - they are measures such as GDP etc. The clusters are just clusters; we then use the Happiness Score to see how aligned these clusters defined by GDP, Freedom, Corruption etc are with the actual Happiness ranking.
# Just focus on col 4:9 - GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption
# where there is a complete data-set (no "NAs")
subset <- data[complete.cases(data[,c(1,4:9)]), ]
#join the hapiness ranking on as 1st column
subset <- merge(h,subset, by.x = "country", by.y="country")
# Create a K-Means cluster with 3 groups based on cols 5:10
# (GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption)
km <- kmeans(subset[, 5:10],3, iter.max=100)
Next, get 3 vectors of Happiness Score g1..g3 from the subset
data-frame, referenced by the KMeans cluster 1..3:
g1 <- subset[km$cluster == 1,]$Happiness.score
g2 <- subset[km$cluster == 2,]$Happiness.score
g3 <- subset[km$cluster == 3,]$Happiness.score
Now, we can plot a Histogram showing the distribution of Happiness Score for each of the three clusters:
# plot option "col=rgb(x,x,x,0.5)"" gives fill transparency
hist(g1, xlim=c(0,10), col=rgb(1,0,0,0.5), breaks=seq(0.25,10,0.25)
, main = "Histogram of Happiness Score for 3 cluster-groups"
, xlab = "Country Happiness Score")
hist(g2, xlim=c(0,10), col=rgb(0,1,0,0.5), breaks=seq(0.25,10,0.25), add=T)
hist(g3, xlim=c(0,10), col=rgb(0,0,1,0.5), breaks=seq(0.25,10,0.25), add=T)
legend("topright", c("Group1", "Group2", "Group3")
, fill=c(rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(0,0,1,0.5)) )
From this we can see that the three clusters approximately map to 3 clusters of different “happiness” - i.e. the most happy set, the least happy set, and a medium set.
There is quite a bit of overlap, however. There are a number of exception cases where countries classed in the happiest set can be seen to have an actual happiness score that would position them much lower down the scale. I didn’t do any detailed analysis other than list out countries placed in the happiest group whos happiness score is actually below the median (see further down in this section).
This imperfection is very interesting - to me it implies that despite having all the aspects that make most countries happy, these exceptions are still not happy despite some other unknown factor.
print(km$size)
## [1] 47 36 30
top <- which.max(c(mean(g1),mean(g2), mean(g3))) # which is the top group
happiest <- subset[km$cluster == top, 1:2]
print(happiest[order(happiest$Happiness.score, decreasing=TRUE), ], row.names = FALSE)
## country Happiness.score
## Denmark 7.526
## Switzerland 7.509
## Norway 7.498
## Finland 7.413
## Canada 7.404
## Netherlands 7.339
## NewZealand 7.334
## Australia 7.313
## Sweden 7.291
## Israel 7.267
## Austria 7.119
## UnitedStates 7.104
## CostaRica 7.087
## Germany 6.994
## Belgium 6.929
## Ireland 6.907
## Luxembourg 6.871
## Mexico 6.778
## UnitedKingdom 6.725
## Chile 6.705
## Panama 6.701
## CzechRepublic 6.596
## Uruguay 6.545
## France 6.478
## Spain 6.361
## Slovakia 6.078
## Italy 5.977
## Japan 5.921
## SouthKorea 5.835
## Slovenia 5.768
## Croatia 5.488
## BosniaandHerzegovina 5.163
## Lebanon 5.129
## Portugal 5.123
## Greece 5.033
## Albania 4.655
Identify who in the top group has lower happiness than the median happiness for the entire set of countries:
print(happiest[happiest$Happiness.score < median(h$Happiness.score), ],row.names = FALSE)
## country Happiness.score
## Albania 4.655
## BosniaandHerzegovina 5.163
## Greece 5.033
## Lebanon 5.129
## Portugal 5.123
Having explored the concept of grouping the data into clusters of different Happiness based on certain indicators, it should be possible to develop a simple model to predict happiness based on a training set of data with known happiness scores - with the caveat that the accuracy will only be as good as seen previously in Part 2.
First, we need to split our data-set from the UN World Happiness Foundation into two parts - a Training Set and a Test Set:
subset <- data[complete.cases(data[,c(1,4:9)]), ]
# random split into a training and small test data-set
set.seed(1)
ind <- sample(nrow(subset),10)
# add TRUE / FALSE index to data
subset[["train"]] <- TRUE
subset[["train"]][ind] <- FALSE
train <- subset[subset[["train"]]==TRUE, ]
test <- subset[subset[["train"]]==FALSE, ]
I couldn’t work out how to use the standard R K-means function kmeans()
to do predictions with multi-dimension data-sets such as this. However, the kcca()
function from the flexclust package looks very feature-rich and did everything I needed.
First, create a K-Means cluster:
library("flexclust")
# create K-Means cluster of 3 groups based on cols 4:9
# (GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption)
km = kcca(train[ , 4:9], k=3, kccaFamily("kmeans"))
Next, Summarise in 3 groups based on the clustering algorithm:
g1 <- train[clusters(km) == 1, 1:2] #cols 1:2 = country,Happiness.score
g2 <- train[clusters(km) == 2, 1:2]
g3 <- train[clusters(km) == 3, 1:2]
We have 3 groups, but we don’t know which is the happiest in our training set. We can use the survey data to classify our clusters developed in the training stage:
# join the happiness ranking on to the training data
g1 <- merge(h,g1, by.x = "country", by.y="country")
g2 <- merge(h,g2, by.x = "country", by.y="country")
g3 <- merge(h,g3, by.x = "country", by.y="country")
# who is in the top and bottom groups?
top <- which.max(c(mean(g1$Happiness.score),mean(g2$Happiness.score), mean(g3$Happiness.score))) #
bottom <- which.min(c(mean(g1$Happiness.score),mean(g2$Happiness.score), mean(g3$Happiness.score)))
print(paste("Happiest Group is g", top, sep=""))
## [1] "Happiest Group is g1"
print(paste("Least Happy Group is g", bottom, sep=""))
## [1] "Least Happy Group is g2"
Now we are all set to use our km
k-means model built with kcca()
to make predictions against our Test data-set.
If we “pretend” that we don’t have survey happiness scores for the countries listed in the test data-set, we can use our K-Means model created in Part 4 to predict their happiness based on the metrics GDP, Social Support, Life Expectancy, Freedom, Generosity, Corruption.
First, just print out the Test data-set so we can see what we are working with:
names(test)[6:7] <- c("Life.expectancy", "Freedom")
print("Test Data Set is:", row.names = FALSE)
## [1] "Test Data Set is:"
print(test[, c(1,4,6,7)], row.names = FALSE)
## country Log.GDP.per.capita Life.expectancy Freedom
## Belarus 9.725568 65.31599 0.6227534
## Congo(Brazzaville) 8.685216 53.51811 0.8501725
## ElSalvador 9.001607 63.90189 0.7333559
## Guinea 7.037234 50.16096 0.6659530
## Malawi 6.660712 54.48933 0.8013907
## Mauritania 8.231690 53.24210 0.4470866
## Montenegro 9.616770 65.11017 0.5833173
## Spain 10.402864 73.37998 0.7320005
## Sweden 10.712334 71.74087 0.9350721
## Tajikistan 7.869648 61.64697 0.8465421
Next, run the prediction:
pred_test <- predict(km, newdata=test[ , 4:9], k=3, kccaFamily("kmeans"))
Finally, print out some results (remember, we worked out what the top
and bottom
group IDs were in Part 4):
Top Group
print(paste("Predict which country is in the happiest group, Group", top, sep=""), row.names = FALSE)
## [1] "Predict which country is in the happiest group, Group1"
print(test[pred_test == top, ]$country, row.names = FALSE)
## [1] "Spain" "Sweden"
Bottom Group
print(paste("Predict which country is in the least happy group, Group", bottom, sep=""), row.names = FALSE)
## [1] "Predict which country is in the least happy group, Group2"
print(test[pred_test == bottom, ]$country, row.names = FALSE)
## [1] "Congo(Brazzaville)" "Guinea" "Malawi"
## [4] "Mauritania"
If these countries are checked back against their happiness score and ranking, it can be seen that the system works pretty well:
print(merge(h,test, by.x = "country", by.y="country")[1:2], row.names = FALSE)
## country Happiness.score
## Belarus 5.802
## Congo(Brazzaville) 4.236
## ElSalvador 6.068
## Guinea 3.607
## Malawi 4.156
## Mauritania 4.201
## Montenegro 5.161
## Spain 6.361
## Sweden 7.291
## Tajikistan 4.996
Repeated testing with different random seeds shows that the prediction is far from absolute, however - there is some error as suggested by the histogram distribution of Happiness in the clusters as investigated in Part 2 above.