How do various countries group together using Hierarchical clustering based on the variables that are used to calculate the happiness score from the 2016 World Happiness report? Furthermore, is there a noticable geographical pattern?
8/14/2020
How do various countries group together using Hierarchical clustering based on the variables that are used to calculate the happiness score from the 2016 World Happiness report? Furthermore, is there a noticable geographical pattern?
Clustering is an algorithmic approach to grouping similar observations
Kmeans -> create K clusters and iterate through observations, assigning each to the closest “centroid” or cluster.
Unlike Kmeans, we create a “dendrogram” or tree and start by assigning each observation to their own cluster. Then at each step two clusters that are most similar–based on a distance metric–are merged and then the algorithm continues until the desired number of clusters is reached.
Bold -> methods we will explore
The proximity between two clusters is the magnitude by which the summed square in their joint cluster will be greater than the combined summed square in the two clusters.

Ward’s
The proximity between two clusters is determined by the two farthest points.
Not an exact science
A statistic that compares the total intra-cluster variation with their expected values under null reference distribution of data, for each value of K.
We want a large gap statistic which would indicate a clustering structure very different from a random distribution of points.
In theory, we would want to maximize the gap statistic. But in practice, clusters are not very well defined and often times the gap statistic won’t have a reasonable maximum.
Our data comes from the World Happiness Report which is publically available and conducted each year.
## [1] "country_name" ## [2] "life_ladder" ## [3] "log_gdp_per_capita" ## [4] "social_support" ## [5] "healthy_life_expectancy_at_birth" ## [6] "freedom_to_make_life_choices" ## [7] "generosity" ## [8] "perceptions_of_corruption" ## [9] "positive_affect" ## [10] "negative_affect" ## [11] "confidence_in_national_government" ## [12] "democratic_quality" ## [13] "delivery_quality" ## [14] "gini_of_household_income_reported_in_gallup_by_wp5_year" ## [15] "gini_est"
log_gdp -> natural log of GDP per capita
healthy_life_expectancy_at_birth -> constructed based on WHO global health observatory data from 2016
social_support -> national average of binary response to gallup poll to “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them?”
freedom_to_make_life_choice -> average of binary response to gallup poll “Are you satisfied or dissatisfied with your freedom to choose what you do with your life”
generosity -> the residual of regressing the national average of gallup responses to the question “Have you donated money to a charity in the past month?” on GDP per capita.
gini_est -> GINI index collected from the World Bank Estimate
gini_of_household_income -> Using household income reported by gallup, converted into international dollars, and measure gini index using STATA commands with sample weights
perceptions_of_corruption -> average of binary answers to two GWP questions: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” Where data for government corruption are missing, the perception of business corruption is used as the overall corruption-perception measure.
positive_affect -> the average of laughter and enjoyment for other waves where the happiness question was not asked. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday?
negative_affect -> the average of previous-day affect measures for worry, sadness, and anger
confidence_in_national_government -> average of gallup poll binary question “Do you have confidence in the national government”
democratic_quality -> mean of “Voice and Accountability” and “political stability/absence of violence” determined by the Worldwide Governance Indicators using the views of enterprises, citizens, and experts in the form of survey respondents.
delivery_quality -> Also using the WGI, we take the average of “Government Effectiveness”, “Regulatory Quality”, “Rule of Law”, and “Control of Corruption” to obtain delivery_quality
For the majority of our analysis we will work with the “complete” cases and drop any countries with missing or NA values in any of the above variables
At the very end we will experiment with data imputation and attempt to create a more complete world map
In order to use clustering we need to reduce our data so that we can calculate the distances. Since we have many variables, in order to achieve this we use Principal Component Analysis to reduce our variables to 2 dimensions while retaining the information.
## -- Imputation 1 -- ## ## 1 2 3 4 5 6 7 ## ## -- Imputation 2 -- ## ## 1 2 3 4 5 6 7 ## ## -- Imputation 3 -- ## ## 1 2 3 4 5 6 7 ## ## -- Imputation 4 -- ## ## 1 2 3 4 5 6 7 8 9 10 11 ## ## -- Imputation 5 -- ## ## 1 2 3 4 5 6 7 8