About the data

The World Happiness Report is a survey of the state of global happiness. It ranks countries by how happy their citizens perceive themselves to be, primarily using data from the Gallup World Poll. The first report was published in 2012, and has focused on a specific theme each subsequent year. Data is collected from people in over 150 countries. Each variable measured reveals a populated-weighted average score on a scale, called a ‘ladder score,’ running from 0 to 10.
Data source: World Happiness Report 2021, kaggle.com

Manipulating the data

Add “rank” column

The data doesn’t include a country ranking by default. It requires users to sort by ladder score to see which countries rank the highest and which the lowest. I added a “rank” variable to the data for easier calculations.

Add “trust” column

One of the calculated variables, “perception of corruption”, is different from the others. The other self-reported variable scores (support, freedom, and generosity) all positively contribute to the happiness score, meaning a higher score in any of those categories contributes to a higher happiness score. Perception of corruption is different - the lower the score, the higher the happiness score. This causes issues when comparing the importance of the calculated variables to the overall happiness score. To rectify this, I’m creating a new column that is the opposite of corruption, called “trust,” by subtracting the corruption score from 1.

Data subset

I created a subset of the 2021 data to use for this EDA, focusing on the countries, their geographic region, and the six main calculated factors of happiness.

There are 149 rows in this data subset and 14 columns.
Rows reference individual countries.
Columns reference the country attributes and calculated variables that make up the happiness score.

Explanation of variables

Variable Data type Explanation
Rank integer Rank by happiness ladder score.
Country character Country name (149 countries).
Region character Geographic region of country (10 total).
Score numeric Happiness score or subjective well-being (also known as ‘ladder score’ or ‘Cantril life ladder’). The English wording of the question is ‘Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?’
GDP numeric GDP per capita is a measure of Gross Domestic Product per its population.
Social Support numeric Social support of having friends and family assissting in times of need or crisis. Social support improves the quality of life and provides a buffer against adverse life events.
Health Life Expectancy numeric Healthy Life Expectancy is the average timespan that a newborn can expect to live in “full health” — in other words, not hampered by disabling illnesses or injuries.
Freedom of Choice numeric Freedom of choice of individual’s opportunity and autonomy to perform an action selected from at least two available options, unconstrained by external parties.
Generosity numeric Residual of regressing the national average of responses to the question, ‘Have you donated money to a charity in past months?’ on GDP capita.
Perception of Corruption numeric The Corruption Perceptions Index (CPI), an index published annually by Transparency International since 1995, ranks countries ‘by their perceived levels of public sector corruption, as determined by expert assessments and opinion surveys.’
Trust numeric The inverse of corruption.

Research questions

Questions I’ll be researching throughout this exploratory data analysis:

  1. Which factors contributed most to the overall happiness score?
  2. Are the factors that contribute the most consistent regardless of overall score?
  3. Do happiness scores differ by geographic region?

Preliminary plots

To get a sense of what data might be interesting to look at, I created scatter plots that show the relationship between happiness scores of all countries vs. calculated variables of all countries.

Scatter plots

Scatter plot notes:

  • Overall, these scatter plots give a sense that a higher score in any one factor (except generosity) contribute to a higher ladder score.
  • The linear regression lines are hard to compare to each other, so this doesn’t look like the best way to see which factors might contribute more than others.

Correlation plot

My second attempt to see which factors might contribute most to the happiness score is to create a correlation heat map.

Correlation plot notes:

  • This plot helps compare the factors more than the scatter plots, but not by much.
  • It’s very apparent that generosity is the least correlated score based on this color palette.
  • This plot has too much unnecessary information to easily answer the question. What I really want to see is the bottom row, correlating the overall happiness score to the different variables. For this EDA, I’m not looking at correlations between the variables themselves.

EDA Questions

1. Which factors contributed most to the overall happiness score?

If you look at the correlation coefficients of each variable compared to the overall happiness score, the country’s GDP and healthy life expectancy scores are the most positively correlated. This is interesting because these are the two scores that are not self-reported by the country’s citizens, meaning they are the most objective factors.

2. Are the factors that contribute the most consistent regardless of overall score?

To answer this question, I looked at the correlation values of the “happiest” countries and compared them to the “least happy” countries. I split the data frame in half, the top 50% are the happiest, the bottom 50% are the saddest.

The most blatant difference in the top and bottom 50% is how strongly “trust” is correlated to the overall happiness score. For the top 50% “happiest” countries, trust in the public sector - or, lack of perceived corruption - is the most positively correlated factor. For the bottom 50%, it’s the least positively correlated factor. In plain language, this means that for the happiest countries, trust in the public sector (governing body) is hugely important to their overall happiness, as compared to the least happy countries.

If we look back at the scatter plot for trust, you can easily see this discrepancy. A subset of the highest ranking countries are pulling the regression line up.

3. Do happiness scores differ by geographic region?

Comparing the top 10 and bottom 10 countries indicates that geographic region is a major factor in overall happiness score. The below bar chart shows that 9 out of the 10 “happiest” countries are in Western Europe, and 7 out of the 10 “least happy” countries is in Sub-Saharan Africa.

To get a larger picture, I created box plots to compare the happiness scores, by region, for the entire data set. This shows that the top and bottom 10 countries is fairly indicative of the data set as a whole.

Additional research opportunities

There are many more questions to explore within the 2021 World Happiness Report data, and when comparing report data from 2012-2021. For example:

  • Create a heat map of happiness scores or certain factors overlaying a world map graphic, or an animated gif showing change in happiness scores over time.
  • Look at trends in particular countries over time.
  • Look at trends in calculated factor correlation over time.
  • Analyze trends or correlations from the “dystopian” side of the data (this data was included in the original .csv but removed for my EDA subset)