library(tidyverse)
library(janitor)
library(RColorBrewer)
library(plotly)
library(ggcorrplot)
library(tinytex)
The datasets that I use in this project is about the World Happiness (link here). There are two data sets:
world-happiness-report-2021. Which focuses mainly on the year 2021.world-happiness-report summarize the happiness score and several other factors from the year 2005 to 2020.It is also important to note that not every country will have data from 2005 - 2020. Now let’s load the data!
happiness_2021 <- read_csv("world-happiness-report-2021.csv")
happiness <- read_csv("world-happiness-report.csv")
Now that we have already imported the datasets successfully. We should take a look at the datasets to check if they needs some polishing.
glimpse(happiness)
## Rows: 1,949
## Columns: 11
## $ `Country name` <chr> "Afghanistan", "Afghanistan", "Afgh~
## $ year <dbl> 2008, 2009, 2010, 2011, 2012, 2013,~
## $ `Life Ladder` <dbl> 3.724, 4.402, 4.758, 3.832, 3.783, ~
## $ `Log GDP per capita` <dbl> 7.370, 7.540, 7.647, 7.620, 7.705, ~
## $ `Social support` <dbl> 0.451, 0.552, 0.539, 0.521, 0.521, ~
## $ `Healthy life expectancy at birth` <dbl> 50.80, 51.20, 51.60, 51.92, 52.24, ~
## $ `Freedom to make life choices` <dbl> 0.718, 0.679, 0.600, 0.496, 0.531, ~
## $ Generosity <dbl> 0.168, 0.190, 0.121, 0.162, 0.236, ~
## $ `Perceptions of corruption` <dbl> 0.882, 0.850, 0.707, 0.731, 0.776, ~
## $ `Positive affect` <dbl> 0.518, 0.584, 0.618, 0.611, 0.710, ~
## $ `Negative affect` <dbl> 0.258, 0.237, 0.275, 0.267, 0.268, ~
glimpse(happiness_2021)
## Rows: 149
## Columns: 20
## $ `Country name` <chr> "Finland", "Denmark", "Sw~
## $ `Regional indicator` <chr> "Western Europe", "Wester~
## $ `Ladder score` <dbl> 7.842, 7.620, 7.571, 7.55~
## $ `Standard error of ladder score` <dbl> 0.032, 0.035, 0.036, 0.05~
## $ upperwhisker <dbl> 7.904, 7.687, 7.643, 7.67~
## $ lowerwhisker <dbl> 7.780, 7.552, 7.500, 7.43~
## $ `Logged GDP per capita` <dbl> 10.775, 10.933, 11.117, 1~
## $ `Social support` <dbl> 0.954, 0.954, 0.942, 0.98~
## $ `Healthy life expectancy` <dbl> 72.000, 72.700, 74.400, 7~
## $ `Freedom to make life choices` <dbl> 0.949, 0.946, 0.919, 0.95~
## $ Generosity <dbl> -0.098, 0.030, 0.025, 0.1~
## $ `Perceptions of corruption` <dbl> 0.186, 0.179, 0.292, 0.67~
## $ `Ladder score in Dystopia` <dbl> 2.43, 2.43, 2.43, 2.43, 2~
## $ `Explained by: Log GDP per capita` <dbl> 1.446, 1.502, 1.566, 1.48~
## $ `Explained by: Social support` <dbl> 1.106, 1.108, 1.079, 1.17~
## $ `Explained by: Healthy life expectancy` <dbl> 0.741, 0.763, 0.816, 0.77~
## $ `Explained by: Freedom to make life choices` <dbl> 0.691, 0.686, 0.653, 0.69~
## $ `Explained by: Generosity` <dbl> 0.124, 0.208, 0.204, 0.29~
## $ `Explained by: Perceptions of corruption` <dbl> 0.481, 0.485, 0.413, 0.17~
## $ `Dystopia + residual` <dbl> 3.253, 2.868, 2.839, 2.96~
We notice that the column names is in a beautiful and easy to read format. However it would not be awesome to work with, hence we are going to clean the column names to be more consistent throughout the analysis.
happiness <- clean_names(happiness)
happiness_2021 <- clean_names(happiness_2021)
colnames(happiness)
## [1] "country_name" "year"
## [3] "life_ladder" "log_gdp_per_capita"
## [5] "social_support" "healthy_life_expectancy_at_birth"
## [7] "freedom_to_make_life_choices" "generosity"
## [9] "perceptions_of_corruption" "positive_affect"
## [11] "negative_affect"
colnames(happiness_2021)
## [1] "country_name"
## [2] "regional_indicator"
## [3] "ladder_score"
## [4] "standard_error_of_ladder_score"
## [5] "upperwhisker"
## [6] "lowerwhisker"
## [7] "logged_gdp_per_capita"
## [8] "social_support"
## [9] "healthy_life_expectancy"
## [10] "freedom_to_make_life_choices"
## [11] "generosity"
## [12] "perceptions_of_corruption"
## [13] "ladder_score_in_dystopia"
## [14] "explained_by_log_gdp_per_capita"
## [15] "explained_by_social_support"
## [16] "explained_by_healthy_life_expectancy"
## [17] "explained_by_freedom_to_make_life_choices"
## [18] "explained_by_generosity"
## [19] "explained_by_perceptions_of_corruption"
## [20] "dystopia_residual"
Smooth like butter! Now you can see the column names are all in “snake” format, which is the default format of the clean_names() function. By the way, the clean_names() function located in the janitor library. Check it out if you want!
For this Exploratory Data Analysis, we will try to answer several questions regarding happiness:
Statista published their ranking on the happiest country in the world in 2020 (link here). Not a surprise, the top five countries are all in Europe, which in descending order is:
Let’s create our own list for the year 2021! This is going to be exciting!
Firstly, let’s find out the top 5 happiest country based on the data from the happiness_2021 data frame.
top_5_happiest_country <- happiness_2021 %>%
select(country_name, ladder_score, logged_gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices) %>%
head(5)
print(top_5_happiest_country)
## # A tibble: 5 x 6
## country_name ladder_score logged_gdp_per_ca~ social_support healthy_life_expe~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Finland 7.84 10.8 0.954 72
## 2 Denmark 7.62 10.9 0.954 72.7
## 3 Switzerland 7.57 11.1 0.942 74.4
## 4 Iceland 7.55 10.9 0.983 73
## 5 Netherlands 7.46 10.9 0.942 72.4
## # ... with 1 more variable: freedom_to_make_life_choices <dbl>
Wow! As I expected, the top 5 happiest countries stays the same. However, the order is a little mixed up. Finland, and Netherlands stays in place, while Iceland drops down to 4th place, Denmark climbs up to 2nd place, and Switzerland manages to climb up a rank to 3rd place.
Pretty exciting finding so far, now let’s plot the happiness points of the countries in our podium.
#Top 5 happiest countries
ggplot(data=top_5_happiest_country,mapping=aes(x=reorder(country_name, -ladder_score),y=ladder_score,fill=country_name))+
geom_bar(stat='identity')+
geom_text(aes(label=round(ladder_score,1)), vjust=-0.5)+
theme(panel.grid.major = element_blank(),
panel.grid.minor=element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.position = "none")+
labs(title="Happiness Score Of The Five Happiest Countries In The World",
x="Country",
y="Happiness Score")+
scale_fill_discrete(name="Country Name by Colors")
As an Asian, I have always been told that people who live in Europe have great life in terms of healthcare, education, financial and freedom. Basically, we always imagine Europe countries to be the greatest places to settle.
On the contrary, we will now find the five least happy countries on this planet.
top_5_unhappiest_country <- happiness_2021 %>%
select(country_name, ladder_score, logged_gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices) %>%
arrange(ladder_score) %>% head(5)
print(top_5_unhappiest_country)
## # A tibble: 5 x 6
## country_name ladder_score logged_gdp_per_ca~ social_support healthy_life_expe~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 2.52 7.70 0.463 52.5
## 2 Zimbabwe 3.14 7.94 0.75 56.2
## 3 Rwanda 3.42 7.68 0.552 61.4
## 4 Botswana 3.47 9.78 0.784 59.3
## 5 Lesotho 3.51 7.93 0.787 48.7
## # ... with 1 more variable: freedom_to_make_life_choices <dbl>
As we can observe, Afghanistan, Zimbabwe, Rwanda, Botswana, and Lesotho are among the least happy countries in the World. Four countries among the five are in the Sub-Saharan Africa, where their life condition, healthcare are not as fortunate as others. Afghanistan has been in war in the past year, which I believe is the main reason for their low happiness score.
Now let’s plot their happiness scores on a bar chart!
#Top 5 unhappiest countries
ggplot(data=top_5_unhappiest_country,mapping=aes(x=reorder(country_name, -ladder_score),y=ladder_score,fill=country_name))+
geom_bar(stat='identity')+
geom_text(aes(label=round(ladder_score,1)), vjust=-0.5)+
ylim(0,8)+
theme(panel.grid.major = element_blank(),
panel.grid.minor=element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.position = "none")+
labs(title="Happiness Score Of The Five Unhappiest Countries In The World",
x="Country",
y="Happiness Score")+
scale_fill_discrete(name="Country Name by Colors")
Now that we know that the top 5 happiest countries are all in Europe, and the top 5 unhappiest countries are in Sub-Saharan Africa and South Asia. How about the other regions, what are their average happiness score in 2021? Let’s take a look!
happiness_by_region <- happiness_2021 %>%
group_by(regional_indicator) %>%
summarize(mean_happiness_score=mean(ladder_score),
highest_score=max(ladder_score),
lowest_score=min(ladder_score)) %>%
arrange(-mean_happiness_score)
We successfully got the data for average, highest, and lowest happiness score for each region on this planet, now let’s plot!
#Average happiness score by region
ggplot(data=happiness_by_region,mapping=aes(x=mean_happiness_score,
y=reorder(regional_indicator,mean_happiness_score),
fill=regional_indicator)) +
geom_bar(stat='identity')+
geom_text(aes(label=round(mean_happiness_score,1)), hjust=1.5)+
xlim(0,8) +
theme(legend.position="none",
axis.title.x=element_blank(),
axis.title.y=element_blank())+
theme(panel.grid.major = element_blank(),
panel.grid.minor=element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))+
scale_fill_brewer(palette = "Set3")+
labs(title="Average Happiness Scores of Every Regions")
This box plot below will show you more detailed about the range of happiness scores in each region. Before, we knew that the top 5 happiest countries are from Western Europe. At that moment, I thought that West Europe will definitely be the happiest region, no doubt! But funny enough, despite having some of the highest happiness scores, Western Europe still falls behind North America and ANZ in the average score. North America and ANZ countries have consistent scores overall, as shown in the “narrow” box, while West Europe has quite a big range in their happiness scores. Sounds like the “American Dream” is still true huh 😂
#Happiness by region
regional_box <- ggplot(data=happiness_2021) +
geom_boxplot(aes(x=ladder_score,
y=reorder(regional_indicator,ladder_score),
fill=regional_indicator))+
theme_classic()+
theme(legend.position="none",
axis.title.x=element_blank(),
axis.title.y=element_blank())+
scale_fill_brewer(palette = "Pastel1")+
labs(title="Happiness Score By Region")
regional_box
I believe this question is one of the most controversial questions of all time 🤣. Who doesn’t want money anyway, but does money actually leads to a happier life? 🤨 Let’s find out using data visualizations shall we.
To answer this question, we need to find a strong positive correlation between money and happiness. And to spot this relationship, I will use the two main metrics: Gross Domestic Product (GDP) per capita logged_gdp_per_capita and Happiness Score ladder_score.
#GDP vs. Happiness
gdp_happiness <- ggplot(data=happiness_2021)+
geom_point(mapping=aes(x=logged_gdp_per_capita, y=ladder_score, color=regional_indicator))+
geom_smooth(mapping=aes(x=logged_gdp_per_capita, y=ladder_score),
method= lm)+
scale_color_discrete(name="Region Name by Colors")+
theme_classic() +
labs(title="GDP vs. Happiness",
x="Logged GDP per Capita",
y="Happiness Score")
ggplotly(gdp_happiness)
Well looks like we have a strong positive correlation between GDP and Happiness Score. I provided a smooth line to better showcase the positive relationship between the two variables.
I think we have an answer for our question Does money buy happiness? now. I believe a country with a strong economy will have a better chance to provide their people with high quality healthcare system and education, which consequently lead to a better life overall. But is this the final answer to the problem of Happiness?
Next, we are looking at another very crucial factor, healthcare. We want to see if there are any relationship between a longer life expectancy, and a happier life. Similar to GDP vs. Happiness, we will try to plot healthy_life_expectancy and ladder_score on a scatter plot to find the pattern.
#Healthcare vs. Happiness
health_happiness <- ggplot(data=happiness_2021)+
geom_point(mapping=aes(x=healthy_life_expectancy, y=ladder_score, color = regional_indicator))+
geom_smooth(mapping=aes(x=healthy_life_expectancy, y=ladder_score),
method=lm)+
scale_color_discrete(name="Region Name by Colors")+
theme_classic() +
labs(title="Health vs. Happiness",
x="Life Expectancy",
y="Happiness Score")
ggplotly(health_happiness)
Though the smooth line looks curvy, we can spot an increase in the pattern, which represent a positive relationship between Healthcare vs. Happiness. But is it true that happier countries will have a longer life expectancies? Let’s compare these box plots.
Even though we have a strong correlation between life expectancy and happiness, it is not always the case. Life expectancy does depends on healthcare systems, but it also depends on personal lifestyle, personal health as well as the culture from which the people came. But generally speaking, a country with a great healthcare system tends to have happier citizens.
There is nothing more precious than freedom to live your own life, to do things that you actually love. We all want freedom, mentally, physically and financially. But reality does not always cooperate, sadly. In many countries, people are free to live life, but it is not the case for some other countries which are striving for their freedom.
Let’s find out how freedom can make changes to the happiness score of a country.
#Freedom vs happiness
freedom_happiness <- happiness_2021 %>% ggplot()+
geom_point(aes(x=freedom_to_make_life_choices,y=ladder_score, color = regional_indicator))+
geom_smooth(mapping=aes(x=freedom_to_make_life_choices,y=ladder_score),method=lm) +
scale_color_discrete(name="Region Name by Colors")+
theme_classic() +
labs(title="Freedom to make life choices vs. Happiness",
x="Freedom Score",
y="Happiness Score")
ggplotly(freedom_happiness)
There is indeed a strong correlation between the two factors.
Interestingly, the generosity factor has several negative value, which is the only factor that has negative number. Let’s see what does generosity have to do with happiness score.
generosity_happiness <- happiness_2021 %>% ggplot()+
geom_point(aes(x=generosity,y=ladder_score, color = regional_indicator))+
geom_smooth(mapping=aes(x=generosity,y=ladder_score),method=lm) +
scale_color_discrete(name="Region Name by Colors")+
theme_classic() +
labs(title="Generosity vs. Happiness",
x="Generosity",
y="Happiness Score")
ggplotly(generosity_happiness)
There doesn’t seem to exist any relationship between the two factors. Let’s take a look at the individual region.
The correlation does exist in some region, however they were quite vague to recognize.
Looking at the data table, there seems to be a negative correlation between these two factors, which means the higher the happiness score, the perceptions of corruption scores seems to be lower. Let’s test this hypothesis with a plot shall we!
#Perceptions of corruption vs happiness
perception_happiness <- happiness_2021 %>% ggplot()+
geom_point(aes(x=perceptions_of_corruption,y=ladder_score, color = regional_indicator))+
geom_smooth(mapping=aes(x=perceptions_of_corruption,y=ladder_score),method=lm) +
scale_color_discrete(name="Region Name by Colors")+
theme_classic() +
labs(title="Perceptions of Corruption vs. Happiness",
x="Perceptions of Corruption",
y="Happiness Score")
ggplotly(perception_happiness)
Looks like we are correct!
Now that we have seen and proved the relationships between GDP, Healthcare, Social Support, Freedom to make life choices, Generosity, and Perceptions of Corruption, we want to find out which of these factors actually most significantly impact the happiness level. To answer this question, we need to use a correlation matrix (or a heat map).
In order to use this method, we need the package ggcorrplot.
So now we are going to try and create a separate data frame df containing only the variables that we want, and rename the column into a more readable format, for visual purpose only 🤣.
df <- happiness_2021 %>%
select(c(3,7,8,9,10,11,12)) %>%
rename(Happiness = ladder_score,
GDP = logged_gdp_per_capita,
SociSupport = social_support,
Healthcare = healthy_life_expectancy,
Freedom = freedom_to_make_life_choices,
Generosity = generosity,
Corruption = perceptions_of_corruption)
We have our df data frame ready, now let’s compute the correlation matrix.
corr <- round(cor(df), 1)
And now we will plot the correlation matrix we just calculated.
ggcorrplot(corr,
lab=TRUE,
legend.title = "Corr",
title = "Happiness Heat Map",
colors = c("#0fc0d4","white","#d40f61"),
show.legend = TRUE)
Side note: How to read this map? The correlation matrix has values ranging from -1 to 1 where
There we go! Looks cute right?🤣 Looking at this heat map, we can observe that the 4 most impactful factors on Happiness are GDP, Social Support, Healthcare, and Freedom to make life choices. This plot also points out that the factors we are considering (that affects happiness level) do affect each other. Wow Happiness is very complicated huh? 🤣
Back in 2005, I was still a 5-year-old kid, oh my the good old days 😭, I really miss that even though I hardly remember any thing that far back in the past 🤣. So with that in mind, I would love to see (with data) how has the world changed in terms of happiness.
#World happiness box plots
world_happiness_plot <- ggplot(data=happiness) +
geom_boxplot(aes(x=as.factor(year),
y=life_ladder),
fill="lightcyan")+
theme_classic()+
theme(legend.position="none",
axis.title.x=element_blank(),
axis.title.y=element_blank())+
labs(title="World Happiness From 2005 - 2020")
ggplotly(world_happiness_plot)
Looks like when I was still a five-year-old kid, everyone was feeling great huh? 🤣 Then things started to get a bit worst after that. Especially around the Economic Recession period. What seems weird to me is that 2020 has the highest happiness value compare to other years besides 2005. The weird thing is 2020 is the peak of Covid-19 pandemic, anyone knows why?😵
But anyway, next I would like to look closer into Viet Nam, where I’m from.
#Viet Nam
happiness %>%
filter(country_name == "Vietnam") %>%
ggplot(aes(x = as.factor(year), y = life_ladder))+
geom_bar(stat='identity', fill='lightcoral')+
geom_text(aes(label=round(life_ladder,1)), vjust=-0.5)+
ylim(0,8)+
theme_classic()+theme(legend.position="none",
axis.title.x=element_blank(),
axis.title.y=element_blank())+
labs(title="Happiness in Viet Nam")
And Canada where my brother is currently studying and working abroad.
#Canada
happiness %>%
filter(country_name == "Canada") %>%
ggplot(aes(x = as.factor(year), y = life_ladder))+
geom_bar(stat='identity', fill='navajowhite1')+
geom_text(aes(label=round(life_ladder,1)), vjust=-0.5)+
ylim(0,8)+
theme_classic()+theme(legend.position="none",
axis.title.x=element_blank(),
axis.title.y=element_blank())+
labs(title="Happiness in Canada")
Certainly you are happier than I am right now huh Khang? 😒
This also brings me to the end of this analysis. Conclusions and endings below!
Through this analysis, there are several insights we were able to retrieve:
Finally, thank you for reading until this point, I really appreciate your time and effort. I am still a beginner in this field, hence it would take me a lot more practice to do better than this, and I will always try my best. If you have any message for me, or any recommendations and suggestions, feel free to contact me through my social media or email:
I also do amateur food photography as a side hobby, check my Instagram @aknobofbutter
Once again, thank you for all your time and support. I’m looking forward to presenting you another analysis in the future, stay tuned!