The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. The first World Happiness Report was prepared in 2012 in support of the United Nations High-Level meeting on ‘Well-Being and Happiness: Defining a New Economic Paradigm’ which was held on 2 April 2012. The Report details and ranks the countries by how happy their citizens perceive themselves to be.
It stresses on how the social well-bring of citizens in a country is not only defined by the macro-economic factors such as Gross Domestic Product (GDP) indicators but also by social indicators such as Freedom to make life choices and Healthy Life Expectancy. The World Happiness Report 2020 is focused mainly on the impact of the social environments on the well-being of citizens.
We will study how the variables indicators such as Ladder score, Logged GDP per capira, Perceptions of corruption , social support, Generosity, Healthy Life expectancy and Freedom to make life choices are related to each other.
Through this Exploratory Data Analysis (EDA), we also aim to answer the below questions:
Are there any other social indicators other than GDP per capita to evaluate the well-being of the citizens of the country? Can these social indicators be used along with macro economic indicators when developing social policies for its people?
Does having a high GDP per capita always mean a higher level of happiness perceived by the people?
What factors influence the Ladder scores (Happiness level) of each country or across regions ?
What factors are correlated to each other?
What are the countries that have reported high levels of happiness scores but low levels of happiness scores?
How does the average life expectancy of the citizens vary across the regions?
I initially wanted to compare the happiness scores across the years. However, upon inspecting the datasets of the previous years, there were some challenges posed. For example, certain countries such as Hong Kong had undergone name changes and hence, when the datasets of different years were combined, some duplicates of some countries were observed.
Hence, for this analysis, the scope is just limited to the World Happiness 2020 dataset. It was also important to understand how the variables and its scores were calculated. The Happiness 2020 dataset was sourced from Kaggle. (https://www.kaggle.com/mathurinache/world-happiness-report)
The geospatial data was sourced from https://hub.arcgis.com/datasets/a21fdb46d23e4ef896f31475217cbb08_1/data
Below are the variables from the combined dataset:
CNTRY_NAME - Country name (141 countries)
Region.indicator - Region (10 Regions)
Ladder.score - Life evaluation score
Logged.GDP.per.capita - Extent to which GDP contributes to the calculation of the Ladder score
Healthy Life Expectancy - Healthy life expectancies at birth based on the data extracted from the World Health Organisation (WHO) data repository
Social support - Defined as having someone to count on in times of trouble (ranked from 0 to 1)
Freedom to make life choices - Defined as the national average of responses to the Gall-WorldPoll question (“Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”)
Generosity - National average of responses to the question - “Have you donated money to a charity in the past month?”
Perception of corruption - National average of responses to the questions (“Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” )
Details on the metadata of the dataset is provided under the References section.
When the World Happiness 2020 dataset (153 countries) was combined with the geo-spatial dataset (251 countries), there were 12 countries omitted as a result of the join. After much result, realised it could because of the fact that some countries were written/labelled differently. Hence, those countries were not included in the join. Some examples include Hong Kong and Hong Kong SAR.
So, in this analysis, only 141 countries were taken into consideration. This part of the pre-processing step was particularly challenging.
In this step, the steps to prepare the various visualisations will be discussed.
Load the packages:
Load the below datasets :
The world happiness 2020 dataset contains 153 countries and has categorised the countries into 10 different regions such as Western Europe, North America and ANZ, Middle East and North Africa, Latin America and Caribbean, Central and Eastern Europe, East Asis, Southeast Asia, Commonwealth of Independent States, Sub-Saharan Africa and South Asia.
The geo-spatial data has been sourced from . After combining the two datasets, it was observed that there were 12 countries that did not match. This is one of the challenges faced in a data-driven problem which will be further elaborated in the ‘Challenges’ section.
Hence, in this exercise, we will be studying the impacts of the various social indicators.
#Loading Dataset(happiness_2020)
#Set working directory
happiness_2020 <- read.csv("2020.csv")
str(happiness_2020)## 'data.frame': 153 obs. of 20 variables:
## $ Country.name : Factor w/ 153 levels "Afghanistan",..: 43 36 132 57 106 100 131 101 7 81 ...
## $ Regional.indicator : Factor w/ 10 levels "Central and Eastern Europe",..: 10 10 10 10 10 10 10 6 10 10 ...
## $ Ladder.score : num 7.81 7.65 7.56 7.5 7.49 ...
## $ Standard.error.of.ladder.score : num 0.0312 0.0335 0.035 0.0596 0.0348 ...
## $ upperwhisker : num 7.87 7.71 7.63 7.62 7.56 ...
## $ lowerwhisker : num 7.75 7.58 7.49 7.39 7.42 ...
## $ Logged.GDP.per.capita : num 10.6 10.8 11 10.8 11.1 ...
## $ Social.support : num 0.954 0.956 0.943 0.975 0.952 ...
## $ Healthy.life.expectancy : num 71.9 72.4 74.1 73 73.2 ...
## $ Freedom.to.make.life.choices : num 0.949 0.951 0.921 0.949 0.956 ...
## $ Generosity : num -0.0595 0.0662 0.1059 0.2469 0.1345 ...
## $ Perceptions.of.corruption : num 0.195 0.168 0.304 0.712 0.263 ...
## $ Ladder.score.in.Dystopia : num 1.97 1.97 1.97 1.97 1.97 ...
## $ Explained.by..Log.GDP.per.capita : num 1.29 1.33 1.39 1.33 1.42 ...
## $ Explained.by..Social.support : num 1.5 1.5 1.47 1.55 1.5 ...
## $ Explained.by..Healthy.life.expectancy : num 0.961 0.979 1.041 1.001 1.008 ...
## $ Explained.by..Freedom.to.make.life.choices: num 0.662 0.665 0.629 0.662 0.67 ...
## $ Explained.by..Generosity : num 0.16 0.243 0.269 0.362 0.288 ...
## $ Explained.by..Perceptions.of.corruption : num 0.478 0.495 0.408 0.145 0.434 ...
## $ Dystopia...residual : num 2.76 2.43 2.35 2.46 2.17 ...
#unique(happiness_2020$Country.name)
maplocation <-
read_sf("Longitude_Graticules_and_World_Countries_Boundaries-shp/99bfd9e7-bb42-4728-87b5-07f8c8ac631c2020328-1-1vef4ev.lu5nk.shp")
str(maplocation)## tibble [251 x 3] (S3: sf/tbl_df/tbl/data.frame)
## $ OBJECTID : int [1:251] 1 2 3 4 5 6 7 8 9 10 ...
## $ CNTRY_NAME: chr [1:251] "Aruba" "Antigua and Barbuda" "Afghanistan" "Algeria" ...
## $ geometry :sfc_MULTIPOLYGON of length 251; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:11, 1:2] -69.9 -69.9 -70.1 -70.1 -70 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
## ..- attr(*, "names")= chr [1:2] "OBJECTID" "CNTRY_NAME"
#unique(maplocation$CNTRY_NAME)
combined_dataset <- left_join(maplocation,happiness_2020 ,
by = c("CNTRY_NAME" = "Country.name"))
names(combined_dataset)## [1] "OBJECTID"
## [2] "CNTRY_NAME"
## [3] "geometry"
## [4] "Regional.indicator"
## [5] "Ladder.score"
## [6] "Standard.error.of.ladder.score"
## [7] "upperwhisker"
## [8] "lowerwhisker"
## [9] "Logged.GDP.per.capita"
## [10] "Social.support"
## [11] "Healthy.life.expectancy"
## [12] "Freedom.to.make.life.choices"
## [13] "Generosity"
## [14] "Perceptions.of.corruption"
## [15] "Ladder.score.in.Dystopia"
## [16] "Explained.by..Log.GDP.per.capita"
## [17] "Explained.by..Social.support"
## [18] "Explained.by..Healthy.life.expectancy"
## [19] "Explained.by..Freedom.to.make.life.choices"
## [20] "Explained.by..Generosity"
## [21] "Explained.by..Perceptions.of.corruption"
## [22] "Dystopia...residual"
In this step, we will remove the NA values from the combined dataset.
Social support is highly correlated to Ladder Score (0.76), Healthy Life Expectancy (0.73) and the Logged GDP per Capita (0.78).
Logged GDP per Capita is highly correlated to Ladder Score (0.77), Healthy Life Expectancy (0.83) and Social support (0.78).
The Happiness 2O20 dataset is quite different from the datasets of previous years (2019,2018,2017). Happiness level of countries were previously referred to happiness scores. In the new dataset (2020), Happiness scores are now instead renamed to Ladder Scores to better assess the life evaluation score for analysis.
The Ladder Scores represent the happiness level of each country. According to the World Happiness Report, it is the national average response to the question of life evaluations. The Ladder Score can be compared to this analogy - “Please imagine a Ladder, with steps numbered from 0 at the bottom to 10 at the top” which was a survey question given to the respondents in the respective countries. The top of the ladder means the best possible life and the bottom of the ladder means the worst possible life and the respondents provide their ratings on the scale. Based on this, the Ladders scores were computed. A person who feels he has the opportunity to best improve his life in his/her country would assign a higher Ladder score.
Ladder Score can be best explained by Healthy Life Expectancy, Logged GDP per Capita and Social support as observed from the Correlation plot.
The heatmaply package was loaded to plot the interactive heatmap and the RColorBrewer package was loaded to apply the palette colours. To plot the heatmap, the x values input into the corr() function had to be in numeric format. Hence, the dataset was converted from a tibble to a dataframe using the as.data.frame() function.
From our correlation matrix plot above, we can observe that the variables Ladder Score and Social Support are closely correlated.
The below scatter plots have been created with the regions grouped.
We observed from the correlation plot that the GDP per capita is also correlated to the Healthy Life Expectancy.
Hence, we wanted to explore if countries with a high GDP per capita observe a higher Life Expectancy?
p <- ggplot(combined_dataset,
aes(x = Logged.GDP.per.capita, y=Healthy.life.expectancy,
colour =Regional.indicator ,text = paste("country:", CNTRY_NAME))) +
geom_point(show.legend = FALSE, alpha = 0.7) +
scale_colour_brewer(type = "seq", palette = "Spectral") +
scale_size(range = c(2, 12)) +
scale_x_log10()+
theme_minimal()+
labs(x = "Logged GDP per capita", y = "Life expectancy",
caption = "Data source: ToothGrowth")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
fig <- ggplotly(p)
figThe countries in Western Europe with high GDP per capita generally has higher level of life expectancy compared to the countries in other regions.
We can also observe from this chart that Singapore being a country with one of the highest GDP per capita, also has a high level of life expectancy of about 76.8.
Central African Republic in the Sub-Saharan Africa region has one of the lowest GDP per capita (6.63) and the lowest life expectancy (45). The violence and displacement of these people are some of the leading causes of the low life expectancy. Central African Republic has been facing decades of political instability since it gained independence from France in 1990. They have also been grappling with diseases such as AIDS/HIV, influenza, pneumonia and diarrhael diseases.
More details on the situation of Central African Republic can be found below: (https://borgenproject.org)
In this section, we will look at the top ten and bottom happiest countries by looking at their Ladder scores. The arrange() function was applied to the combined_dataset to arrange the records by decreasing order of Ladder Score and head() function was applied to extract the top 10 and bottom 10 happiest countries respectively.
When plotting the dot plot, we wanted to show the countries in decreasing order of Ladder scores. Hence, the reorder(CNTY_NAME,-Ladder.score) was applied to sort the countries in decreasing order of the Ladder score.
combined_dataset <- combined_dataset %>% arrange(desc(Ladder.score))
top20<- head(combined_dataset,10)
p2 <- ggplot(top20, aes(x= reorder(CNTRY_NAME,-Ladder.score),
y=Ladder.score, fill=Regional.indicator))+
geom_point( color="#C4961A", size=4, shape=18) +
geom_segment( aes(x=reorder(CNTRY_NAME,-Ladder.score),
xend=reorder(CNTRY_NAME,-Ladder.score),
y=0, yend=Ladder.score), color="grey") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank()
) +
scale_fill_brewer(palette = "Dark2")+
xlab("") +
ylab("Ladder score")+
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
ggtitle("Top 10 Countries with high Ladder Scores" )
ggplotly(p2)bottom20<- tail(combined_dataset,10)
p3 <- ggplot(bottom20, aes(x= reorder(CNTRY_NAME,-Ladder.score),
y=Ladder.score, fill=Regional.indicator))+
geom_point( color="#00AFBB", size=4, shape=18) +
geom_segment( aes(x=reorder(CNTRY_NAME,-Ladder.score),
xend=reorder(CNTRY_NAME,-Ladder.score),
y=0, yend=Ladder.score), color="grey") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank()
) +
scale_fill_brewer(palette = "Set2")+
xlab("Country") +
ylab("Ladder score")+
scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) +
ggtitle("Bottom 10 Countries with low Ladder Scores" )
ggplotly(p3)What is the Ladder score across the different regions?
p2 <- ggplot(combined_dataset, aes(x=Regional.indicator, y = Ladder.score))+
geom_boxplot()+
theme_minimal()+
geom_violin(aes(fill=Regional.indicator))+
scale_color_brewer(palette = "Dark2") +
stat_summary(geom = 'point', fun = 'mean', color='red')+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(p2)It can be observed that in general, the countries in the North America and ANZ region and in the Western Europe region have reported high mean ladder scores. However, the spread of the ladder scores is higher in the Western Europe region. The maximum ladder score is 7.8 and the minimum ladder score is 5.51 in Western Europe.
What is the average life expectancy across the different regions?
p3 <- ggplot(combined_dataset, aes(x=Regional.indicator, y = Healthy.life.expectancy))+
geom_boxplot()+
geom_violin(aes(fill=Regional.indicator))+
theme_minimal()+
stat_summary(geom = 'point', fun = 'mean', color='red')+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(p3)The tmap package was loaded to create the below charts. In this analysis, we will focus on the countries in the Southeast region. With the help of the tm_text() function, we were able to load the respective countries’ names over the area of the map.
The various social indicators such as Ladder score, Logged GDP per capita, Perceptions of corruption , social support, Generosity, Healthy Life expectancy and Freedom to make life choices were plot across the countries in the Southeast Asia.
## tmap mode set to interactive viewing
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Ladder.score",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Logged.GDP.per.capita",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Perceptions.of.corruption",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Social.support",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Generosity",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Healthy.life.expectancy",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
tm_shape(combined_dataset[combined_dataset$Regional.indicator=="Southeast Asia", ]) +
tm_fill("Freedom.to.make.life.choices",
style = "quantile",
palette = "Greens") +
tm_borders(alpha = 0.5) +
tm_text("CNTRY_NAME", size="CNTRY_NAME")+
tmap_style("watercolor")## tmap style set to "watercolor"
## other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "albatross", "beaver", "bw", "classic"
## Text size will be constant in view mode. Set tm_view(text.size.variable = TRUE) to enable variable text sizes.
The below observations can be made from the above maps.
A high GDP per capita does not always mean that the happiness level/satisfaction level perceived among its citizens is high as well.
The low life expectancy is very concerning in countries such as the Central African Republic.
Finland is one of the happiest countries to live in with a high ladder score reported.
Countries with high GDP per capita tend to have high perception of corruption. More trust has to be built among its citizens.
In countries with high GDP per capita, higher level of life expectancy is observed in most cases.