1 Introduction

Happiness… How to define it ? How to measure it ? Quite a complex question considering how different cultures might define happiness as well as how complicated it is to measure such a subjective concept.

Based on observed data of six variables (GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption) and estimates of their associations with the Happiness Score, the World Happiness Reports explains the variation across countries.

The World Happiness Report is a partnership of Gallup, the Oxford Wellbeing Research Centre, the UN Sustainable Development Solutions Network, and the WHR’s Editorial Board. The report is produced under the editorial control of the WHR Editorial Board.

The World Happiness Report reflects a worldwide demand for more attention to happiness and well-being as criteria for government policy. It reviews the state of happiness in the world today and shows how the science of happiness explains personal and national variations in happiness.

In this second and final Capstone Project for the HarvardX Professional Certificate in Data Science (PH125.9x), we will explore and visually examine the 2024 World Happiness Report to develop a machine-learning model that shall predict the Happiness Score on a validation set with a Root Mean Squared Error (RMSE) between 0.2 an 0.5. Four models will be developed and compared based on their resulting RMSE.

2 Loading the data & Initial Data Wrangling

Importing Data from the World Happiness Report official website .

# Import dataset
url <- "https://happiness-report.s3.amazonaws.com/2024/DataForFigure2.1+with+sub+bars+2024.xls"
destfile <- "WHR_2024.xls"
curl::curl_download(url, destfile)
whr2024 <- read_excel(destfile)

Modify columns’ names for better readability.

whr2024 <- whr2024 %>% 
  rename("Country" = "Country name", "Happiness Score" = "Ladder score", "Expl. by Economy" = "Explained by: Log GDP per capita", "Expl. by Social" = "Explained by: Social support", "Expl. by Health" = "Explained by: Healthy life expectancy", "Expl. by Freedom" = "Explained by: Freedom to make life choices", "Expl. by Generosity" = "Explained by: Generosity", "Expl. by Trust" = "Explained by: Perceptions of corruption")

Adding Country Code column to the Data Set in order to plot maps.

whr2024 <- whr2024 %>% 
  mutate("Country Code" = countrycode(
    sourcevar = whr2024$`Country`,
    origin = 'country.name',
    destination = 'wb'
  ))

whr2024 <- whr2024 %>% 
  relocate("Country Code", .after = "Country")

Rounding all numbers at 4 decimals

whr2024 <- whr2024 %>%
  mutate(across(.cols = 3:12, .fns = ~round(.x, digits = 4)))

Eliminate unnecessary columns

whr2024 <- whr2024 %>% 
  select(!c(upperwhisker, lowerwhisker))

3 Exploratory Data Analysis

The objective of this EDA is to:

Identify potential errors/typos, better understand patterns within the data, detect outliers or anomalous events, identify potential relations among the variables;
Help defining the most relevant strategy to develop our prediction models.

We will first check the whr2024 Data Set, its 5 first head rows, its structure info. We will then check its related summary statistics, determine what “explanatory variables” have the highest correlation coefficient with the “Happiness Score” and their related distribution. Finally, we’ll check whether “Explanatory Variables” have outliers.

3.1 Head Rows

kable(head(whr2024, 5), booktabs = TRUE) %>%
  kable_styling(position = "center", latex_options = c("striped", "scale_down"))

Country	Country Code	Happiness Score	Expl. by Economy	Expl. by Social	Expl. by Health	Expl. by Freedom	Expl. by Generosity	Expl. by Trust	Dystopia + residual
Finland	FIN	7.7407	1.8441	1.5724	0.6948	0.8593	0.1417	0.5462	2.0824
Denmark	DNK	7.5827	1.9078	1.5204	0.6989	0.8227	0.2036	0.5484	1.8809
Iceland	ISL	7.5251	1.8807	1.6165	0.7183	0.8185	0.2583	0.1825	2.0502
Sweden	SWE	7.3441	1.8781	1.5008	0.7239	0.8383	0.2215	0.5238	1.6577
Israel	ISR	7.3411	1.8029	1.5128	0.7398	0.6415	0.1532	0.1928	2.2980

3.2 Data Set Structure

str(whr2024)

## tibble [143 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Country            : chr [1:143] "Finland" "Denmark" "Iceland" "Sweden" ...
##  $ Country Code       : chr [1:143] "FIN" "DNK" "ISL" "SWE" ...
##  $ Happiness Score    : num [1:143] 7.74 7.58 7.53 7.34 7.34 ...
##  $ Expl. by Economy   : num [1:143] 1.84 1.91 1.88 1.88 1.8 ...
##  $ Expl. by Social    : num [1:143] 1.57 1.52 1.62 1.5 1.51 ...
##  $ Expl. by Health    : num [1:143] 0.695 0.699 0.718 0.724 0.74 ...
##  $ Expl. by Freedom   : num [1:143] 0.859 0.823 0.818 0.838 0.641 ...
##  $ Expl. by Generosity: num [1:143] 0.142 0.204 0.258 0.222 0.153 ...
##  $ Expl. by Trust     : num [1:143] 0.546 0.548 0.182 0.524 0.193 ...
##  $ Dystopia + residual: num [1:143] 2.08 1.88 2.05 1.66 2.3 ...

The Data Set contains 20 variables:

Country : Character. Country Name;
Country Code : Character. World Bank Country Codes;
Happiness Score : Numeric. XXXXXXX
Expl. by Economy : Numeric. How much does the Economy variable contributes to the Happiness Score. GDP per capita. Gross Domestic Product, or how much each country produces, divided by the number of people in the country. GDP per capita gives information about the size of the economy and how the economy is performing;
Expl. by Social : Numeric. Explanatory variable. How much does the Social Support variable contributes to the Happiness Score. Social support, or having someone to count on in times of trouble. “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”;
Expl. by Health : Numeric. Explanatory variable. How much does the Health variable contributes to the Happiness Score. Healthy Life Expectancy. More than life expectancy, how is your physical and mental health? Mental health is a key component of subjective well-being and is also a risk factor for future physical health and longevity. Mental health influences and drives a number of individual choices, behaviours, and outcomes;
Expl. by Freedom : Numeric. Explanatory variable. How much does the Freedom variable contributes to the Happiness Score. Freedom to make Life Choices. “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” This also includes Human Rights. Inherent to all human beings, regardless of race, sex, nationality, ethnicity, language, religion, or any other status. Human rights include the right to life and liberty, freedom from slavery and torture, freedom of opinion and expression, the right to work and education, and many more. Everyone is entitled to these rights without discrimination;
Expl. by Generosity : Numeric. Explanatory variable. How much does the Generosity variable contributes to the Happiness Score. “Have you donated money to a charity in the past month?” A clear marker for a sense of positive community engagement and a central way that humans connect with each other. Research shows that in all cultures, starting in early childhood, people are drawn to behaviours which benefit other people;
Expl. by Trust : Numeric. Explanatory variable. How much does the Trust variable contributes to the Happiness Score. “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” Do people trust their governments and have trust in the benevolence of others?
Dystopia + residual : Numeric. Explanatory variable. How much does the Dystopia + residual variable contributes to the Happiness Score.

3.3 Summary Statistics

summary_whr <- whr2024 %>% 
  select(!c(Country, "Country Code"))

summary_whr <- summary(summary_whr)

kable(summary_whr, booktabs = TRUE, caption = "Summary Statistics") %>%
  kable_styling(position = "center", latex_options = c("scale_down", "striped"))

Summary Statistics
Happiness Score	Expl. by Economy	Expl. by Social	Expl. by Health	Expl. by Freedom	Expl. by Generosity	Expl. by Trust	Dystopia + residual
Min. :1.721	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :-0.0728
1st Qu.:4.726	1st Qu.:1.077	1st Qu.:0.9214	1st Qu.:0.3981	1st Qu.:0.5274	1st Qu.:0.09115	1st Qu.:0.06842	1st Qu.: 1.3079
Median :5.785	Median :1.431	Median :1.2372	Median :0.5492	Median :0.6411	Median :0.13630	Median :0.12015	Median : 1.6444
Mean :5.528	Mean :1.379	Mean :1.1343	Mean :0.5208	Mean :0.6206	Mean :0.14633	Mean :0.15413	Mean : 1.5759
3rd Qu.:6.416	3rd Qu.:1.741	3rd Qu.:1.3834	3rd Qu.:0.6482	3rd Qu.:0.7363	3rd Qu.:0.19268	3rd Qu.:0.19350	3rd Qu.: 1.8816
Max. :7.741	Max. :2.140	Max. :1.6165	Max. :0.8573	Max. :0.8633	Max. :0.40080	Max. :0.57510	Max. : 2.9979
NA	NA’s :3	NA’s :3	NA’s :3	NA’s :3	NA’s :3	NA’s :3	NA’s :3

Check NAs

print("Row and Col positions of NA values")

## [1] "Row and Col positions of NA values"

which(is.na(whr2024), arr.ind = TRUE)

##       row col
##  [1,]  62   4
##  [2,]  88   4
##  [3,] 103   4
##  [4,]  62   5
##  [5,]  88   5
##  [6,] 103   5
##  [7,]  62   6
##  [8,]  88   6
##  [9,] 103   6
## [10,]  62   7
## [11,]  88   7
## [12,] 103   7
## [13,]  62   8
## [14,]  88   8
## [15,] 103   8
## [16,]  62   9
## [17,]  88   9
## [18,] 103   9
## [19,]  62  10
## [20,]  88  10
## [21,] 103  10

Bahrain, Tajikistan and Palestine don’t have explanatory variables info and will therefore be removed from the dataset.

whr2024 <- whr2024[-c(62, 88, 103),]

3.4 Correlation

corr <- whr2024 %>%
  select("Happiness Score", "Expl. by Economy", "Expl. by Social", "Expl. by Health", "Expl. by Freedom", "Expl. by Generosity", "Expl. by Trust", "Dystopia + residual")

whr_corr = cor(corr)

corrplot(whr_corr,
         method = "circle",
         outline = TRUE,
         diag = FALSE,
         addCoef.col = TRUE,
         number.cex = .8,
         type = "upper",
         tl.col = "black")

Highest level of correlation between Happiness Score and:

Social Support (.81);
Economy (.77);
Health (.76);
Freedom (.64).

3.5 Key Explanatory Variables Visualization & Distribution

3.5.1 Social Support

3.5.1.1 Mapping the “Social Support” explanatory variable.

plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Social`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Social Support' Contribution to Happiness Score",
    geo = list(
      showcountries = TRUE,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.1.2 Top and bottom 5 Countries

# Defining colour palette for "top/bottom 5" tables
color_palette <- rev(paletteer_d("RColorBrewer::RdYlGn", n = 11))

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Social`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Social`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Social`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Social`, Group) %>%
  mutate(`Expl. by Social` = cell_spec(`Expl. by Social`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Social Support Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)

Top & Bottom Countries / Social Support Explanatory Variable
Country	Expl. by Social	Group
Iceland	1.6165	Top 5
Finland	1.5724	Top 5
Slovakia	1.5403	Top 5
Hungary	1.5282	Top 5
Estonia	1.5271	Top 5
Malawi	0.4104	Bottom 5
Comoros	0.328	Bottom 5
Bangladesh	0.2492	Bottom 5
Benin	0.1283	Bottom 5
Afghanistan	0	Bottom 5

3.5.1.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_social <- mean(whr2024$`Expl. by Social`, na.rm = TRUE)
median_social <- median(whr2024$`Expl. by Social`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Social`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Social Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_social, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_social, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of the “Social” variable suggests that while most countries have high levels of social support (as indicated by a median above 1), a number of countries with lower social support scores exist, pulling the average (mean) down and causing a slight left skew in the distribution.

3.5.1.4 Plotting the variable distribution vs. Happiness Score

ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Social`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Social Support",
       x = "Happiness Score",
       y = "Social Support Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()

The positive correlation between “Happiness Score” and “Social” suggests a significant association where countries with higher levels of social support tend to report higher happiness scores. This relationship, underscored by the scatter plot and linear model trend line, points towards social support as an important factor in influencing overall happiness. The mean and median values for both variables provide a central tendency context, with the plot illustrating a broad range of scores across countries yet maintaining a discernible positive trend.

3.5.2 Economy

3.5.2.1 Mapping the “Economy” explanatory variable

plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Economy`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Economy' Contribution to Happiness Score",
    geo = list(
      showcountries = T,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.2.2 Top and bottom 5 Countries

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Economy`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Economy`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Economy`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Economy`, Group) %>%
  mutate(`Expl. by Economy` = cell_spec(`Expl. by Economy`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Economy Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)

Top & Bottom Countries / Economy Explanatory Variable
Country	Expl. by Economy	Group
Luxembourg	2.1405	Top 5
Ireland	2.1288	Top 5
Singapore	2.1181	Top 5
United Arab Emirates	1.9831	Top 5
Switzerland	1.9703	Top 5
Chad	0.6033	Bottom 5
Niger	0.5728	Bottom 5
Mozambique	0.5598	Bottom 5
Congo (Kinshasa)	0.5337	Bottom 5
Venezuela	0	Bottom 5

3.5.2.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_economy <- mean(whr2024$`Expl. by Economy`, na.rm = TRUE)
median_economy <- median(whr2024$`Expl. by Economy`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Economy`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Economy Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_economy, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_economy, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of the “Economy” variable suggests a relatively high and concentrated level of economic contribution to happiness across different countries, with a slight indication of variability or outliers on the lower end. The histogram visually convey this distribution’s characteristics, indicating both the general trend towards higher economic contributions to happiness and the presence of variability across countries.

3.5.2.4 Plotting the variable distribution vs. Happiness Score

#Plotting distribution 
ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Economy`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Economy",
       x = "Happiness Score",
       y = "Economy Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The positive correlation between “Happiness Score” and “Economy” suggests a significant association where countries with higher levels of economic contributions tend to report higher happiness scores. This relationship, underscored by the scatter plot and linear model trend line, points towards the economy as an important factor in influencing overall happiness. The mean and median values for both variables provide a central tendency context, with the plot illustrating a broad range of scores across countries yet maintaining a discernible positive trend.

3.5.3 Health

3.5.3.1 Mapping the “Health” explanatory variable.

# Plotting map
plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Health`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Health' Contribution to Happiness Score",
    geo = list(
      showcountries = T,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.3.2 Top and bottom 5 Countries

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Health`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Health`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Health`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Health`, Group) %>%
  mutate(`Expl. by Health` = cell_spec(`Expl. by Health`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Health Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)

Top & Bottom Countries / Health Explanatory Variable
Country	Expl. by Health	Group
Hong Kong S.A.R. of China	0.8573	Top 5
Japan	0.7848	Top 5
South Korea	0.7696	Top 5
Singapore	0.7689	Top 5
Switzerland	0.7467	Top 5
Zimbabwe	0.2321	Bottom 5
Chad	0.1985	Bottom 5
Eswatini	0.1759	Bottom 5
Mozambique	0.1556	Bottom 5
Lesotho	0	Bottom 5

3.5.3.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_health <- mean(whr2024$`Expl. by Health`, na.rm = TRUE)
median_health <- median(whr2024$`Expl. by Health`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Health`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Health Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_health, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_health, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of “Health” across countries is showing a generally high level of health’s contribution to happiness, with potential slight variations indicating the presence of outliers or a modest range of scores. The close values of mean and median, coupled with the visualization’s aesthetic choices, emphasize a relatively symmetrical distribution with slight deviations.

3.5.3.4 Plotting the variable distribution vs. Happiness Score

ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Health`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Health",
       x = "Happiness Score",
       y = "Health Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The plot shows a positive relationship between “Health” and “Happiness Score”, signified by the upward slope of the linear model line. This implies that countries with better health metrics tend to report higher happiness scores, suggesting a significant role of health in influencing a country’s overall happiness.

3.5.4 Freedom

3.5.4.1 Mapping the “Freedom” explanatory variable.

plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Freedom`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Freedom' Contribution to Happiness Score",
    geo = list(
      showcountries = T,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.4.2 Top and bottom 5 Countries

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Freedom`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Freedom`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Freedom`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Freedom`, Group) %>%
  mutate(`Expl. by Freedom` = cell_spec(`Expl. by Freedom`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Freedom Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)

Top & Bottom Countries / Freedom Explanatory Variable
Country	Expl. by Freedom	Group
Cambodia	0.8633	Top 5
Finland	0.8593	Top 5
Vietnam	0.8426	Top 5
Sweden	0.8383	Top 5
Norway	0.8354	Top 5
Algeria	0.247	Bottom 5
Turkiye	0.2021	Bottom 5
Lebanon	0.1732	Bottom 5
Comoros	0.1721	Bottom 5
Afghanistan	0	Bottom 5

3.5.4.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_free <- mean(whr2024$`Expl. by Freedom`, na.rm = TRUE)
median_free <- median(whr2024$`Expl. by Freedom`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Freedom`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Freedom Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_free, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_free, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of the “Freedom” variable within the whr2024 dataset reflects a range of freedom contributions to happiness across countries, with a generally symmetrical distribution around the central values of mean and median. The lack of a significant difference between the mean and median suggests minimal skewness, indicating that for most countries, the contribution of freedom to happiness does not deviate dramatically from the average.

This symmetrical distribution could imply that while variations in freedom’s contribution to happiness exist across different nations, extreme outliers are less common, and most countries cluster around a central level of freedom’s contribution to overall happiness.

3.5.4.4 Plotting the variable distribution vs. Happiness Score

ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Freedom`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Freedom",
       x = "Happiness Score",
       y = "Freedom Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()

The plot shows a positive relationship between “Happiness Score” and “Freedom”, where countries with higher levels of freedom tend to report higher happiness scores. The slope of the linear model line visually confirm this relationship.

3.6 Outliers

The next step is to handle the outliers of explanatory variables present in our dataset.

whr2024_long <- whr2024 %>%
  pivot_longer(cols = c(`Expl. by Social`, `Expl. by Economy`, `Expl. by Health`, `Expl. by Freedom`, `Expl. by Trust`),
               names_to = "Category",
               values_to = "Value") %>%
  mutate(Category = factor(Category,
                           levels = c("Expl. by Social", "Expl. by Economy", "Expl. by Health", "Expl. by Freedom", "Expl. by Trust"),
                           labels = c("Social", "Economy", "Health", "Freedom", "Trust")))

ggplot(whr2024_long, aes(x = Category, y = Value, fill = Category)) +
  geom_boxplot() +
  scale_fill_paletteer_d("ggthemes::excel_Retrospect") + 
  labs(title = "Outliers", x = "Explanatory Variables", y = "Value") +
  theme_minimal()

Outliers in Social, Economy, Health, Freedom and Trust explanatory variables will be removed before developing our prediction models.

3.7 Further Data Cleaning & Wrangling

Removing Outliers

# Capping function
cap <- function(x){
  quantiles <- quantile(x, c(0.5, .25, .75, .95))
  x[x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
  x[x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
x}

# Removing outliers
whr2024$`Expl. by Social` <- whr2024$`Expl. by Social` %>% 
  cap()
whr2024$`Expl. by Economy` <- whr2024$`Expl. by Economy` %>% 
  cap()
whr2024$`Expl. by Health` <- whr2024$`Expl. by Health` %>% 
  cap()
whr2024$`Expl. by Freedom` <- whr2024$`Expl. by Freedom` %>% 
  cap()
whr2024$`Expl. by Trust` <- whr2024$`Expl. by Trust` %>% 
  cap

The WHR dataset will be divided into two subsets with an arbitrary partition of 0.7/0.3:

“whr_train,” serving for training, developing, then selecting the best algorithm;
“whr_test” that will be used for evaluating the RMSE of the selected algorithm.

# test_whr set will be 30% of whr_2024 data set
set.seed(1)

test_index <- createDataPartition(y = whr2024$`Happiness Score`, times = 1, p = 0.3, list = FALSE)

whr_train <- whr2024[-test_index,] 
whr_test <- whr2024[test_index,]

In the next section, we build several models to determine the best way to predict the Happiness Score.

4 Modelling Approaches

4.1 RMSE

The RMSE measures the average difference between a statistical model’s predicted values and the actual values. Mathematically, it is the standard deviation of the residuals which represent the distance between the regression line and the data points. RMSE quantifies how dispersed these residuals are, revealing how tightly the observed data clusters around the predicted values.

We will save the following RMSE/Loss function in our environment: \[ RMSE = \sqrt{\frac{1}{N}\displaystyle\sum_{u,i} (\hat{y}_{u,i}-y_{u,i})^{2}} \]

RMSE <- function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

4.2 Model 1: Benchmark

The benchmark model we will use is to naively predict all ratings as the average rating of the training set.

The formula we will use is defined as follow: \(Y_{u, i}\) = predicted rating, \(\mu\) = average rating and \(\epsilon_{u, i}\) = independent errors centred at 0. \[Y_{u, i} = \mu + \epsilon_{u, i}\]

#Define formula / mu
whr_mu <- mean(whr_train$`Happiness Score`)

# Extract RMSE from the model's result
RMSE_1 <- RMSE(whr_train$`Happiness Score`, whr_mu)

# Create a table with the result
result1_table <- tibble(Model = "Benchmark", RMSE = RMSE_1)
result1_table %>% 
  knitr::kable()

Model	RMSE
Benchmark	1.205863

The RMSE for this model is 1.2058, which is unsurprisingly high.This will nevertheless be used as a benchmark for comparison with the different models we will now design

4.3 Model 2: The Generalized Linear Model

The General Linear Model (GLM) allows the response variable to have a distribution other than the normal distribution (e.g., binomial for logistic regression, Poisson for count data) and by using a link function to relate the mean of the observed response to the linear predictor (combination of coefficients and predictors).

A GLM using all four explanatory variables with the highest correlation is fitted using the caret package. score is predicted using all seven factors.

# Define formula
formula <- `Happiness Score` ~ `Expl. by Economy` + `Expl. by Social` + `Expl. by Health` + `Expl. by Freedom`

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10, savePredictions = "final")

# Train the model using the formula
set.seed(123)
glm_model <- train(formula, data = whr_train, method = "glm", trControl = train_control)

# Extract RMSE from glm model's results
RMSE_glm <- min(glm_model$results$RMSE)

# Create a table with the result
result2_table <- tibble(Model = "GLM", RMSE = RMSE_glm)
result2_table %>%
  knitr::kable()

Model	RMSE
GLM	0.5864053

4.4 Model 3: GLM Model without the “Freedom” Explanatory Variable

In an attempt to improve the model, we will remove the “Freedom” Explanatory Variable (Freedom to make life choices) as this variable is the least correlated to the “Happiness Score”

# Define formula
formula2 <- `Happiness Score` ~ `Expl. by Economy` + `Expl. by Social` + `Expl. by Health`

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10, savePredictions = "final")

# Train the model using the formula
set.seed(123)
glm_model <- train(formula2, data = whr_train, method = "glm", trControl = train_control)

# Extract RMSE from glm model's results
RMSE_glm <- min(glm_model$results$RMSE)

# Create a table with the result
result3_table <- tibble(Model = "GLM w/o Freedom", RMSE = RMSE_glm)
result3_table %>%
  knitr::kable()

Model	RMSE
GLM w/o Freedom	0.6397102

With a RMSE at .634, this was obviously a wrong intuition to remove the “Freedom” explanatory variable. We will therefore keep this variable in the following models and try to further improve it.

4.5 Model 4: Regularised GLM Model

The regularized GLM adds penalty terms to the model’s loss function to constrain the size of coefficients. By adding these penalty terms, regularized GLMs reduce the risk of overfitting, making them more robust, especially when dealing with datasets with many predictors or when predictors are correlated.

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train the model using glmnet (regularized GLM)
set.seed(123) 
glmnet_model <- train(formula, data = whr_train, 
                      method = "glmnet", 
                      trControl = train_control)

# Extract RMSE from the trained model's results
RMSE_glmnet <- min(glmnet_model$results$RMSE)

# Create a table with the result
result4_table <- tibble(Model = "Regularized GLM", RMSE = RMSE_glmnet)
  result4_table %>% 
  knitr::kable()

Model	RMSE
Regularized GLM	0.5755984

5 Results

Summary of of the different models’ RMSEs we developed along this project:

results_table <- tibble(Model = c("Benchmark", "GLM", "GLM w/o Freedom", "Regularised GLM"), RMSE = c(result1_table$RMSE, result2_table$RMSE, result3_table$RMSE, result4_table$RMSE))

results_table %>% 
  knitr::kable()

Model	RMSE
Benchmark	1.2058628
GLM	0.5864053
GLM w/o Freedom	0.6397102
Regularised GLM	0.5755984

The Regularised GLM model b, with a RMSE at .5756 is the the best model we developed. We will now apply it to the Test Set: whr_test.

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train the model using glmnet (regularized GLM)
set.seed(123) 
glmnet_model <- train(formula, data = whr_test, 
                      method = "glmnet", 
                      trControl = train_control)

# Extract RMSE from the trained model's results
RMSE_glmnet <- min(glmnet_model$results$RMSE)

# Create a table with the result
result5_table <- tibble(Model = "Regularized GLM", RMSE = RMSE_glmnet)
  result5_table %>% 
  knitr::kable()

Model	RMSE
Regularized GLM	0.5970947

With a RMSE at .5970 we can say we only achieved a passable result considering that an acceptable RMSE should be comprised between .2 and .5 for an accurate the model to predict the actual value.

6 Conclusion

While we built a machine learning model to predict the Happiness Score in the WHR Dataset, it is clear that we could have gone further at building a stronger model by taking in account other perspective into account. We could have as well developed more complex models based on, for example, Distributed Random Forest or many more. We focused on the most basic model development. I nevertheless hope we learned some interesting information about the World Happiness Score.

7 References

World Happiness Report 2024: https://worldhappiness.report/ed/2024/

Kaggle: https://www.kaggle.com/datasets/sazidthe1/global-happiness-scores-and-factors

Citation: Helliwell, J. F., Layard, R., Sachs, J. D., De Neve, J.-E., Aknin, L. B., & Wang, S. (Eds.). (2024). World Happiness Report 2024. University of Oxford: Wellbeing Research Centre.

HarvardX PH125.9x | Data Science 2024 World Happiness Report Capstone Project

Thierry Laurent-Badin

April 2024