1 Introduction

Happiness… How to define it ? How to measure it ? Quite a complex question considering how different cultures might define happiness as well as how complicated it is to measure such a subjective concept.

Based on observed data of six variables (GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption) and estimates of their associations with the Happiness Score, the World Happiness Reports explains the variation across countries.

The World Happiness Report is a partnership of Gallup, the Oxford Wellbeing Research Centre, the UN Sustainable Development Solutions Network, and the WHR’s Editorial Board. The report is produced under the editorial control of the WHR Editorial Board.

The World Happiness Report reflects a worldwide demand for more attention to happiness and well-being as criteria for government policy. It reviews the state of happiness in the world today and shows how the science of happiness explains personal and national variations in happiness.

In this second and final Capstone Project for the HarvardX Professional Certificate in Data Science (PH125.9x), we will explore and visually examine the 2024 World Happiness Report to develop a machine-learning model that shall predict the Happiness Score on a validation set with a Root Mean Squared Error (RMSE) between 0.2 an 0.5. Four models will be developed and compared based on their resulting RMSE.

2 Loading the data & Initial Data Wrangling

Importing Data from the World Happiness Report official website .

# Import dataset
url <- "https://happiness-report.s3.amazonaws.com/2024/DataForFigure2.1+with+sub+bars+2024.xls"
destfile <- "WHR_2024.xls"
curl::curl_download(url, destfile)
whr2024 <- read_excel(destfile)

Modify columns’ names for better readability.

whr2024 <- whr2024 %>% 
  rename("Country" = "Country name", "Happiness Score" = "Ladder score", "Expl. by Economy" = "Explained by: Log GDP per capita", "Expl. by Social" = "Explained by: Social support", "Expl. by Health" = "Explained by: Healthy life expectancy", "Expl. by Freedom" = "Explained by: Freedom to make life choices", "Expl. by Generosity" = "Explained by: Generosity", "Expl. by Trust" = "Explained by: Perceptions of corruption")

Adding Country Code column to the Data Set in order to plot maps.

whr2024 <- whr2024 %>% 
  mutate("Country Code" = countrycode(
    sourcevar = whr2024$`Country`,
    origin = 'country.name',
    destination = 'wb'
  ))

whr2024 <- whr2024 %>% 
  relocate("Country Code", .after = "Country")

Rounding all numbers at 4 decimals

whr2024 <- whr2024 %>%
  mutate(across(.cols = 3:12, .fns = ~round(.x, digits = 4)))

Eliminate unnecessary columns

whr2024 <- whr2024 %>% 
  select(!c(upperwhisker, lowerwhisker))

3 Exploratory Data Analysis

The objective of this EDA is to:

We will first check the whr2024 Data Set, its 5 first head rows, its structure info. We will then check its related summary statistics, determine what “explanatory variables” have the highest correlation coefficient with the “Happiness Score” and their related distribution. Finally, we’ll check whether “Explanatory Variables” have outliers.

3.1 Head Rows

kable(head(whr2024, 5), booktabs = TRUE) %>%
  kable_styling(position = "center", latex_options = c("striped", "scale_down")) 
Country Country Code Happiness Score Expl. by Economy Expl. by Social Expl. by Health Expl. by Freedom Expl. by Generosity Expl. by Trust Dystopia + residual
Finland FIN 7.7407 1.8441 1.5724 0.6948 0.8593 0.1417 0.5462 2.0824
Denmark DNK 7.5827 1.9078 1.5204 0.6989 0.8227 0.2036 0.5484 1.8809
Iceland ISL 7.5251 1.8807 1.6165 0.7183 0.8185 0.2583 0.1825 2.0502
Sweden SWE 7.3441 1.8781 1.5008 0.7239 0.8383 0.2215 0.5238 1.6577
Israel ISR 7.3411 1.8029 1.5128 0.7398 0.6415 0.1532 0.1928 2.2980

3.2 Data Set Structure

str(whr2024)
## tibble [143 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Country            : chr [1:143] "Finland" "Denmark" "Iceland" "Sweden" ...
##  $ Country Code       : chr [1:143] "FIN" "DNK" "ISL" "SWE" ...
##  $ Happiness Score    : num [1:143] 7.74 7.58 7.53 7.34 7.34 ...
##  $ Expl. by Economy   : num [1:143] 1.84 1.91 1.88 1.88 1.8 ...
##  $ Expl. by Social    : num [1:143] 1.57 1.52 1.62 1.5 1.51 ...
##  $ Expl. by Health    : num [1:143] 0.695 0.699 0.718 0.724 0.74 ...
##  $ Expl. by Freedom   : num [1:143] 0.859 0.823 0.818 0.838 0.641 ...
##  $ Expl. by Generosity: num [1:143] 0.142 0.204 0.258 0.222 0.153 ...
##  $ Expl. by Trust     : num [1:143] 0.546 0.548 0.182 0.524 0.193 ...
##  $ Dystopia + residual: num [1:143] 2.08 1.88 2.05 1.66 2.3 ...

The Data Set contains 20 variables:

  • Country : Character. Country Name;

  • Country Code : Character. World Bank Country Codes;

  • Happiness Score : Numeric. XXXXXXX

  • Expl. by Economy : Numeric. How much does the Economy variable contributes to the Happiness Score. GDP per capita. Gross Domestic Product, or how much each country produces, divided by the number of people in the country. GDP per capita gives information about the size of the economy and how the economy is performing;

  • Expl. by Social : Numeric. Explanatory variable. How much does the Social Support variable contributes to the Happiness Score. Social support, or having someone to count on in times of trouble. “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”;

  • Expl. by Health : Numeric. Explanatory variable. How much does the Health variable contributes to the Happiness Score. Healthy Life Expectancy. More than life expectancy, how is your physical and mental health? Mental health is a key component of subjective well-being and is also a risk factor for future physical health and longevity. Mental health influences and drives a number of individual choices, behaviours, and outcomes;

  • Expl. by Freedom : Numeric. Explanatory variable. How much does the Freedom variable contributes to the Happiness Score. Freedom to make Life Choices. “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” This also includes Human Rights. Inherent to all human beings, regardless of race, sex, nationality, ethnicity, language, religion, or any other status. Human rights include the right to life and liberty, freedom from slavery and torture, freedom of opinion and expression, the right to work and education, and many more. Everyone is entitled to these rights without discrimination;

  • Expl. by Generosity : Numeric. Explanatory variable. How much does the Generosity variable contributes to the Happiness Score. “Have you donated money to a charity in the past month?” A clear marker for a sense of positive community engagement and a central way that humans connect with each other. Research shows that in all cultures, starting in early childhood, people are drawn to behaviours which benefit other people;

  • Expl. by Trust : Numeric. Explanatory variable. How much does the Trust variable contributes to the Happiness Score. “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” Do people trust their governments and have trust in the benevolence of others?

  • Dystopia + residual : Numeric. Explanatory variable. How much does the Dystopia + residual variable contributes to the Happiness Score.

3.3 Summary Statistics

summary_whr <- whr2024 %>% 
  select(!c(Country, "Country Code"))

summary_whr <- summary(summary_whr)

kable(summary_whr, booktabs = TRUE, caption = "Summary Statistics") %>%
  kable_styling(position = "center", latex_options = c("scale_down", "striped"))
Summary Statistics
Happiness Score Expl. by Economy Expl. by Social Expl. by Health Expl. by Freedom Expl. by Generosity Expl. by Trust Dystopia + residual
Min. :1.721 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :-0.0728
1st Qu.:4.726 1st Qu.:1.077 1st Qu.:0.9214 1st Qu.:0.3981 1st Qu.:0.5274 1st Qu.:0.09115 1st Qu.:0.06842 1st Qu.: 1.3079
Median :5.785 Median :1.431 Median :1.2372 Median :0.5492 Median :0.6411 Median :0.13630 Median :0.12015 Median : 1.6444
Mean :5.528 Mean :1.379 Mean :1.1343 Mean :0.5208 Mean :0.6206 Mean :0.14633 Mean :0.15413 Mean : 1.5759
3rd Qu.:6.416 3rd Qu.:1.741 3rd Qu.:1.3834 3rd Qu.:0.6482 3rd Qu.:0.7363 3rd Qu.:0.19268 3rd Qu.:0.19350 3rd Qu.: 1.8816
Max. :7.741 Max. :2.140 Max. :1.6165 Max. :0.8573 Max. :0.8633 Max. :0.40080 Max. :0.57510 Max. : 2.9979
NA NA’s :3 NA’s :3 NA’s :3 NA’s :3 NA’s :3 NA’s :3 NA’s :3

Check NAs

print("Row and Col positions of NA values") 
## [1] "Row and Col positions of NA values"
which(is.na(whr2024), arr.ind = TRUE)
##       row col
##  [1,]  62   4
##  [2,]  88   4
##  [3,] 103   4
##  [4,]  62   5
##  [5,]  88   5
##  [6,] 103   5
##  [7,]  62   6
##  [8,]  88   6
##  [9,] 103   6
## [10,]  62   7
## [11,]  88   7
## [12,] 103   7
## [13,]  62   8
## [14,]  88   8
## [15,] 103   8
## [16,]  62   9
## [17,]  88   9
## [18,] 103   9
## [19,]  62  10
## [20,]  88  10
## [21,] 103  10

Bahrain, Tajikistan and Palestine don’t have explanatory variables info and will therefore be removed from the dataset.

whr2024 <- whr2024[-c(62, 88, 103),]

3.4 Correlation

corr <- whr2024 %>%
  select("Happiness Score", "Expl. by Economy", "Expl. by Social", "Expl. by Health", "Expl. by Freedom", "Expl. by Generosity", "Expl. by Trust", "Dystopia + residual")

whr_corr = cor(corr)

corrplot(whr_corr,
         method = "circle",
         outline = TRUE,
         diag = FALSE,
         addCoef.col = TRUE,
         number.cex = .8,
         type = "upper",
         tl.col = "black")

Highest level of correlation between Happiness Score and:

  1. Social Support (.81);

  2. Economy (.77);

  3. Health (.76);

  4. Freedom (.64).

3.5 Key Explanatory Variables Visualization & Distribution

3.5.1 Social Support

3.5.1.1 Mapping the “Social Support” explanatory variable.

plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Social`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Social Support' Contribution to Happiness Score",
    geo = list(
      showcountries = TRUE,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.1.2 Top and bottom 5 Countries

# Defining colour palette for "top/bottom 5" tables
color_palette <- rev(paletteer_d("RColorBrewer::RdYlGn", n = 11))

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Social`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Social`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Social`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Social`, Group) %>%
  mutate(`Expl. by Social` = cell_spec(`Expl. by Social`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Social Support Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)
Top & Bottom Countries / Social Support Explanatory Variable
Country Expl. by Social Group
Iceland 1.6165 Top 5
Finland 1.5724 Top 5
Slovakia 1.5403 Top 5
Hungary 1.5282 Top 5
Estonia 1.5271 Top 5
Malawi 0.4104 Bottom 5
Comoros 0.328 Bottom 5
Bangladesh 0.2492 Bottom 5
Benin 0.1283 Bottom 5
Afghanistan 0 Bottom 5

3.5.1.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_social <- mean(whr2024$`Expl. by Social`, na.rm = TRUE)
median_social <- median(whr2024$`Expl. by Social`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Social`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Social Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_social, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_social, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of the “Social” variable suggests that while most countries have high levels of social support (as indicated by a median above 1), a number of countries with lower social support scores exist, pulling the average (mean) down and causing a slight left skew in the distribution.

3.5.1.4 Plotting the variable distribution vs. Happiness Score

ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Social`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Social Support",
       x = "Happiness Score",
       y = "Social Support Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()

The positive correlation between “Happiness Score” and “Social” suggests a significant association where countries with higher levels of social support tend to report higher happiness scores. This relationship, underscored by the scatter plot and linear model trend line, points towards social support as an important factor in influencing overall happiness. The mean and median values for both variables provide a central tendency context, with the plot illustrating a broad range of scores across countries yet maintaining a discernible positive trend.

3.5.2 Economy

3.5.2.1 Mapping the “Economy” explanatory variable

plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Economy`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Economy' Contribution to Happiness Score",
    geo = list(
      showcountries = T,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.2.2 Top and bottom 5 Countries

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Economy`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Economy`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Economy`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Economy`, Group) %>%
  mutate(`Expl. by Economy` = cell_spec(`Expl. by Economy`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Economy Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)
Top & Bottom Countries / Economy Explanatory Variable
Country Expl. by Economy Group
Luxembourg 2.1405 Top 5
Ireland 2.1288 Top 5
Singapore 2.1181 Top 5
United Arab Emirates 1.9831 Top 5
Switzerland 1.9703 Top 5
Chad 0.6033 Bottom 5
Niger 0.5728 Bottom 5
Mozambique 0.5598 Bottom 5
Congo (Kinshasa) 0.5337 Bottom 5
Venezuela 0 Bottom 5

3.5.2.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_economy <- mean(whr2024$`Expl. by Economy`, na.rm = TRUE)
median_economy <- median(whr2024$`Expl. by Economy`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Economy`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Economy Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_economy, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_economy, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of the “Economy” variable suggests a relatively high and concentrated level of economic contribution to happiness across different countries, with a slight indication of variability or outliers on the lower end. The histogram visually convey this distribution’s characteristics, indicating both the general trend towards higher economic contributions to happiness and the presence of variability across countries.

3.5.2.4 Plotting the variable distribution vs. Happiness Score

#Plotting distribution 
ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Economy`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Economy",
       x = "Happiness Score",
       y = "Economy Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The positive correlation between “Happiness Score” and “Economy” suggests a significant association where countries with higher levels of economic contributions tend to report higher happiness scores. This relationship, underscored by the scatter plot and linear model trend line, points towards the economy as an important factor in influencing overall happiness. The mean and median values for both variables provide a central tendency context, with the plot illustrating a broad range of scores across countries yet maintaining a discernible positive trend.

3.5.3 Health

3.5.3.1 Mapping the “Health” explanatory variable.

# Plotting map
plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Health`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Health' Contribution to Happiness Score",
    geo = list(
      showcountries = T,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.3.2 Top and bottom 5 Countries

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Health`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Health`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Health`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Health`, Group) %>%
  mutate(`Expl. by Health` = cell_spec(`Expl. by Health`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Health Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)
Top & Bottom Countries / Health Explanatory Variable
Country Expl. by Health Group
Hong Kong S.A.R. of China 0.8573 Top 5
Japan 0.7848 Top 5
South Korea 0.7696 Top 5
Singapore 0.7689 Top 5
Switzerland 0.7467 Top 5
Zimbabwe 0.2321 Bottom 5
Chad 0.1985 Bottom 5
Eswatini 0.1759 Bottom 5
Mozambique 0.1556 Bottom 5
Lesotho 0 Bottom 5

3.5.3.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_health <- mean(whr2024$`Expl. by Health`, na.rm = TRUE)
median_health <- median(whr2024$`Expl. by Health`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Health`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Health Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_health, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_health, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of “Health” across countries is showing a generally high level of health’s contribution to happiness, with potential slight variations indicating the presence of outliers or a modest range of scores. The close values of mean and median, coupled with the visualization’s aesthetic choices, emphasize a relatively symmetrical distribution with slight deviations.

3.5.3.4 Plotting the variable distribution vs. Happiness Score

ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Health`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Health",
       x = "Happiness Score",
       y = "Health Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The plot shows a positive relationship between “Health” and “Happiness Score”, signified by the upward slope of the linear model line. This implies that countries with better health metrics tend to report higher happiness scores, suggesting a significant role of health in influencing a country’s overall happiness.

3.5.4 Freedom

3.5.4.1 Mapping the “Freedom” explanatory variable.

plot_ly(whr2024,
              type = 'choropleth',
              locations = ~`Country Code`,
              z = whr2024$`Expl. by Freedom`, 
              text = ~Country,
              reversescale = F) %>% 
  layout(
    title = "'Freedom' Contribution to Happiness Score",
    geo = list(
      showcountries = T,
      countrycolor = toRGB("gray33"),
      showocean = TRUE,
      oceancolor = toRGB('lightblue1'), 
      projection = list(
        type = 'natural earth')))

3.5.4.2 Top and bottom 5 Countries

# Create and output the table
whr2024 %>%
  mutate(Group = if_else(rank(-`Expl. by Freedom`) <= 5, "Top 5", 
                         if_else(rank(`Expl. by Freedom`) <= 5, "Bottom 5", NA_character_))) %>%
  filter(!is.na(Group)) %>%
  arrange(desc(`Expl. by Freedom`)) %>%
  mutate(Group = factor(Group, levels = c("Top 5", "Bottom 5"))) %>%
  select(Country, `Expl. by Freedom`, Group) %>%
  mutate(`Expl. by Freedom` = cell_spec(`Expl. by Freedom`, color = "white", bold = TRUE,
                                        background = color_palette[as.integer(cut(1:n(), breaks = length(color_palette), labels = FALSE))])) %>%
  kable("html", escape = FALSE, align = 'c', caption = "Top & Bottom Countries / Freedom Explanatory Variable") %>%
  kable_classic("striped", full_width = FALSE)
Top & Bottom Countries / Freedom Explanatory Variable
Country Expl. by Freedom Group
Cambodia 0.8633 Top 5
Finland 0.8593 Top 5
Vietnam 0.8426 Top 5
Sweden 0.8383 Top 5
Norway 0.8354 Top 5
Algeria 0.247 Bottom 5
Turkiye 0.2021 Bottom 5
Lebanon 0.1732 Bottom 5
Comoros 0.1721 Bottom 5
Afghanistan 0 Bottom 5

3.5.4.3 Plotting the variable distribution

# Pre-calculate mean and median
mean_free <- mean(whr2024$`Expl. by Freedom`, na.rm = TRUE)
median_free <- median(whr2024$`Expl. by Freedom`, na.rm = TRUE)

# Plot
ggplot(whr2024, aes(x = `Expl. by Freedom`)) +
  geom_histogram(bins = 20, fill = "navajowhite4", color = "darkgray") +
  labs(title = "Distribution of Freedom Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_vline(aes(xintercept = mean_free, color = "Mean"), lwd = 1) +
  geom_vline(aes(xintercept = median_free, color = "Median"), lwd = 1) +
  scale_color_manual(values = c("Mean" = "dodgerblue4", "Median" = "darkorange4"),
                     name = "", 
                     guide = guide_legend(override.aes = list(linetype = c("solid", "solid")))) +
  theme_minimal() +
  guides(color = guide_legend())

The distribution of the “Freedom” variable within the whr2024 dataset reflects a range of freedom contributions to happiness across countries, with a generally symmetrical distribution around the central values of mean and median. The lack of a significant difference between the mean and median suggests minimal skewness, indicating that for most countries, the contribution of freedom to happiness does not deviate dramatically from the average.

This symmetrical distribution could imply that while variations in freedom’s contribution to happiness exist across different nations, extreme outliers are less common, and most countries cluster around a central level of freedom’s contribution to overall happiness.

3.5.4.4 Plotting the variable distribution vs. Happiness Score

ggplot(whr2024, aes(x = `Happiness Score`, y = `Expl. by Freedom`)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Happiness Score vs. Freedom",
       x = "Happiness Score",
       y = "Freedom Explanatory Variable") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(method = "lm", color = "black") + 
  theme_minimal()

The plot shows a positive relationship between “Happiness Score” and “Freedom”, where countries with higher levels of freedom tend to report higher happiness scores. The slope of the linear model line visually confirm this relationship.

3.6 Outliers

The next step is to handle the outliers of explanatory variables present in our dataset.

whr2024_long <- whr2024 %>%
  pivot_longer(cols = c(`Expl. by Social`, `Expl. by Economy`, `Expl. by Health`, `Expl. by Freedom`, `Expl. by Trust`),
               names_to = "Category",
               values_to = "Value") %>%
  mutate(Category = factor(Category,
                           levels = c("Expl. by Social", "Expl. by Economy", "Expl. by Health", "Expl. by Freedom", "Expl. by Trust"),
                           labels = c("Social", "Economy", "Health", "Freedom", "Trust")))

ggplot(whr2024_long, aes(x = Category, y = Value, fill = Category)) +
  geom_boxplot() +
  scale_fill_paletteer_d("ggthemes::excel_Retrospect") + 
  labs(title = "Outliers", x = "Explanatory Variables", y = "Value") +
  theme_minimal()

Outliers in Social, Economy, Health, Freedom and Trust explanatory variables will be removed before developing our prediction models.

3.7 Further Data Cleaning & Wrangling

Removing Outliers

# Capping function
cap <- function(x){
  quantiles <- quantile(x, c(0.5, .25, .75, .95))
  x[x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
  x[x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
x}

# Removing outliers
whr2024$`Expl. by Social` <- whr2024$`Expl. by Social` %>% 
  cap()
whr2024$`Expl. by Economy` <- whr2024$`Expl. by Economy` %>% 
  cap()
whr2024$`Expl. by Health` <- whr2024$`Expl. by Health` %>% 
  cap()
whr2024$`Expl. by Freedom` <- whr2024$`Expl. by Freedom` %>% 
  cap()
whr2024$`Expl. by Trust` <- whr2024$`Expl. by Trust` %>% 
  cap

The WHR dataset will be divided into two subsets with an arbitrary partition of 0.7/0.3:

  • “whr_train,” serving for training, developing, then selecting the best algorithm;

  • “whr_test” that will be used for evaluating the RMSE of the selected algorithm.

# test_whr set will be 30% of whr_2024 data set
set.seed(1)

test_index <- createDataPartition(y = whr2024$`Happiness Score`, times = 1, p = 0.3, list = FALSE)

whr_train <- whr2024[-test_index,] 
whr_test <- whr2024[test_index,] 

In the next section, we build several models to determine the best way to predict the Happiness Score.

4 Modelling Approaches

4.1 RMSE

The RMSE measures the average difference between a statistical model’s predicted values and the actual values. Mathematically, it is the standard deviation of the residuals which represent the distance between the regression line and the data points. RMSE quantifies how dispersed these residuals are, revealing how tightly the observed data clusters around the predicted values.

We will save the following RMSE/Loss function in our environment: \[ RMSE = \sqrt{\frac{1}{N}\displaystyle\sum_{u,i} (\hat{y}_{u,i}-y_{u,i})^{2}} \]

RMSE <- function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

4.2 Model 1: Benchmark

The benchmark model we will use is to naively predict all ratings as the average rating of the training set.

The formula we will use is defined as follow: \(Y_{u, i}\) = predicted rating, \(\mu\) = average rating and \(\epsilon_{u, i}\) = independent errors centred at 0. \[Y_{u, i} = \mu + \epsilon_{u, i}\]

#Define formula / mu
whr_mu <- mean(whr_train$`Happiness Score`)

# Extract RMSE from the model's result
RMSE_1 <- RMSE(whr_train$`Happiness Score`, whr_mu)

# Create a table with the result
result1_table <- tibble(Model = "Benchmark", RMSE = RMSE_1)
result1_table %>% 
  knitr::kable()
Model RMSE
Benchmark 1.205863

The RMSE for this model is 1.2058, which is unsurprisingly high.This will nevertheless be used as a benchmark for comparison with the different models we will now design

4.3 Model 2: The Generalized Linear Model

The General Linear Model (GLM) allows the response variable to have a distribution other than the normal distribution (e.g., binomial for logistic regression, Poisson for count data) and by using a link function to relate the mean of the observed response to the linear predictor (combination of coefficients and predictors).

A GLM using all four explanatory variables with the highest correlation is fitted using the caret package. score is predicted using all seven factors.

# Define formula
formula <- `Happiness Score` ~ `Expl. by Economy` + `Expl. by Social` + `Expl. by Health` + `Expl. by Freedom`

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10, savePredictions = "final")

# Train the model using the formula
set.seed(123)
glm_model <- train(formula, data = whr_train, method = "glm", trControl = train_control)

# Extract RMSE from glm model's results
RMSE_glm <- min(glm_model$results$RMSE)

# Create a table with the result
result2_table <- tibble(Model = "GLM", RMSE = RMSE_glm)
result2_table %>%
  knitr::kable()
Model RMSE
GLM 0.5864053

4.4 Model 3: GLM Model without the “Freedom” Explanatory Variable

In an attempt to improve the model, we will remove the “Freedom” Explanatory Variable (Freedom to make life choices) as this variable is the least correlated to the “Happiness Score”

# Define formula
formula2 <- `Happiness Score` ~ `Expl. by Economy` + `Expl. by Social` + `Expl. by Health`

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10, savePredictions = "final")

# Train the model using the formula
set.seed(123)
glm_model <- train(formula2, data = whr_train, method = "glm", trControl = train_control)

# Extract RMSE from glm model's results
RMSE_glm <- min(glm_model$results$RMSE)

# Create a table with the result
result3_table <- tibble(Model = "GLM w/o Freedom", RMSE = RMSE_glm)
result3_table %>%
  knitr::kable()
Model RMSE
GLM w/o Freedom 0.6397102

With a RMSE at .634, this was obviously a wrong intuition to remove the “Freedom” explanatory variable. We will therefore keep this variable in the following models and try to further improve it.

4.5 Model 4: Regularised GLM Model

The regularized GLM adds penalty terms to the model’s loss function to constrain the size of coefficients. By adding these penalty terms, regularized GLMs reduce the risk of overfitting, making them more robust, especially when dealing with datasets with many predictors or when predictors are correlated.

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train the model using glmnet (regularized GLM)
set.seed(123) 
glmnet_model <- train(formula, data = whr_train, 
                      method = "glmnet", 
                      trControl = train_control)

# Extract RMSE from the trained model's results
RMSE_glmnet <- min(glmnet_model$results$RMSE)

# Create a table with the result
result4_table <- tibble(Model = "Regularized GLM", RMSE = RMSE_glmnet)
  result4_table %>% 
  knitr::kable()
Model RMSE
Regularized GLM 0.5755984

5 Results

Summary of of the different models’ RMSEs we developed along this project:

results_table <- tibble(Model = c("Benchmark", "GLM", "GLM w/o Freedom", "Regularised GLM"), RMSE = c(result1_table$RMSE, result2_table$RMSE, result3_table$RMSE, result4_table$RMSE))

results_table %>% 
  knitr::kable()
Model RMSE
Benchmark 1.2058628
GLM 0.5864053
GLM w/o Freedom 0.6397102
Regularised GLM 0.5755984

The Regularised GLM model b, with a RMSE at .5756 is the the best model we developed. We will now apply it to the Test Set: whr_test.

# Configure training control for cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train the model using glmnet (regularized GLM)
set.seed(123) 
glmnet_model <- train(formula, data = whr_test, 
                      method = "glmnet", 
                      trControl = train_control)

# Extract RMSE from the trained model's results
RMSE_glmnet <- min(glmnet_model$results$RMSE)

# Create a table with the result
result5_table <- tibble(Model = "Regularized GLM", RMSE = RMSE_glmnet)
  result5_table %>% 
  knitr::kable()
Model RMSE
Regularized GLM 0.5970947

With a RMSE at .5970 we can say we only achieved a passable result considering that an acceptable RMSE should be comprised between .2 and .5 for an accurate the model to predict the actual value.

6 Conclusion

While we built a machine learning model to predict the Happiness Score in the WHR Dataset, it is clear that we could have gone further at building a stronger model by taking in account other perspective into account. We could have as well developed more complex models based on, for example, Distributed Random Forest or many more. We focused on the most basic model development. I nevertheless hope we learned some interesting information about the World Happiness Score.

7 References

World Happiness Report 2024: https://worldhappiness.report/ed/2024/

Kaggle: https://www.kaggle.com/datasets/sazidthe1/global-happiness-scores-and-factors

Citation: Helliwell, J. F., Layard, R., Sachs, J. D., De Neve, J.-E., Aknin, L. B., & Wang, S. (Eds.). (2024). World Happiness Report 2024. University of Oxford: Wellbeing Research Centre.