1 Objective

The goal of this project is to analyse the World Happiness Report 2024 dataset, which covers 143 countries, and to build predictive models capable of estimating a country’s happiness score (Ladder Score) from its socio-economic indicators.

The specific objectives are:

Data Cleaning: Load the dataset, inspect its structure, remove irrelevant or non-predictor columns, identify and handle missing values, check for duplicates, and prepare a lean, clean dataset ready for analysis.
Exploratory Data Analysis (EDA): Perform a thorough examination of the data through statistical summaries, visualisations, correlation analysis, and trend analysis. Each visualisation is accompanied by inferences and justifications.
Detailed Analysis: Apply and evaluate four machine learning regression models :— Multiple Linear Regression, Ridge Regression, Lasso Regression, and Random Forest — and compare their predictive performance using MSE, RMSE, R², and MAE.

# ── Core data wrangling ──────────────────────────────────────────────────────
library(tidyverse)
library(dplyr)
library(tidyr)

# ── Visualisation ────────────────────────────────────────────────────────────
library(ggplot2)
library(corrplot)
library(scales)
library(gridExtra)
library(ggrepel)

# ── Machine learning ─────────────────────────────────────────────────────────
library(caret)
library(glmnet)
library(randomForest)

# ── Tables ───────────────────────────────────────────────────────────────────
library(knitr)
library(kableExtra)

df_raw <- read.csv("WHR2024.csv", stringsAsFactors = FALSE)

# Rename columns for convenience
df <- df_raw %>%
  rename(
    country         = Country.name,
    happiness       = Ladder.score,
    upper_ci        = upperwhisker,
    lower_ci        = lowerwhisker,
    gdp             = Explained.by..Log.GDP.per.capita,
    social_support  = Explained.by..Social.support,
    life_expectancy = Explained.by..Healthy.life.expectancy,
    freedom         = Explained.by..Freedom.to.make.life.choices,
    generosity      = Explained.by..Generosity,
    corruption      = Explained.by..Perceptions.of.corruption,
    dystopia        = Dystopia...residual
  )

cat("Dataset loaded successfully:", nrow(df), "countries,", ncol(df), "variables\n")

## Dataset loaded successfully: 143 countries, 11 variables

2 Data Cleaning

2.1 Initial Inspection

Before any cleaning, the full raw structure of the dataset is examined to understand its dimensions, variable types, and general content.

head(df) %>%
  kable(digits = 3, caption = "First 6 Rows of the Raw WHR 2024 Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)

First 6 Rows of the Raw WHR 2024 Dataset
country	happiness	upper_ci	lower_ci	gdp	social_support	life_expectancy	freedom	generosity	corruption	dystopia
Finland	7.741	7.815	7.667	1.844	1.572	0.695	0.859	0.142	0.546	2.082
Denmark	7.583	7.665	7.500	1.908	1.520	0.699	0.823	0.204	0.548	1.881
Iceland	7.525	7.618	7.433	1.881	1.617	0.718	0.819	0.258	0.182	2.050
Sweden	7.344	7.422	7.267	1.878	1.501	0.724	0.838	0.221	0.524	1.658
Israel	7.341	7.405	7.277	1.803	1.513	0.740	0.641	0.153	0.193	2.298
Netherlands	7.319	7.383	7.256	1.901	1.462	0.706	0.725	0.247	0.372	1.906

tail(df) %>%
  kable(digits = 3, caption = "Last 6 Rows of the Raw WHR 2024 Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)

Last 6 Rows of the Raw WHR 2024 Dataset
	country	happiness	upper_ci	lower_ci	gdp	social_support	life_expectancy	freedom	generosity	corruption	dystopia
138	Zimbabwe	3.341	3.457	3.226	0.748	0.850	0.232	0.487	0.096	0.131	0.797
139	Congo (Kinshasa)	3.295	3.462	3.128	0.534	0.665	0.262	0.473	0.189	0.072	1.102
140	Sierra Leone	3.245	3.366	3.124	0.654	0.566	0.253	0.469	0.181	0.053	1.068
141	Lesotho	3.186	3.469	2.904	0.771	0.851	0.000	0.523	0.082	0.085	0.875
142	Lebanon	2.707	2.797	2.616	1.377	0.577	0.556	0.173	0.068	0.029	-0.073
143	Afghanistan	1.721	1.775	1.667	0.628	0.000	0.242	0.000	0.091	0.088	0.672

Inference: The dataset is sorted in descending order of happiness. The first rows are the happiest nations (Nordic countries) and the last rows are the least happy (conflict-affected or low-income countries). This confirms that the data is well-organised and requires no re-sorting.

str(df)

## 'data.frame':    143 obs. of  11 variables:
##  $ country        : chr  "Finland" "Denmark" "Iceland" "Sweden" ...
##  $ happiness      : num  7.74 7.58 7.53 7.34 7.34 ...
##  $ upper_ci       : num  7.82 7.66 7.62 7.42 7.41 ...
##  $ lower_ci       : num  7.67 7.5 7.43 7.27 7.28 ...
##  $ gdp            : num  1.84 1.91 1.88 1.88 1.8 ...
##  $ social_support : num  1.57 1.52 1.62 1.5 1.51 ...
##  $ life_expectancy: num  0.695 0.699 0.718 0.724 0.74 0.706 0.704 0.708 0.747 0.692 ...
##  $ freedom        : num  0.859 0.823 0.819 0.838 0.641 0.725 0.835 0.801 0.759 0.756 ...
##  $ generosity     : num  0.142 0.204 0.258 0.221 0.153 0.247 0.224 0.146 0.173 0.225 ...
##  $ corruption     : num  0.546 0.548 0.182 0.524 0.193 0.372 0.484 0.432 0.498 0.323 ...
##  $ dystopia       : num  2.08 1.88 2.05 1.66 2.3 ...

summary(df) %>%
  kable(caption = "Statistical Summary of All Variables (Before Cleaning)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)

Statistical Summary of All Variables (Before Cleaning)
country	happiness	upper_ci	lower_ci	gdp	social_support	life_expectancy	freedom	generosity	corruption	dystopia
Length:143	Min. :1.721	Min. :1.775	Min. :1.667	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :-0.073
Class :character	1st Qu.:4.726	1st Qu.:4.846	1st Qu.:4.606	1st Qu.:1.078	1st Qu.:0.9217	1st Qu.:0.3980	1st Qu.:0.5275	1st Qu.:0.0910	1st Qu.:0.06875	1st Qu.: 1.308
Mode :character	Median :5.785	Median :5.895	Median :5.674	Median :1.431	Median :1.2375	Median :0.5495	Median :0.6410	Median :0.1365	Median :0.12050	Median : 1.645
NA	Mean :5.528	Mean :5.641	Mean :5.414	Mean :1.379	Mean :1.1343	Mean :0.5209	Mean :0.6206	Mean :0.1463	Mean :0.15412	Mean : 1.576
NA	3rd Qu.:6.416	3rd Qu.:6.508	3rd Qu.:6.319	3rd Qu.:1.742	3rd Qu.:1.3833	3rd Qu.:0.6485	3rd Qu.:0.7360	3rd Qu.:0.1925	3rd Qu.:0.19375	3rd Qu.: 1.882
NA	Max. :7.741	Max. :7.815	Max. :7.667	Max. :2.141	Max. :1.6170	Max. :0.8570	Max. :0.8630	Max. :0.4010	Max. :0.57500	Max. : 2.998
NA	NA	NA	NA	NA’s :3	NA’s :3	NA’s :3	NA’s :3	NA’s :3	NA’s :3	NA’s :3

Inference: The dataset contains 1 categorical variable (country) and 10 numerical variables. The Ladder Score (happiness) ranges from 1.72 to 7.74 with a mean of approximately 5.5, indicating a wide global spread in life satisfaction. Several variables :— upper_ci, lower_ci, country, and dystopia , are identified as candidates for removal in the next step.

2.2 Removing Irrelevant Columns

The following four columns are dropped before any further analysis:

Column	Reason for Removal
`country`	A text identifier , not a numerical predictor(used only for labelling)
`upperwhisker` (upper_ci)	Confidence interval bound , a statistical artefact of the Ladder Score, not an independent predictor
`lowerwhisker` (lower_ci)	Confidence interval bound , a statistical artefact of the Ladder Score, not an independent predictor
`dystopia` (Dystopia + Residual)	Represents the unexplained remainder after the six factors are accounted for, including it would artificially inflate model performance since it is derived from the target variable

df <- df %>%
  select(-country, -upper_ci, -lower_ci, -dystopia)

cat("Columns remaining after removal:", ncol(df), "\n")

## Columns remaining after removal: 7

cat("Column names:", paste(names(df), collapse = ", "), "\n")

## Column names: happiness, gdp, social_support, life_expectancy, freedom, generosity, corruption

head(df) %>%
  kable(digits = 3, caption = "Dataset After Removing Irrelevant Columns") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)

Dataset After Removing Irrelevant Columns
happiness	gdp	social_support	life_expectancy	freedom	generosity	corruption
7.741	1.844	1.572	0.695	0.859	0.142	0.546
7.583	1.908	1.520	0.699	0.823	0.204	0.548
7.525	1.881	1.617	0.718	0.819	0.258	0.182
7.344	1.878	1.501	0.724	0.838	0.221	0.524
7.341	1.803	1.513	0.740	0.641	0.153	0.193
7.319	1.901	1.462	0.706	0.725	0.247	0.372

Inference: The dataset now contains only the happiness score (target) and the six socio-economic predictor variables: GDP per capita, Social Support, Healthy Life Expectancy, Freedom, Generosity, and Perceptions of Corruption. This is a clean and meaningful feature set for predictive modelling.

2.3 Missing Value Analysis

missing_df <- df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Pct = round(Missing_Count / nrow(df) * 100, 2))

missing_df %>%
  kable(caption = "Missing Value Count per Variable (After Column Removal)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(missing_df$Missing_Count > 0), background = "#fff3cd")

Missing Value Count per Variable (After Column Removal)
Variable	Missing_Count	Missing_Pct
happiness	0	0.0
gdp	3	2.1
social_support	3	2.1
life_expectancy	3	2.1
freedom	3	2.1
generosity	3	2.1
corruption	3	2.1

missing_df %>%
  filter(Missing_Count > 0) %>%
  ggplot(aes(x = reorder(Variable, -Missing_Count), y = Missing_Count, fill = Variable)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(Missing_Count, " missing (", Missing_Pct, "%)")),
            vjust = -0.5, size = 4) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(limits = c(0, 5)) +
  labs(title = "Variables with Missing Values",
       x = "Variable", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Inference: Three countries (Bahrain, Tajikistan, and State of Palestine) have missing values across all six predictor columns. Since the missingness is complete across all predictors for these rows, imputation is not feasible, there is no partial information to base an estimate on. These rows will be removed entirely.

2.4 Duplicate Check

dup_count <- sum(duplicated(df))
cat("Number of duplicate rows:", dup_count, "\n")

## Number of duplicate rows: 0

Inference: No duplicate rows exist in the dataset.

2.5 Removing Missing Rows

df_clean <- df %>% drop_na()

cat("Rows before removal:", nrow(df), "\n")

## Rows before removal: 143

cat("Rows after removal: ", nrow(df_clean), "\n")

## Rows after removal:  140

cat("Rows removed:       ", nrow(df) - nrow(df_clean), "\n")

## Rows removed:        3

2.6 Final Clean Dataset

head(df_clean) %>%
  kable(digits = 3, caption = "Final Clean Dataset — Ready for EDA and Modelling") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)

Final Clean Dataset — Ready for EDA and Modelling
happiness	gdp	social_support	life_expectancy	freedom	generosity	corruption
7.741	1.844	1.572	0.695	0.859	0.142	0.546
7.583	1.908	1.520	0.699	0.823	0.204	0.548
7.525	1.881	1.617	0.718	0.819	0.258	0.182
7.344	1.878	1.501	0.724	0.838	0.221	0.524
7.341	1.803	1.513	0.740	0.641	0.153	0.193
7.319	1.901	1.462	0.706	0.725	0.247	0.372

cat("Final dataset dimensions:", nrow(df_clean), "rows x", ncol(df_clean), "columns\n")

## Final dataset dimensions: 140 rows x 7 columns

cat("Variables:", paste(names(df_clean), collapse = ", "), "\n")

## Variables: happiness, gdp, social_support, life_expectancy, freedom, generosity, corruption

Inference: The final clean dataset contains 140 countries and 7 variables , the happiness score and six predictor variables. All columns are numeric, all rows are complete, and no irrelevant features remain. The dataset is ready for exploratory data analysis.

3 Exploratory Data Analysis

3.1 Distribution of the Happiness Score

h_mean   <- mean(df_clean$happiness)
h_median <- median(df_clean$happiness)
h_sd     <- sd(df_clean$happiness)
h_min    <- min(df_clean$happiness)
h_max    <- max(df_clean$happiness)

# Summary table
data.frame(
  Statistic = c("Mean", "Median", "Std Dev", "Minimum", "Maximum", "Range"),
  Value     = round(c(h_mean, h_median, h_sd, h_min, h_max, h_max - h_min), 3)
) %>%
  kable(caption = "Descriptive Statistics — Happiness Score") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics — Happiness Score
Statistic	Value
Mean	5.531
Median	5.800
Std Dev	1.181
Minimum	1.721
Maximum	7.741
Range	6.020

ggplot(df_clean, aes(x = happiness)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = "#4e79a7", colour = "white", alpha = 0.85) +
  geom_density(colour = "#e15759", linewidth = 1.2) +
  geom_vline(xintercept = h_mean,   linetype = "dashed",
             colour = "#f28e2b", linewidth = 1) +
  geom_vline(xintercept = h_median, linetype = "dotted",
             colour = "#59a14f", linewidth = 1) +
  annotate("text", x = h_mean + 0.12, y = 0.32,
           label = paste("Mean =", round(h_mean, 2)),
           colour = "#f28e2b", hjust = 0, size = 4) +
  annotate("text", x = h_median - 0.12, y = 0.28,
           label = paste("Median =", round(h_median, 2)),
           colour = "#59a14f", hjust = 1, size = 4) +
  labs(title    = "Distribution of Global Happiness Scores",
       subtitle = "World Happiness Report 2024 — 140 Countries",
       x = "Happiness Score (Ladder Score)", y = "Density") +
  theme_minimal(base_size = 13)

The distribution of happiness scores is unimodal and approximately bell-shaped, though it exhibits a notable left skew. The mean sits below the median, a result of the low-scoring outliers (nations in extreme adversity) pulling the average downward. A standard deviation of confirms a significant global spread in life satisfaction. The primary concentration of countries falls within the 5–7 range, indicating that moderate-to-high happiness is the global norm. However, the expansive range of roughly 6 points—from a minimum of 1.72 to a maximum of 7.74—underscores a profound global inequality in well-being.

3.2 Distribution of Each Predictor Variable

predictors  <- c("gdp", "social_support", "life_expectancy",
                 "freedom", "generosity", "corruption")
pred_labels <- c("Log GDP per capita", "Social Support", "Healthy Life Expectancy",
                 "Freedom", "Generosity", "Perceptions of Corruption")

plots <- lapply(seq_along(predictors), function(i) {
  m <- mean(df_clean[[predictors[i]]], na.rm = TRUE)
  ggplot(df_clean, aes_string(x = predictors[i])) +
    geom_histogram(fill = "#76b7b2", colour = "white", bins = 20, alpha = 0.9) +
    geom_vline(xintercept = m, linetype = "dashed", colour = "#e15759", linewidth = 0.9) +
    labs(title = pred_labels[i], x = NULL, y = "Count") +
    theme_minimal(base_size = 11)
})
do.call(grid.arrange, c(plots, ncol = 2))

Inference:

GDP per capita shows a bimodal (twin-peaked) tendency, with a primary peak around 1.8 and a secondary peak near 1.3. A distinct left tail is present, pulled down by a small number of very low-income nations.
Social Support is strongly left-skewed, with the vast majority of countries reporting very high scores (clustered between 1.2 and 1.5), and a long tail of lower-scoring outliers on the left.
Healthy Life Expectancy exhibits a roughly bimodal pattern, reflecting a global health divide. One significant cluster of countries peaks around 0.6–0.7 (high healthcare access), while a smaller, separate peak appears near 0.3 (developing health outcomes).
Freedom is relatively widely distributed but remains left-skewed. While it has more spread than other metrics, most countries still cluster toward the higher end of the scale (0.6–0.8), with fewer reporting very low personal autonomy.
Generosity is distinctly right-skewed and concentrated at low values. Most countries contribute very little to happiness through recorded generosity, with only a tiny number of nations scoring as high-end outliers near 0.4.
Perceptions of Corruption is also right-skewed, low anti-corruption contribution dominates with few countries where low corruption meaningfully boosts happiness scores.

3.3 Descriptive Statistics for All Variables

df_clean %>%
  summary() %>%
  kable(caption = "Summary Statistics — Clean Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)

Summary Statistics — Clean Dataset
happiness	gdp	social_support	life_expectancy	freedom	generosity	corruption
Min. :1.721	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000
1st Qu.:4.632	1st Qu.:1.078	1st Qu.:0.9217	1st Qu.:0.3980	1st Qu.:0.5275	1st Qu.:0.0910	1st Qu.:0.06875
Median :5.801	Median :1.431	Median :1.2375	Median :0.5495	Median :0.6410	Median :0.1365	Median :0.12050
Mean :5.531	Mean :1.379	Mean :1.1343	Mean :0.5209	Mean :0.6206	Mean :0.1463	Mean :0.15412
3rd Qu.:6.426	3rd Qu.:1.742	3rd Qu.:1.3833	3rd Qu.:0.6485	3rd Qu.:0.7360	3rd Qu.:0.1925	3rd Qu.:0.19375
Max. :7.741	Max. :2.141	Max. :1.6170	Max. :0.8570	Max. :0.8630	Max. :0.4010	Max. :0.57500

3.4 Boxplot Analysis of All Variables

df_clean %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  mutate(variable = factor(variable,
                           levels = c("happiness", predictors),
                           labels = c("Happiness", pred_labels))) %>%
  ggplot(aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 2) +
  scale_fill_brewer(palette = "Set3") +
  coord_flip() +
  labs(title    = "Boxplots of All Variables",
       subtitle = "Red dots = outliers",
       x = NULL, y = "Value") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Inference:

Happiness shows the widest overall spread (ranging from roughly 3 to 8), indicating high global variance, with a single high-end outlier.
Log GDP per capita, Social Support, and Healthy Life Expectancy show the most considerable interquartile ranges (IQR) among the predictors, reflecting a massive divide in economic and social infrastructure across nations.
Freedom is relatively compact but features distinct low-end outliers, indicating countries with significantly restricted personal autonomy.
Generosity and Perceptions of Corruption have the most compact boxes (smallest IQRs), as most countries are tightly clustered at the low end of the scale.
Corruption is particularly notable for its cluster of high-end outliers, representing the few nations that significantly outperform the global norm in transparency.
Outliers (red dots) are present across almost every variable; these represent genuine extreme cases and are retained to ensure the model accounts for the full global spectrum.

3.5 Outlier Detection: Happiness Score

Q1      <- quantile(df_clean$happiness, 0.25)
Q3      <- quantile(df_clean$happiness, 0.75)
IQR_val <- Q3 - Q1
lower_b <- Q1 - 1.5 * IQR_val
upper_b <- Q3 + 1.5 * IQR_val

data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Fence (Q1 - 1.5*IQR)", "Upper Fence (Q3 + 1.5*IQR)"),
  Value  = round(c(Q1, Q3, IQR_val, lower_b, upper_b), 3)
) %>%
  kable(caption = "IQR-Based Outlier Detection — Happiness Score") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

IQR-Based Outlier Detection — Happiness Score
Metric	Value
Q1	4.632
Q3	6.426
IQR	1.795
Lower Fence (Q1 - 1.5*IQR)	1.940
Upper Fence (Q3 + 1.5*IQR)	9.118

outliers_h <- df_raw %>%
  rename(country = Country.name, happiness = Ladder.score) %>%
  filter(happiness < lower_b | happiness > upper_b) %>%
  select(country, happiness)

if (nrow(outliers_h) > 0) {
  outliers_h %>%
    kable(caption = "Statistical Outliers in Happiness Score (IQR Method)") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
} else {
  cat("No statistical outliers detected.\n")
}

Statistical Outliers in Happiness Score (IQR Method)
country	happiness
Afghanistan	1.721

ggplot(df_clean, aes(y = happiness)) +
  geom_boxplot(fill = "#4e79a7", alpha = 0.7, outlier.colour = "red",
               outlier.size = 3, width = 0.4) +
  geom_hline(yintercept = c(lower_b, upper_b),
             linetype = "dashed", colour = "#e15759", linewidth = 1) +
  labs(title    = "Boxplot of Happiness Score with IQR Fences",
       subtitle = "Red dashed lines mark the outlier boundaries",
       y = "Happiness Score", x = NULL) +
  theme_minimal(base_size = 13)

Inference: A small number of countries fall outside the IQR fences as statistical outliers. These represent the genuine extremes of global development—most notably a very low-performing outlier in the Happiness Score. These are real-world data points, not data entry errors, and should be retained. Removing them would bias the model toward middle-range countries and reduce its ability to generalize across the full spectrum of global happiness, from the high-performing nations to those facing the most significant challenges.”

3.6 Correlation Analysis

cor_matrix <- cor(df_clean, use = "complete.obs")

colnames(cor_matrix) <- rownames(cor_matrix) <-
  c("Happiness", "GDP", "Social Support", "Life Exp.",
    "Freedom", "Generosity", "Corruption")

corrplot(cor_matrix,
         method      = "color",
         type        = "upper",
         tl.col      = "black",
         tl.srt      = 45,
         addCoef.col = "black",
         number.cex  = 0.85,
         col         = colorRampPalette(c("#e15759", "white", "#4e79a7"))(200),
         title       = "Correlation Matrix — WHR 2024 (6 Predictors)",
         mar         = c(0, 0, 2, 0))

Inference:

Social Support (r = 0.81) has the strongest positive correlation with happiness, followed closely by GDP (r = 0.77) and Life Expectancy (r = 0.76). These three form the dominant “trio” of happiness predictors.
Freedom shows a solid moderate-to-strong correlation (r = 0.64), indicating that personal autonomy contributes meaningfully to well-being.
Perceptions of Corruption (r = 0.45) has a moderate positive effect, suggesting that transparency and low corruption levels are linked to higher happiness.
Generosity (r = 0.13) shows the weakest link to happiness at the national level in this specific dataset.
Multicollinearity: Importantly, GDP, Social Support, and Life Expectancy are very strongly intercorrelated (r between 0.71 and 0.83). This confirms that wealthier nations typically enjoy better healthcare and social safety nets.

3.7 Feature Importance: Correlation with Happiness

cor_happy <- cor_matrix["Happiness", -1]

data.frame(
  Feature     = names(cor_happy),
  Correlation = round(as.numeric(cor_happy), 3)
) %>%
  arrange(desc(Correlation)) %>%
  kable(caption = "Correlation of Each Predictor with Happiness Score") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Correlation of Each Predictor with Happiness Score
Feature	Correlation
Social Support	0.814
GDP	0.769
Life Exp.	0.760
Freedom	0.644
Corruption	0.452
Generosity	0.130

data.frame(
  Feature     = names(cor_happy),
  Correlation = as.numeric(cor_happy)
) %>%
  mutate(Feature = factor(Feature,
                           levels = c("GDP", "Social Support", "Life Exp.",
                                      "Freedom", "Corruption", "Generosity"))) %>%
  ggplot(aes(x = reorder(Feature, Correlation), y = Correlation)) +
  geom_col(fill = "#4e79a7", alpha = 0.9, width = 0.6) +
  geom_text(aes(label = round(Correlation, 3)), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1.0)) +
  labs(title    = "Predictor Correlation with Happiness Score",
       subtitle = "Ordered by strength of association",
       x = NULL, y = "Pearson Correlation (r)") +
  theme_minimal(base_size = 13)

Inference:

GDP, Social Support, and Life Expectancy form a clear top tier of predictors, all correlating strongly at \(r\ge 0.76\).These represent the fundamental structural pillars of national well-being.
Freedom sits in a strong mid-tier position at r = 0.644,indicating that personal autonomy is a major driver, carrying more weight than institutional or charitable factors.
Corruption (r = 0.452) and Generosity (r = 0.13) are the weakest predictors. Their lower correlations suggest that while they contribute to happiness, their impact is smaller and likely more context-dependent than economic or social security.

3.8 Scatterplots: Each Predictor vs Happiness

scatter_plots <- lapply(seq_along(predictors), function(i) {
  ggplot(df_clean, aes_string(x = predictors[i], y = "happiness")) +
    geom_point(colour = "#4e79a7", alpha = 0.65, size = 2.2) +
    geom_smooth(method = "lm", colour = "#e15759", se = TRUE, linewidth = 1) +
    labs(title = paste(pred_labels[i], "vs Happiness"),
         x = pred_labels[i], y = "Happiness Score") +
    theme_minimal(base_size = 11)
})
do.call(grid.arrange, c(scatter_plots, ncol = 2))

Inference: All six predictor variables show a positive linear relationship with happiness. The tightest, most consistent linear trends are seen for GDP, Social Support, and Life Expectancy, aligning with their high correlation values. Freedom also shows a clear upward slope, though with slightly more variance. Generosity and Corruption display more scattered patterns with wider confidence bands (gray shading), confirming their weaker and noisier predictive signals. Overall, the linear assumption for regression modeling appears reasonable for all six predictors.

3.9 Trend Analysis: Happiness by Category

df_clean <- df_clean %>%
  mutate(happiness_group = case_when(
    happiness >= 6.5 ~ "Very Happy (>=6.5)",
    happiness >= 5.0 ~ "Moderately Happy (5-6.5)",
    happiness >= 3.5 ~ "Unhappy (3.5-5)",
    TRUE             ~ "Very Unhappy (<3.5)"
  ),
  happiness_group = factor(happiness_group,
    levels = c("Very Happy (>=6.5)", "Moderately Happy (5-6.5)",
               "Unhappy (3.5-5)", "Very Unhappy (<3.5)")))

df_clean %>%
  count(happiness_group) %>%
  ggplot(aes(x = happiness_group, y = n, fill = happiness_group)) +
  geom_col(alpha = 0.9, width = 0.6) +
  geom_text(aes(label = n), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("#4e79a7", "#76b7b2", "#f28e2b", "#e15759")) +
  labs(title = "Number of Countries by Happiness Category",
       x = "Happiness Category", y = "Number of Countries") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

df_clean %>%
  group_by(happiness_group) %>%
  summarise(across(all_of(predictors), mean, na.rm = TRUE)) %>%
  pivot_longer(-happiness_group, names_to = "factor", values_to = "mean_value") %>%
  mutate(factor = factor(factor, levels = predictors, labels = pred_labels)) %>%
  ggplot(aes(x = factor, y = mean_value, fill = happiness_group)) +
  geom_col(position = "dodge", alpha = 0.9, width = 0.75) +
  scale_fill_manual(values = c("#4e79a7", "#76b7b2", "#f28e2b", "#e15759")) +
  coord_flip() +
  labs(title    = "Average Factor Contributions by Happiness Group",
       subtitle = "Higher value = greater contribution to happiness",
       x = NULL, y = "Mean Contribution", fill = "Happiness Group") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

Inference:

A clear and consistent stepwise gradient is visible across all groups; Very Happy countries score substantially higher on GDP, Social Support, and Life Expectancy than their unhappy counterparts.
Freedom also serves as a strong differentiator, showing a steady climb across every tier.
In contrast, Generosity and Corruption show minimal variation across the bottom three groups, only rising notably for the “Very Happy” tier. This confirms they are the least reliable separators for the majority of the world’s population.
The majority of nations fall into the Moderately Happy (5–6.5) category, which reflects the central peak of the global distribution and serves as the world’s “average” well-being benchmark.

3.10 Pairplot: All Variable Relationships

pairs(df_clean %>% select(-happiness_group),
      col    = "#4e79a7",
      pch    = 16,
      cex    = 0.7,
      main   = "Pairplot of All Variables — WHR 2024",
      labels = c("Happiness", "GDP", "Social\nSupport",
                 "Life\nExp.", "Freedom", "Generosity", "Corruption"))

Inference: The pairplot provides a comprehensive view of all pairwise relationships simultaneously. The top row confirms the strong, tight linear associations of GDP, Social Support, and Life Expectancy with Happiness. The off-diagonal plots reveal high density and “tunneling” between the predictors themselves—particularly between GDP and Life Expectancy and GDP and Social Support. This visual clustering further highlights the multicollinearity present in the dataset, confirming that these factors often improve in tandem. In contrast, Generosity and Corruption appear as diffuse, scattered clouds against all other variables, consistent with their weak correlations and noisier predictive signals.

4 Detailed Analysis

This section consolidates the key inferences drawn from the Exploratory Data Analysis into a structured summary, highlighting the most important patterns, relationships, and observations uncovered before modelling.

4.1 Summary of EDA Inferences

1. Happiness Score Distribution

The global happiness score follows an approximately bell-shaped distribution centred around a mean of ≈ 5.53, with a standard deviation of ≈ 1.18. The distribution has a slight left skew, meaning more countries cluster in the moderate-to-high happiness range than at the very low end. The total range spans nearly 6 points — from Afghanistan (1.72) to Finland (7.74) — revealing a profound global inequality in subjective well-being.

2. Country-Level Extremes

Nordic countries (Finland, Denmark, Iceland, Sweden, Norway) consistently occupy the top happiness rankings, driven by their strong social welfare systems, economic prosperity, and high levels of social trust. The bottom tier is dominated by conflict-affected and low-income nations (Afghanistan, Lebanon, Lesotho, Sierra Leone), where insecurity and poverty severely suppress life satisfaction. The near-6-point gap between the happiest and least happy nation illustrates the scale of this global disparity.

3. Predictor Distributions

Log GDP per capita: Shows a bimodal tendency with a primary peak around 1.8 and a secondary peak near 1.3, reflecting the economic divide between high-income and middle-income nations.
Social Support: Displays a strong left skew; the majority of countries are clustered at the high end (1.2–1.5), with a long, thin tail of lower-scoring outliers stretching toward zero.
Healthy Life Expectancy: Exhibits a distinct bimodal (two-peaked) pattern. One cluster peaks around 0.3 (developing outcomes) and another significantly larger peak sits at 0.6–0.7 (advanced healthcare access), visualising the “global health gap.”
Freedom: Appears as the most symmetrically distributed and widely spread metric, though it still maintains a slight lean toward higher autonomy scores.
Generosity & Perceptions of Corruption: Both are heavily right-skewed and concentrated at low values. The majority of nations sit at the bottom of these scales, with only a few exceptional outliers (the far-right bars) where these factors provide a high contribution to happiness.

4. Outliers

IQR-based outlier detection identified a small number of statistical outliers in the happiness score — countries at both extremes of the global spectrum (e.g. Finland, Afghanistan, Lebanon). These are genuine real-world data points representing true inequality and are retained in the dataset, as removing them would bias the models toward average-happiness countries.

5. Correlation Analysis

Three predictors emerge as the strongest drivers of national happiness:

Social Support (r ≈ 0.81) — The strongest driver, followed by GDP (r ≈ 0.77) and Healthy Life Expectancy (r ≈ 0.76).
Freedom (r ≈ 0.64): A strong mid-tier predictor.
Corruption (r ≈ 0.45) & Generosity (r ≈ 0.13): The weakest associations.
Multicollinearity: Strong inter-correlation (r > 0.70) exists between the “Big Three” (GDP, Support, Health), motivating the use of Regularised Regression (Ridge/Lasso) to ensure model stability.

6. Scatterplot Trends

All six predictors show positive linear relationships with happiness. Trends for GDP, Social Support, and Life Expectancy are the tightest and most consistent, while Generosity and Corruption show more scattered patterns and wider confidence bands.

7. Happiness Category Trends

When countries are grouped into four happiness tiers (Very Happy, Moderately Happy, Unhappy, Very Unhappy), a clear stepwise gradient is visible across GDP, Social Support, and Life Expectancy — confirming their role as the primary separators of happiness levels. Freedom also differentiates groups noticeably. Generosity, however, shows minimal variation across groups, further confirming it is the least reliable predictor of national happiness at scale.

8. Pairplot Observations

The pairplot confirms the dominant role of GDP, Social Support, and Life Expectancy — both in their individual relationships with happiness and in their strong inter-predictor correlations. Generosity and Corruption appear largely uncorrelated with both happiness and the other predictors, reinforcing their limited utility as standalone features.

Overall Conclusion from EDA: The EDA strongly suggests that a country’s happiness is primarily driven by its economic capacity, social infrastructure, and health system — three deeply interconnected dimensions. Any predictive model must account for their multicollinearity. Generosity and Corruption, while theoretically relevant, contribute less predictive value at the national level in this dataset.

5 Model Building Phase

5.1 Data Preparation for Modelling

# Remove the grouping column before modelling
model_df <- df_clean %>% select(-happiness_group)

set.seed(42)
train_idx  <- createDataPartition(model_df$happiness, p = 0.80, list = FALSE)
train_data <- model_df[ train_idx, ]
test_data  <- model_df[-train_idx, ]

cat("Training set:", nrow(train_data), "countries\n")

## Training set: 112 countries

cat("Testing set: ", nrow(test_data),  "countries\n")

## Testing set:  28 countries

# 10-fold cross-validation
cv_control <- trainControl(method = "cv", number = 10, verboseIter = FALSE)

An 80/20 train-test split is applied. Given the relatively small sample size (140 countries), 10-fold cross-validation is used during training to improve reliability of performance estimates and reduce the effect of any single lucky or unlucky split.

5.2 Model 1: Multiple Linear Regression

Multiple Linear Regression models the happiness score as a weighted linear combination of all six predictor variables. It serves as the interpretable baseline model against which regularised and ensemble models are benchmarked.

mlr_model <- train(happiness ~ ., data = train_data,
                   method = "lm", trControl = cv_control)

mlr_pred <- predict(mlr_model, newdata = test_data)

mlr_res <- data.frame(
  Actual    = test_data$happiness,
  Predicted = mlr_pred,
  Residual  = test_data$happiness - mlr_pred,
  Model     = "MLR"
)

mlr_mse  <- mean(mlr_res$Residual^2)
mlr_rmse <- sqrt(mlr_mse)
mlr_r2   <- cor(mlr_res$Actual, mlr_res$Predicted)^2
mlr_mae  <- mean(abs(mlr_res$Residual))

cat(sprintf("MLR -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
            mlr_mse, mlr_rmse, mlr_r2, mlr_mae))

## MLR -> MSE: 0.2201 | RMSE: 0.4691 | R2: 0.8995 | MAE: 0.3641

summary(mlr_model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57408 -0.24270  0.09035  0.31875  1.04750 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.2690     0.2460   5.159 1.18e-06 ***
## gdp               0.4694     0.2272   2.066   0.0413 *  
## social_support    1.3175     0.2517   5.234 8.57e-07 ***
## life_expectancy   1.1656     0.5813   2.005   0.0475 *  
## freedom           2.0188     0.3822   5.282 6.96e-07 ***
## generosity        0.8076     0.7748   1.042   0.2997    
## corruption        1.1132     0.4877   2.282   0.0245 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5298 on 105 degrees of freedom
## Multiple R-squared:  0.7942, Adjusted R-squared:  0.7825 
## F-statistic: 67.55 on 6 and 105 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(mlr_model$finalModel)

par(mfrow = c(1, 1))

Inference:

Residuals vs Fitted: This plot shows a generally random scatter around the zero line, confirming that the linear assumption is reasonable. However, a slight downward trend in the red line suggests the model may slightly over-predict happiness at the lower end.
Normal Q-Q: Most residuals follow the dashed line, confirming normality. The deviations at the bottom left (points 125, 131, 134) represent the extreme low-happiness outliers identified earlier, where the model’s error is higher.
Scale-Location: This plot checks for homoscedasticity (constant variance). The slight downward slope indicates that the model’s predictive precision varies slightly across different happiness levels.
Residuals vs Leverage: This identifies influential cases. While most points are clustered, point 78 has high leverage, and points like 134 have large residuals. These represent genuine extreme cases (like Afghanistan or Hong Kong) that exert more “pull” on the regression line than average nations.

5.3 Model 2: Ridge Regression

Ridge Regression applies L2 regularisation, adding a penalty term proportional to the square of each coefficient. This shrinks all coefficients toward zero without eliminating any — making it ideal when predictors are highly correlated, as is the case here.

ridge_model <- train(happiness ~ ., data = train_data,
                     method    = "glmnet",
                     trControl = cv_control,
                     tuneGrid  = expand.grid(
                       alpha  = 0,
                       lambda = 10^seq(-4, 1, length = 100)))

ridge_pred <- predict(ridge_model, newdata = test_data)

ridge_res <- data.frame(
  Actual    = test_data$happiness,
  Predicted = ridge_pred,
  Residual  = test_data$happiness - ridge_pred,
  Model     = "Ridge"
)

ridge_mse  <- mean(ridge_res$Residual^2)
ridge_rmse <- sqrt(ridge_mse)
ridge_r2   <- cor(ridge_res$Actual, ridge_res$Predicted)^2
ridge_mae  <- mean(abs(ridge_res$Residual))

cat(sprintf("Ridge -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
            ridge_mse, ridge_rmse, ridge_r2, ridge_mae))

## Ridge -> MSE: 0.2412 | RMSE: 0.4911 | R2: 0.9000 | MAE: 0.3878

cat("Optimal lambda:", round(ridge_model$bestTune$lambda, 5), "\n")

## Optimal lambda: 0.17074

plot(ridge_model, main = "Ridge Regression: Lambda Tuning via Cross-Validation")

Inference: The lambda tuning plot tracks the RMSE (Root Mean Square Error) across a range of regularization strengths, with the optimal lambda selected where the error is minimized. Ridge regression is employed here to retain all six predictors with shrunken coefficients—a better approach than feature removal since every variable shows some correlation with happiness. The small optimal lambda (near zero) indicates that heavy regularization is not required, suggesting that the linear structure of the data is well-behaved and that the “Big Three” predictors already provide a strong, stable signal.

5.4 Model 3: Lasso Regression

Lasso Regression applies L1 regularisation, which can shrink some coefficients entirely to zero — combining prediction with automatic feature selection. This makes Lasso particularly useful for identifying the most parsimonious model.

lasso_model <- train(happiness ~ ., data = train_data,
                     method    = "glmnet",
                     trControl = cv_control,
                     tuneGrid  = expand.grid(
                       alpha  = 1,
                       lambda = 10^seq(-4, 1, length = 100)))

lasso_pred <- predict(lasso_model, newdata = test_data)

lasso_res <- data.frame(
  Actual    = test_data$happiness,
  Predicted = lasso_pred,
  Residual  = test_data$happiness - lasso_pred,
  Model     = "Lasso"
)

lasso_mse  <- mean(lasso_res$Residual^2)
lasso_rmse <- sqrt(lasso_mse)
lasso_r2   <- cor(lasso_res$Actual, lasso_res$Predicted)^2
lasso_mae  <- mean(abs(lasso_res$Residual))

cat(sprintf("Lasso -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
            lasso_mse, lasso_rmse, lasso_r2, lasso_mae))

## Lasso -> MSE: 0.2202 | RMSE: 0.4693 | R2: 0.9004 | MAE: 0.3648

cat("Optimal lambda:", round(lasso_model$bestTune$lambda, 5), "\n")

## Optimal lambda: 0.00413

# Lasso coefficient output — dots indicate variables zeroed out
lasso_coefs <- coef(lasso_model$finalModel, s = lasso_model$bestTune$lambda)
print(lasso_coefs)

## 7 x 1 sparse Matrix of class "dgCMatrix"
##                 s=0.004132012
## (Intercept)         1.2960232
## gdp                 0.4644887
## social_support      1.3149066
## life_expectancy     1.1611156
## freedom             2.0107022
## generosity          0.7531208
## corruption          1.0993834

plot(lasso_model, main = "Lasso Regression: Lambda Tuning via Cross-Validation")

Inference:

The Lasso tuning plot identifies the optimal regularization strength that minimizes RMSE. As with the Ridge model, the optimal lambda is very small (near 0), indicating that the model performs best when it retains all or most of the original predictors.
The sharp increase in RMSE as lambda increases from 0 to 1 shows that Lasso is aggressively zeroing out coefficients. Because the error rises so quickly, it confirms that the predictors (especially the “Big Three”) are too essential to be discarded.
The plateau at \(\lambda \approx 1\) (where RMSE levels off around 1.1) represents the point where the model has likely dropped all predictors, leaving only the intercept. This “null model” results in a significantly higher error, highlighting the predictive value of the dataset.
While Lasso provides feature selection, the low optimal lambda suggests that for this specific year of the World Happiness Report, a dense model (keeping all features) is superior to a sparse one.

5.5 Model 4: Random Forest Regression

Random Forest is an ensemble method that builds 500 decision trees on bootstrapped samples and averages their predictions. It captures non-linear relationships and variable interaction effects that all three linear models cannot, making it the most flexible approach in this comparison.

rf_model <- train(happiness ~ ., data = train_data,
                  method    = "rf",
                  trControl = cv_control,
                  tuneGrid  = data.frame(mtry = c(2, 3, 4, 5)),
                  ntree     = 500)

rf_pred <- predict(rf_model, newdata = test_data)

rf_res <- data.frame(
  Actual    = test_data$happiness,
  Predicted = rf_pred,
  Residual  = test_data$happiness - rf_pred,
  Model     = "Random Forest"
)

rf_mse  <- mean(rf_res$Residual^2)
rf_rmse <- sqrt(rf_mse)
rf_r2   <- cor(rf_res$Actual, rf_res$Predicted)^2
rf_mae  <- mean(abs(rf_res$Residual))

cat(sprintf("RF -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
            rf_mse, rf_rmse, rf_r2, rf_mae))

## RF -> MSE: 0.3633 | RMSE: 0.6028 | R2: 0.8746 | MAE: 0.4213

cat("Best mtry:", rf_model$bestTune$mtry, "\n")

## Best mtry: 2

as.data.frame(importance(rf_model$finalModel)) %>%
  rownames_to_column("Feature") %>%
  rename(Importance = IncNodePurity) %>%
  mutate(Feature = factor(Feature, levels = predictors, labels = pred_labels)) %>%
  ggplot(aes(x = reorder(Feature, Importance), y = Importance, fill = Importance)) +
  geom_col(alpha = 0.9, width = 0.6) +
  geom_text(aes(label = round(Importance, 2)), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_fill_gradient(low = "#76b7b2", high = "#4e79a7") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(title    = "Random Forest: Variable Importance",
       subtitle = "Higher = greater contribution to reducing prediction error across all trees",
       x = NULL, y = "Importance (Mean Decrease in Node Impurity)") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Inference: The Random Forest variable importance plot identifies which predictors most consistently reduced prediction error (Node Impurity) across all decision trees. Social Support (44.45), GDP (31.85), and Life Expectancy (27.39) emerge as the top tier of contributors—a finding that is fully consistent with the preceding EDA and linear models. Notably, the Random Forest is inherently robust to the multicollinearity identified earlier; because each tree split is based on a random subset of features (mtry), the model naturally decorrelates the predictors. Given that this ensemble method can capture complex interaction effects, it provides the most robust confirmation of our feature hierarchy.

5.6 Model Comparison

comparison <- data.frame(
  Model = c("Multiple Linear Regression", "Ridge Regression",
            "Lasso Regression", "Random Forest"),
  MSE   = round(c(mlr_mse,   ridge_mse,   lasso_mse,   rf_mse),   4),
  RMSE  = round(c(mlr_rmse,  ridge_rmse,  lasso_rmse,  rf_rmse),  4),
  R2    = round(c(mlr_r2,    ridge_r2,    lasso_r2,    rf_r2),    4),
  MAE   = round(c(mlr_mae,   ridge_mae,   lasso_mae,   rf_mae),   4)
) %>% arrange(RMSE)

comparison %>%
  kable(caption = "Model Performance Comparison — Test Set (Best model highlighted)",
        col.names = c("Model", "MSE", "RMSE", "R2", "MAE")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  row_spec(1, bold = TRUE, background = "#d4edda")

Model Performance Comparison — Test Set (Best model highlighted)
Model	MSE	RMSE	R2	MAE
Multiple Linear Regression	0.2201	0.4691	0.8995	0.3641
Lasso Regression	0.2202	0.4693	0.9004	0.3648
Ridge Regression	0.2412	0.4911	0.9000	0.3878
Random Forest	0.3633	0.6028	0.8746	0.4213

comparison %>%
  pivot_longer(-Model, names_to = "Metric", values_to = "Value") %>%
  filter(Metric %in% c("RMSE", "R2", "MAE")) %>%
  ggplot(aes(x = reorder(Model, Value), y = Value, fill = Model)) +
  geom_col(alpha = 0.85, width = 0.6) +
  facet_wrap(~ Metric, scales = "free_y") +
  scale_fill_manual(values = c("#4e79a7", "#f28e2b", "#59a14f", "#e15759")) +
  labs(title = "Model Comparison: RMSE, R2, and MAE",
       x = NULL, y = "Value") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x  = element_text(angle = 30, hjust = 1),
        legend.position = "none")

5.7 Actual vs Predicted: All Models

bind_rows(mlr_res, ridge_res, lasso_res, rf_res) %>%
  ggplot(aes(x = Actual, y = Predicted, colour = Model)) +
  geom_point(alpha = 0.8, size = 3) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "black", linewidth = 0.9) +
  facet_wrap(~ Model, ncol = 2) +
  scale_colour_manual(values = c("#4e79a7", "#f28e2b", "#59a14f", "#e15759")) +
  labs(title    = "Actual vs Predicted Happiness Scores — All Models",
       subtitle = "Points along the dashed line = perfect prediction",
       x = "Actual Happiness Score", y = "Predicted Happiness Score") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

5.8 Residual Analysis: All Models

bind_rows(mlr_res, ridge_res, lasso_res, rf_res) %>%
  ggplot(aes(x = Predicted, y = Residual, colour = Model)) +
  geom_point(alpha = 0.8, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed",
             colour = "black", linewidth = 0.9) +
  facet_wrap(~ Model, ncol = 2) +
  scale_colour_manual(values = c("#4e79a7", "#f28e2b", "#59a14f", "#e15759")) +
  labs(title    = "Residual Plots — All Models",
       subtitle = "Residuals should be randomly scattered around zero",
       x = "Predicted Value", y = "Residual (Actual - Predicted)") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Inference — Overall Model Comparison:

The Multiple Linear Regression (MLR) and Lasso Regression models identified as the top performers on the test set, achieving the lowest RMSE (0.469) and the highest \(R^2\) (0.90).
Contrary to expectations, the Random Forest underperformed the linear models (\(R^2\) = 0.87). This suggests that the relationship between these six predictors and happiness is fundamentally linear, and the added complexity of an ensemble model did not provide a predictive advantage.
Lasso and MLR show near-identical results, indicating that while multicollinearity is present, the “Big Three” predictors (GDP, Social Support, and Life Expectancy) provide such a strong, stable signal that heavy regularization was not required.
Lasso provides the most valuable interpretive output—it confirms that even after penalizing coefficients, all predictors (or nearly all) remain relevant, though it effectively highlights which features carry the most “weight” for policy communication.
Residual plots across all models show values scattered reasonably close to zero with no dramatic systematic patterns. This validates that the models are appropriately specified, though all models struggle with the same extreme low-end outlier. The high \(R^2\) values (averaging 0.89) across the models reflect the massive predictive power of these six features, explaining nearly 90% of the variance in global happiness scores.

6 Conclusion

This project analyzed the World Happiness Report 2024 to identify the key determinants of national happiness and build predictive models for the Ladder Score across 140 countries.

Data Cleaning: Four columns were removed (Country name, upper/lower whiskers, Dystopia + Residual) to produce a lean, meaningful feature set. Three rows with missing predictor data were removed, yielding a clean dataset of 140 countries and 7 variables.

Key EDA Findings:

The “Big Three”: GDP per capita, Social Support, and Healthy Life Expectancy are the primary drivers of happiness, each correlating strongly (\(r \ge 0.76\)) with the Ladder Score.
Autonomy & Institutions: Freedom is a strong mid-tier predictor (r = 0.64), while Generosity and Corruption have the weakest national-level effects, being more context-dependent.
Global Inequality: A massive happiness gap exists—spanning nearly 6 points between the highest and lowest-ranked nations.
Multicollinearity: Strong inter-predictor correlations (especially between GDP and Life Expectancy at r = 0.83) justified the use of Ridge and Lasso regression to ensure model stability.

Modelling Results:

Linear Superiority: Multiple Linear Regression (MLR) and Lasso Regression emerged as the top performers (\(R^2 \approx 0.90\) and \(\text{RMSE} \approx 0.47\)). This indicates the relationships are fundamentally linear and well-captured by simpler models.
Random Forest: While robust, the ensemble model slightly underperformed the linear models (\(R^2\) = 0.87), likely due to the dataset’s relatively small size and the overwhelming linear strength of the “Big Three” predictors.
Model Validation): Diagnostic and residual plots confirm that the linear models are well-specified, though all models struggle to fully explain the extreme low-end outlier (Afghanistan).
Interpretability: Lasso confirmed that even under regularization, the primary predictors remain essential, providing a parsimonious framework for policy evaluation.

Policy Implications: Nations seeking to improve national well-being should prioritize investments in economic stability, social safety nets, and healthcare infrastructure. These three factors provide the most consistent and significant “stepwise” increase in global happiness scores.

Predictive Modelling for Global Happiness Index

Sruthy Elsa Jacob (2026EMAI10057)

2026-04-17

1 Objective

2 Data Cleaning

2.1 Initial Inspection

2.2 Removing Irrelevant Columns

2.3 Missing Value Analysis

2.4 Duplicate Check

2.5 Removing Missing Rows

2.6 Final Clean Dataset

3 Exploratory Data Analysis

3.1 Distribution of the Happiness Score

3.2 Distribution of Each Predictor Variable

3.3 Descriptive Statistics for All Variables

3.4 Boxplot Analysis of All Variables

3.5 Outlier Detection: Happiness Score

3.6 Correlation Analysis

3.7 Feature Importance: Correlation with Happiness

3.8 Scatterplots: Each Predictor vs Happiness

3.9 Trend Analysis: Happiness by Category

3.10 Pairplot: All Variable Relationships

4 Detailed Analysis

4.1 Summary of EDA Inferences

5 Model Building Phase

5.1 Data Preparation for Modelling

5.2 Model 1: Multiple Linear Regression

5.3 Model 2: Ridge Regression

5.4 Model 3: Lasso Regression

5.5 Model 4: Random Forest Regression

5.6 Model Comparison

5.7 Actual vs Predicted: All Models

5.8 Residual Analysis: All Models

6 Conclusion