The goal of this project is to analyse the World Happiness Report 2024 dataset, which covers 143 countries, and to build predictive models capable of estimating a country’s happiness score (Ladder Score) from its socio-economic indicators.
The specific objectives are:
Data Cleaning: Load the dataset, inspect its structure, remove irrelevant or non-predictor columns, identify and handle missing values, check for duplicates, and prepare a lean, clean dataset ready for analysis.
Exploratory Data Analysis (EDA): Perform a thorough examination of the data through statistical summaries, visualisations, correlation analysis, and trend analysis. Each visualisation is accompanied by inferences and justifications.
Detailed Analysis: Apply and evaluate four machine learning regression models :— Multiple Linear Regression, Ridge Regression, Lasso Regression, and Random Forest — and compare their predictive performance using MSE, RMSE, R², and MAE.
# ── Core data wrangling ──────────────────────────────────────────────────────
library(tidyverse)
library(dplyr)
library(tidyr)
# ── Visualisation ────────────────────────────────────────────────────────────
library(ggplot2)
library(corrplot)
library(scales)
library(gridExtra)
library(ggrepel)
# ── Machine learning ─────────────────────────────────────────────────────────
library(caret)
library(glmnet)
library(randomForest)
# ── Tables ───────────────────────────────────────────────────────────────────
library(knitr)
library(kableExtra)df_raw <- read.csv("WHR2024.csv", stringsAsFactors = FALSE)
# Rename columns for convenience
df <- df_raw %>%
rename(
country = Country.name,
happiness = Ladder.score,
upper_ci = upperwhisker,
lower_ci = lowerwhisker,
gdp = Explained.by..Log.GDP.per.capita,
social_support = Explained.by..Social.support,
life_expectancy = Explained.by..Healthy.life.expectancy,
freedom = Explained.by..Freedom.to.make.life.choices,
generosity = Explained.by..Generosity,
corruption = Explained.by..Perceptions.of.corruption,
dystopia = Dystopia...residual
)
cat("Dataset loaded successfully:", nrow(df), "countries,", ncol(df), "variables\n")## Dataset loaded successfully: 143 countries, 11 variables
Before any cleaning, the full raw structure of the dataset is examined to understand its dimensions, variable types, and general content.
head(df) %>%
kable(digits = 3, caption = "First 6 Rows of the Raw WHR 2024 Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| country | happiness | upper_ci | lower_ci | gdp | social_support | life_expectancy | freedom | generosity | corruption | dystopia |
|---|---|---|---|---|---|---|---|---|---|---|
| Finland | 7.741 | 7.815 | 7.667 | 1.844 | 1.572 | 0.695 | 0.859 | 0.142 | 0.546 | 2.082 |
| Denmark | 7.583 | 7.665 | 7.500 | 1.908 | 1.520 | 0.699 | 0.823 | 0.204 | 0.548 | 1.881 |
| Iceland | 7.525 | 7.618 | 7.433 | 1.881 | 1.617 | 0.718 | 0.819 | 0.258 | 0.182 | 2.050 |
| Sweden | 7.344 | 7.422 | 7.267 | 1.878 | 1.501 | 0.724 | 0.838 | 0.221 | 0.524 | 1.658 |
| Israel | 7.341 | 7.405 | 7.277 | 1.803 | 1.513 | 0.740 | 0.641 | 0.153 | 0.193 | 2.298 |
| Netherlands | 7.319 | 7.383 | 7.256 | 1.901 | 1.462 | 0.706 | 0.725 | 0.247 | 0.372 | 1.906 |
tail(df) %>%
kable(digits = 3, caption = "Last 6 Rows of the Raw WHR 2024 Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| country | happiness | upper_ci | lower_ci | gdp | social_support | life_expectancy | freedom | generosity | corruption | dystopia | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 138 | Zimbabwe | 3.341 | 3.457 | 3.226 | 0.748 | 0.850 | 0.232 | 0.487 | 0.096 | 0.131 | 0.797 |
| 139 | Congo (Kinshasa) | 3.295 | 3.462 | 3.128 | 0.534 | 0.665 | 0.262 | 0.473 | 0.189 | 0.072 | 1.102 |
| 140 | Sierra Leone | 3.245 | 3.366 | 3.124 | 0.654 | 0.566 | 0.253 | 0.469 | 0.181 | 0.053 | 1.068 |
| 141 | Lesotho | 3.186 | 3.469 | 2.904 | 0.771 | 0.851 | 0.000 | 0.523 | 0.082 | 0.085 | 0.875 |
| 142 | Lebanon | 2.707 | 2.797 | 2.616 | 1.377 | 0.577 | 0.556 | 0.173 | 0.068 | 0.029 | -0.073 |
| 143 | Afghanistan | 1.721 | 1.775 | 1.667 | 0.628 | 0.000 | 0.242 | 0.000 | 0.091 | 0.088 | 0.672 |
Inference: The dataset is sorted in descending order of happiness. The first rows are the happiest nations (Nordic countries) and the last rows are the least happy (conflict-affected or low-income countries). This confirms that the data is well-organised and requires no re-sorting.
## 'data.frame': 143 obs. of 11 variables:
## $ country : chr "Finland" "Denmark" "Iceland" "Sweden" ...
## $ happiness : num 7.74 7.58 7.53 7.34 7.34 ...
## $ upper_ci : num 7.82 7.66 7.62 7.42 7.41 ...
## $ lower_ci : num 7.67 7.5 7.43 7.27 7.28 ...
## $ gdp : num 1.84 1.91 1.88 1.88 1.8 ...
## $ social_support : num 1.57 1.52 1.62 1.5 1.51 ...
## $ life_expectancy: num 0.695 0.699 0.718 0.724 0.74 0.706 0.704 0.708 0.747 0.692 ...
## $ freedom : num 0.859 0.823 0.819 0.838 0.641 0.725 0.835 0.801 0.759 0.756 ...
## $ generosity : num 0.142 0.204 0.258 0.221 0.153 0.247 0.224 0.146 0.173 0.225 ...
## $ corruption : num 0.546 0.548 0.182 0.524 0.193 0.372 0.484 0.432 0.498 0.323 ...
## $ dystopia : num 2.08 1.88 2.05 1.66 2.3 ...
summary(df) %>%
kable(caption = "Statistical Summary of All Variables (Before Cleaning)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)| country | happiness | upper_ci | lower_ci | gdp | social_support | life_expectancy | freedom | generosity | corruption | dystopia | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:143 | Min. :1.721 | Min. :1.775 | Min. :1.667 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :-0.073 | |
| Class :character | 1st Qu.:4.726 | 1st Qu.:4.846 | 1st Qu.:4.606 | 1st Qu.:1.078 | 1st Qu.:0.9217 | 1st Qu.:0.3980 | 1st Qu.:0.5275 | 1st Qu.:0.0910 | 1st Qu.:0.06875 | 1st Qu.: 1.308 | |
| Mode :character | Median :5.785 | Median :5.895 | Median :5.674 | Median :1.431 | Median :1.2375 | Median :0.5495 | Median :0.6410 | Median :0.1365 | Median :0.12050 | Median : 1.645 | |
| NA | Mean :5.528 | Mean :5.641 | Mean :5.414 | Mean :1.379 | Mean :1.1343 | Mean :0.5209 | Mean :0.6206 | Mean :0.1463 | Mean :0.15412 | Mean : 1.576 | |
| NA | 3rd Qu.:6.416 | 3rd Qu.:6.508 | 3rd Qu.:6.319 | 3rd Qu.:1.742 | 3rd Qu.:1.3833 | 3rd Qu.:0.6485 | 3rd Qu.:0.7360 | 3rd Qu.:0.1925 | 3rd Qu.:0.19375 | 3rd Qu.: 1.882 | |
| NA | Max. :7.741 | Max. :7.815 | Max. :7.667 | Max. :2.141 | Max. :1.6170 | Max. :0.8570 | Max. :0.8630 | Max. :0.4010 | Max. :0.57500 | Max. : 2.998 | |
| NA | NA | NA | NA | NA’s :3 | NA’s :3 | NA’s :3 | NA’s :3 | NA’s :3 | NA’s :3 | NA’s :3 |
Inference: The dataset contains 1 categorical
variable (country) and 10 numerical variables. The Ladder
Score (happiness) ranges from 1.72 to 7.74 with a mean of approximately
5.5, indicating a wide global spread in life satisfaction. Several
variables :— upper_ci, lower_ci,
country, and dystopia , are identified as
candidates for removal in the next step.
The following four columns are dropped before any further analysis:
| Column | Reason for Removal |
|---|---|
country |
A text identifier , not a numerical predictor(used only for labelling) |
upperwhisker (upper_ci) |
Confidence interval bound , a statistical artefact of the Ladder Score, not an independent predictor |
lowerwhisker (lower_ci) |
Confidence interval bound , a statistical artefact of the Ladder Score, not an independent predictor |
dystopia (Dystopia + Residual) |
Represents the unexplained remainder after the six factors are accounted for, including it would artificially inflate model performance since it is derived from the target variable |
df <- df %>%
select(-country, -upper_ci, -lower_ci, -dystopia)
cat("Columns remaining after removal:", ncol(df), "\n")## Columns remaining after removal: 7
## Column names: happiness, gdp, social_support, life_expectancy, freedom, generosity, corruption
head(df) %>%
kable(digits = 3, caption = "Dataset After Removing Irrelevant Columns") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| happiness | gdp | social_support | life_expectancy | freedom | generosity | corruption |
|---|---|---|---|---|---|---|
| 7.741 | 1.844 | 1.572 | 0.695 | 0.859 | 0.142 | 0.546 |
| 7.583 | 1.908 | 1.520 | 0.699 | 0.823 | 0.204 | 0.548 |
| 7.525 | 1.881 | 1.617 | 0.718 | 0.819 | 0.258 | 0.182 |
| 7.344 | 1.878 | 1.501 | 0.724 | 0.838 | 0.221 | 0.524 |
| 7.341 | 1.803 | 1.513 | 0.740 | 0.641 | 0.153 | 0.193 |
| 7.319 | 1.901 | 1.462 | 0.706 | 0.725 | 0.247 | 0.372 |
Inference: The dataset now contains only the happiness score (target) and the six socio-economic predictor variables: GDP per capita, Social Support, Healthy Life Expectancy, Freedom, Generosity, and Perceptions of Corruption. This is a clean and meaningful feature set for predictive modelling.
missing_df <- df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Pct = round(Missing_Count / nrow(df) * 100, 2))
missing_df %>%
kable(caption = "Missing Value Count per Variable (After Column Removal)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(missing_df$Missing_Count > 0), background = "#fff3cd")| Variable | Missing_Count | Missing_Pct |
|---|---|---|
| happiness | 0 | 0.0 |
| gdp | 3 | 2.1 |
| social_support | 3 | 2.1 |
| life_expectancy | 3 | 2.1 |
| freedom | 3 | 2.1 |
| generosity | 3 | 2.1 |
| corruption | 3 | 2.1 |
missing_df %>%
filter(Missing_Count > 0) %>%
ggplot(aes(x = reorder(Variable, -Missing_Count), y = Missing_Count, fill = Variable)) +
geom_col(show.legend = FALSE, width = 0.5) +
geom_text(aes(label = paste0(Missing_Count, " missing (", Missing_Pct, "%)")),
vjust = -0.5, size = 4) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(limits = c(0, 5)) +
labs(title = "Variables with Missing Values",
x = "Variable", y = "Count") +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))Inference: Three countries (Bahrain, Tajikistan, and State of Palestine) have missing values across all six predictor columns. Since the missingness is complete across all predictors for these rows, imputation is not feasible, there is no partial information to base an estimate on. These rows will be removed entirely.
## Number of duplicate rows: 0
Inference: No duplicate rows exist in the dataset.
## Rows before removal: 143
## Rows after removal: 140
## Rows removed: 3
head(df_clean) %>%
kable(digits = 3, caption = "Final Clean Dataset — Ready for EDA and Modelling") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| happiness | gdp | social_support | life_expectancy | freedom | generosity | corruption |
|---|---|---|---|---|---|---|
| 7.741 | 1.844 | 1.572 | 0.695 | 0.859 | 0.142 | 0.546 |
| 7.583 | 1.908 | 1.520 | 0.699 | 0.823 | 0.204 | 0.548 |
| 7.525 | 1.881 | 1.617 | 0.718 | 0.819 | 0.258 | 0.182 |
| 7.344 | 1.878 | 1.501 | 0.724 | 0.838 | 0.221 | 0.524 |
| 7.341 | 1.803 | 1.513 | 0.740 | 0.641 | 0.153 | 0.193 |
| 7.319 | 1.901 | 1.462 | 0.706 | 0.725 | 0.247 | 0.372 |
## Final dataset dimensions: 140 rows x 7 columns
## Variables: happiness, gdp, social_support, life_expectancy, freedom, generosity, corruption
Inference: The final clean dataset contains 140 countries and 7 variables , the happiness score and six predictor variables. All columns are numeric, all rows are complete, and no irrelevant features remain. The dataset is ready for exploratory data analysis.
h_mean <- mean(df_clean$happiness)
h_median <- median(df_clean$happiness)
h_sd <- sd(df_clean$happiness)
h_min <- min(df_clean$happiness)
h_max <- max(df_clean$happiness)
# Summary table
data.frame(
Statistic = c("Mean", "Median", "Std Dev", "Minimum", "Maximum", "Range"),
Value = round(c(h_mean, h_median, h_sd, h_min, h_max, h_max - h_min), 3)
) %>%
kable(caption = "Descriptive Statistics — Happiness Score") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| Mean | 5.531 |
| Median | 5.800 |
| Std Dev | 1.181 |
| Minimum | 1.721 |
| Maximum | 7.741 |
| Range | 6.020 |
ggplot(df_clean, aes(x = happiness)) +
geom_histogram(aes(y = after_stat(density)), bins = 20,
fill = "#4e79a7", colour = "white", alpha = 0.85) +
geom_density(colour = "#e15759", linewidth = 1.2) +
geom_vline(xintercept = h_mean, linetype = "dashed",
colour = "#f28e2b", linewidth = 1) +
geom_vline(xintercept = h_median, linetype = "dotted",
colour = "#59a14f", linewidth = 1) +
annotate("text", x = h_mean + 0.12, y = 0.32,
label = paste("Mean =", round(h_mean, 2)),
colour = "#f28e2b", hjust = 0, size = 4) +
annotate("text", x = h_median - 0.12, y = 0.28,
label = paste("Median =", round(h_median, 2)),
colour = "#59a14f", hjust = 1, size = 4) +
labs(title = "Distribution of Global Happiness Scores",
subtitle = "World Happiness Report 2024 — 140 Countries",
x = "Happiness Score (Ladder Score)", y = "Density") +
theme_minimal(base_size = 13)The distribution of happiness scores is unimodal and approximately bell-shaped, though it exhibits a notable left skew. The mean sits below the median, a result of the low-scoring outliers (nations in extreme adversity) pulling the average downward. A standard deviation of confirms a significant global spread in life satisfaction. The primary concentration of countries falls within the 5–7 range, indicating that moderate-to-high happiness is the global norm. However, the expansive range of roughly 6 points—from a minimum of 1.72 to a maximum of 7.74—underscores a profound global inequality in well-being.
predictors <- c("gdp", "social_support", "life_expectancy",
"freedom", "generosity", "corruption")
pred_labels <- c("Log GDP per capita", "Social Support", "Healthy Life Expectancy",
"Freedom", "Generosity", "Perceptions of Corruption")
plots <- lapply(seq_along(predictors), function(i) {
m <- mean(df_clean[[predictors[i]]], na.rm = TRUE)
ggplot(df_clean, aes_string(x = predictors[i])) +
geom_histogram(fill = "#76b7b2", colour = "white", bins = 20, alpha = 0.9) +
geom_vline(xintercept = m, linetype = "dashed", colour = "#e15759", linewidth = 0.9) +
labs(title = pred_labels[i], x = NULL, y = "Count") +
theme_minimal(base_size = 11)
})
do.call(grid.arrange, c(plots, ncol = 2))Inference:
df_clean %>%
summary() %>%
kable(caption = "Summary Statistics — Clean Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)| happiness | gdp | social_support | life_expectancy | freedom | generosity | corruption | |
|---|---|---|---|---|---|---|---|
| Min. :1.721 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | |
| 1st Qu.:4.632 | 1st Qu.:1.078 | 1st Qu.:0.9217 | 1st Qu.:0.3980 | 1st Qu.:0.5275 | 1st Qu.:0.0910 | 1st Qu.:0.06875 | |
| Median :5.801 | Median :1.431 | Median :1.2375 | Median :0.5495 | Median :0.6410 | Median :0.1365 | Median :0.12050 | |
| Mean :5.531 | Mean :1.379 | Mean :1.1343 | Mean :0.5209 | Mean :0.6206 | Mean :0.1463 | Mean :0.15412 | |
| 3rd Qu.:6.426 | 3rd Qu.:1.742 | 3rd Qu.:1.3833 | 3rd Qu.:0.6485 | 3rd Qu.:0.7360 | 3rd Qu.:0.1925 | 3rd Qu.:0.19375 | |
| Max. :7.741 | Max. :2.141 | Max. :1.6170 | Max. :0.8570 | Max. :0.8630 | Max. :0.4010 | Max. :0.57500 |
df_clean %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
mutate(variable = factor(variable,
levels = c("happiness", predictors),
labels = c("Happiness", pred_labels))) %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 2) +
scale_fill_brewer(palette = "Set3") +
coord_flip() +
labs(title = "Boxplots of All Variables",
subtitle = "Red dots = outliers",
x = NULL, y = "Value") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Inference:
Q1 <- quantile(df_clean$happiness, 0.25)
Q3 <- quantile(df_clean$happiness, 0.75)
IQR_val <- Q3 - Q1
lower_b <- Q1 - 1.5 * IQR_val
upper_b <- Q3 + 1.5 * IQR_val
data.frame(
Metric = c("Q1", "Q3", "IQR", "Lower Fence (Q1 - 1.5*IQR)", "Upper Fence (Q3 + 1.5*IQR)"),
Value = round(c(Q1, Q3, IQR_val, lower_b, upper_b), 3)
) %>%
kable(caption = "IQR-Based Outlier Detection — Happiness Score") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Q1 | 4.632 |
| Q3 | 6.426 |
| IQR | 1.795 |
| Lower Fence (Q1 - 1.5*IQR) | 1.940 |
| Upper Fence (Q3 + 1.5*IQR) | 9.118 |
outliers_h <- df_raw %>%
rename(country = Country.name, happiness = Ladder.score) %>%
filter(happiness < lower_b | happiness > upper_b) %>%
select(country, happiness)
if (nrow(outliers_h) > 0) {
outliers_h %>%
kable(caption = "Statistical Outliers in Happiness Score (IQR Method)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
} else {
cat("No statistical outliers detected.\n")
}| country | happiness |
|---|---|
| Afghanistan | 1.721 |
ggplot(df_clean, aes(y = happiness)) +
geom_boxplot(fill = "#4e79a7", alpha = 0.7, outlier.colour = "red",
outlier.size = 3, width = 0.4) +
geom_hline(yintercept = c(lower_b, upper_b),
linetype = "dashed", colour = "#e15759", linewidth = 1) +
labs(title = "Boxplot of Happiness Score with IQR Fences",
subtitle = "Red dashed lines mark the outlier boundaries",
y = "Happiness Score", x = NULL) +
theme_minimal(base_size = 13)Inference: A small number of countries fall outside the IQR fences as statistical outliers. These represent the genuine extremes of global development—most notably a very low-performing outlier in the Happiness Score. These are real-world data points, not data entry errors, and should be retained. Removing them would bias the model toward middle-range countries and reduce its ability to generalize across the full spectrum of global happiness, from the high-performing nations to those facing the most significant challenges.”
cor_matrix <- cor(df_clean, use = "complete.obs")
colnames(cor_matrix) <- rownames(cor_matrix) <-
c("Happiness", "GDP", "Social Support", "Life Exp.",
"Freedom", "Generosity", "Corruption")
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.85,
col = colorRampPalette(c("#e15759", "white", "#4e79a7"))(200),
title = "Correlation Matrix — WHR 2024 (6 Predictors)",
mar = c(0, 0, 2, 0))Inference:
cor_happy <- cor_matrix["Happiness", -1]
data.frame(
Feature = names(cor_happy),
Correlation = round(as.numeric(cor_happy), 3)
) %>%
arrange(desc(Correlation)) %>%
kable(caption = "Correlation of Each Predictor with Happiness Score") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Feature | Correlation |
|---|---|
| Social Support | 0.814 |
| GDP | 0.769 |
| Life Exp. | 0.760 |
| Freedom | 0.644 |
| Corruption | 0.452 |
| Generosity | 0.130 |
data.frame(
Feature = names(cor_happy),
Correlation = as.numeric(cor_happy)
) %>%
mutate(Feature = factor(Feature,
levels = c("GDP", "Social Support", "Life Exp.",
"Freedom", "Corruption", "Generosity"))) %>%
ggplot(aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_col(fill = "#4e79a7", alpha = 0.9, width = 0.6) +
geom_text(aes(label = round(Correlation, 3)), hjust = -0.1, size = 4) +
coord_flip() +
scale_y_continuous(limits = c(0, 1.0)) +
labs(title = "Predictor Correlation with Happiness Score",
subtitle = "Ordered by strength of association",
x = NULL, y = "Pearson Correlation (r)") +
theme_minimal(base_size = 13)Inference:
scatter_plots <- lapply(seq_along(predictors), function(i) {
ggplot(df_clean, aes_string(x = predictors[i], y = "happiness")) +
geom_point(colour = "#4e79a7", alpha = 0.65, size = 2.2) +
geom_smooth(method = "lm", colour = "#e15759", se = TRUE, linewidth = 1) +
labs(title = paste(pred_labels[i], "vs Happiness"),
x = pred_labels[i], y = "Happiness Score") +
theme_minimal(base_size = 11)
})
do.call(grid.arrange, c(scatter_plots, ncol = 2))Inference: All six predictor variables show a positive linear relationship with happiness. The tightest, most consistent linear trends are seen for GDP, Social Support, and Life Expectancy, aligning with their high correlation values. Freedom also shows a clear upward slope, though with slightly more variance. Generosity and Corruption display more scattered patterns with wider confidence bands (gray shading), confirming their weaker and noisier predictive signals. Overall, the linear assumption for regression modeling appears reasonable for all six predictors.
df_clean <- df_clean %>%
mutate(happiness_group = case_when(
happiness >= 6.5 ~ "Very Happy (>=6.5)",
happiness >= 5.0 ~ "Moderately Happy (5-6.5)",
happiness >= 3.5 ~ "Unhappy (3.5-5)",
TRUE ~ "Very Unhappy (<3.5)"
),
happiness_group = factor(happiness_group,
levels = c("Very Happy (>=6.5)", "Moderately Happy (5-6.5)",
"Unhappy (3.5-5)", "Very Unhappy (<3.5)")))
df_clean %>%
count(happiness_group) %>%
ggplot(aes(x = happiness_group, y = n, fill = happiness_group)) +
geom_col(alpha = 0.9, width = 0.6) +
geom_text(aes(label = n), vjust = -0.5, size = 5, fontface = "bold") +
scale_fill_manual(values = c("#4e79a7", "#76b7b2", "#f28e2b", "#e15759")) +
labs(title = "Number of Countries by Happiness Category",
x = "Happiness Category", y = "Number of Countries") +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))df_clean %>%
group_by(happiness_group) %>%
summarise(across(all_of(predictors), mean, na.rm = TRUE)) %>%
pivot_longer(-happiness_group, names_to = "factor", values_to = "mean_value") %>%
mutate(factor = factor(factor, levels = predictors, labels = pred_labels)) %>%
ggplot(aes(x = factor, y = mean_value, fill = happiness_group)) +
geom_col(position = "dodge", alpha = 0.9, width = 0.75) +
scale_fill_manual(values = c("#4e79a7", "#76b7b2", "#f28e2b", "#e15759")) +
coord_flip() +
labs(title = "Average Factor Contributions by Happiness Group",
subtitle = "Higher value = greater contribution to happiness",
x = NULL, y = "Mean Contribution", fill = "Happiness Group") +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")Inference:
pairs(df_clean %>% select(-happiness_group),
col = "#4e79a7",
pch = 16,
cex = 0.7,
main = "Pairplot of All Variables — WHR 2024",
labels = c("Happiness", "GDP", "Social\nSupport",
"Life\nExp.", "Freedom", "Generosity", "Corruption"))Inference: The pairplot provides a comprehensive view of all pairwise relationships simultaneously. The top row confirms the strong, tight linear associations of GDP, Social Support, and Life Expectancy with Happiness. The off-diagonal plots reveal high density and “tunneling” between the predictors themselves—particularly between GDP and Life Expectancy and GDP and Social Support. This visual clustering further highlights the multicollinearity present in the dataset, confirming that these factors often improve in tandem. In contrast, Generosity and Corruption appear as diffuse, scattered clouds against all other variables, consistent with their weak correlations and noisier predictive signals.
This section consolidates the key inferences drawn from the Exploratory Data Analysis into a structured summary, highlighting the most important patterns, relationships, and observations uncovered before modelling.
1. Happiness Score Distribution
The global happiness score follows an approximately bell-shaped distribution centred around a mean of ≈ 5.53, with a standard deviation of ≈ 1.18. The distribution has a slight left skew, meaning more countries cluster in the moderate-to-high happiness range than at the very low end. The total range spans nearly 6 points — from Afghanistan (1.72) to Finland (7.74) — revealing a profound global inequality in subjective well-being.
2. Country-Level Extremes
Nordic countries (Finland, Denmark, Iceland, Sweden, Norway) consistently occupy the top happiness rankings, driven by their strong social welfare systems, economic prosperity, and high levels of social trust. The bottom tier is dominated by conflict-affected and low-income nations (Afghanistan, Lebanon, Lesotho, Sierra Leone), where insecurity and poverty severely suppress life satisfaction. The near-6-point gap between the happiest and least happy nation illustrates the scale of this global disparity.
3. Predictor Distributions
4. Outliers
IQR-based outlier detection identified a small number of statistical outliers in the happiness score — countries at both extremes of the global spectrum (e.g. Finland, Afghanistan, Lebanon). These are genuine real-world data points representing true inequality and are retained in the dataset, as removing them would bias the models toward average-happiness countries.
5. Correlation Analysis
Three predictors emerge as the strongest drivers of national happiness:
6. Scatterplot Trends
All six predictors show positive linear relationships with happiness. Trends for GDP, Social Support, and Life Expectancy are the tightest and most consistent, while Generosity and Corruption show more scattered patterns and wider confidence bands.
7. Happiness Category Trends
When countries are grouped into four happiness tiers (Very Happy, Moderately Happy, Unhappy, Very Unhappy), a clear stepwise gradient is visible across GDP, Social Support, and Life Expectancy — confirming their role as the primary separators of happiness levels. Freedom also differentiates groups noticeably. Generosity, however, shows minimal variation across groups, further confirming it is the least reliable predictor of national happiness at scale.
8. Pairplot Observations
The pairplot confirms the dominant role of GDP, Social Support, and Life Expectancy — both in their individual relationships with happiness and in their strong inter-predictor correlations. Generosity and Corruption appear largely uncorrelated with both happiness and the other predictors, reinforcing their limited utility as standalone features.
Overall Conclusion from EDA: The EDA strongly suggests that a country’s happiness is primarily driven by its economic capacity, social infrastructure, and health system — three deeply interconnected dimensions. Any predictive model must account for their multicollinearity. Generosity and Corruption, while theoretically relevant, contribute less predictive value at the national level in this dataset.
# Remove the grouping column before modelling
model_df <- df_clean %>% select(-happiness_group)
set.seed(42)
train_idx <- createDataPartition(model_df$happiness, p = 0.80, list = FALSE)
train_data <- model_df[ train_idx, ]
test_data <- model_df[-train_idx, ]
cat("Training set:", nrow(train_data), "countries\n")## Training set: 112 countries
## Testing set: 28 countries
# 10-fold cross-validation
cv_control <- trainControl(method = "cv", number = 10, verboseIter = FALSE)An 80/20 train-test split is applied. Given the relatively small sample size (140 countries), 10-fold cross-validation is used during training to improve reliability of performance estimates and reduce the effect of any single lucky or unlucky split.
Multiple Linear Regression models the happiness score as a weighted linear combination of all six predictor variables. It serves as the interpretable baseline model against which regularised and ensemble models are benchmarked.
mlr_model <- train(happiness ~ ., data = train_data,
method = "lm", trControl = cv_control)
mlr_pred <- predict(mlr_model, newdata = test_data)
mlr_res <- data.frame(
Actual = test_data$happiness,
Predicted = mlr_pred,
Residual = test_data$happiness - mlr_pred,
Model = "MLR"
)
mlr_mse <- mean(mlr_res$Residual^2)
mlr_rmse <- sqrt(mlr_mse)
mlr_r2 <- cor(mlr_res$Actual, mlr_res$Predicted)^2
mlr_mae <- mean(abs(mlr_res$Residual))
cat(sprintf("MLR -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
mlr_mse, mlr_rmse, mlr_r2, mlr_mae))## MLR -> MSE: 0.2201 | RMSE: 0.4691 | R2: 0.8995 | MAE: 0.3641
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57408 -0.24270 0.09035 0.31875 1.04750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2690 0.2460 5.159 1.18e-06 ***
## gdp 0.4694 0.2272 2.066 0.0413 *
## social_support 1.3175 0.2517 5.234 8.57e-07 ***
## life_expectancy 1.1656 0.5813 2.005 0.0475 *
## freedom 2.0188 0.3822 5.282 6.96e-07 ***
## generosity 0.8076 0.7748 1.042 0.2997
## corruption 1.1132 0.4877 2.282 0.0245 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5298 on 105 degrees of freedom
## Multiple R-squared: 0.7942, Adjusted R-squared: 0.7825
## F-statistic: 67.55 on 6 and 105 DF, p-value: < 2.2e-16
Inference:
Ridge Regression applies L2 regularisation, adding a penalty term proportional to the square of each coefficient. This shrinks all coefficients toward zero without eliminating any — making it ideal when predictors are highly correlated, as is the case here.
ridge_model <- train(happiness ~ ., data = train_data,
method = "glmnet",
trControl = cv_control,
tuneGrid = expand.grid(
alpha = 0,
lambda = 10^seq(-4, 1, length = 100)))
ridge_pred <- predict(ridge_model, newdata = test_data)
ridge_res <- data.frame(
Actual = test_data$happiness,
Predicted = ridge_pred,
Residual = test_data$happiness - ridge_pred,
Model = "Ridge"
)
ridge_mse <- mean(ridge_res$Residual^2)
ridge_rmse <- sqrt(ridge_mse)
ridge_r2 <- cor(ridge_res$Actual, ridge_res$Predicted)^2
ridge_mae <- mean(abs(ridge_res$Residual))
cat(sprintf("Ridge -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
ridge_mse, ridge_rmse, ridge_r2, ridge_mae))## Ridge -> MSE: 0.2412 | RMSE: 0.4911 | R2: 0.9000 | MAE: 0.3878
## Optimal lambda: 0.17074
Inference: The lambda tuning plot tracks the RMSE (Root Mean Square Error) across a range of regularization strengths, with the optimal lambda selected where the error is minimized. Ridge regression is employed here to retain all six predictors with shrunken coefficients—a better approach than feature removal since every variable shows some correlation with happiness. The small optimal lambda (near zero) indicates that heavy regularization is not required, suggesting that the linear structure of the data is well-behaved and that the “Big Three” predictors already provide a strong, stable signal.
Lasso Regression applies L1 regularisation, which can shrink some coefficients entirely to zero — combining prediction with automatic feature selection. This makes Lasso particularly useful for identifying the most parsimonious model.
lasso_model <- train(happiness ~ ., data = train_data,
method = "glmnet",
trControl = cv_control,
tuneGrid = expand.grid(
alpha = 1,
lambda = 10^seq(-4, 1, length = 100)))
lasso_pred <- predict(lasso_model, newdata = test_data)
lasso_res <- data.frame(
Actual = test_data$happiness,
Predicted = lasso_pred,
Residual = test_data$happiness - lasso_pred,
Model = "Lasso"
)
lasso_mse <- mean(lasso_res$Residual^2)
lasso_rmse <- sqrt(lasso_mse)
lasso_r2 <- cor(lasso_res$Actual, lasso_res$Predicted)^2
lasso_mae <- mean(abs(lasso_res$Residual))
cat(sprintf("Lasso -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
lasso_mse, lasso_rmse, lasso_r2, lasso_mae))## Lasso -> MSE: 0.2202 | RMSE: 0.4693 | R2: 0.9004 | MAE: 0.3648
## Optimal lambda: 0.00413
# Lasso coefficient output — dots indicate variables zeroed out
lasso_coefs <- coef(lasso_model$finalModel, s = lasso_model$bestTune$lambda)
print(lasso_coefs)## 7 x 1 sparse Matrix of class "dgCMatrix"
## s=0.004132012
## (Intercept) 1.2960232
## gdp 0.4644887
## social_support 1.3149066
## life_expectancy 1.1611156
## freedom 2.0107022
## generosity 0.7531208
## corruption 1.0993834
Inference:
Random Forest is an ensemble method that builds 500 decision trees on bootstrapped samples and averages their predictions. It captures non-linear relationships and variable interaction effects that all three linear models cannot, making it the most flexible approach in this comparison.
rf_model <- train(happiness ~ ., data = train_data,
method = "rf",
trControl = cv_control,
tuneGrid = data.frame(mtry = c(2, 3, 4, 5)),
ntree = 500)
rf_pred <- predict(rf_model, newdata = test_data)
rf_res <- data.frame(
Actual = test_data$happiness,
Predicted = rf_pred,
Residual = test_data$happiness - rf_pred,
Model = "Random Forest"
)
rf_mse <- mean(rf_res$Residual^2)
rf_rmse <- sqrt(rf_mse)
rf_r2 <- cor(rf_res$Actual, rf_res$Predicted)^2
rf_mae <- mean(abs(rf_res$Residual))
cat(sprintf("RF -> MSE: %.4f | RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
rf_mse, rf_rmse, rf_r2, rf_mae))## RF -> MSE: 0.3633 | RMSE: 0.6028 | R2: 0.8746 | MAE: 0.4213
## Best mtry: 2
as.data.frame(importance(rf_model$finalModel)) %>%
rownames_to_column("Feature") %>%
rename(Importance = IncNodePurity) %>%
mutate(Feature = factor(Feature, levels = predictors, labels = pred_labels)) %>%
ggplot(aes(x = reorder(Feature, Importance), y = Importance, fill = Importance)) +
geom_col(alpha = 0.9, width = 0.6) +
geom_text(aes(label = round(Importance, 2)), hjust = -0.1, size = 4) +
coord_flip() +
scale_fill_gradient(low = "#76b7b2", high = "#4e79a7") +
scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
labs(title = "Random Forest: Variable Importance",
subtitle = "Higher = greater contribution to reducing prediction error across all trees",
x = NULL, y = "Importance (Mean Decrease in Node Impurity)") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Inference: The Random Forest variable importance plot identifies which predictors most consistently reduced prediction error (Node Impurity) across all decision trees. Social Support (44.45), GDP (31.85), and Life Expectancy (27.39) emerge as the top tier of contributors—a finding that is fully consistent with the preceding EDA and linear models. Notably, the Random Forest is inherently robust to the multicollinearity identified earlier; because each tree split is based on a random subset of features (mtry), the model naturally decorrelates the predictors. Given that this ensemble method can capture complex interaction effects, it provides the most robust confirmation of our feature hierarchy.
comparison <- data.frame(
Model = c("Multiple Linear Regression", "Ridge Regression",
"Lasso Regression", "Random Forest"),
MSE = round(c(mlr_mse, ridge_mse, lasso_mse, rf_mse), 4),
RMSE = round(c(mlr_rmse, ridge_rmse, lasso_rmse, rf_rmse), 4),
R2 = round(c(mlr_r2, ridge_r2, lasso_r2, rf_r2), 4),
MAE = round(c(mlr_mae, ridge_mae, lasso_mae, rf_mae), 4)
) %>% arrange(RMSE)
comparison %>%
kable(caption = "Model Performance Comparison — Test Set (Best model highlighted)",
col.names = c("Model", "MSE", "RMSE", "R2", "MAE")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
row_spec(1, bold = TRUE, background = "#d4edda")| Model | MSE | RMSE | R2 | MAE |
|---|---|---|---|---|
| Multiple Linear Regression | 0.2201 | 0.4691 | 0.8995 | 0.3641 |
| Lasso Regression | 0.2202 | 0.4693 | 0.9004 | 0.3648 |
| Ridge Regression | 0.2412 | 0.4911 | 0.9000 | 0.3878 |
| Random Forest | 0.3633 | 0.6028 | 0.8746 | 0.4213 |
comparison %>%
pivot_longer(-Model, names_to = "Metric", values_to = "Value") %>%
filter(Metric %in% c("RMSE", "R2", "MAE")) %>%
ggplot(aes(x = reorder(Model, Value), y = Value, fill = Model)) +
geom_col(alpha = 0.85, width = 0.6) +
facet_wrap(~ Metric, scales = "free_y") +
scale_fill_manual(values = c("#4e79a7", "#f28e2b", "#59a14f", "#e15759")) +
labs(title = "Model Comparison: RMSE, R2, and MAE",
x = NULL, y = "Value") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "none")bind_rows(mlr_res, ridge_res, lasso_res, rf_res) %>%
ggplot(aes(x = Actual, y = Predicted, colour = Model)) +
geom_point(alpha = 0.8, size = 3) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", colour = "black", linewidth = 0.9) +
facet_wrap(~ Model, ncol = 2) +
scale_colour_manual(values = c("#4e79a7", "#f28e2b", "#59a14f", "#e15759")) +
labs(title = "Actual vs Predicted Happiness Scores — All Models",
subtitle = "Points along the dashed line = perfect prediction",
x = "Actual Happiness Score", y = "Predicted Happiness Score") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")bind_rows(mlr_res, ridge_res, lasso_res, rf_res) %>%
ggplot(aes(x = Predicted, y = Residual, colour = Model)) +
geom_point(alpha = 0.8, size = 3) +
geom_hline(yintercept = 0, linetype = "dashed",
colour = "black", linewidth = 0.9) +
facet_wrap(~ Model, ncol = 2) +
scale_colour_manual(values = c("#4e79a7", "#f28e2b", "#59a14f", "#e15759")) +
labs(title = "Residual Plots — All Models",
subtitle = "Residuals should be randomly scattered around zero",
x = "Predicted Value", y = "Residual (Actual - Predicted)") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")Inference — Overall Model Comparison:
This project analyzed the World Happiness Report 2024 to identify the key determinants of national happiness and build predictive models for the Ladder Score across 140 countries.
Data Cleaning: Four columns were removed (Country name, upper/lower whiskers, Dystopia + Residual) to produce a lean, meaningful feature set. Three rows with missing predictor data were removed, yielding a clean dataset of 140 countries and 7 variables.
Key EDA Findings:
Modelling Results:
Policy Implications: Nations seeking to improve national well-being should prioritize investments in economic stability, social safety nets, and healthcare infrastructure. These three factors provide the most consistent and significant “stepwise” increase in global happiness scores.