The goal of this project is to model and predict happiness scores in 2019 using economic and social factors from the World Happiness Report 2019. We use linear regression from scratch to find the best predictors. The dataset was obtained from Kaggle.
First, we will analyze the 2019 dataset only. The dataset includes variables such as GDP per capita, social support, life expectancy, freedom, generosity, and perception of corruption.
# Load the dataset using base R
# This reads in the CSV file and creates a data frame named df
df <- read.csv("2019.csv")
# Preview the structure of the dataset
# This shows column names, data types, and a few observations
str(df)
## 'data.frame': 156 obs. of 9 variables:
## $ Overall.rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country.or.region : chr "Finland" "Denmark" "Norway" "Iceland" ...
## $ Score : num 7.77 7.6 7.55 7.49 7.49 ...
## $ GDP.per.capita : num 1.34 1.38 1.49 1.38 1.4 ...
## $ Social.support : num 1.59 1.57 1.58 1.62 1.52 ...
## $ Healthy.life.expectancy : num 0.986 0.996 1.028 1.026 0.999 ...
## $ Freedom.to.make.life.choices: num 0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
## $ Generosity : num 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
## $ Perceptions.of.corruption : num 0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
# Display summary statistics for each column
# Helps us understand ranges, means, and identify missing data
summary(df)
## Overall.rank Country.or.region Score GDP.per.capita
## Min. : 1.00 Length:156 Min. :2.853 Min. :0.0000
## 1st Qu.: 39.75 Class :character 1st Qu.:4.545 1st Qu.:0.6028
## Median : 78.50 Mode :character Median :5.380 Median :0.9600
## Mean : 78.50 Mean :5.407 Mean :0.9051
## 3rd Qu.:117.25 3rd Qu.:6.184 3rd Qu.:1.2325
## Max. :156.00 Max. :7.769 Max. :1.6840
## Social.support Healthy.life.expectancy Freedom.to.make.life.choices
## Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.056 1st Qu.:0.5477 1st Qu.:0.3080
## Median :1.272 Median :0.7890 Median :0.4170
## Mean :1.209 Mean :0.7252 Mean :0.3926
## 3rd Qu.:1.452 3rd Qu.:0.8818 3rd Qu.:0.5072
## Max. :1.624 Max. :1.1410 Max. :0.6310
## Generosity Perceptions.of.corruption
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1087 1st Qu.:0.0470
## Median :0.1775 Median :0.0855
## Mean :0.1848 Mean :0.1106
## 3rd Qu.:0.2482 3rd Qu.:0.1412
## Max. :0.5660 Max. :0.4530
# Plot GDP per Capita vs Happiness Score using base R
# This helps visually assess the correlation between the two variables
plot(df$GDP.per.capita, df$Score,
main = "GDP per Capita vs Happiness Score",
xlab = "GDP per Capita", ylab = "Happiness Score",
pch = 19, col = "blue")
# Add a best-fit line from a simple linear model to show the trend
abline(lm(Score ~ GDP.per.capita, data = df), col = "red")
# Fit a multiple linear regression model using 6 predictors
# This estimates the impact of each variable on happiness score
model <- lm(Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption, data = df)
# Show model summary
# Includes coefficients, significance levels, R-squared, etc.
summary(model)
##
## Call:
## lm(formula = Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy +
## Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75304 -0.35306 0.05703 0.36695 1.19059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7952 0.2111 8.505 1.77e-14 ***
## GDP.per.capita 0.7754 0.2182 3.553 0.000510 ***
## Social.support 1.1242 0.2369 4.745 4.83e-06 ***
## Healthy.life.expectancy 1.0781 0.3345 3.223 0.001560 **
## Freedom.to.make.life.choices 1.4548 0.3753 3.876 0.000159 ***
## Generosity 0.4898 0.4977 0.984 0.326709
## Perceptions.of.corruption 0.9723 0.5424 1.793 0.075053 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5335 on 149 degrees of freedom
## Multiple R-squared: 0.7792, Adjusted R-squared: 0.7703
## F-statistic: 87.62 on 6 and 149 DF, p-value: < 2.2e-16
# Predict happiness scores using the model
predicted <- predict(model)
# Plot predicted vs actual scores
# A good model should show points close to the diagonal line
plot(df$Score, predicted,
main = "Actual vs Predicted Happiness Score",
xlab = "Actual Score", ylab = "Predicted Score",
pch = 19, col = "darkgreen")
abline(0, 1, lty = 2, col = "blue") # reference line for perfect predictions
# Fit simpler models with fewer predictors to compare performance
model1 <- lm(Score ~ GDP.per.capita, data = df)
model2 <- lm(Score ~ GDP.per.capita + Social.support, data = df)
model3 <- lm(Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy, data = df)
# Extract Adjusted R-squared from each model to compare their explanatory power
adj_r2_1 <- summary(model1)$adj.r.squared
adj_r2_2 <- summary(model2)$adj.r.squared
adj_r2_3 <- summary(model3)$adj.r.squared
adj_r2_full <- summary(model)$adj.r.squared
# Print results to console
cat("Adjusted R² - Model 1 (GDP):", adj_r2_1, "\n")
## Adjusted R² - Model 1 (GDP): 0.627849
cat("Adjusted R² - Model 2 (+Social Support):", adj_r2_2, "\n")
## Adjusted R² - Model 2 (+Social Support): 0.6998346
cat("Adjusted R² - Model 3 (+Life Expectancy):", adj_r2_3, "\n")
## Adjusted R² - Model 3 (+Life Expectancy): 0.7209235
cat("Adjusted R² - Full Model:", adj_r2_full, "\n")
## Adjusted R² - Full Model: 0.7702711
# Set seed for reproducibility of random sampling
set.seed(1)
# Randomly split the data into 70% training and 30% testing
train_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train <- df[train_index, ]
test <- df[-train_index, ]
# Train the model on the training set
train_model <- lm(Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption, data = train)
# Predict on the testing set
pred_test <- predict(train_model, newdata = test)
# Compute Mean Squared Error on test set
# This measures the average prediction error
mse <- mean((test$Score - pred_test)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 0.2718677
The simpler models (with fewer variables) tend to have higher bias and lower variance. As we add more predictors, the variance increases but we may reduce bias. Subset selection helps us find a good balance by comparing model fit statistics like Adjusted R². Our simple cross-validation gives us an estimate of the model’s generalization performance.
From our analysis we can conclude that: 1) GDP per capita and social support are strong predictors of happiness. 2) We manually tested different combinations of predictors. 3) The model performs well with 3 variables and improves slightly with the full model. 4) Cross-validation shows the model generalizes reasonably well.