Introduction

The goal of this project is to model and predict happiness scores in 2019 using economic and social factors from the World Happiness Report 2019. We use linear regression from scratch to find the best predictors. The dataset was obtained from Kaggle.

Dataset

First, we will analyze the 2019 dataset only. The dataset includes variables such as GDP per capita, social support, life expectancy, freedom, generosity, and perception of corruption.

# Load the dataset using base R
# This reads in the CSV file and creates a data frame named df
df <- read.csv("2019.csv")

# Preview the structure of the dataset
# This shows column names, data types, and a few observations
str(df)
## 'data.frame':    156 obs. of  9 variables:
##  $ Overall.rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country.or.region           : chr  "Finland" "Denmark" "Norway" "Iceland" ...
##  $ Score                       : num  7.77 7.6 7.55 7.49 7.49 ...
##  $ GDP.per.capita              : num  1.34 1.38 1.49 1.38 1.4 ...
##  $ Social.support              : num  1.59 1.57 1.58 1.62 1.52 ...
##  $ Healthy.life.expectancy     : num  0.986 0.996 1.028 1.026 0.999 ...
##  $ Freedom.to.make.life.choices: num  0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ Generosity                  : num  0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ Perceptions.of.corruption   : num  0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...

Exploratory Data Analysis

# Display summary statistics for each column
# Helps us understand ranges, means, and identify missing data
summary(df)
##   Overall.rank    Country.or.region      Score       GDP.per.capita  
##  Min.   :  1.00   Length:156         Min.   :2.853   Min.   :0.0000  
##  1st Qu.: 39.75   Class :character   1st Qu.:4.545   1st Qu.:0.6028  
##  Median : 78.50   Mode  :character   Median :5.380   Median :0.9600  
##  Mean   : 78.50                      Mean   :5.407   Mean   :0.9051  
##  3rd Qu.:117.25                      3rd Qu.:6.184   3rd Qu.:1.2325  
##  Max.   :156.00                      Max.   :7.769   Max.   :1.6840  
##  Social.support  Healthy.life.expectancy Freedom.to.make.life.choices
##  Min.   :0.000   Min.   :0.0000          Min.   :0.0000              
##  1st Qu.:1.056   1st Qu.:0.5477          1st Qu.:0.3080              
##  Median :1.272   Median :0.7890          Median :0.4170              
##  Mean   :1.209   Mean   :0.7252          Mean   :0.3926              
##  3rd Qu.:1.452   3rd Qu.:0.8818          3rd Qu.:0.5072              
##  Max.   :1.624   Max.   :1.1410          Max.   :0.6310              
##    Generosity     Perceptions.of.corruption
##  Min.   :0.0000   Min.   :0.0000           
##  1st Qu.:0.1087   1st Qu.:0.0470           
##  Median :0.1775   Median :0.0855           
##  Mean   :0.1848   Mean   :0.1106           
##  3rd Qu.:0.2482   3rd Qu.:0.1412           
##  Max.   :0.5660   Max.   :0.4530
# Plot GDP per Capita vs Happiness Score using base R
# This helps visually assess the correlation between the two variables
plot(df$GDP.per.capita, df$Score,
     main = "GDP per Capita vs Happiness Score",
     xlab = "GDP per Capita", ylab = "Happiness Score",
     pch = 19, col = "blue")

# Add a best-fit line from a simple linear model to show the trend
abline(lm(Score ~ GDP.per.capita, data = df), col = "red")

Linear Regression Model

# Fit a multiple linear regression model using 6 predictors
# This estimates the impact of each variable on happiness score
model <- lm(Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption, data = df)
# Show model summary
# Includes coefficients, significance levels, R-squared, etc.
summary(model)
## 
## Call:
## lm(formula = Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75304 -0.35306  0.05703  0.36695  1.19059 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.7952     0.2111   8.505 1.77e-14 ***
## GDP.per.capita                 0.7754     0.2182   3.553 0.000510 ***
## Social.support                 1.1242     0.2369   4.745 4.83e-06 ***
## Healthy.life.expectancy        1.0781     0.3345   3.223 0.001560 ** 
## Freedom.to.make.life.choices   1.4548     0.3753   3.876 0.000159 ***
## Generosity                     0.4898     0.4977   0.984 0.326709    
## Perceptions.of.corruption      0.9723     0.5424   1.793 0.075053 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5335 on 149 degrees of freedom
## Multiple R-squared:  0.7792, Adjusted R-squared:  0.7703 
## F-statistic: 87.62 on 6 and 149 DF,  p-value: < 2.2e-16
# Predict happiness scores using the model
predicted <- predict(model)

# Plot predicted vs actual scores
# A good model should show points close to the diagonal line
plot(df$Score, predicted,
     main = "Actual vs Predicted Happiness Score",
     xlab = "Actual Score", ylab = "Predicted Score",
     pch = 19, col = "darkgreen")
abline(0, 1, lty = 2, col = "blue")  # reference line for perfect predictions

Manual Subset Selection

# Fit simpler models with fewer predictors to compare performance
model1 <- lm(Score ~ GDP.per.capita, data = df)
model2 <- lm(Score ~ GDP.per.capita + Social.support, data = df)
model3 <- lm(Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy, data = df)

# Extract Adjusted R-squared from each model to compare their explanatory power
adj_r2_1 <- summary(model1)$adj.r.squared
adj_r2_2 <- summary(model2)$adj.r.squared
adj_r2_3 <- summary(model3)$adj.r.squared
adj_r2_full <- summary(model)$adj.r.squared

# Print results to console
cat("Adjusted R² - Model 1 (GDP):", adj_r2_1, "\n")
## Adjusted R² - Model 1 (GDP): 0.627849
cat("Adjusted R² - Model 2 (+Social Support):", adj_r2_2, "\n")
## Adjusted R² - Model 2 (+Social Support): 0.6998346
cat("Adjusted R² - Model 3 (+Life Expectancy):", adj_r2_3, "\n")
## Adjusted R² - Model 3 (+Life Expectancy): 0.7209235
cat("Adjusted R² - Full Model:", adj_r2_full, "\n")
## Adjusted R² - Full Model: 0.7702711

Simple Cross-Validation (Hold-Out)

# Set seed for reproducibility of random sampling
set.seed(1)

# Randomly split the data into 70% training and 30% testing
train_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train <- df[train_index, ]
test <- df[-train_index, ]

# Train the model on the training set
train_model <- lm(Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption, data = train)

# Predict on the testing set
pred_test <- predict(train_model, newdata = test)

# Compute Mean Squared Error on test set
# This measures the average prediction error
mse <- mean((test$Score - pred_test)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 0.2718677

Bias-Variance Tradeoff Discussion

The simpler models (with fewer variables) tend to have higher bias and lower variance. As we add more predictors, the variance increases but we may reduce bias. Subset selection helps us find a good balance by comparing model fit statistics like Adjusted R². Our simple cross-validation gives us an estimate of the model’s generalization performance.

Conclusion

From our analysis we can conclude that: 1) GDP per capita and social support are strong predictors of happiness. 2) We manually tested different combinations of predictors. 3) The model performs well with 3 variables and improves slightly with the full model. 4) Cross-validation shows the model generalizes reasonably well.

References