LinhLe- DI HW Linear Regression

Part 1: Formulating a Research Question

Research question: How do a country’s social support systems, freedom of life choices, generosity, and perceptions of corruption predict national happiness scores?

Dependent variable: Happiness score (0–10 scale)

Independent variables: Social support, freedom to make life choices, generosity index, corruption perception score

Why the question is of your interest Because I always concern what can decide a happy country as a happy country would have more good people and the more good people the better the world.

Dataset: World Happiness Report 2021 - Kaggle - https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021

Part 2: Data Exploration and Preparation

About:

The World Happiness Report is a landmark survey of the state of global happiness

The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions.

Leading experts across fields-economics, psycology, surevy analysis, national statistics, health, public policy and more-describe how measurements of well-being can be used effectively to assess the progress of nations.

Source: - The happiness scores and rankings use data from the Gallup World Poll - a set of nationally representative surveys undertaken in more than 160 countries in over 140 languages.

Variables: Main variable

var<-tribble(
  ~variable, ~meaning, 
  "Ladder Score (Happiness Score)", "Based on the Cantril Ladder of Life Scale — respondents imagine a ladder from 0 (worst possible life) to 10 (best possible life) and rate where they currently stand. This is the national average of those responses.",
  "Log GDP per capita", "Acountry's economic output per person, log-transformed",
  
 "Social support", "Average answer to 'Do you have someone to count on in times of trouble?' (0 or 1)",

"Healthy life expectancy", "Average number of years a person can expect to live in good health",


"Freedom to make life choices", "Average response to 'Are you satisfied with your freedom to choose what to do with your life?'",

"Generosity", "Residual of regressing national average of donation behavior on GDP per capita",

"Perceptions of corruption", "Average of two binary questions about corruption in government and business",

"Regional indicator", "Region",

"Upper and lower confidence interval bounds"," Confidence interval bounds for the ladder score"
)
library(knitr)
kable(var)

variable	meaning
Ladder Score (Happiness Score)	Based on the Cantril Ladder of Life Scale — respondents imagine a ladder from 0 (worst possible life) to 10 (best possible life) and rate where they currently stand. This is the national average of those responses.
Log GDP per capita	Acountry’s economic output per person, log-transformed
Social support	Average answer to ‘Do you have someone to count on in times of trouble?’ (0 or 1)
Healthy life expectancy	Average number of years a person can expect to live in good health
Freedom to make life choices	Average response to ‘Are you satisfied with your freedom to choose what to do with your life?’
Generosity	Residual of regressing national average of donation behavior on GDP per capita
Perceptions of corruption	Average of two binary questions about corruption in government and business
Regional indicator	Region
Upper and lower confidence interval bounds	Confidence interval bounds for the ladder score

view(var)

Table of the Dataset:

wh <- read.csv("~/Downloads/School material/D-Visual/archive/world-happiness-report-2021.csv", header=TRUE)

glimpse(wh)

## Rows: 149
## Columns: 20
## $ Country.name                               <chr> "Finland", "Denmark", "Swit…
## $ Regional.indicator                         <chr> "Western Europe", "Western …
## $ Ladder.score                               <dbl> 7.842, 7.620, 7.571, 7.554,…
## $ Standard.error.of.ladder.score             <dbl> 0.032, 0.035, 0.036, 0.059,…
## $ upperwhisker                               <dbl> 7.904, 7.687, 7.643, 7.670,…
## $ lowerwhisker                               <dbl> 7.780, 7.552, 7.500, 7.438,…
## $ Logged.GDP.per.capita                      <dbl> 10.775, 10.933, 11.117, 10.…
## $ Social.support                             <dbl> 0.954, 0.954, 0.942, 0.983,…
## $ Healthy.life.expectancy                    <dbl> 72.000, 72.700, 74.400, 73.…
## $ Freedom.to.make.life.choices               <dbl> 0.949, 0.946, 0.919, 0.955,…
## $ Generosity                                 <dbl> -0.098, 0.030, 0.025, 0.160…
## $ Perceptions.of.corruption                  <dbl> 0.186, 0.179, 0.292, 0.673,…
## $ Ladder.score.in.Dystopia                   <dbl> 2.43, 2.43, 2.43, 2.43, 2.4…
## $ Explained.by..Log.GDP.per.capita           <dbl> 1.446, 1.502, 1.566, 1.482,…
## $ Explained.by..Social.support               <dbl> 1.106, 1.108, 1.079, 1.172,…
## $ Explained.by..Healthy.life.expectancy      <dbl> 0.741, 0.763, 0.816, 0.772,…
## $ Explained.by..Freedom.to.make.life.choices <dbl> 0.691, 0.686, 0.653, 0.698,…
## $ Explained.by..Generosity                   <dbl> 0.124, 0.208, 0.204, 0.293,…
## $ Explained.by..Perceptions.of.corruption    <dbl> 0.481, 0.485, 0.413, 0.170,…
## $ Dystopia...residual                        <dbl> 3.253, 2.868, 2.839, 2.967,…

Visualize:

wh %>%
  slice_max(Ladder.score, n = 20) %>%
  ggplot(aes(x = reorder(Country.name, Ladder.score), y = Ladder.score)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Happiest Countries - 2021",
    x = "Country",
    y = "Ladder Score"
  ) +
  theme_minimal()

Task 3: Linear Regression Analysis

Choice of variables:

Dependent variable: Ladder.score (Happiness) Independent variables: GDP, Social support, Life expectancy,Freedom, Generosity, Corruption

We are predicting happiness using the 6 factors that the World Happiness Report uses in their research: GDP, social support, healthy life expectancy, freedom, generosity, and perceptions of corruption

This is because the studies and theory both support that these things affect how happy people feel.

Model and anova

# Fit the full model
model <- lm( Ladder.score ~ Logged.GDP.per.capita + 
               Social.support + 
               Healthy.life.expectancy + 
               Freedom.to.make.life.choices + 
               Generosity + 
               Perceptions.of.corruption, 
             data = wh)
#View 
summary(model)

## 
## Call:
## lm(formula = Ladder.score ~ Logged.GDP.per.capita + Social.support + 
##     Healthy.life.expectancy + Freedom.to.make.life.choices + 
##     Generosity + Perceptions.of.corruption, data = wh)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85049 -0.30026  0.05735  0.33368  1.04878 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.23722    0.63049  -3.548 0.000526 ***
## Logged.GDP.per.capita         0.27953    0.08684   3.219 0.001595 ** 
## Social.support                2.47621    0.66822   3.706 0.000301 ***
## Healthy.life.expectancy       0.03031    0.01333   2.274 0.024494 *  
## Freedom.to.make.life.choices  2.01046    0.49480   4.063 7.98e-05 ***
## Generosity                    0.36438    0.32121   1.134 0.258541    
## Perceptions.of.corruption    -0.60509    0.29051  -2.083 0.039058 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5417 on 142 degrees of freedom
## Multiple R-squared:  0.7558, Adjusted R-squared:  0.7455 
## F-statistic: 73.27 on 6 and 142 DF,  p-value: < 2.2e-16

#Anova
anova(model)

## Analysis of Variance Table
## 
## Response: Ladder.score
##                               Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Logged.GDP.per.capita          1 106.463 106.463 362.7575 < 2.2e-16 ***
## Social.support                 1   8.320   8.320  28.3503 3.869e-07 ***
## Healthy.life.expectancy        1   3.476   3.476  11.8455 0.0007596 ***
## Freedom.to.make.life.choices   1   8.769   8.769  29.8807 2.009e-07 ***
## Generosity                     1   0.713   0.713   2.4306 0.1212116    
## Perceptions.of.corruption      1   1.273   1.273   4.3383 0.0390577 *  
## Residuals                    142  41.674   0.293                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Reduced model

model_r <- lm( Ladder.score ~ Logged.GDP.per.capita + 
               Social.support + 
               Healthy.life.expectancy + 
               Freedom.to.make.life.choices + 
               Perceptions.of.corruption, 
             data = wh)

summary(model_r)

## 
## Call:
## lm(formula = Ladder.score ~ Logged.GDP.per.capita + Social.support + 
##     Healthy.life.expectancy + Freedom.to.make.life.choices + 
##     Perceptions.of.corruption, data = wh)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93303 -0.29768  0.06863  0.33924  1.02304 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.11039    0.62112  -3.398 0.000880 ***
## Logged.GDP.per.capita         0.26400    0.08584   3.075 0.002518 ** 
## Social.support                2.50670    0.66835   3.751 0.000256 ***
## Healthy.life.expectancy       0.02936    0.01332   2.204 0.029095 *  
## Freedom.to.make.life.choices  2.13266    0.48342   4.412 2.01e-05 ***
## Perceptions.of.corruption    -0.66778    0.28549  -2.339 0.020718 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5423 on 143 degrees of freedom
## Multiple R-squared:  0.7536, Adjusted R-squared:  0.745 
## F-statistic: 87.49 on 5 and 143 DF,  p-value: < 2.2e-16

# Compare full vs reduced model
anova(model_r, model)

## Analysis of Variance Table
## 
## Model 1: Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Perceptions.of.corruption
## Model 2: Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    143 42.052                           
## 2    142 41.674  1   0.37767 1.2868 0.2585

Residual plots, Normality, Collinearity:

# Residual and normality plots (4 plots in one)
par(mfrow = c(2,2))
plot(model)

# Check collinearity using VIF 
# VIF > 5 = moderate concern, VIF > 10 = serious problem

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:openintro':
## 
##     densityPlot

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(model)

##        Logged.GDP.per.capita               Social.support 
##                     5.104890                     2.972200 
##      Healthy.life.expectancy Freedom.to.make.life.choices 
##                     4.099348                     1.585807 
##                   Generosity    Perceptions.of.corruption 
##                     1.180982                     1.367122

Interaction or higher-order terms

wh_selected <- wh %>%
  dplyr:: select(Ladder.score,
         Logged.GDP.per.capita,
         Social.support,
         Healthy.life.expectancy,
         Freedom.to.make.life.choices,
         Generosity,
         Perceptions.of.corruption)
cor(wh_selected)

##                              Ladder.score Logged.GDP.per.capita Social.support
## Ladder.score                   1.00000000             0.7897597      0.7568876
## Logged.GDP.per.capita          0.78975970             1.0000000      0.7852987
## Social.support                 0.75688765             0.7852987      1.0000000
## Healthy.life.expectancy        0.76809946             0.8594606      0.7232561
## Freedom.to.make.life.choices   0.60775307             0.4323235      0.4829298
## Generosity                    -0.01779928            -0.1992864     -0.1149459
## Perceptions.of.corruption     -0.42114000            -0.3423374     -0.2032070
##                              Healthy.life.expectancy
## Ladder.score                               0.7680995
## Logged.GDP.per.capita                      0.8594606
## Social.support                             0.7232561
## Healthy.life.expectancy                    1.0000000
## Freedom.to.make.life.choices               0.4614939
## Generosity                                -0.1617503
## Perceptions.of.corruption                 -0.3643735
##                              Freedom.to.make.life.choices  Generosity
## Ladder.score                                    0.6077531 -0.01779928
## Logged.GDP.per.capita                           0.4323235 -0.19928640
## Social.support                                  0.4829298 -0.11494585
## Healthy.life.expectancy                         0.4614939 -0.16175028
## Freedom.to.make.life.choices                    1.0000000  0.16943737
## Generosity                                      0.1694374  1.00000000
## Perceptions.of.corruption                      -0.4013630 -0.16396173
##                              Perceptions.of.corruption
## Ladder.score                                -0.4211400
## Logged.GDP.per.capita                       -0.3423374
## Social.support                              -0.2032070
## Healthy.life.expectancy                     -0.3643735
## Freedom.to.make.life.choices                -0.4013630
## Generosity                                  -0.1639617
## Perceptions.of.corruption                    1.0000000

With a correlations higher than 0.85, we can tell the correlations between GDP and Health is quite high which show that rich country will live longer.

Wealthier countries here tend to have stronger social systems because our correlation between GDP and social is about 0.78

Check all the residual plot

wh_diag <- wh %>%
  mutate(residuals = resid(model))

predictors <- c("Logged.GDP.per.capita", "Social.support",
                "Healthy.life.expectancy", "Freedom.to.make.life.choices",
                "Generosity", "Perceptions.of.corruption")

for (var in predictors) {
  p <- ggplot(wh_diag, aes_string(x = var, y = "residuals")) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "steelblue") +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = paste("Residuals vs", var), y = "Residuals") +
    theme_minimal()
  print(p)
}

Residual plots were examined for all six predictors to assess whether higher-order or interaction terms were necessary.

Freedom to make life choices showed a near-flat pattern around zero which mean there is no need of adjustment.

The remaining variables showed varying degrees of non-linearity:

Healthy life expectancy displayed the strongest pattern which provide a clear justification of a quadratic term
Social support showed a bit curve in the mid-range
GDP showed only a mild S-curve
Perceptions of corruption and Generosity showed minor deviations.

Based on this visual evidence, quadratic terms for Healthy.life.expectancy and Social.support were tested using partial F-tests via anova()

# strongest candidate
model_quad_health <- lm(Ladder.score ~ Logged.GDP.per.capita +
                        Social.support +
                        Healthy.life.expectancy +
                        I(Healthy.life.expectancy^2) +
                        Freedom.to.make.life.choices +
                        Generosity +
                        Perceptions.of.corruption,
                      data = wh)

# Second candidate
model_quad_social <- lm(Ladder.score ~ Logged.GDP.per.capita +
                          Social.support +
                          I(Social.support^2) +
                          Healthy.life.expectancy +
                          Freedom.to.make.life.choices +
                          Generosity +
                          Perceptions.of.corruption,
                        data = wh)

# Compare each to base model
anova(model, model_quad_health)    # p < 0.05 = keep it

## Analysis of Variance Table
## 
## Model 1: Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
## Model 2: Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     I(Healthy.life.expectancy^2) + Freedom.to.make.life.choices + 
##     Generosity + Perceptions.of.corruption
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    142 41.674                                
## 2    141 39.705  1    1.9694 6.9939 0.009106 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model, model_quad_social)  # p < 0.05 = keep it

## Analysis of Variance Table
## 
## Model 1: Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
## Model 2: Ladder.score ~ Logged.GDP.per.capita + Social.support + I(Social.support^2) + 
##     Healthy.life.expectancy + Freedom.to.make.life.choices + 
##     Generosity + Perceptions.of.corruption
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
## 1    142 41.674                             
## 2    141 40.343  1     1.331 4.652 0.03271 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Also check adjusted R² — did it improve?
summary(model)$adj.r.squared

## [1] 0.7455308

summary(model_quad_health)$adj.r.squared

## [1] 0.7558371

summary(model_quad_social)$adj.r.squared

## [1] 0.7519113

#both quadratic terms together
model_quad_both <- lm(Ladder.score ~ Logged.GDP.per.capita +
                        Social.support +
                        I(Social.support^2) +
                        Healthy.life.expectancy +
                        I(Healthy.life.expectancy^2) +
                        Freedom.to.make.life.choices +
                        Generosity +
                        Perceptions.of.corruption,
                      data = wh)

summary(model_quad_both)$adj.r.squared

## [1] 0.7601577

anova(model_quad_health, model_quad_both)

## Analysis of Variance Table
## 
## Model 1: Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     I(Healthy.life.expectancy^2) + Freedom.to.make.life.choices + 
##     Generosity + Perceptions.of.corruption
## Model 2: Ladder.score ~ Logged.GDP.per.capita + Social.support + I(Social.support^2) + 
##     Healthy.life.expectancy + I(Healthy.life.expectancy^2) + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
##   Res.Df    RSS Df Sum of Sq    F  Pr(>F)  
## 1    141 39.705                            
## 2    140 38.726  1   0.97922 3.54 0.06198 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since adding Social.support^2 on top of Healthy.life.expectancy^2 does not significantly improve fit because p = 0.062 > 0.05 threshold

So that we keep my final model would be Healthy.life.expectancy^2.

# Final 
model_final <- lm(Ladder.score ~ Logged.GDP.per.capita +
                        Social.support +
                        Healthy.life.expectancy +
                        I(Healthy.life.expectancy^2) +
                        Freedom.to.make.life.choices +
                        Generosity +
                        Perceptions.of.corruption,
                      data = wh)
summary(model_final)

## 
## Call:
## lm(formula = Ladder.score ~ Logged.GDP.per.capita + Social.support + 
##     Healthy.life.expectancy + I(Healthy.life.expectancy^2) + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption, 
##     data = wh)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73154 -0.24981  0.06797  0.33892  1.12600 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   7.758067   3.829640   2.026 0.044672 *  
## Logged.GDP.per.capita         0.225680   0.087467   2.580 0.010897 *  
## Social.support                2.561439   0.655343   3.909 0.000144 ***
## Healthy.life.expectancy      -0.297325   0.124576  -2.387 0.018328 *  
## I(Healthy.life.expectancy^2)  0.002650   0.001002   2.645 0.009106 ** 
## Freedom.to.make.life.choices  2.274823   0.494882   4.597 9.45e-06 ***
## Generosity                    0.353807   0.314667   1.124 0.262760    
## Perceptions.of.corruption    -0.312564   0.305309  -1.024 0.307701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5307 on 141 degrees of freedom
## Multiple R-squared:  0.7674, Adjusted R-squared:  0.7558 
## F-statistic: 66.45 on 7 and 141 DF,  p-value: < 2.2e-16

Interpretation:

The final model explains 75.6% of the variance in national happiness scores (adjusted R² = 0.7558), indicating a strong fit.

For log GDP per capita, it is associated with a 0.23 point increase in happiness. This is statistically significant with p value <0.05, confirming that wealthier nations tend to be happier.

For Social support, and Freedom, these are the strongest significant predictor since a one-unit increase in Social support is associated with a 2.56 point increase in happiness and a one-unit increase in freedom of life choices is associated with a 2.27 point increase in happiness. And for my research question : How do a country’s social support systems, freedom of life choices, generosity, and perceptions of corruption predict national happiness scores, we can tell that Social support and freedom of life have the most influent on the national happiness scores.

For Healthy life expectancy, we can’t interpreted as a single number because of the quadratic term. The relationship is curved, not straight

For Generosity and Corruption, these two variables are not significant since their p value are larger than 0.05.

Part 4: Discussion and Model Improvement

Potential Model Improvements: I can add regional indicator to capture more information

Additional Variables or Transformations:

We can add Unemployment rate and Income inequality to show how Financial insecurity affects happiness beyond average GDP and how Two countries can have the same GDP but very different happiness if wealth is unevenly distributed

We can also change it from just in 2021 to 2015-2021 or make it update until now since the year of 2021 is the year of Pandemic that it might not be representative.

Limitation:

2021 was affected by the pandemic, which may have changed the happiness scores, social support measures, and life expectancy in ways that are not representative of normal conditions.

This data set might be biased since all variables are national averages, which hides enormous within-country variation. For example: A person in rural China and urban China may have completely different happiness drivers, but they contribute to the same data point.