Assessing the impact of socio economic status across different demographics to understand the disparities: Analyzing Global Life Expectancy

Author

Karim 32228643

Introduction

Research Question

The conceptual research question guiding this analysis is: “How do a country’s socio-economic factors, including healthcare access, income levels, and consumption patterns, influence life expectancy globally?” This question aims to explore the complex relationship between these socio-economic determinants and health outcomes, highlighting disparities across countries with different demographics.

Research Hypothesis

The practical research hypothesis posited in this report is: “Countries with higher levels of income accessibility to healthcare services exhibit higher life expectancy and lower mortality rates” we believe the data we uses support the research hypothesis. The full dataset overview will be given below. For instance, Health expenditure, GDP per capita, and GNI per capita all increase significantly from low-income to high-income countries, indicating that wealthier nations invest more in healthcare infrastructure and services (Boniol et al., 2022). Disease death rates and both child and adult mortality rates decrease with rising income levels, reflecting the effectiveness of healthcare services in wealthier countries (Arias et al., 2022). The number of hospital beds and urban population percentages also increase, highlighting improved healthcare capacity and access in richer nations. Despite higher alcohol consumption in high-income countries, the overall health benefits from better healthcare infrastructure outweigh potential negative impacts (Watsons, 2022).

Libraries used

To perform the analysis and generate the necessary visualizations and s for this technical report, the following R packages will be utilized:

show code
library(readr)
library(tidyverse)
library(readxl)
library(visdat)
library(gt)
library(gtsummary)
library(tidymodels)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(rsample)
library(tune)
library(car)
library(patchwork)
library(glmnet)
library(vip)
library(DALEX)
library(knitr)
library(kableExtra)
library(huxtable)
library(modelsummary)
library(gtsummary)
library(psych)
library(randomForest)
library(caret)
library(glmnet) 
library(corrplot)
library(rpart.plot)
library(rpart)

Dataset Overview

To perform this analysis, we utilizes the dataset from the previous infographic assignment. The dataset given, is a joined dataset from numerous sources which consists of socio-economic predictors and wealth status. The previous infographic dataset consists of aggregated countries, region and UN classify territories. Therefore, the code chunk below filter out those non-country values and keep all 193 UN member countries.

Initial Data Preprocessing

1. Data collection and integration

show code
# load previous infographic dataset 
IA3_df = read_excel("Datasets/Raw_IA3_df.xlsx")

# convert 0s into NA
full_df <- IA3_df %>%
  mutate(across(where(is.numeric), ~na_if(., 0)))
# countries column has loads of region and state names, even aggregated names.
# collect country code that consists of all recognized 195 countries, however this datasets do not consists of Palestine and Vatican city as they are non-member observer state.
codes = read_csv("Datasets/countries.csv") %>% select(en, alpha3) %>% 
  transmute(Code = toupper(alpha3),
            Country = en)

# get continents dataset for grouping descriptive summary results
continent = read_csv("Datasets/continents.csv") %>%
  rename(country = Country)
  


# join two datasets, and remove rows that is inconsistent with the earlier country code
cleaned_df = full_df %>% full_join(codes, by = "Code") %>% 
  filter(Country != "0") %>% select(-Country) %>% distinct(Code, .keep_all =TRUE) %>% inner_join(continent, by = "country")


# rewrite the name to be shorter and simpler to write
cleaned_df = cleaned_df %>% 
  rename(adult_mortality = `Adult Mortality rate per 1,000`,
         total_mortality = `Total Mortality rate per 1,000`)

# replace white space with underscores
names(cleaned_df) <- gsub(" ", "_",tolower(names(cleaned_df)))

Dependent Variable: Life_Expectancy

The dependent variable we use is `Life_Expectancy` variable which fortunately has no missing values. Data is obtained from World Bank Databank specifically taken from year 2019 before the heavy influx of Covid-19 that may affect our predictions.

2. Handling Missing Values

show code
# calculate the percentage amount of NA relative to the dataset population
percent_na = sum(is.na(cleaned_df)) / (nrow(cleaned_df) * ncol(cleaned_df)) 

# calculate the total NA
full_na = sum(is.na(cleaned_df))

# visualize the missing values
vis_miss(cleaned_df)

In this dataset, it has 416 missing “NA” values. Which means 11.97% of this dataset is missing. gini_index and doctors dataset has the highest missing values while life expectancy (dependent variable) has no missing values. Because more than half of the observation is missing, imputing the value would not be sui as the predicted values would not replicate the supposed actual value.

2.1 Removing variables with very high percentage of missing values

Missing Values of Upper middle income to High income countries

show code
# we visualizes the data of the upper-middle- income and above
cleaned_df %>%
  filter(gni_percap >=3995)%>% vis_miss()

Missing Values of Low middle income to Low income countries

show code
# we visualizes the data of the low-middle- income and below
cleaned_df %>%
  filter(gni_percap <=3995)%>% vis_miss() 

It can be seen that the lower income countries has relatively higher amount of missing values. This is no coincidence, likely to be Missing Not at Random (MNAR), suggesting that regions with lower income are less likely to report this data, even 50% of the hospital_beds variable is missing. Although docters and gini_index might be useful for the model, this non-informative missingness constitutes a large portion of the overall data (more than 50%) for doctors and gini_index. Therefore, we delete this variable and perform imputation on the rest.

2.2. Handling missing values through imputation methods

Use Bagged Trees (Random Forest) for Higher Missingness: (more than or equal to 10%)

Random forest imputation is robust and can handle higher levels of missing data better than KNN (Hu & Szymczak, 2022). It considers the relationships between variables and leverages the ensemble method to make better predictions (Hu & Szymczak, 2022).

Variables: unemployment_(%), hospital_beds, adult_mortality


Use K-Nearest Neighbor for Lower Missingness: (less than 10%)

KNN is effective for lower levels of missing data and can be quicker to implement. It uses the closest neighbors to impute the missing values, which works well when missingness is lower (Keith, 2021)..

Variables: health_expenditure, gdp_percap, gni_percap, disease_deathrate, disease_death, adult_mortality, total_mortality, child_mortality

show code
# delete gini index and doctors
deleted_df = select(cleaned_df,-c(gini_index, doctors))%>% 
  # convert character columns to factors or vice versa as needed
  mutate_if(is.character, as.factor)

# variable name with low missingness
low_missingness = colnames(select(deleted_df,-c(`unemployment_(%)`, hospital_beds, adult_mortality)))


# Create a recipe for data preprocessing with imputation
rec0 <- recipe(~ ., data = deleted_df) %>%
  step_impute_bag(c(`unemployment_(%)`, hospital_beds, adult_mortality)) %>%
  step_impute_knn(c(low_missingness), neighbors = 3) 
  

# Prepare the recipe
rec_prep0 <- prep(rec0, training = deleted_df)

# Apply the recipe to the data (bake the recipe)
final_data <- bake(rec_prep0, new_data =deleted_df )

# Check the result
vis_miss(final_data)

Perfect!

2.3 Full dataset before input selection and normalization

show code
table_1_df <- final_data %>%
  mutate(`Income Level` = factor(case_when(
    gni_percap < 1035 ~ "Low-income",
    gni_percap >= 1036 & gni_percap <= 4045 ~ "Lower-middle-income",
    gni_percap >= 4046 & gni_percap <= 12535 ~ "Upper-middle-income",
    gni_percap >= 12535 ~ "High-income"),
    levels = c("Low-income", "Lower-middle-income","Upper-middle-income","High-income"))
    ) %>% select(!c(country, code))


summary_by_income <- table_1_df %>%
  tbl_summary(
    by = `Income Level`,
    missing = "no"
  ) %>%
  modify_header(label = "**Variable**")


kable(summary_by_income,
      caption = "*Full dataset summary statistics aggregated by income levels*")
Full dataset summary statistics aggregated by income levels
Variable Low-income, N = 24 Lower-middle-income, N = 50 Upper-middle-income, N = 60 High-income, N = 59
health_expenditure 107 (72, 152) 213 (146, 412) 907 (657, 1,306) 3,324 (2,200, 5,314)
life_expectancy 62 (60, 64) 68 (63, 71) 74 (72, 77) 81 (78, 83)
unemployment_(%) 3.9 (3.0, 7.3) 5.1 (3.4, 9.3) 5.9 (4.4, 11.7) 5.0 (3.6, 6.4)
gdp_percap 698 (540, 810) 2,145 (1,454, 3,049) 6,697 (5,172, 9,158) 32,279 (19,138, 51,004)
gni_percap 695 (520, 823) 2,130 (1,525, 3,158) 6,635 (5,260, 8,975) 34,660 (18,630, 50,470)
disease_deathrate 291 (212, 407) 134 (61, 250) 44 (27, 75) 36 (25, 48)
diseases_death 46,205 (23,463, 90,374) 16,992 (2,982, 55,250) 2,049 (467, 8,696) 1,720 (222, 6,922)
hospital_beds 9 (8, 20) 11 (9, 18) 22 (13, 40) 31 (25, 51)
child_mortality 52 (41, 70) 26 (15, 38) 11 (7, 18) 4 (3, 6)
adult_mortality 290 (266, 325) 231 (187, 282) 183 (141, 227) 83 (69, 139)
total_mortality 7.93 (6.57, 9.10) 6.67 (5.66, 7.92) 7.17 (5.58, 8.51) 7.94 (6.32, 9.98)
alcohol_percap 2.0 (0.4, 3.8) 3.3 (1.4, 5.0) 5.3 (2.7, 7.4) 9.4 (6.3, 10.9)
urban_pop 37 (24, 43) 42 (31, 57) 64 (54, 77) 82 (68, 90)
continent NA NA NA NA
Africa 21 (88%) 23 (46%) 9 (15%) 1 (1.7%)
Asia 3 (13%) 17 (34%) 12 (20%) 11 (19%)
Europe 0 (0%) 1 (2.0%) 12 (20%) 34 (58%)
North America 0 (0%) 3 (6.0%) 13 (22%) 7 (12%)
Oceania 0 (0%) 5 (10%) 5 (8.3%) 4 (6.8%)
South America 0 (0%) 1 (2.0%) 9 (15%) 2 (3.4%)
show code
# Print the 
kable(describe(final_data),
      caption = "*Full dataset summary statistics for each variable*",
      digits = 4)
Full dataset summary statistics for each variable
vars n mean sd median trimmed mad min max range skew kurtosis se
country* 1 193 97.0000 55.8585 97.0000 97.0000 71.1648 1.0000 193.0000 192.0000 0.0000 -1.2187 4.0208
health_expenditure 2 193 1573.7975 1897.9076 791.7386 1193.4901 957.4787 39.1323 10661.0284 10621.8961 1.8082 3.2498 136.6144
code* 3 193 97.0000 55.8585 97.0000 97.0000 71.1648 1.0000 193.0000 192.0000 0.0000 -1.2187 4.0208
life_expectancy 4 193 72.4264 7.6160 73.4000 72.7697 8.5991 52.9000 86.5000 33.6000 -0.3792 -0.6855 0.5482
unemployment_(%) 5 193 6.8661 5.1275 5.0200 6.0661 2.5797 0.1000 26.3070 26.2070 1.4854 1.8643 0.3691
gdp_percap 6 193 15889.1266 25892.1206 6070.3881 10373.6659 6901.5154 216.9730 199382.8386 199165.8656 3.6552 18.3176 1863.7556
gni_percap 7 193 14490.6563 19284.8572 5940.0000 10385.2903 6775.4820 230.0000 83160.0000 82930.0000 1.7921 2.3469 1388.1544
disease_deathrate 8 193 116.6439 135.6945 57.2600 89.4985 50.7642 4.1600 820.5100 816.3500 2.0516 4.8176 9.7675
diseases_death 9 193 40634.7789 150938.8479 5385.0000 15385.2344 7807.3716 6.0000 1841182.0000 1841176.0000 9.5761 105.6194 10864.8160
hospital_beds 10 193 27.5813 20.8002 21.5000 24.7495 17.4947 1.7000 128.8000 127.1000 1.3482 2.3763 1.4972
child_mortality 11 193 20.5006 21.2119 12.5398 16.5013 12.7742 0.8317 98.4470 97.6153 1.7055 2.6173 1.5269
adult_mortality 12 193 191.9923 96.5979 185.8290 184.8435 114.1291 46.4220 516.4180 469.9960 0.5762 -0.0935 6.9533
total_mortality 13 193 7.6299 2.6311 7.2900 7.4650 2.2387 0.9910 15.5000 14.5090 0.5186 0.4896 0.1894
alcohol_percap 14 193 5.4208 3.8830 4.9250 5.2308 4.6465 0.0051 16.9896 16.9846 0.3819 -0.8859 0.2795
urban_pop 15 193 59.2395 22.9545 59.0390 59.6902 27.2472 13.2500 100.0000 86.7500 -0.1482 -0.9836 1.6523
continent* 16 193 2.6684 1.4874 2.0000 2.5097 1.4826 1.0000 6.0000 5.0000 0.6513 -0.4678 0.1071

3. Correlation analysis with target variable

3.1 Full data correlation matrix

show code
# store a numeric only dataframe for our linear regression
numeric_df = select(final_data, where(is.numeric)) 

# form a correlation matrix
cor_matrix = numeric_df %>% cor() 

# identify the variable with high correlation 
cor_matrix %>%
  corrplot(method ="number",
           order = "hclust")

show code
# identify the correlated variables, setting a threshold of 0.3 correlation. (Moderate to strong correlation)
correlated_variables = as.data.frame(cor_matrix) %>% 
  select(life_expectancy) %>%
  filter(life_expectancy > 0.3 | life_expectancy < -0.3 )%>%
  filter(life_expectancy != 1) 

# these variables with the most correlation 
kable(correlated_variables %>%
  rownames_to_column() %>% 
  arrange(desc(life_expectancy)),
  caption = "*Variables with moderate to strong Correlation with life expectancy*",
  label= "*Variable*")
Variables with moderate to strong Correlation with life expectancy
rowname life_expectancy
health_expenditure 0.7162116
gni_percap 0.6961471
gdp_percap 0.6107644
urban_pop 0.5798739
alcohol_percap 0.4023439
hospital_beds 0.3246269
disease_deathrate -0.8006771
child_mortality -0.8412293
adult_mortality -0.9340389
show code
# Melt the data frame
df_melt0 <- melt(numeric_df%>%
  select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars = "life_expectancy")

# Create scatterplots
scatterplots0 <- ggplot(df_melt0, aes(x = value, y = life_expectancy)) +
  geom_point(alpha = 0.2) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Scatterplot Matrix Against Life Expectancy")+
  geom_smooth(size = 0.3, color = "red",linetype = 2, se = FALSE)+
  theme_bw()

print(scatterplots0)

show code
distributions0 <- ggplot(df_melt0, aes(x = value)) +
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.3) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions Against Life Expectancy")+
  theme_bw()

distributions0

4. Final selected Inputs

show code
# select corelated variable without normalization 
correlated_inputs = select(final_data, c(colnames(t(correlated_variables)), "life_expectancy", ))
kable(correlated_inputs %>%
  mutate(`Income Level` = factor(case_when(
    gni_percap < 1035 ~ "Low-income",
    gni_percap >= 1036 & gni_percap <= 4045 ~ "Lower-middle-income",
    gni_percap >= 4046 & gni_percap <= 12535 ~ "Upper-middle-income",
    gni_percap >= 12535 ~ "High-income"),
    levels = c("Low-income", "Lower-middle-income","Upper-middle-income","High-income"))) %>%
  tbl_summary(
    by = `Income Level`,
    missing = "no"
  ) %>%
  modify_header(label = "**Variable**"),
      caption = "*Selected dataset summary statistics aggregated by income levels*",digit = 4)
Selected dataset summary statistics aggregated by income levels
Variable Low-income, N = 24 Lower-middle-income, N = 50 Upper-middle-income, N = 60 High-income, N = 59
health_expenditure 107 (72, 152) 213 (146, 412) 907 (657, 1,306) 3,324 (2,200, 5,314)
gdp_percap 698 (540, 810) 2,145 (1,454, 3,049) 6,697 (5,172, 9,158) 32,279 (19,138, 51,004)
gni_percap 695 (520, 823) 2,130 (1,525, 3,158) 6,635 (5,260, 8,975) 34,660 (18,630, 50,470)
disease_deathrate 291 (212, 407) 134 (61, 250) 44 (27, 75) 36 (25, 48)
hospital_beds 9 (8, 20) 11 (9, 18) 22 (13, 40) 31 (25, 51)
child_mortality 52 (41, 70) 26 (15, 38) 11 (7, 18) 4 (3, 6)
adult_mortality 290 (266, 325) 231 (187, 282) 183 (141, 227) 83 (69, 139)
alcohol_percap 2.0 (0.4, 3.8) 3.3 (1.4, 5.0) 5.3 (2.7, 7.4) 9.4 (6.3, 10.9)
urban_pop 37 (24, 43) 42 (31, 57) 64 (54, 77) 82 (68, 90)
life_expectancy 62 (60, 64) 68 (63, 71) 74 (72, 77) 81 (78, 83)
show code
kable(describe(correlated_inputs),
      caption = "*Selected dataset summary statistics for each variable*",
      digits = 4)
Selected dataset summary statistics for each variable
vars n mean sd median trimmed mad min max range skew kurtosis se
health_expenditure 1 193 1573.7975 1897.9076 791.7386 1193.4901 957.4787 39.1323 10661.0284 10621.8961 1.8082 3.2498 136.6144
gdp_percap 2 193 15889.1266 25892.1206 6070.3881 10373.6659 6901.5154 216.9730 199382.8386 199165.8656 3.6552 18.3176 1863.7556
gni_percap 3 193 14490.6563 19284.8572 5940.0000 10385.2903 6775.4820 230.0000 83160.0000 82930.0000 1.7921 2.3469 1388.1544
disease_deathrate 4 193 116.6439 135.6945 57.2600 89.4985 50.7642 4.1600 820.5100 816.3500 2.0516 4.8176 9.7675
hospital_beds 5 193 27.5813 20.8002 21.5000 24.7495 17.4947 1.7000 128.8000 127.1000 1.3482 2.3763 1.4972
child_mortality 6 193 20.5006 21.2119 12.5398 16.5013 12.7742 0.8317 98.4470 97.6153 1.7055 2.6173 1.5269
adult_mortality 7 193 191.9923 96.5979 185.8290 184.8435 114.1291 46.4220 516.4180 469.9960 0.5762 -0.0935 6.9533
alcohol_percap 8 193 5.4208 3.8830 4.9250 5.2308 4.6465 0.0051 16.9896 16.9846 0.3819 -0.8859 0.2795
urban_pop 9 193 59.2395 22.9545 59.0390 59.6902 27.2472 13.2500 100.0000 86.7500 -0.1482 -0.9836 1.6523
life_expectancy 10 193 72.4264 7.6160 73.4000 72.7697 8.5991 52.9000 86.5000 33.6000 -0.3792 -0.6855 0.5482
show code
sources = as.data.frame(t(correlated_inputs))[2] %>% rownames_to_column()
sources$V2 <- c("WorldBank","WorldBank","WorldBank","Our World In Data", "World Health Organization (WHO)","UNICEF", "WorldBank", "WorldBank", "WorldBank", "WorldBank")
colnames(sources) <- c("variable", "source")
kable(sources, caption = "*Each selected variables sources*")
Each selected variables sources
variable source
health_expenditure WorldBank
gdp_percap WorldBank
gni_percap WorldBank
disease_deathrate Our World In Data
hospital_beds World Health Organization (WHO)
child_mortality UNICEF
adult_mortality WorldBank
alcohol_percap WorldBank
urban_pop WorldBank
life_expectancy WorldBank

How would this variable be useful?

All these 9 selected variables have moderate to strong correlation (threshold of 0.3 correlation). Hence, these variables have a noticable relationship with life expectancy. Each of these features provide valuable information for the predictive model to run. Analyzing these variables collectively can help develop predictive models for estimating life expectancy accurately. However, to fit the model, further data preprocessing will be used. By bringing features to a similair scale, the model is able to eliminate bias from outliers and produce more accurate prediction.

5. Data Normalization of selected inputs

show code
# Create a recipe for data preprocessing with imputation
rec1 <- recipe(~ ., data = final_data) %>%
    step_YeoJohnson(colnames(t(correlated_variables))) # Normalize numeric predictors

# Prepare the recipe
rec_prep1 <- prep(rec1, training = final_data)

# Apply the recipe to the data (bake the recipe)
normalized_final_data <- bake(rec_prep1, new_data =final_data )

df_melt1 <- melt(select(normalized_final_data, where(is.numeric))%>%
  select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars = "life_expectancy")

distributions1 <- ggplot(df_melt1, aes(x = value)) +
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.3) +  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions Against Life Expectancy")+
  theme_bw()

distributions1

show code
scatterplots1 <- ggplot(df_melt1, aes(x = value, y = life_expectancy)) +
  geom_point(alpha = 0.2) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Scatterplot Matrix Against Life Expectancy")+
  geom_smooth(size = 0.3, color = "red",linetype = 2, se = FALSE)+
  theme_bw()

print(scatterplots1)

6. Data partitioning

show code
# our selected data
selected_inputs = select(normalized_final_data, 
                           c("life_expectancy",
                             colnames(as.data.frame(t(correlated_variables)))))
set.seed(000)
# split data to training and testing
df_split = initial_split(selected_inputs, prop = 0.7, strata = life_expectancy)

# input train and test data 
df_train = training(df_split)
df_test = testing(df_split)

Multiple Linear Regression - Ordinary Least Squared

\(y=β0​+β1​x1​+β2​x2​+...​+ϵ\)

1. Training Multiple Linear Regression

1.1. Train and evaluate initial regression model

show code
# form the formula
first_fmla = as.formula("life_expectancy ~ health_expenditure +
                        gdp_percap+ gni_percap + disease_deathrate + 
                        child_mortality + adult_mortality + urban_pop +
                        alcohol_percap + hospital_beds")

# formula being used
linear_model0 = lm(first_fmla,
# dataset being used
   data = df_train) 

# recieve results
df_results0 = cbind(df_train,
        as.data.frame(predict(linear_model0,interval = "prediction")))

# model summary
linear0_summary = summary(linear_model0)
modelsummary_output <- modelsummary(
  list(linear_model0), 
  output = "gt",
  stars = TRUE,
  statistic = c("t = {statistic}", "se = {std.error}", 'p.value'),
  shape = 
)
# Convert the gt  to kableExtra
kable(modelsummary_output %>%
  as.data.frame(),
  caption = "*Initial training model summary*",
  full_width = FALSE, position = "center")
Initial training model summary
(1)
(Intercept) 94.831***
t = 27.843
se = 3.406
(<0.001)
health_expenditure 0.427
t = 1.404
se = 0.304
(0.163)
gdp_percap 3.718*
t = 2.560
se = 1.452
(0.012)
gni_percap -2.779*
t = -2.098
se = 1.325
(0.038)
disease_deathrate -1.373*
t = -2.367
se = 0.580
(0.020)
child_mortality -3.209***
t = -6.382
se = 0.503
(<0.001)
adult_mortality -1.131***
t = -13.562
se = 0.083
(<0.001)
urban_pop -0.008
t = -0.834
se = 0.010
(0.406)
alcohol_percap 0.216
t = 1.525
se = 0.142
(0.130)
hospital_beds -0.643**
t = -2.987
se = 0.215
(0.003)
Num.Obs. 133
R2 0.955
R2 Adj. 0.951
AIC 536.1
BIC 567.9
Log.Lik. -257.047
F 286.747
RMSE 1.67

1.2. Checking for Multicollinearity: Variance of Influence (VIF)

show code
# form a VIF plot 
VIF_linear_0 = as.data.frame(vif(linear_model0)) %>% 
  rownames_to_column() %>% 
  ggplot(
  aes(x = rowname, y = vif(linear_model0)))+
  geom_col()+
  coord_flip() +
  geom_hline(yintercept = 5, linetype = 2, color = 'blue')+
  labs(title = "VIF Values of Predictor Variables", x= NULL, y = 'VIF values') +
  theme_bw()+
   theme(axis.text.y = element_text(size = 12)) 

VIF_linear_0

The VIF plot suggest gni_percap and gdp_percap variable is correlated. Therefore we delete the variable wit the highest VIF values, gni_percap. Additionally, we delete child_mortality as well, this is likely to be correlated with adult_mortality. Removing multicollinearity may help improve model interpretability and enhance statistical significance.

show code
# form the formula
second_fmla = as.formula("life_expectancy ~ gdp_percap + disease_deathrate + adult_mortality + urban_pop + alcohol_percap + hospital_beds + health_expenditure")

# formula being used
linear_model1 = lm(second_fmla,
# dataset being used
   data = df_train) 

# recieve results
df_results1 = cbind(df_train,
        as.data.frame(predict(linear_model1,interval = "prediction")))

# recieve residuals
df_results1$residuals = linear_model1$residuals
# model summary
linear1_summary = summary(linear_model1) 
VIF_linear_1 = as.data.frame(vif(linear_model1)) %>% 
  rownames_to_column() %>% 
  ggplot(
  aes(x = rowname, y = vif(linear_model1))
)+
  geom_col()+
  coord_flip() +
  geom_hline(yintercept = 5, linetype = 2, color = 'blue')+
  labs(title = "VIF Values of Predictor Variables", x= NULL, y = 'VIF values') +
  theme_bw()+
   theme(axis.text.y = element_text(size = 12)) 

(VIF_linear_0 + ggtitle(label = "Before removing gni_percap and child_mortality") ) / 
(VIF_linear_1 + ggtitle(label = "After removing gni_percap and child_mortality ") ) 

Health_expenditure is still correlated with gdp_percap, therefore we ultimately remove this.

show code
third_fmla = as.formula("life_expectancy ~ gdp_percap + disease_deathrate + 
                        adult_mortality + urban_pop + alcohol_percap + 
                        hospital_beds")

# formula being used
linear_model2 = lm(third_fmla,
# dataset being used
   data = df_train) 

# recieve results
df_results2 = cbind(df_train,
        as.data.frame(predict(linear_model2,interval = "prediction")))

# recieve residuals
df_results2$residuals = linear_model2$residuals
# model summary
linear2_summary = summary(linear_model2) 

VIF_linear_2 = as.data.frame(vif(linear_model2)) %>% 
  rownames_to_column() %>% 
  ggplot(
  aes(x = rowname, y = vif(linear_model2)))+
  geom_col()+
  coord_flip() +
  geom_hline(yintercept = 5, linetype = 2, color = 'blue')+
  labs(title = "VIF Values of Predictor Variables", x= NULL, y = 'VIF values') +
  theme_bw()+
   theme(axis.text.y = element_text(size = 12)) 

(VIF_linear_1 + ggtitle(label = "Before removing health_expenditure") ) / 
(VIF_linear_2 + ggtitle(label = "After removing health_expenditure") ) 

1.3. Training model evaluation after adjusting for multicollinearity

show code
# calculate RMSE
rmse_lm_train <- sqrt(mean(df_results2$residuals^2))

# calculate MAE
mae_lm_train <- mean(abs(df_results2$residuals))

# calculate MSE
mse_lm_train <- mean(df_results2$residuals^2)

# calculate AIC
aic_value_lm_train <- AIC(linear_model2)

# calculate BIC
bic_value_lm_train <- BIC(linear_model2)

modelsummary_output1 <- modelsummary(
  list(linear_model2), 
  output = "gt",
  stars = TRUE,
  statistic = c("t = {statistic}", "se = {std.error}", 'p.value')
)
kable(as.data.frame(modelsummary_output1),
      caption = "*Training model summary*")
Training model summary
(1)
(Intercept) 87.483***
t = 24.310
se = 3.599
(<0.001)
gdp_percap 1.820***
t = 5.549
se = 0.328
(<0.001)
disease_deathrate -2.742***
t = -4.380
se = 0.626
(<0.001)
adult_mortality -1.351***
t = -15.003
se = 0.090
(<0.001)
urban_pop -0.015
t = -1.385
se = 0.011
(0.169)
alcohol_percap 0.524***
t = 3.404
se = 0.154
(<0.001)
hospital_beds -0.270
t = -1.115
se = 0.242
(0.267)
Num.Obs. 133
R2 0.937
R2 Adj. 0.934
AIC 574.4
BIC 597.5
Log.Lik. -279.186
F 309.896
RMSE 1.97
show code
# collect train metrics
lm_train_metrics = data.frame(
  metric = c("rmse", "rsq", "mae", "mse", "aic","bic"),
  lm_train = c(rmse_lm_train, linear2_summary$r.squared, mae_lm_train, mse_lm_train, aic_value_lm_train, bic_value_lm_train ))

kable(lm_train_metrics,
      caption = "*Training model performance metrics*")
Training model performance metrics
metric lm_train
rmse 1.974277
rsq 0.936536
mae 1.469630
mse 3.897769
aic 574.371421
bic 597.494214

We’ve eventually arrive to this initial final model after adjusting for multicollinearity.

1.4. Prediction vs actual plot

show code
# plot the prediction
r_squared_linear1 = ggplot(df_results2, aes(x = fit, life_expectancy)) +
  geom_point(alpha = 0.4) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred()+
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Training Model using Linear Regression") + 
  theme_bw() +
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy") +
  annotate("label", x = 80, y = 50,
           label = paste0("R-squared:", round(linear2_summary$r.squared,2)))

r_squared_linear1

1.5. Residuals diagnostics

show code
# form a residual histogram plot
residual_linear1 = ggplot(df_results2, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(df_results2$residuals), 
             linetype = 2, color = "blue") +
  geom_vline(xintercept = median(df_results2$residuals), 
             linetype = 2, color = "darkgreen") +
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(df_results2$residuals),2)), 
           color = "blue" ) +
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(df_results2$residuals),2)), 
           color = "darkgreen" ) +
  ylim(0, 0.21) +
  theme_bw() +
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit1 = ggplot(df_results2, aes(x = fit, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw() +
  ggtitle("Residual plot ")

# qq plot
qq_plot1 = ggplot(df_results2, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot ", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

 residual_linear1 + (  qq_plot1/ residual_fit1 ) 

  • Random Distribution: The residuals appear to be randomly scattered, indicating that the model fits the data well and that there are no obvious patterns or trends.

  • Homoscedasticity: The spread of residuals seems consistent across the range of fitted values, suggesting that the variance of the residuals is constant (homoscedasticity). No heteroskedasticity.

  • No Systematic Patterns: There are no clear curves or systematic patterns, implying that the model captures the relationship between predictors and the response variable adequately.

  • Outliers and normality: Good amount of residuals fall relatively far from zero but has a roughly symmetric distribution with a longer tails to the left, indicating potential outliers that may need further investigation.

  • Center and Spread: Mean of residuals is approximately 0 and the median is 0.26, Since the residuals are centered around 0, it suggest no significant bias in the model.

2. K-fold Cross-validation

2.1. Cross-validation performance evaluation

show code
# Specify a model
lm_model0 <- linear_reg() %>%
  set_engine("lm")

# Create a recipe for data preprocessing
rec <- recipe(third_fmla, 
              data = df_train) 

# Create a workflow
wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lm_model0)

# Create 10-fold cross-validation resamples
set.seed(123) # For reproducibility
folds <- vfold_cv(df_train, v = 10)

# Perform cross-validation
results <- wf %>%
  fit_resamples(
    resamples = folds,
    metrics = metric_set(rmse, rsq),
    control = control_resamples(save_pred = TRUE)
  )

# Collect and summarize metrics
metrics_cv = results %>%
  collect_metrics()

# Visualize results 
lm_cv_predictions = results %>%
  collect_predictions()
# Function to fit model and extract AIC/BIC
extract_aic_bic <- function(split) {
  train_data <- analysis(split)
  mod_fit <- lm(life_expectancy ~ ., data = train_data)
  aic <- AIC(mod_fit)
  bic <- BIC(mod_fit)
  
  tibble(
    id = split$id,
    AIC = aic,
    BIC = bic
  )
}

aic_bic_results <- folds %>%
  mutate(results = map(splits, extract_aic_bic)) %>%
  select(id, results) %>%
  unnest(cols = results, names_repair = "unique")

# Summarize the results
aic_bic_summary <- aic_bic_results %>%
  summarise(
    mean_AIC = mean(AIC),
    mean_BIC = mean(BIC)
  )

# collect cv metrics
lm_cv_metrics = metrics_cv %>% 
  transmute(metric = .metric, lm_crossvalidate = mean,) 

# calculate residuals 
lm_cv_residuals = lm_cv_predictions$.pred - lm_cv_predictions$life_expectancy

# Calculate MAE
mae_cv <- mean(abs(lm_cv_residuals))

# Calculate MSE
mse_cv <- mean((lm_cv_residuals)^2)

lm_cv_metrics = lm_cv_metrics %>% 
  full_join(data.frame(metric = c("mae", "mse", "aic", "bic"),
           lm_crossvalidate = c(mae_cv, mse_cv,
                                aic_bic_summary$mean_AIC, 
                                aic_bic_summary$mean_BIC)), 
                                        by = c("lm_crossvalidate", "metric"))
# combine all metrics in linear model
train_metrics_lm = full_join(lm_train_metrics, lm_cv_metrics,
  by = "metric")

kable(train_metrics_lm,
      caption = "*Train and Cross-validation performance metrics*")
Train and Cross-validation performance metrics
metric lm_train lm_crossvalidate
rmse 1.974277 2.0193737
rsq 0.936536 0.9450802
mae 1.469630 1.5712364
mse 3.897769 4.5356599
aic 574.371421 483.0233169
bic 597.494214 513.6581109

2.2. Prediction vs Actual Plot

show code
lm_cv_predictions%>%
  ggplot(aes(x = .pred, y = life_expectancy)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "K-Fold Cross-validation resamples for Linear Regression") +   coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy") +
  annotate("label", x = 80, y = 50,
           label = paste0("R-squared:", round(metrics_cv$mean[2],2)))

We’ve reached our optimum OLS linear model, the R-squared seems to reach 0.95 even with cross-validation testing!

2.3. Residual Diagnostic

show code
df_cv_resid = data.frame("residuals" = lm_cv_residuals,
                         fit = df_train$life_expectancy)
# form a residual histogram plot
residual_linear_cv = ggplot(df_cv_resid, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(df_cv_resid$residuals), 
             linetype = 2, color = "blue") +
  geom_vline(xintercept = median(df_cv_resid$residuals), 
             linetype = 2, color = "darkgreen") +
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(df_cv_resid$residuals),2)), 
           color = "blue" ) +
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(df_cv_resid$residuals),2)), 
           color = "darkgreen" ) +
  ylim(0, 0.21) +
  theme_bw() +
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit_cv = ggplot(df_cv_resid, aes(x = fit, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw() +
  ggtitle("Residual plot ")

# qq plot
qq_plot_cv = ggplot(df_cv_resid, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot ", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

 residual_linear_cv + (  qq_plot_cv/ residual_fit_cv ) 

3. Prediction and Evaluation

3.1. Testing performance evaluation (

show code
# Predict on the test dataset
model_fit0 <- wf %>%
  fit(data = df_train)

test_predictions0 <- model_fit0 %>%
  predict(new_data = df_test) %>%
  bind_cols(df_test)

# get residuals 
test_predictions0 <- test_predictions0 %>%
  mutate(residuals = life_expectancy - .pred)

metrics0 <- test_predictions0 %>%
  metrics(truth = life_expectancy, estimate = .pred)

# collect testing metrics
# get mse
mse_testing <- data.frame(metric = "mse",
                          lm_testing = mean((test_predictions0$residuals)^2))
# make test metrics 
lm_test_metrics = metrics0 %>% 
  transmute(metric = .metric, 
            lm_testing = .estimate) %>%
  full_join(mse_testing)

# combine all metrics in linear model
joined_metrics_lm = full_join(lm_train_metrics, lm_cv_metrics,
  by = "metric") %>% inner_join(lm_test_metrics)

# Summary statistics of the test dataset
kable(model_fit0 %>%
  extract_fit_parsnip() %>%
  tidy(),
  caption = "*Testing model summary*")
Testing model summary
term estimate std.error statistic p.value
(Intercept) 87.4834234 3.5986188 24.310278 0.0000000
gdp_percap 1.8201193 0.3279959 5.549214 0.0000002
disease_deathrate -2.7424293 0.6261265 -4.379993 0.0000247
adult_mortality -1.3511587 0.0900610 -15.002699 0.0000000
urban_pop -0.0151245 0.0109225 -1.384708 0.1685889
alcohol_percap 0.5243082 0.1540085 3.404412 0.0008898
hospital_beds -0.2696177 0.2418432 -1.114845 0.2670386

The testing model summary highlights the relationships between key predictors and life expectancy. GDP per capita positively influences life expectancy (estimate = 1.795, p < 0.001), indicating economic stability enhances health outcomes. Disease death rate (estimate = -2.755, p < 0.001) and adult mortality (estimate = -1.344, p < 0.001) negatively affect life expectancy, underscoring the adverse impact of health burdens. Urban population and hospital beds were not significant predictors (p > 0.1), while alcohol consumption per capita had a positive but moderate effect (estimate = 0.526, p < 0.001). These findings align with the hypothesis that healthcare accessibility and economic factors significantly influence life expectancy.

3.2. Overall OLS regression’s performance metrics

show code
kable(joined_metrics_lm,
      caption = "*Multiple Linear Regression's perfomance metrics*")
Multiple Linear Regression’s perfomance metrics
metric lm_train lm_crossvalidate lm_testing
rmse 1.974277 2.0193737 2.1707938
rsq 0.936536 0.9450802 0.9060414
mae 1.469630 1.5712364 1.6833470
mse 3.897769 4.5356599 4.7123459

3.3. Prediction vs actual plot

show code
test_predictions0 %>%
  ggplot(aes(x = .pred, y = life_expectancy)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Linear Regression Testing set") +   coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy") +
  annotate("label", x = 78, y = 50,
           label = paste0("R-squared:", 
                          round(metrics0$.estimate[2],2)))

It performed really well!

3.4. Residual Diagnostic of testing data

show code
# form a residual histogram plot
histogram_linear_test = ggplot(test_predictions0, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(test_predictions0$residuals), 
             linetype = 2, color = "blue")+
  geom_vline(xintercept = median(test_predictions0$residuals), 
             linetype = 2, color = "darkgreen")+
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(test_predictions0$residuals),2)), 
           color = "blue" )+
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(test_predictions0$residuals),2)), 
           color = "darkgreen" )+
  ylim(0, 0.21)+
  theme_bw()+
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit_test_linear = ggplot(test_predictions0, aes(x = life_expectancy, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw()+
  labs(title = "fit vs residuals")


qq_linear_test = ggplot(test_predictions0, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

# plot 
histogram_linear_test + (qq_linear_test/residual_fit_test_linear)

Regularized Regression : L2 and Elastic Net

Since there is severe multicollinearity in the original selected input, we had to delete 3 variables from the selected input even though we still believe those inputs are still useful. Hence, we could’ve capture important information that can better predict life_expectancy. Therefore, we decided to use Ridge Regression because we wanted to keep all selected variables. We will assign them different weightage and penalty.

We wanted to find the optimal values of lambda. We do this by using 10 fold cross validation

1. Further preprocessing: Standardization

show code
# standardize the train variable
blueprint = recipe(life_expectancy ~ ., data = df_train)%>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

prepare <- prep(blueprint, training = df_train)

ridge_train <- bake(prepare, new_data = df_train)

# standardize the test variable
blueprint_test = recipe(life_expectancy ~ ., data = df_test)%>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

prepare_test  <- prep(blueprint, training = df_test)

ridge_test <- bake(prepare, new_data = df_test)

X_test <- model.matrix(life_expectancy ~ ., ridge_test)[, -1]

# transform y with log transformation
Y_test <- ridge_test$life_expectancy

X <- model.matrix(life_expectancy ~ ., ridge_train)[, -1]

# transform y with log transformation
Y <- ridge_train$life_expectancy
X %>% summary()
 health_expenditure    gdp_percap         gni_percap        disease_deathrate 
 Min.   :-2.040838   Min.   :-1.91869   Min.   :-1.835669   Min.   :-3.33174  
 1st Qu.:-0.845484   1st Qu.:-0.71710   1st Qu.:-0.753251   1st Qu.:-0.71886  
 Median :-0.001998   Median : 0.01193   Median : 0.006561   Median :-0.07494  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
 3rd Qu.: 0.750230   3rd Qu.: 0.79157   3rd Qu.: 0.768977   3rd Qu.: 0.72109  
 Max.   : 2.120375   Max.   : 2.33050   Max.   : 1.855852   Max.   : 2.09019  
 hospital_beds      child_mortality     adult_mortality    alcohol_percap    
 Min.   :-2.59460   Min.   :-2.302658   Min.   :-1.91589   Min.   :-1.82613  
 1st Qu.:-0.82147   1st Qu.:-0.796102   1st Qu.:-0.83967   1st Qu.:-0.84449  
 Median : 0.05711   Median :-0.004018   Median : 0.09513   Median : 0.04171  
 Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.00000  
 3rd Qu.: 0.77446   3rd Qu.: 0.837161   3rd Qu.: 0.70380   3rd Qu.: 0.84159  
 Max.   : 2.40967   Max.   : 1.900753   Max.   : 2.35828   Max.   : 1.82986  
   urban_pop       
 Min.   :-1.96012  
 1st Qu.:-0.77573  
 Median : 0.04723  
 Mean   : 0.00000  
 3rd Qu.: 0.81685  
 Max.   : 1.74379  

2. Ridge Regression Training and Evaluation

2.1 Regularization path plot

show code
ridge <- glmnet(
  x = X,
  y = Y,
  alpha = 0) # a = ridge 

plot(ridge, main = "Ridge penalty\n")

This plot shows how ridge regression coefficients shrink as the regularization parameter λ increases (moving right). Higher λ values lead to smaller coefficients, indicating stronger regularization, which reduces overfitting by limiting the model’s sensitivity to individual features. Unlike LASSO, coefficients remain non-zero, meaning all features contribute to the model. This helps balance model complexity and prediction accuracy, making it robust, especially when handling multicollinearity or many features relative to observations.

2.2. Cross-validation to search optimal lambda

show code
# cross validation ridge regression
ridge_cv <- cv.glmnet(
  x = X,
  y = Y,
  alpha = 0,
  type.measure ="mse"
)

plot(ridge, main = "Ridge penalty\n")
abline(v = ridge_cv$lambda.min, col = "red", lty = "dashed")
abline(v = ridge_cv$lambda.1se, col = "blue", lty = "dashed")

2.4. Cross-validation plot

show code
# we will be using MSE to compare models 
plot(ridge_cv, main = "Ridge penalty\n\n")

2.5. Cross-validation results

show code
ridge_cv

Call:  cv.glmnet(x = X, y = Y, type.measure = "mse", alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.7358   100   3.658 0.3665       9
1se 1.2859    94   3.967 0.4291       9

The 10-fold cross-validation Mean Squared Error (MSE) for a ridge model is depicted in a plot. The first vertical dotted line indicates the value of λ that corresponds to the smallest MSE, while the second vertical line represents the λ value that is within one standard error of the minimum MSE.

Minimum Lambda Found: 0.7358318

The crossvalidation gives us this optimized lambda values that allow us to yield the lowest Mean Squared Error, now we will use this value to predict using the training dataset.

2.3. Uses the minimum lambda to predict training set

show code
# using this crossvalidation
ridge_cv_result = ridge_cv %>% 
  predict(newx = X) %>% rowMeans()%>% 
  cbind(Y)%>% 
  as.data.frame() 

ridge_cv_residuals = (Y - ridge_cv %>% 
  predict(newx = X))

SST <- sum((Y - mean(Y))^2)  # Total Sum of Squares
SSR <- sum(ridge_cv_residuals^2) # Residual Sum of Squares

R_squared_ridgecv_train <- 1 - (SSR / SST)

ridge_cv_result %>% 
  ggplot(aes(x = . , y = Y)) + 
  geom_point(alpha = 0.5)+   
  geom_abline(linetype = 2, color = "blue") +
  coord_obs_pred() +
  theme_bw() +
  annotate("label", x = 80, y = 50,
           label = paste0("R-squared:", round(R_squared_ridgecv_train,2)))+
  xlab ("Predicted life expectancy") +
  ylab ("Actual life expectancy")

3. Elastic net to find the optimal tuning

3.1. Cross-validation plot

show code
# grid search across 
cv_glmnet <- train(
  x = X,
  y = Y,
  method = "glmnet",
  #preProc = c("zv", "center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
)

# plot cross-validated RMSE
ggplot(cv_glmnet)+ theme_bw()

This plot visualizes which parameter yields the lowest RMSE in each of the 10 folds. Notice how the lambda has changes a bit, but allow us to use alpha of 0.2.

show code
kable(cv_glmnet$bestTune, caption = "*Grid seach result*")
Grid seach result
alpha lambda
3 0.1 0.0181434

Alpha of 0.1 with lambda of 0.0078 gives the lowest rmse.

3.2. Variable Importance plot

show code
vip(cv_glmnet, bar = FALSE) + theme_bw()

3.2 Final evaluation of ridge and elastic net models

show code
# predict life_expectancy on training data
ridge_grid_result <- data.frame(pred = predict(cv_glmnet, X),
                                fit = df_train$life_expectancy)

ridge_grid_residuals = ridge_grid_result$pred - ridge_grid_result$fit

# compute RMSE of transformed predicted
rmse_grid = RMSE(ridge_grid_result$pred, ridge_grid_result$fit)
mae_grid = MAE(ridge_grid_result$pred, ridge_grid_result$fit)
mse_grid = mean((ridge_grid_residuals)^2)


SST <- sum((Y - mean(Y))^2)  # Total Sum of Squares
SSR_grid <- sum(ridge_grid_residuals^2) # Residual Sum of Squares

R_squared_gridsq_train <- 1 - (SSR_grid / SST)

ridge_cv_grid_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  elastic_train = c(rmse_grid, 
                       R_squared_gridsq_train, 
                       mae_grid ,mse_grid))

# calculate MAE
mae_cv_ridge <- mean(abs(ridge_cv_residuals))

# calculate MSE
mse_cv_ridge <- mean((ridge_cv_residuals)^2)

# calculate MSE
rmse_cv_ridge <- sqrt(mse_cv_ridge)


# calculate MAE
mae_cv_ridge <- mean(abs(ridge_cv_residuals))

# calculate MSE
mse_cv_ridge <- mean((ridge_cv_residuals)^2)

# calculate MSE
rmse_cv_ridge <- sqrt(mse_cv_ridge)
ridge_cv_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  ridge_train = c(rmse_cv_ridge, R_squared_ridgecv_train, mae_cv_ridge ,mse_cv_ridge)
)

train_regular_metrics = ridge_cv_metrics %>% 
  full_join(ridge_cv_grid_metrics, by = "metric")
kable(train_regular_metrics,
      caption = "*Ridge and Elastic net regressions training performance*")
Ridge and Elastic net regressions training performance
metric ridge_train elastic_train
rmse 1.8708484 1.6795753
rsq 0.9430113 0.9540685
mae 1.4392721 1.2902426
mse 3.5000736 2.8209732

4. Prediction and evaluation

4.1. Ridge regression and Elastic results

show code
ridge_predict = 
  predict(ridge_cv, s=ridge_cv$lambda.1se, newx = X_test, lambda = ridge_cv$lambda.min )

ridge_result =cbind(ridge_predict, Y_test) %>% 
  as.data.frame()

SST_r_test <- sum((Y_test - mean(Y_test))^2)  # Total Sum of Squares
SSR_r_test<- sum((Y_test - ridge_predict)^2) # Residual Sum of Squares

R_squared_r_test <- 1 - ( SSR_r_test/SST_r_test)

ridge_result0 = ridge_result %>%
  transmute(life_expectancy = Y_test,
         .pred = s1,
         residuals = Y_test - s1)

ridge_plot = ridge_result0 %>%
  ggplot(aes(x = .pred, y = life_expectancy)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Ridge Regression") +   
  coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy")+
  annotate("label", x = 76, y = 50,
           label = paste0("R-squared:", 
                          round(R_squared_r_test,3))) 

elastic_predict <- data.frame(pred = predict(cv_glmnet, 
                                               newdata = X_test))
elastic_result =cbind(elastic_predict, Y_test) %>% 
  as.data.frame()

SST_net_test <- sum((Y_test - mean(Y_test))^2)  # Total Sum of Squares
SSR_net_test<- sum((Y_test - elastic_predict)^2) # Residual Sum of Squares

R_squared_net_test <- 1 - (SSR_net_test / SST_net_test)

elastic_result0 = elastic_result %>%
  mutate(residuals = Y_test - pred)

elastic_plot = elastic_result0 %>%
  ggplot(aes(x = pred, y = Y_test)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Elastic net") +   
  coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy")+
  annotate("label", x = 76, y = 50,
           label = paste0("R-squared:", 
                          round(R_squared_net_test,3))) 
ridge_plot+elastic_plot

4.2 Performance Metrics of testing and training

show code
# Ridge------------
ridge_residuals = ridge_result[1] - ridge_result[2]

# compute RMSE of transformed predicted
rmse_ridge = RMSE(ridge_result$s1, ridge_result$Y_test)
mae_ridge= MAE(ridge_result$s1,ridge_result$Y_test)
mse_ridge = mean((ridge_residuals$s1)^2)

SSR_ridge <- sum(ridge_grid_residuals^2) # Residual Sum of Squares

# calculate MSE
ridge_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  ridge_test = c(rmse_ridge, 
                       R_squared_r_test, 
                       mae_ridge ,mse_ridge))

# Elastic------------
elastic_residuals = elastic_result$pred - elastic_result$Y_test

# compute RMSE of transformed predicted
rmse_elastic = RMSE(elastic_result$pred , elastic_result$Y_test)
mae_elastic = MAE(elastic_result$pred ,elastic_result$Y_test)
mse_elastic = mean((elastic_residuals)^2)

SSR_elastic <- sum(elastic_residuals^2) # Residual Sum of Squares

# calculate MSE
elastic_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  elastic_test = c(rmse_elastic, 
                       R_squared_net_test, 
                       mae_elastic ,mse_elastic))
join_test_r = ridge_metrics %>% full_join(elastic_metrics, by = 'metric') 
joined_metrics_r = train_regular_metrics %>% 
  full_join(join_test_r, by = 'metric')
kable(joined_metrics_r,
      caption = "*Ridge and Elastic net regression performance metrics*")
Ridge and Elastic net regression performance metrics
metric ridge_train elastic_train ridge_test elastic_test
rmse 1.8708484 1.6795753 1.9318764 1.8293091
rsq 0.9430113 0.9540685 0.9244032 0.9322173
mae 1.4392721 1.2902426 1.5130543 1.4343533
mse 3.5000736 2.8209732 3.7321464 3.3463717

4.3 Residual Diagnostic best model

show code
elastic_residuals_df = data.frame(residuals = elastic_residuals,
                                  life_expectancy = Y_test)
# form a residual histogram plot
histogram_test = ggplot(elastic_residuals_df, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(elastic_residuals_df$residuals), 
             linetype = 2, color = "blue")+
  geom_vline(xintercept = median(elastic_residuals_df$residuals), 
             linetype = 2, color = "darkgreen")+
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(elastic_residuals_df$residuals),2)), 
           color = "blue" )+
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(elastic_residuals_df$residuals),2)), 
           color = "darkgreen" )+
  ylim(0, 0.21)+
  theme_bw()+
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit = ggplot(elastic_residuals_df, aes(x = life_expectancy, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw()+
  labs(title = "fit vs residuals")


qq = ggplot(elastic_residuals_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

# plot 
histogram_test + (qq/residual_fit)

Random Forest Regression - Ensemble Learning

Ensemble-based method is used to seek the bias variance trade-off. We plan to improve predictive accuracy but unfortunately, reduces interpretability as random forest is a black-box model that is difficult to interpret as compare to linear or regularized regression.

1. Model Building

Random Forest models are inherently robust to the scale of input features, as they base splits on relative values, making normalization less crucial (Hu & Szymczak, 2022). However, normalization can still enhance performance by ensuring balanced splits and equal feature contribution, leading to slightly better accuracy. This improvement is minor compared to methods like ridge regression or linear OLS, where normalization is essential for stability and convergence of coefficients. Therefore, to calculate the best predictive model, we’ll still use the same preprocessed data.

show code
# Building the random forest model
set.seed(123)
n_features = length(setdiff(names(df_test), "life_expectancy"))

# for each of the trees 

rf_model_normalized <- randomForest(
  life_expectancy ~ .,
  data = df_train,
  importance = TRUE,
  mtry = floor(n_features/3), # regression
  xtest = subset(df_test,
                 select = -life_expectancy),
                ytest = df_test$life_expectancy,
                         keep.forest = TRUE)

# Print the model summary
print(rf_model_normalized)

Call:
 randomForest(formula = life_expectancy ~ ., data = df_train,      importance = TRUE, mtry = floor(n_features/3), xtest = subset(df_test,          select = -life_expectancy), ytest = df_test$life_expectancy,      keep.forest = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 2.528765
                    % Var explained: 95.88
                       Test set MSE: 2.52
                    % Var explained: 94.95

1.2. Variable importance

show code
# Plot variable importance
vip_rf1 <- vip(rf_model_normalized) + 
  ggtitle("Variable Importance")+theme_bw()

vip_rf1

show code
# create an explainer for the model
explainer_rf = explain(model = rf_model_normalized, data = df_train,
                        y = df_train$life_expectancy, 
                        label = "Random Forest")
Preparation of a new explainer is initiated
  -> model label       :  Random Forest 
  -> data              :  133  rows  10  cols 
  -> data              :  tibble converted into a data.frame 
  -> target variable   :  133  values 
  -> predict function  :  yhat.randomForest  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package randomForest , ver. 4.7.1.1 , task regression (  default  ) 
  -> predicted values  :  numerical, min =  55.02809 , mean =  72.31606 , max =  84.59589  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -2.44344 , mean =  -0.009295188 , max =  1.904113  
  A new explainer has been created!  

1.3. Partial Dependence plot

show code
# plot Partial Dependence for three selected feature
pdp_rf_adult <- model_profile(explainer = explainer_rf, 
                        variables = "adult_mortality")

pdp_rf_disease <- model_profile(explainer = explainer_rf, 
                        variables = "disease_deathrate")

pdp_rf_health <- model_profile(explainer = explainer_rf, 
                        variables = "health_expenditure")

pdp_rf_gdp <- model_profile(explainer = explainer_rf, 
                        variables = "gdp_percap")
# plot the PDP put them in a 2x2 figure
(plot(pdp_rf_adult) + ggtitle(NULL, NULL) + 
  plot(pdp_rf_disease) + ggtitle(NULL, NULL))/
  (plot(pdp_rf_health) + ggtitle(NULL, NULL)+
     plot(pdp_rf_gdp)+ ggtitle(NULL, NULL))

Adult mortality has the greatest influence on the model’s predictions compared to other variables. As the adult mortality decreases, the predicted probability decrease exponentially, till the lowest life expectancy. The diseases death rate suggest simlair results, which make sense where more people are dying, the lower the estimate life expectancy is. In contrast, Health expenditure and GDP percap positively influence the predicted probability. Suggesting as countries invest more in health expenditure the lesser people will die, and the richer the country the higher the life expectancy.

1.4. Single tree instance example

show code
# Extract a single tree from the random forest model
single_tree <- getTree(rf_model_normalized, k = 1, labelVar = TRUE)

# Convert it to a decision tree model
tree_model <- rpart(first_fmla, data = df_train)

# Plot the tree
# Plot the decision tree
rpart.plot(tree_model, type = 1, extra = 101, 
           fallen.leaves = TRUE, main = "Decision Tree for Life Expectancy")

2. Model Prediction and Evaluation

show code
# Predictions on training data
train_preds <- predict(rf_model_normalized, newdata = df_train)

# Predictions on testing data
test_preds <- predict(rf_model_normalized, newdata = df_test)

# Calculate metrics for training data
train_rmse <- RMSE(pred = train_preds, obs = df_train$life_expectancy)
train_mae <- MAE(pred = train_preds, obs = df_train$life_expectancy)
train_mse <- mean((train_preds - df_train$life_expectancy)^2)
train_rsq <- R2(pred = train_preds, obs = df_train$life_expectancy)

# Calculate metrics for testing data
test_rmse <- RMSE(pred = test_preds, obs = df_test$life_expectancy)
test_mae <- MAE(pred = test_preds, obs = df_test$life_expectancy)
test_mse <- mean((test_preds - df_test$life_expectancy)^2)
test_rsq <- R2(pred = test_preds, obs = df_test$life_expectancy)

# Combine metrics into a dataframe
rf_metrics <- data.frame(
  metric = c("rf_train", "rf_test"),
  rmse = c(train_rmse, test_rmse),
  rsq = c(train_rsq, test_rsq),
  mae = c(train_mae, test_mae),
  mse = c(train_mse, test_mse)
)

# Print the metrics
print(rf_metrics)
    metric      rmse       rsq       mae       mse
1 rf_train 0.6857352 0.9932940 0.5139248 0.4702328
2  rf_test 1.5869542 0.9502163 1.1706776 2.5184236
show code
rf_residuals_df = data.frame(residuals = Y_test - test_preds,
                                  life_expectancy = Y_test)

histogram_rf = ggplot(rf_residuals_df, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(rf_residuals_df$residuals), 
             linetype = 2, color = "blue")+
  geom_vline(xintercept = median(rf_residuals_df$residuals), 
             linetype = 2, color = "darkgreen")+
  annotate("label",x = 16, y = 0.03, 
           label = paste0("Mean: ", round(mean(rf_residuals_df$residuals),2)), 
           color = "blue" , size = 3.5)+
  annotate("label",x = 17, y = 0.025, 
           label = paste0("Median: ", round(median(rf_residuals_df$residuals),2)), 
           color = "darkgreen" , size = 3.5)+
  theme_bw()+
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit_test_rf = ggplot(rf_residuals_df, aes(x = life_expectancy, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw()+
  labs(title = "fit vs residuals")


qq_rf_test = ggplot(rf_residuals_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

# plot 
histogram_rf + (qq_rf_test/residual_fit_test_rf)

Model Comparison and Conclusion

show code
t1 = joined_metrics_r %>% 
  full_join(joined_metrics_lm, by = "metric") %>%
  t() %>% as.data.frame() %>%rownames_to_column() 
  
  # transposed metrics
colnames(t1) <- as.character(unlist(t1[1, ]))
 t1 <-  t1[-1, ]

t1$rmse <- as.numeric(t1$rmse)
t1$rsq <- as.numeric(t1$rsq)
t1$mae <- as.numeric(t1$mae)
t1$mse <- as.numeric(t1$mse)

model_comparison = t1 %>% full_join(rf_metrics, by = colnames(t1)) %>% arrange(rmse)
kable(model_comparison, caption = "All model's performance metric comparison")
All model’s performance metric comparison
metric rmse rsq mae mse
rf_train 0.6857352 0.9932940 0.5139248 0.4702328
rf_test 1.5869542 0.9502163 1.1706776 2.5184236
elastic_train 1.6795753 0.9540685 1.2902426 2.8209732
elastic_test 1.8293091 0.9322173 1.4343533 3.3463717
ridge_train 1.8708484 0.9430113 1.4392721 3.5000736
ridge_test 1.9318764 0.9244032 1.5130543 3.7321464
lm_train 1.9742770 0.9365360 1.4696300 3.8977690
lm_crossvalidate 2.0193737 0.9450802 1.5712364 4.5356599
lm_testing 2.1707938 0.9060414 1.6833470 4.7123459

Consideration throughout the modelling workflow

Throughout the modeling workflow, careful consideration was given to achieving a balance between interpretability and predictive accuracy. The process began with initial model estimation using multiple linear regression to establish a baseline and identify key predictors. Cross-validation was employed to ensure the model’s generalizability, addressing the bias-variance tradeoff by evaluating performance across multiple folds. Due to multicollinearity, we had to remove 3 highly correlated variables. Thus, regularization techniques, including ridge regression and elastic net, were incorporated to keep the 3 variables and handle multicollinearity. The Random Forest model was ultimately chosen for its superior predictive performance, despite its complexity. For every model, steps were taken to validate model assumptions, such as checking for homoscedasticity and normality of residuals, ensuring robustness and reliability in the final model.

Final models results conclusion

1. Multiple Linear Regression

The multiple linear regression model establishes the baseline relationship between predictors and life expectancy. Significant predictors include GDP per capita, disease death rate, and adult mortality, indicating that economic status, disease prevalence, and mortality rates substantially impact life expectancy. The model’s performance is moderate, with some limitations due to multicollinearity among predictors. Adjustments and removal of highly collinear variables were necessary to enhance model reliability and interpretability.

2. Elastic Net Regression

Elastic Net regression combines the penalties of both ridge and lasso regression to address multicollinearity and variable selection. This model showed improved performance over the linear model, with lower RMSE and higher R² values, indicating better predictive accuracy. Significant predictors remain consistent, reaffirming the importance of economic and health-related factors on life expectancy. The model’s regularization helps in retaining valuable predictors while penalizing less significant ones, ensuring a balanced and interpretable model.

3. Random Forest regression

The Random Forest model, an ensemble learning method, provided the best performance among the models. It handled multicollinearity effectively and captured complex relationships in the data. Key predictors like GDP per capita, disease death rate, and adult mortality were identified as the most influential. The model’s high accuracy, reflected in the lowest RMSE and highest R² values, underscores its robustness. However, its black-box nature limits interpretability compared to linear models. Thus, partial dependence plot was visualized. Overall, Random Forest proved highly effective in predicting life expectancy, reinforcing the significance of economic stability and health factors.

Was the research hypothesis achieved?

Yes, the research hypothesis was achieved. Overall the model was able predict the estimated life expectancy with very high accuracy, indicating that the variables we used can deeply explain life expectancy. According to our results, there is sufficient evidence to prove that countries with higher levels of income accessibility to healthcare services do infact exhibit higher life expectancy and lower mortality rates. As countries get wealthier, they increase health expenditure and afford more hospital beds, increasing their likelihood of higher live expectancy compared to countries that are from poorer income levels.

Salient Findings

Economic Wealth plays a role: Our Random Forest model corroborates existing research on the negative association between mortality rate, diseases and life expectancy. Our analysis suggest the importance of addressing the health disparity between high- and low-income countries.

Multifaceted Influences on Health: While the model confirms GDP’s significance, it identifies disease death rate and adult mortality as key factors. Obviously disease and mortality rates are key factors, other variables suggests there are external influences that capture the multifaceted nature of life expectancy. If we remove the mortality rate and diseases death rate, and alllow more datasets and variables to be imported, it may give a more detailed insights about life expectancy.

Proposed Solutions

The analysis shows money impacts health. To tackle this, governments can invest more in public healthcare for wider access, create programs for low-income areas, and focus on preventative care. Wealthy nations can also aid developing countries. By using data like this to identify problem areas, governments can create policies that target specific needs and work towards a future where everyone has equal chances for good health.

Limitations

  • Balancing Interpretability and Accuracy: This analysis aimed to strike a balance between a model that’s easy to understand (interpretable) and one that predicts well (accurate). While linear regression offered clear cause-and-effect relationships, its accuracy was limited by multicollinearity (correlated variables). Random Forest, on the other hand, achieved highest accuracy but its complex nature makes it harder to pinpoint why certain factors influence life expectancy.

  • Data Availability and Generalizability: The analysis relied on available data, which might not capture all relevant factors affecting life expectancy. Socioeconomic factors beyond GDP, like education or access to clean water,could be missing. Additionally, the chosen models may not generalize perfectly to all countries due to cultural or environmental variations not included in the data.

Reference list