Assessing the impact of socio economic status across different demographics to understand the disparities: Analyzing Global Life Expectancy

Author

Karim 32228643

Introduction

Research Question

The conceptual research question guiding this analysis is: “How do a country’s socio-economic factors, including healthcare access, income levels, and consumption patterns, influence life expectancy globally?” This question aims to explore the complex relationship between these socio-economic determinants and health outcomes, highlighting disparities across countries with different demographics.

Research Hypothesis

The practical research hypothesis posited in this report is: “Countries with higher levels of income accessibility to healthcare services exhibit higher life expectancy and lower mortality rates” we believe the data we uses support the research hypothesis. The full dataset overview will be given below. For instance, Health expenditure, GDP per capita, and GNI per capita all increase significantly from low-income to high-income countries, indicating that wealthier nations invest more in healthcare infrastructure and services (Boniol et al., 2022). Disease death rates and both child and adult mortality rates decrease with rising income levels, reflecting the effectiveness of healthcare services in wealthier countries (Arias et al., 2022). The number of hospital beds and urban population percentages also increase, highlighting improved healthcare capacity and access in richer nations. Despite higher alcohol consumption in high-income countries, the overall health benefits from better healthcare infrastructure outweigh potential negative impacts (Watsons, 2022).

Libraries used

To perform the analysis and generate the necessary visualizations and s for this technical report, the following R packages will be utilized:

show code

library(readr)
library(tidyverse)
library(readxl)
library(visdat)
library(gt)
library(gtsummary)
library(tidymodels)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(rsample)
library(tune)
library(car)
library(patchwork)
library(glmnet)
library(vip)
library(DALEX)
library(knitr)
library(kableExtra)
library(huxtable)
library(modelsummary)
library(gtsummary)
library(psych)
library(randomForest)
library(caret)
library(glmnet) 
library(corrplot)
library(rpart.plot)
library(rpart)

Dataset Overview

To perform this analysis, we utilizes the dataset from the previous infographic assignment. The dataset given, is a joined dataset from numerous sources which consists of socio-economic predictors and wealth status. The previous infographic dataset consists of aggregated countries, region and UN classify territories. Therefore, the code chunk below filter out those non-country values and keep all 193 UN member countries.

Initial Data Preprocessing

1. Data collection and integration

show code

# load previous infographic dataset 
IA3_df = read_excel("Datasets/Raw_IA3_df.xlsx")

# convert 0s into NA
full_df <- IA3_df %>%
  mutate(across(where(is.numeric), ~na_if(., 0)))
# countries column has loads of region and state names, even aggregated names.
# collect country code that consists of all recognized 195 countries, however this datasets do not consists of Palestine and Vatican city as they are non-member observer state.
codes = read_csv("Datasets/countries.csv") %>% select(en, alpha3) %>% 
  transmute(Code = toupper(alpha3),
            Country = en)

# get continents dataset for grouping descriptive summary results
continent = read_csv("Datasets/continents.csv") %>%
  rename(country = Country)
  


# join two datasets, and remove rows that is inconsistent with the earlier country code
cleaned_df = full_df %>% full_join(codes, by = "Code") %>% 
  filter(Country != "0") %>% select(-Country) %>% distinct(Code, .keep_all =TRUE) %>% inner_join(continent, by = "country")


# rewrite the name to be shorter and simpler to write
cleaned_df = cleaned_df %>% 
  rename(adult_mortality = `Adult Mortality rate per 1,000`,
         total_mortality = `Total Mortality rate per 1,000`)

# replace white space with underscores
names(cleaned_df) <- gsub(" ", "_",tolower(names(cleaned_df)))

Dependent Variable: Life_Expectancy

The dependent variable we use is `Life_Expectancy` variable which fortunately has no missing values. Data is obtained from World Bank Databank specifically taken from year 2019 before the heavy influx of Covid-19 that may affect our predictions.

2. Handling Missing Values

show code

# calculate the percentage amount of NA relative to the dataset population
percent_na = sum(is.na(cleaned_df)) / (nrow(cleaned_df) * ncol(cleaned_df)) 

# calculate the total NA
full_na = sum(is.na(cleaned_df))

# visualize the missing values
vis_miss(cleaned_df)

In this dataset, it has 416 missing “NA” values. Which means 11.97% of this dataset is missing. gini_index and doctors dataset has the highest missing values while life expectancy (dependent variable) has no missing values. Because more than half of the observation is missing, imputing the value would not be sui as the predicted values would not replicate the supposed actual value.

2.1 Removing variables with very high percentage of missing values

Missing Values of Upper middle income to High income countries

show code

# we visualizes the data of the upper-middle- income and above
cleaned_df %>%
  filter(gni_percap >=3995)%>% vis_miss()

Missing Values of Low middle income to Low income countries

show code

# we visualizes the data of the low-middle- income and below
cleaned_df %>%
  filter(gni_percap <=3995)%>% vis_miss()

It can be seen that the lower income countries has relatively higher amount of missing values. This is no coincidence, likely to be Missing Not at Random (MNAR), suggesting that regions with lower income are less likely to report this data, even 50% of the hospital_beds variable is missing. Although docters and gini_index might be useful for the model, this non-informative missingness constitutes a large portion of the overall data (more than 50%) for doctors and gini_index. Therefore, we delete this variable and perform imputation on the rest.

2.2. Handling missing values through imputation methods

Use Bagged Trees (Random Forest) for Higher Missingness: (more than or equal to 10%)

Random forest imputation is robust and can handle higher levels of missing data better than KNN (Hu & Szymczak, 2022). It considers the relationships between variables and leverages the ensemble method to make better predictions (Hu & Szymczak, 2022).

Variables: unemployment_(%), hospital_beds, adult_mortality

Use K-Nearest Neighbor for Lower Missingness: (less than 10%)

KNN is effective for lower levels of missing data and can be quicker to implement. It uses the closest neighbors to impute the missing values, which works well when missingness is lower (Keith, 2021)..

Variables: health_expenditure, gdp_percap, gni_percap, disease_deathrate, disease_death, adult_mortality, total_mortality, child_mortality

show code

# delete gini index and doctors
deleted_df = select(cleaned_df,-c(gini_index, doctors))%>% 
  # convert character columns to factors or vice versa as needed
  mutate_if(is.character, as.factor)

# variable name with low missingness
low_missingness = colnames(select(deleted_df,-c(`unemployment_(%)`, hospital_beds, adult_mortality)))


# Create a recipe for data preprocessing with imputation
rec0 <- recipe(~ ., data = deleted_df) %>%
  step_impute_bag(c(`unemployment_(%)`, hospital_beds, adult_mortality)) %>%
  step_impute_knn(c(low_missingness), neighbors = 3) 
  

# Prepare the recipe
rec_prep0 <- prep(rec0, training = deleted_df)

# Apply the recipe to the data (bake the recipe)
final_data <- bake(rec_prep0, new_data =deleted_df )

# Check the result
vis_miss(final_data)

Perfect!

2.3 Full dataset before input selection and normalization

show code

table_1_df <- final_data %>%
  mutate(`Income Level` = factor(case_when(
    gni_percap < 1035 ~ "Low-income",
    gni_percap >= 1036 & gni_percap <= 4045 ~ "Lower-middle-income",
    gni_percap >= 4046 & gni_percap <= 12535 ~ "Upper-middle-income",
    gni_percap >= 12535 ~ "High-income"),
    levels = c("Low-income", "Lower-middle-income","Upper-middle-income","High-income"))
    ) %>% select(!c(country, code))


summary_by_income <- table_1_df %>%
  tbl_summary(
    by = `Income Level`,
    missing = "no"
  ) %>%
  modify_header(label = "**Variable**")


kable(summary_by_income,
      caption = "*Full dataset summary statistics aggregated by income levels*")

*Full dataset summary statistics aggregated by income levels*
Variable	Low-income, N = 24	Lower-middle-income, N = 50	Upper-middle-income, N = 60	High-income, N = 59
health_expenditure	107 (72, 152)	213 (146, 412)	907 (657, 1,306)	3,324 (2,200, 5,314)
life_expectancy	62 (60, 64)	68 (63, 71)	74 (72, 77)	81 (78, 83)
unemployment_(%)	3.9 (3.0, 7.3)	5.1 (3.4, 9.3)	5.9 (4.4, 11.7)	5.0 (3.6, 6.4)
gdp_percap	698 (540, 810)	2,145 (1,454, 3,049)	6,697 (5,172, 9,158)	32,279 (19,138, 51,004)
gni_percap	695 (520, 823)	2,130 (1,525, 3,158)	6,635 (5,260, 8,975)	34,660 (18,630, 50,470)
disease_deathrate	291 (212, 407)	134 (61, 250)	44 (27, 75)	36 (25, 48)
diseases_death	46,205 (23,463, 90,374)	16,992 (2,982, 55,250)	2,049 (467, 8,696)	1,720 (222, 6,922)
hospital_beds	9 (8, 20)	11 (9, 18)	22 (13, 40)	31 (25, 51)
child_mortality	52 (41, 70)	26 (15, 38)	11 (7, 18)	4 (3, 6)
adult_mortality	290 (266, 325)	231 (187, 282)	183 (141, 227)	83 (69, 139)
total_mortality	7.93 (6.57, 9.10)	6.67 (5.66, 7.92)	7.17 (5.58, 8.51)	7.94 (6.32, 9.98)
alcohol_percap	2.0 (0.4, 3.8)	3.3 (1.4, 5.0)	5.3 (2.7, 7.4)	9.4 (6.3, 10.9)
urban_pop	37 (24, 43)	42 (31, 57)	64 (54, 77)	82 (68, 90)
continent	NA	NA	NA	NA
Africa	21 (88%)	23 (46%)	9 (15%)	1 (1.7%)
Asia	3 (13%)	17 (34%)	12 (20%)	11 (19%)
Europe	0 (0%)	1 (2.0%)	12 (20%)	34 (58%)
North America	0 (0%)	3 (6.0%)	13 (22%)	7 (12%)
Oceania	0 (0%)	5 (10%)	5 (8.3%)	4 (6.8%)
South America	0 (0%)	1 (2.0%)	9 (15%)	2 (3.4%)

show code

# Print the 
kable(describe(final_data),
      caption = "*Full dataset summary statistics for each variable*",
      digits = 4)

*Full dataset summary statistics for each variable*
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
country*	1	193	97.0000	55.8585	97.0000	97.0000	71.1648	1.0000	193.0000	192.0000	0.0000	-1.2187	4.0208
health_expenditure	2	193	1573.7975	1897.9076	791.7386	1193.4901	957.4787	39.1323	10661.0284	10621.8961	1.8082	3.2498	136.6144
code*	3	193	97.0000	55.8585	97.0000	97.0000	71.1648	1.0000	193.0000	192.0000	0.0000	-1.2187	4.0208
life_expectancy	4	193	72.4264	7.6160	73.4000	72.7697	8.5991	52.9000	86.5000	33.6000	-0.3792	-0.6855	0.5482
unemployment_(%)	5	193	6.8661	5.1275	5.0200	6.0661	2.5797	0.1000	26.3070	26.2070	1.4854	1.8643	0.3691
gdp_percap	6	193	15889.1266	25892.1206	6070.3881	10373.6659	6901.5154	216.9730	199382.8386	199165.8656	3.6552	18.3176	1863.7556
gni_percap	7	193	14490.6563	19284.8572	5940.0000	10385.2903	6775.4820	230.0000	83160.0000	82930.0000	1.7921	2.3469	1388.1544
disease_deathrate	8	193	116.6439	135.6945	57.2600	89.4985	50.7642	4.1600	820.5100	816.3500	2.0516	4.8176	9.7675
diseases_death	9	193	40634.7789	150938.8479	5385.0000	15385.2344	7807.3716	6.0000	1841182.0000	1841176.0000	9.5761	105.6194	10864.8160
hospital_beds	10	193	27.5813	20.8002	21.5000	24.7495	17.4947	1.7000	128.8000	127.1000	1.3482	2.3763	1.4972
child_mortality	11	193	20.5006	21.2119	12.5398	16.5013	12.7742	0.8317	98.4470	97.6153	1.7055	2.6173	1.5269
adult_mortality	12	193	191.9923	96.5979	185.8290	184.8435	114.1291	46.4220	516.4180	469.9960	0.5762	-0.0935	6.9533
total_mortality	13	193	7.6299	2.6311	7.2900	7.4650	2.2387	0.9910	15.5000	14.5090	0.5186	0.4896	0.1894
alcohol_percap	14	193	5.4208	3.8830	4.9250	5.2308	4.6465	0.0051	16.9896	16.9846	0.3819	-0.8859	0.2795
urban_pop	15	193	59.2395	22.9545	59.0390	59.6902	27.2472	13.2500	100.0000	86.7500	-0.1482	-0.9836	1.6523
continent*	16	193	2.6684	1.4874	2.0000	2.5097	1.4826	1.0000	6.0000	5.0000	0.6513	-0.4678	0.1071

3. Correlation analysis with target variable

3.1 Full data correlation matrix

show code

# store a numeric only dataframe for our linear regression
numeric_df = select(final_data, where(is.numeric)) 

# form a correlation matrix
cor_matrix = numeric_df %>% cor() 

# identify the variable with high correlation 
cor_matrix %>%
  corrplot(method ="number",
           order = "hclust")

show code

# identify the correlated variables, setting a threshold of 0.3 correlation. (Moderate to strong correlation)
correlated_variables = as.data.frame(cor_matrix) %>% 
  select(life_expectancy) %>%
  filter(life_expectancy > 0.3 | life_expectancy < -0.3 )%>%
  filter(life_expectancy != 1) 

# these variables with the most correlation 
kable(correlated_variables %>%
  rownames_to_column() %>% 
  arrange(desc(life_expectancy)),
  caption = "*Variables with moderate to strong Correlation with life expectancy*",
  label= "*Variable*")

*Variables with moderate to strong Correlation with life expectancy*
rowname	life_expectancy
health_expenditure	0.7162116
gni_percap	0.6961471
gdp_percap	0.6107644
urban_pop	0.5798739
alcohol_percap	0.4023439
hospital_beds	0.3246269
disease_deathrate	-0.8006771
child_mortality	-0.8412293
adult_mortality	-0.9340389

show code

# Melt the data frame
df_melt0 <- melt(numeric_df%>%
  select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars = "life_expectancy")

# Create scatterplots
scatterplots0 <- ggplot(df_melt0, aes(x = value, y = life_expectancy)) +
  geom_point(alpha = 0.2) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Scatterplot Matrix Against Life Expectancy")+
  geom_smooth(size = 0.3, color = "red",linetype = 2, se = FALSE)+
  theme_bw()

print(scatterplots0)

show code

distributions0 <- ggplot(df_melt0, aes(x = value)) +
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.3) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions Against Life Expectancy")+
  theme_bw()

distributions0

4. Final selected Inputs

show code

# select corelated variable without normalization 
correlated_inputs = select(final_data, c(colnames(t(correlated_variables)), "life_expectancy", ))
kable(correlated_inputs %>%
  mutate(`Income Level` = factor(case_when(
    gni_percap < 1035 ~ "Low-income",
    gni_percap >= 1036 & gni_percap <= 4045 ~ "Lower-middle-income",
    gni_percap >= 4046 & gni_percap <= 12535 ~ "Upper-middle-income",
    gni_percap >= 12535 ~ "High-income"),
    levels = c("Low-income", "Lower-middle-income","Upper-middle-income","High-income"))) %>%
  tbl_summary(
    by = `Income Level`,
    missing = "no"
  ) %>%
  modify_header(label = "**Variable**"),
      caption = "*Selected dataset summary statistics aggregated by income levels*",digit = 4)

*Selected dataset summary statistics aggregated by income levels*
Variable	Low-income, N = 24	Lower-middle-income, N = 50	Upper-middle-income, N = 60	High-income, N = 59
health_expenditure	107 (72, 152)	213 (146, 412)	907 (657, 1,306)	3,324 (2,200, 5,314)
gdp_percap	698 (540, 810)	2,145 (1,454, 3,049)	6,697 (5,172, 9,158)	32,279 (19,138, 51,004)
gni_percap	695 (520, 823)	2,130 (1,525, 3,158)	6,635 (5,260, 8,975)	34,660 (18,630, 50,470)
disease_deathrate	291 (212, 407)	134 (61, 250)	44 (27, 75)	36 (25, 48)
hospital_beds	9 (8, 20)	11 (9, 18)	22 (13, 40)	31 (25, 51)
child_mortality	52 (41, 70)	26 (15, 38)	11 (7, 18)	4 (3, 6)
adult_mortality	290 (266, 325)	231 (187, 282)	183 (141, 227)	83 (69, 139)
alcohol_percap	2.0 (0.4, 3.8)	3.3 (1.4, 5.0)	5.3 (2.7, 7.4)	9.4 (6.3, 10.9)
urban_pop	37 (24, 43)	42 (31, 57)	64 (54, 77)	82 (68, 90)
life_expectancy	62 (60, 64)	68 (63, 71)	74 (72, 77)	81 (78, 83)

show code

kable(describe(correlated_inputs),
      caption = "*Selected dataset summary statistics for each variable*",
      digits = 4)

*Selected dataset summary statistics for each variable*
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
health_expenditure	1	193	1573.7975	1897.9076	791.7386	1193.4901	957.4787	39.1323	10661.0284	10621.8961	1.8082	3.2498	136.6144
gdp_percap	2	193	15889.1266	25892.1206	6070.3881	10373.6659	6901.5154	216.9730	199382.8386	199165.8656	3.6552	18.3176	1863.7556
gni_percap	3	193	14490.6563	19284.8572	5940.0000	10385.2903	6775.4820	230.0000	83160.0000	82930.0000	1.7921	2.3469	1388.1544
disease_deathrate	4	193	116.6439	135.6945	57.2600	89.4985	50.7642	4.1600	820.5100	816.3500	2.0516	4.8176	9.7675
hospital_beds	5	193	27.5813	20.8002	21.5000	24.7495	17.4947	1.7000	128.8000	127.1000	1.3482	2.3763	1.4972
child_mortality	6	193	20.5006	21.2119	12.5398	16.5013	12.7742	0.8317	98.4470	97.6153	1.7055	2.6173	1.5269
adult_mortality	7	193	191.9923	96.5979	185.8290	184.8435	114.1291	46.4220	516.4180	469.9960	0.5762	-0.0935	6.9533
alcohol_percap	8	193	5.4208	3.8830	4.9250	5.2308	4.6465	0.0051	16.9896	16.9846	0.3819	-0.8859	0.2795
urban_pop	9	193	59.2395	22.9545	59.0390	59.6902	27.2472	13.2500	100.0000	86.7500	-0.1482	-0.9836	1.6523
life_expectancy	10	193	72.4264	7.6160	73.4000	72.7697	8.5991	52.9000	86.5000	33.6000	-0.3792	-0.6855	0.5482

show code

sources = as.data.frame(t(correlated_inputs))[2] %>% rownames_to_column()
sources$V2 <- c("WorldBank","WorldBank","WorldBank","Our World In Data", "World Health Organization (WHO)","UNICEF", "WorldBank", "WorldBank", "WorldBank", "WorldBank")
colnames(sources) <- c("variable", "source")
kable(sources, caption = "*Each selected variables sources*")

*Each selected variables sources*
variable	source
health_expenditure	WorldBank
gdp_percap	WorldBank
gni_percap	WorldBank
disease_deathrate	Our World In Data
hospital_beds	World Health Organization (WHO)
child_mortality	UNICEF
adult_mortality	WorldBank
alcohol_percap	WorldBank
urban_pop	WorldBank
life_expectancy	WorldBank

How would this variable be useful?

All these 9 selected variables have moderate to strong correlation (threshold of 0.3 correlation). Hence, these variables have a noticable relationship with life expectancy. Each of these features provide valuable information for the predictive model to run. Analyzing these variables collectively can help develop predictive models for estimating life expectancy accurately. However, to fit the model, further data preprocessing will be used. By bringing features to a similair scale, the model is able to eliminate bias from outliers and produce more accurate prediction.

5. Data Normalization of selected inputs

show code

# Create a recipe for data preprocessing with imputation
rec1 <- recipe(~ ., data = final_data) %>%
    step_YeoJohnson(colnames(t(correlated_variables))) # Normalize numeric predictors

# Prepare the recipe
rec_prep1 <- prep(rec1, training = final_data)

# Apply the recipe to the data (bake the recipe)
normalized_final_data <- bake(rec_prep1, new_data =final_data )

df_melt1 <- melt(select(normalized_final_data, where(is.numeric))%>%
  select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars = "life_expectancy")

distributions1 <- ggplot(df_melt1, aes(x = value)) +
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.3) +  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions Against Life Expectancy")+
  theme_bw()

distributions1

show code

scatterplots1 <- ggplot(df_melt1, aes(x = value, y = life_expectancy)) +
  geom_point(alpha = 0.2) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Scatterplot Matrix Against Life Expectancy")+
  geom_smooth(size = 0.3, color = "red",linetype = 2, se = FALSE)+
  theme_bw()

print(scatterplots1)

6. Data partitioning

show code

# our selected data
selected_inputs = select(normalized_final_data, 
                           c("life_expectancy",
                             colnames(as.data.frame(t(correlated_variables)))))
set.seed(000)
# split data to training and testing
df_split = initial_split(selected_inputs, prop = 0.7, strata = life_expectancy)

# input train and test data 
df_train = training(df_split)
df_test = testing(df_split)

Multiple Linear Regression - Ordinary Least Squared

\(y=β0+β1x1+β2x2+...+ϵ\)

1. Training Multiple Linear Regression

1.1. Train and evaluate initial regression model

show code

# form the formula
first_fmla = as.formula("life_expectancy ~ health_expenditure +
                        gdp_percap+ gni_percap + disease_deathrate + 
                        child_mortality + adult_mortality + urban_pop +
                        alcohol_percap + hospital_beds")

# formula being used
linear_model0 = lm(first_fmla,
# dataset being used
   data = df_train) 

# recieve results
df_results0 = cbind(df_train,
        as.data.frame(predict(linear_model0,interval = "prediction")))

# model summary
linear0_summary = summary(linear_model0)
modelsummary_output <- modelsummary(
  list(linear_model0), 
  output = "gt",
  stars = TRUE,
  statistic = c("t = {statistic}", "se = {std.error}", 'p.value'),
  shape = 
)
# Convert the gt  to kableExtra
kable(modelsummary_output %>%
  as.data.frame(),
  caption = "*Initial training model summary*",
  full_width = FALSE, position = "center")

*Initial training model summary*
	(1)
(Intercept)	94.831***
	t = 27.843
	se = 3.406
	(<0.001)
health_expenditure	0.427
	t = 1.404
	se = 0.304
	(0.163)
gdp_percap	3.718*
	t = 2.560
	se = 1.452
	(0.012)
gni_percap	-2.779*
	t = -2.098
	se = 1.325
	(0.038)
disease_deathrate	-1.373*
	t = -2.367
	se = 0.580
	(0.020)
child_mortality	-3.209***
	t = -6.382
	se = 0.503
	(<0.001)
adult_mortality	-1.131***
	t = -13.562
	se = 0.083
	(<0.001)
urban_pop	-0.008
	t = -0.834
	se = 0.010
	(0.406)
alcohol_percap	0.216
	t = 1.525
	se = 0.142
	(0.130)
hospital_beds	-0.643**
	t = -2.987
	se = 0.215
	(0.003)
Num.Obs.	133
R2	0.955
R2 Adj.	0.951
AIC	536.1
BIC	567.9
Log.Lik.	-257.047
F	286.747
RMSE	1.67

1.2. Checking for Multicollinearity: Variance of Influence (VIF)

show code

# form a VIF plot 
VIF_linear_0 = as.data.frame(vif(linear_model0)) %>% 
  rownames_to_column() %>% 
  ggplot(
  aes(x = rowname, y = vif(linear_model0)))+
  geom_col()+
  coord_flip() +
  geom_hline(yintercept = 5, linetype = 2, color = 'blue')+
  labs(title = "VIF Values of Predictor Variables", x= NULL, y = 'VIF values') +
  theme_bw()+
   theme(axis.text.y = element_text(size = 12)) 

VIF_linear_0

The VIF plot suggest gni_percap and gdp_percap variable is correlated. Therefore we delete the variable wit the highest VIF values, gni_percap. Additionally, we delete child_mortality as well, this is likely to be correlated with adult_mortality. Removing multicollinearity may help improve model interpretability and enhance statistical significance.

show code

# form the formula
second_fmla = as.formula("life_expectancy ~ gdp_percap + disease_deathrate + adult_mortality + urban_pop + alcohol_percap + hospital_beds + health_expenditure")

# formula being used
linear_model1 = lm(second_fmla,
# dataset being used
   data = df_train) 

# recieve results
df_results1 = cbind(df_train,
        as.data.frame(predict(linear_model1,interval = "prediction")))

# recieve residuals
df_results1$residuals = linear_model1$residuals
# model summary
linear1_summary = summary(linear_model1) 
VIF_linear_1 = as.data.frame(vif(linear_model1)) %>% 
  rownames_to_column() %>% 
  ggplot(
  aes(x = rowname, y = vif(linear_model1))
)+
  geom_col()+
  coord_flip() +
  geom_hline(yintercept = 5, linetype = 2, color = 'blue')+
  labs(title = "VIF Values of Predictor Variables", x= NULL, y = 'VIF values') +
  theme_bw()+
   theme(axis.text.y = element_text(size = 12)) 

(VIF_linear_0 + ggtitle(label = "Before removing gni_percap and child_mortality") ) / 
(VIF_linear_1 + ggtitle(label = "After removing gni_percap and child_mortality ") )

Health_expenditure is still correlated with gdp_percap, therefore we ultimately remove this.

show code

third_fmla = as.formula("life_expectancy ~ gdp_percap + disease_deathrate + 
                        adult_mortality + urban_pop + alcohol_percap + 
                        hospital_beds")

# formula being used
linear_model2 = lm(third_fmla,
# dataset being used
   data = df_train) 

# recieve results
df_results2 = cbind(df_train,
        as.data.frame(predict(linear_model2,interval = "prediction")))

# recieve residuals
df_results2$residuals = linear_model2$residuals
# model summary
linear2_summary = summary(linear_model2) 

VIF_linear_2 = as.data.frame(vif(linear_model2)) %>% 
  rownames_to_column() %>% 
  ggplot(
  aes(x = rowname, y = vif(linear_model2)))+
  geom_col()+
  coord_flip() +
  geom_hline(yintercept = 5, linetype = 2, color = 'blue')+
  labs(title = "VIF Values of Predictor Variables", x= NULL, y = 'VIF values') +
  theme_bw()+
   theme(axis.text.y = element_text(size = 12)) 

(VIF_linear_1 + ggtitle(label = "Before removing health_expenditure") ) / 
(VIF_linear_2 + ggtitle(label = "After removing health_expenditure") )

1.3. Training model evaluation after adjusting for multicollinearity

show code

# calculate RMSE
rmse_lm_train <- sqrt(mean(df_results2$residuals^2))

# calculate MAE
mae_lm_train <- mean(abs(df_results2$residuals))

# calculate MSE
mse_lm_train <- mean(df_results2$residuals^2)

# calculate AIC
aic_value_lm_train <- AIC(linear_model2)

# calculate BIC
bic_value_lm_train <- BIC(linear_model2)

modelsummary_output1 <- modelsummary(
  list(linear_model2), 
  output = "gt",
  stars = TRUE,
  statistic = c("t = {statistic}", "se = {std.error}", 'p.value')
)
kable(as.data.frame(modelsummary_output1),
      caption = "*Training model summary*")

*Training model summary*
	(1)
(Intercept)	87.483***
	t = 24.310
	se = 3.599
	(<0.001)
gdp_percap	1.820***
	t = 5.549
	se = 0.328
	(<0.001)
disease_deathrate	-2.742***
	t = -4.380
	se = 0.626
	(<0.001)
adult_mortality	-1.351***
	t = -15.003
	se = 0.090
	(<0.001)
urban_pop	-0.015
	t = -1.385
	se = 0.011
	(0.169)
alcohol_percap	0.524***
	t = 3.404
	se = 0.154
	(<0.001)
hospital_beds	-0.270
	t = -1.115
	se = 0.242
	(0.267)
Num.Obs.	133
R2	0.937
R2 Adj.	0.934
AIC	574.4
BIC	597.5
Log.Lik.	-279.186
F	309.896
RMSE	1.97

show code

# collect train metrics
lm_train_metrics = data.frame(
  metric = c("rmse", "rsq", "mae", "mse", "aic","bic"),
  lm_train = c(rmse_lm_train, linear2_summary$r.squared, mae_lm_train, mse_lm_train, aic_value_lm_train, bic_value_lm_train ))

kable(lm_train_metrics,
      caption = "*Training model performance metrics*")

*Training model performance metrics*
metric	lm_train
rmse	1.974277
rsq	0.936536
mae	1.469630
mse	3.897769
aic	574.371421
bic	597.494214

We’ve eventually arrive to this initial final model after adjusting for multicollinearity.

1.4. Prediction vs actual plot

show code

# plot the prediction
r_squared_linear1 = ggplot(df_results2, aes(x = fit, life_expectancy)) +
  geom_point(alpha = 0.4) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred()+
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Training Model using Linear Regression") + 
  theme_bw() +
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy") +
  annotate("label", x = 80, y = 50,
           label = paste0("R-squared:", round(linear2_summary$r.squared,2)))

r_squared_linear1

1.5. Residuals diagnostics

show code

# form a residual histogram plot
residual_linear1 = ggplot(df_results2, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(df_results2$residuals), 
             linetype = 2, color = "blue") +
  geom_vline(xintercept = median(df_results2$residuals), 
             linetype = 2, color = "darkgreen") +
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(df_results2$residuals),2)), 
           color = "blue" ) +
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(df_results2$residuals),2)), 
           color = "darkgreen" ) +
  ylim(0, 0.21) +
  theme_bw() +
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit1 = ggplot(df_results2, aes(x = fit, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw() +
  ggtitle("Residual plot ")

# qq plot
qq_plot1 = ggplot(df_results2, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot ", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

 residual_linear1 + (  qq_plot1/ residual_fit1 )

Random Distribution: The residuals appear to be randomly scattered, indicating that the model fits the data well and that there are no obvious patterns or trends.
Homoscedasticity: The spread of residuals seems consistent across the range of fitted values, suggesting that the variance of the residuals is constant (homoscedasticity). No heteroskedasticity.
No Systematic Patterns: There are no clear curves or systematic patterns, implying that the model captures the relationship between predictors and the response variable adequately.
Outliers and normality: Good amount of residuals fall relatively far from zero but has a roughly symmetric distribution with a longer tails to the left, indicating potential outliers that may need further investigation.
Center and Spread: Mean of residuals is approximately 0 and the median is 0.26, Since the residuals are centered around 0, it suggest no significant bias in the model.

2. K-fold Cross-validation

2.1. Cross-validation performance evaluation

show code

# Specify a model
lm_model0 <- linear_reg() %>%
  set_engine("lm")

# Create a recipe for data preprocessing
rec <- recipe(third_fmla, 
              data = df_train) 

# Create a workflow
wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lm_model0)

# Create 10-fold cross-validation resamples
set.seed(123) # For reproducibility
folds <- vfold_cv(df_train, v = 10)

# Perform cross-validation
results <- wf %>%
  fit_resamples(
    resamples = folds,
    metrics = metric_set(rmse, rsq),
    control = control_resamples(save_pred = TRUE)
  )

# Collect and summarize metrics
metrics_cv = results %>%
  collect_metrics()

# Visualize results 
lm_cv_predictions = results %>%
  collect_predictions()
# Function to fit model and extract AIC/BIC
extract_aic_bic <- function(split) {
  train_data <- analysis(split)
  mod_fit <- lm(life_expectancy ~ ., data = train_data)
  aic <- AIC(mod_fit)
  bic <- BIC(mod_fit)
  
  tibble(
    id = split$id,
    AIC = aic,
    BIC = bic
  )
}

aic_bic_results <- folds %>%
  mutate(results = map(splits, extract_aic_bic)) %>%
  select(id, results) %>%
  unnest(cols = results, names_repair = "unique")

# Summarize the results
aic_bic_summary <- aic_bic_results %>%
  summarise(
    mean_AIC = mean(AIC),
    mean_BIC = mean(BIC)
  )

# collect cv metrics
lm_cv_metrics = metrics_cv %>% 
  transmute(metric = .metric, lm_crossvalidate = mean,) 

# calculate residuals 
lm_cv_residuals = lm_cv_predictions$.pred - lm_cv_predictions$life_expectancy

# Calculate MAE
mae_cv <- mean(abs(lm_cv_residuals))

# Calculate MSE
mse_cv <- mean((lm_cv_residuals)^2)

lm_cv_metrics = lm_cv_metrics %>% 
  full_join(data.frame(metric = c("mae", "mse", "aic", "bic"),
           lm_crossvalidate = c(mae_cv, mse_cv,
                                aic_bic_summary$mean_AIC, 
                                aic_bic_summary$mean_BIC)), 
                                        by = c("lm_crossvalidate", "metric"))
# combine all metrics in linear model
train_metrics_lm = full_join(lm_train_metrics, lm_cv_metrics,
  by = "metric")

kable(train_metrics_lm,
      caption = "*Train and Cross-validation performance metrics*")

*Train and Cross-validation performance metrics*
metric	lm_train	lm_crossvalidate
rmse	1.974277	2.0193737
rsq	0.936536	0.9450802
mae	1.469630	1.5712364
mse	3.897769	4.5356599
aic	574.371421	483.0233169
bic	597.494214	513.6581109

2.2. Prediction vs Actual Plot

show code

lm_cv_predictions%>%
  ggplot(aes(x = .pred, y = life_expectancy)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "K-Fold Cross-validation resamples for Linear Regression") +   coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy") +
  annotate("label", x = 80, y = 50,
           label = paste0("R-squared:", round(metrics_cv$mean[2],2)))

We’ve reached our optimum OLS linear model, the R-squared seems to reach 0.95 even with cross-validation testing!

2.3. Residual Diagnostic

show code

df_cv_resid = data.frame("residuals" = lm_cv_residuals,
                         fit = df_train$life_expectancy)
# form a residual histogram plot
residual_linear_cv = ggplot(df_cv_resid, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(df_cv_resid$residuals), 
             linetype = 2, color = "blue") +
  geom_vline(xintercept = median(df_cv_resid$residuals), 
             linetype = 2, color = "darkgreen") +
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(df_cv_resid$residuals),2)), 
           color = "blue" ) +
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(df_cv_resid$residuals),2)), 
           color = "darkgreen" ) +
  ylim(0, 0.21) +
  theme_bw() +
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit_cv = ggplot(df_cv_resid, aes(x = fit, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw() +
  ggtitle("Residual plot ")

# qq plot
qq_plot_cv = ggplot(df_cv_resid, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot ", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

 residual_linear_cv + (  qq_plot_cv/ residual_fit_cv )

3. Prediction and Evaluation

3.1. Testing performance evaluation (

show code

# Predict on the test dataset
model_fit0 <- wf %>%
  fit(data = df_train)

test_predictions0 <- model_fit0 %>%
  predict(new_data = df_test) %>%
  bind_cols(df_test)

# get residuals 
test_predictions0 <- test_predictions0 %>%
  mutate(residuals = life_expectancy - .pred)

metrics0 <- test_predictions0 %>%
  metrics(truth = life_expectancy, estimate = .pred)

# collect testing metrics
# get mse
mse_testing <- data.frame(metric = "mse",
                          lm_testing = mean((test_predictions0$residuals)^2))
# make test metrics 
lm_test_metrics = metrics0 %>% 
  transmute(metric = .metric, 
            lm_testing = .estimate) %>%
  full_join(mse_testing)

# combine all metrics in linear model
joined_metrics_lm = full_join(lm_train_metrics, lm_cv_metrics,
  by = "metric") %>% inner_join(lm_test_metrics)

# Summary statistics of the test dataset
kable(model_fit0 %>%
  extract_fit_parsnip() %>%
  tidy(),
  caption = "*Testing model summary*")

*Testing model summary*
term	estimate	std.error	statistic	p.value
(Intercept)	87.4834234	3.5986188	24.310278	0.0000000
gdp_percap	1.8201193	0.3279959	5.549214	0.0000002
disease_deathrate	-2.7424293	0.6261265	-4.379993	0.0000247
adult_mortality	-1.3511587	0.0900610	-15.002699	0.0000000
urban_pop	-0.0151245	0.0109225	-1.384708	0.1685889
alcohol_percap	0.5243082	0.1540085	3.404412	0.0008898
hospital_beds	-0.2696177	0.2418432	-1.114845	0.2670386

The testing model summary highlights the relationships between key predictors and life expectancy. GDP per capita positively influences life expectancy (estimate = 1.795, p < 0.001), indicating economic stability enhances health outcomes. Disease death rate (estimate = -2.755, p < 0.001) and adult mortality (estimate = -1.344, p < 0.001) negatively affect life expectancy, underscoring the adverse impact of health burdens. Urban population and hospital beds were not significant predictors (p > 0.1), while alcohol consumption per capita had a positive but moderate effect (estimate = 0.526, p < 0.001). These findings align with the hypothesis that healthcare accessibility and economic factors significantly influence life expectancy.

3.2. Overall OLS regression’s performance metrics

show code

kable(joined_metrics_lm,
      caption = "*Multiple Linear Regression's perfomance metrics*")

*Multiple Linear Regression’s perfomance metrics*
metric	lm_train	lm_crossvalidate	lm_testing
rmse	1.974277	2.0193737	2.1707938
rsq	0.936536	0.9450802	0.9060414
mae	1.469630	1.5712364	1.6833470
mse	3.897769	4.5356599	4.7123459

3.3. Prediction vs actual plot

show code

test_predictions0 %>%
  ggplot(aes(x = .pred, y = life_expectancy)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Linear Regression Testing set") +   coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy") +
  annotate("label", x = 78, y = 50,
           label = paste0("R-squared:", 
                          round(metrics0$.estimate[2],2)))

It performed really well!

3.4. Residual Diagnostic of testing data

show code

# form a residual histogram plot
histogram_linear_test = ggplot(test_predictions0, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(test_predictions0$residuals), 
             linetype = 2, color = "blue")+
  geom_vline(xintercept = median(test_predictions0$residuals), 
             linetype = 2, color = "darkgreen")+
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(test_predictions0$residuals),2)), 
           color = "blue" )+
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(test_predictions0$residuals),2)), 
           color = "darkgreen" )+
  ylim(0, 0.21)+
  theme_bw()+
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit_test_linear = ggplot(test_predictions0, aes(x = life_expectancy, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw()+
  labs(title = "fit vs residuals")


qq_linear_test = ggplot(test_predictions0, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

# plot 
histogram_linear_test + (qq_linear_test/residual_fit_test_linear)

Regularized Regression : L2 and Elastic Net

Since there is severe multicollinearity in the original selected input, we had to delete 3 variables from the selected input even though we still believe those inputs are still useful. Hence, we could’ve capture important information that can better predict life_expectancy. Therefore, we decided to use Ridge Regression because we wanted to keep all selected variables. We will assign them different weightage and penalty.

We wanted to find the optimal values of lambda. We do this by using 10 fold cross validation

1. Further preprocessing: Standardization

show code

# standardize the train variable
blueprint = recipe(life_expectancy ~ ., data = df_train)%>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

prepare <- prep(blueprint, training = df_train)

ridge_train <- bake(prepare, new_data = df_train)

# standardize the test variable
blueprint_test = recipe(life_expectancy ~ ., data = df_test)%>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

prepare_test  <- prep(blueprint, training = df_test)

ridge_test <- bake(prepare, new_data = df_test)

X_test <- model.matrix(life_expectancy ~ ., ridge_test)[, -1]

# transform y with log transformation
Y_test <- ridge_test$life_expectancy

X <- model.matrix(life_expectancy ~ ., ridge_train)[, -1]

# transform y with log transformation
Y <- ridge_train$life_expectancy
X %>% summary()

 health_expenditure    gdp_percap         gni_percap        disease_deathrate 
 Min.   :-2.040838   Min.   :-1.91869   Min.   :-1.835669   Min.   :-3.33174  
 1st Qu.:-0.845484   1st Qu.:-0.71710   1st Qu.:-0.753251   1st Qu.:-0.71886  
 Median :-0.001998   Median : 0.01193   Median : 0.006561   Median :-0.07494  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
 3rd Qu.: 0.750230   3rd Qu.: 0.79157   3rd Qu.: 0.768977   3rd Qu.: 0.72109  
 Max.   : 2.120375   Max.   : 2.33050   Max.   : 1.855852   Max.   : 2.09019  
 hospital_beds      child_mortality     adult_mortality    alcohol_percap    
 Min.   :-2.59460   Min.   :-2.302658   Min.   :-1.91589   Min.   :-1.82613  
 1st Qu.:-0.82147   1st Qu.:-0.796102   1st Qu.:-0.83967   1st Qu.:-0.84449  
 Median : 0.05711   Median :-0.004018   Median : 0.09513   Median : 0.04171  
 Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.00000  
 3rd Qu.: 0.77446   3rd Qu.: 0.837161   3rd Qu.: 0.70380   3rd Qu.: 0.84159  
 Max.   : 2.40967   Max.   : 1.900753   Max.   : 2.35828   Max.   : 1.82986  
   urban_pop       
 Min.   :-1.96012  
 1st Qu.:-0.77573  
 Median : 0.04723  
 Mean   : 0.00000  
 3rd Qu.: 0.81685  
 Max.   : 1.74379

2. Ridge Regression Training and Evaluation

2.1 Regularization path plot

show code

ridge <- glmnet(
  x = X,
  y = Y,
  alpha = 0) # a = ridge 

plot(ridge, main = "Ridge penalty\n")

This plot shows how ridge regression coefficients shrink as the regularization parameter λ increases (moving right). Higher λ values lead to smaller coefficients, indicating stronger regularization, which reduces overfitting by limiting the model’s sensitivity to individual features. Unlike LASSO, coefficients remain non-zero, meaning all features contribute to the model. This helps balance model complexity and prediction accuracy, making it robust, especially when handling multicollinearity or many features relative to observations.

2.2. Cross-validation to search optimal lambda

show code

# cross validation ridge regression
ridge_cv <- cv.glmnet(
  x = X,
  y = Y,
  alpha = 0,
  type.measure ="mse"
)

plot(ridge, main = "Ridge penalty\n")
abline(v = ridge_cv$lambda.min, col = "red", lty = "dashed")
abline(v = ridge_cv$lambda.1se, col = "blue", lty = "dashed")

2.4. Cross-validation plot

show code

# we will be using MSE to compare models 
plot(ridge_cv, main = "Ridge penalty\n\n")

2.5. Cross-validation results

show code

ridge_cv


Call:  cv.glmnet(x = X, y = Y, type.measure = "mse", alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.7358   100   3.658 0.3665       9
1se 1.2859    94   3.967 0.4291       9

The 10-fold cross-validation Mean Squared Error (MSE) for a ridge model is depicted in a plot. The first vertical dotted line indicates the value of λ that corresponds to the smallest MSE, while the second vertical line represents the λ value that is within one standard error of the minimum MSE.

Minimum Lambda Found: 0.7358318

The crossvalidation gives us this optimized lambda values that allow us to yield the lowest Mean Squared Error, now we will use this value to predict using the training dataset.

2.3. Uses the minimum lambda to predict training set

show code

# using this crossvalidation
ridge_cv_result = ridge_cv %>% 
  predict(newx = X) %>% rowMeans()%>% 
  cbind(Y)%>% 
  as.data.frame() 

ridge_cv_residuals = (Y - ridge_cv %>% 
  predict(newx = X))

SST <- sum((Y - mean(Y))^2)  # Total Sum of Squares
SSR <- sum(ridge_cv_residuals^2) # Residual Sum of Squares

R_squared_ridgecv_train <- 1 - (SSR / SST)

ridge_cv_result %>% 
  ggplot(aes(x = . , y = Y)) + 
  geom_point(alpha = 0.5)+   
  geom_abline(linetype = 2, color = "blue") +
  coord_obs_pred() +
  theme_bw() +
  annotate("label", x = 80, y = 50,
           label = paste0("R-squared:", round(R_squared_ridgecv_train,2)))+
  xlab ("Predicted life expectancy") +
  ylab ("Actual life expectancy")

3. Elastic net to find the optimal tuning

3.1. Cross-validation plot

show code

# grid search across 
cv_glmnet <- train(
  x = X,
  y = Y,
  method = "glmnet",
  #preProc = c("zv", "center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
)

# plot cross-validated RMSE
ggplot(cv_glmnet)+ theme_bw()

This plot visualizes which parameter yields the lowest RMSE in each of the 10 folds. Notice how the lambda has changes a bit, but allow us to use alpha of 0.2.

show code

kable(cv_glmnet$bestTune, caption = "*Grid seach result*")

*Grid seach result*
	alpha	lambda
3	0.1	0.0181434

Alpha of 0.1 with lambda of 0.0078 gives the lowest rmse.

3.2. Variable Importance plot

show code

vip(cv_glmnet, bar = FALSE) + theme_bw()

3.2 Final evaluation of ridge and elastic net models

show code

# predict life_expectancy on training data
ridge_grid_result <- data.frame(pred = predict(cv_glmnet, X),
                                fit = df_train$life_expectancy)

ridge_grid_residuals = ridge_grid_result$pred - ridge_grid_result$fit

# compute RMSE of transformed predicted
rmse_grid = RMSE(ridge_grid_result$pred, ridge_grid_result$fit)
mae_grid = MAE(ridge_grid_result$pred, ridge_grid_result$fit)
mse_grid = mean((ridge_grid_residuals)^2)


SST <- sum((Y - mean(Y))^2)  # Total Sum of Squares
SSR_grid <- sum(ridge_grid_residuals^2) # Residual Sum of Squares

R_squared_gridsq_train <- 1 - (SSR_grid / SST)

ridge_cv_grid_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  elastic_train = c(rmse_grid, 
                       R_squared_gridsq_train, 
                       mae_grid ,mse_grid))

# calculate MAE
mae_cv_ridge <- mean(abs(ridge_cv_residuals))

# calculate MSE
mse_cv_ridge <- mean((ridge_cv_residuals)^2)

# calculate MSE
rmse_cv_ridge <- sqrt(mse_cv_ridge)


# calculate MAE
mae_cv_ridge <- mean(abs(ridge_cv_residuals))

# calculate MSE
mse_cv_ridge <- mean((ridge_cv_residuals)^2)

# calculate MSE
rmse_cv_ridge <- sqrt(mse_cv_ridge)
ridge_cv_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  ridge_train = c(rmse_cv_ridge, R_squared_ridgecv_train, mae_cv_ridge ,mse_cv_ridge)
)

train_regular_metrics = ridge_cv_metrics %>% 
  full_join(ridge_cv_grid_metrics, by = "metric")
kable(train_regular_metrics,
      caption = "*Ridge and Elastic net regressions training performance*")

*Ridge and Elastic net regressions training performance*
metric	ridge_train	elastic_train
rmse	1.8708484	1.6795753
rsq	0.9430113	0.9540685
mae	1.4392721	1.2902426
mse	3.5000736	2.8209732

4. Prediction and evaluation

4.1. Ridge regression and Elastic results

show code

ridge_predict = 
  predict(ridge_cv, s=ridge_cv$lambda.1se, newx = X_test, lambda = ridge_cv$lambda.min )

ridge_result =cbind(ridge_predict, Y_test) %>% 
  as.data.frame()

SST_r_test <- sum((Y_test - mean(Y_test))^2)  # Total Sum of Squares
SSR_r_test<- sum((Y_test - ridge_predict)^2) # Residual Sum of Squares

R_squared_r_test <- 1 - ( SSR_r_test/SST_r_test)

ridge_result0 = ridge_result %>%
  transmute(life_expectancy = Y_test,
         .pred = s1,
         residuals = Y_test - s1)

ridge_plot = ridge_result0 %>%
  ggplot(aes(x = .pred, y = life_expectancy)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Ridge Regression") +   
  coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy")+
  annotate("label", x = 76, y = 50,
           label = paste0("R-squared:", 
                          round(R_squared_r_test,3))) 

elastic_predict <- data.frame(pred = predict(cv_glmnet, 
                                               newdata = X_test))
elastic_result =cbind(elastic_predict, Y_test) %>% 
  as.data.frame()

SST_net_test <- sum((Y_test - mean(Y_test))^2)  # Total Sum of Squares
SSR_net_test<- sum((Y_test - elastic_predict)^2) # Residual Sum of Squares

R_squared_net_test <- 1 - (SSR_net_test / SST_net_test)

elastic_result0 = elastic_result %>%
  mutate(residuals = Y_test - pred)

elastic_plot = elastic_result0 %>%
  ggplot(aes(x = pred, y = Y_test)) +
  geom_point(alpha = 0.4) +
  geom_abline(linetype = 2, color = "blue") +
  ggtitle("Predicted vs Actual life expectancy", 
          label = "Elastic net") +   
  coord_obs_pred()+
  theme_bw()+
  xlab("Predicted Life Expectancy")+
  ylab("Actual Life Expectancy")+
  annotate("label", x = 76, y = 50,
           label = paste0("R-squared:", 
                          round(R_squared_net_test,3))) 
ridge_plot+elastic_plot

4.2 Performance Metrics of testing and training

show code

# Ridge------------
ridge_residuals = ridge_result[1] - ridge_result[2]

# compute RMSE of transformed predicted
rmse_ridge = RMSE(ridge_result$s1, ridge_result$Y_test)
mae_ridge= MAE(ridge_result$s1,ridge_result$Y_test)
mse_ridge = mean((ridge_residuals$s1)^2)

SSR_ridge <- sum(ridge_grid_residuals^2) # Residual Sum of Squares

# calculate MSE
ridge_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  ridge_test = c(rmse_ridge, 
                       R_squared_r_test, 
                       mae_ridge ,mse_ridge))

# Elastic------------
elastic_residuals = elastic_result$pred - elastic_result$Y_test

# compute RMSE of transformed predicted
rmse_elastic = RMSE(elastic_result$pred , elastic_result$Y_test)
mae_elastic = MAE(elastic_result$pred ,elastic_result$Y_test)
mse_elastic = mean((elastic_residuals)^2)

SSR_elastic <- sum(elastic_residuals^2) # Residual Sum of Squares

# calculate MSE
elastic_metrics = data.frame(
  metric = c("rmse","rsq", "mae", "mse"), 
  elastic_test = c(rmse_elastic, 
                       R_squared_net_test, 
                       mae_elastic ,mse_elastic))
join_test_r = ridge_metrics %>% full_join(elastic_metrics, by = 'metric') 
joined_metrics_r = train_regular_metrics %>% 
  full_join(join_test_r, by = 'metric')
kable(joined_metrics_r,
      caption = "*Ridge and Elastic net regression performance metrics*")

*Ridge and Elastic net regression performance metrics*
metric	ridge_train	elastic_train	ridge_test	elastic_test
rmse	1.8708484	1.6795753	1.9318764	1.8293091
rsq	0.9430113	0.9540685	0.9244032	0.9322173
mae	1.4392721	1.2902426	1.5130543	1.4343533
mse	3.5000736	2.8209732	3.7321464	3.3463717

4.3 Residual Diagnostic best model

show code

elastic_residuals_df = data.frame(residuals = elastic_residuals,
                                  life_expectancy = Y_test)
# form a residual histogram plot
histogram_test = ggplot(elastic_residuals_df, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(elastic_residuals_df$residuals), 
             linetype = 2, color = "blue")+
  geom_vline(xintercept = median(elastic_residuals_df$residuals), 
             linetype = 2, color = "darkgreen")+
  annotate("text",x = -1.5, y = 0.17, 
           label = paste0("Mean: ", round(mean(elastic_residuals_df$residuals),2)), 
           color = "blue" )+
  annotate("text",x = 2.3, y = 0.20, 
           label = paste0("Median: ", round(median(elastic_residuals_df$residuals),2)), 
           color = "darkgreen" )+
  ylim(0, 0.21)+
  theme_bw()+
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit = ggplot(elastic_residuals_df, aes(x = life_expectancy, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw()+
  labs(title = "fit vs residuals")


qq = ggplot(elastic_residuals_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

# plot 
histogram_test + (qq/residual_fit)

Random Forest Regression - Ensemble Learning

Ensemble-based method is used to seek the bias variance trade-off. We plan to improve predictive accuracy but unfortunately, reduces interpretability as random forest is a black-box model that is difficult to interpret as compare to linear or regularized regression.

1. Model Building

Random Forest models are inherently robust to the scale of input features, as they base splits on relative values, making normalization less crucial (Hu & Szymczak, 2022). However, normalization can still enhance performance by ensuring balanced splits and equal feature contribution, leading to slightly better accuracy. This improvement is minor compared to methods like ridge regression or linear OLS, where normalization is essential for stability and convergence of coefficients. Therefore, to calculate the best predictive model, we’ll still use the same preprocessed data.

show code

# Building the random forest model
set.seed(123)
n_features = length(setdiff(names(df_test), "life_expectancy"))

# for each of the trees 

rf_model_normalized <- randomForest(
  life_expectancy ~ .,
  data = df_train,
  importance = TRUE,
  mtry = floor(n_features/3), # regression
  xtest = subset(df_test,
                 select = -life_expectancy),
                ytest = df_test$life_expectancy,
                         keep.forest = TRUE)

# Print the model summary
print(rf_model_normalized)


Call:
 randomForest(formula = life_expectancy ~ ., data = df_train,      importance = TRUE, mtry = floor(n_features/3), xtest = subset(df_test,          select = -life_expectancy), ytest = df_test$life_expectancy,      keep.forest = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 2.528765
                    % Var explained: 95.88
                       Test set MSE: 2.52
                    % Var explained: 94.95

1.2. Variable importance

show code

# Plot variable importance
vip_rf1 <- vip(rf_model_normalized) + 
  ggtitle("Variable Importance")+theme_bw()

vip_rf1

show code

# create an explainer for the model
explainer_rf = explain(model = rf_model_normalized, data = df_train,
                        y = df_train$life_expectancy, 
                        label = "Random Forest")

Preparation of a new explainer is initiated
  -> model label       :  Random Forest 
  -> data              :  133  rows  10  cols 
  -> data              :  tibble converted into a data.frame 
  -> target variable   :  133  values 
  -> predict function  :  yhat.randomForest  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package randomForest , ver. 4.7.1.1 , task regression (  default  ) 
  -> predicted values  :  numerical, min =  55.02809 , mean =  72.31606 , max =  84.59589  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -2.44344 , mean =  -0.009295188 , max =  1.904113  
  A new explainer has been created!

1.3. Partial Dependence plot

show code

# plot Partial Dependence for three selected feature
pdp_rf_adult <- model_profile(explainer = explainer_rf, 
                        variables = "adult_mortality")

pdp_rf_disease <- model_profile(explainer = explainer_rf, 
                        variables = "disease_deathrate")

pdp_rf_health <- model_profile(explainer = explainer_rf, 
                        variables = "health_expenditure")

pdp_rf_gdp <- model_profile(explainer = explainer_rf, 
                        variables = "gdp_percap")
# plot the PDP put them in a 2x2 figure
(plot(pdp_rf_adult) + ggtitle(NULL, NULL) + 
  plot(pdp_rf_disease) + ggtitle(NULL, NULL))/
  (plot(pdp_rf_health) + ggtitle(NULL, NULL)+
     plot(pdp_rf_gdp)+ ggtitle(NULL, NULL))

Adult mortality has the greatest influence on the model’s predictions compared to other variables. As the adult mortality decreases, the predicted probability decrease exponentially, till the lowest life expectancy. The diseases death rate suggest simlair results, which make sense where more people are dying, the lower the estimate life expectancy is. In contrast, Health expenditure and GDP percap positively influence the predicted probability. Suggesting as countries invest more in health expenditure the lesser people will die, and the richer the country the higher the life expectancy.

1.4. Single tree instance example

show code

# Extract a single tree from the random forest model
single_tree <- getTree(rf_model_normalized, k = 1, labelVar = TRUE)

# Convert it to a decision tree model
tree_model <- rpart(first_fmla, data = df_train)

# Plot the tree
# Plot the decision tree
rpart.plot(tree_model, type = 1, extra = 101, 
           fallen.leaves = TRUE, main = "Decision Tree for Life Expectancy")

2. Model Prediction and Evaluation

show code

# Predictions on training data
train_preds <- predict(rf_model_normalized, newdata = df_train)

# Predictions on testing data
test_preds <- predict(rf_model_normalized, newdata = df_test)

# Calculate metrics for training data
train_rmse <- RMSE(pred = train_preds, obs = df_train$life_expectancy)
train_mae <- MAE(pred = train_preds, obs = df_train$life_expectancy)
train_mse <- mean((train_preds - df_train$life_expectancy)^2)
train_rsq <- R2(pred = train_preds, obs = df_train$life_expectancy)

# Calculate metrics for testing data
test_rmse <- RMSE(pred = test_preds, obs = df_test$life_expectancy)
test_mae <- MAE(pred = test_preds, obs = df_test$life_expectancy)
test_mse <- mean((test_preds - df_test$life_expectancy)^2)
test_rsq <- R2(pred = test_preds, obs = df_test$life_expectancy)

# Combine metrics into a dataframe
rf_metrics <- data.frame(
  metric = c("rf_train", "rf_test"),
  rmse = c(train_rmse, test_rmse),
  rsq = c(train_rsq, test_rsq),
  mae = c(train_mae, test_mae),
  mse = c(train_mse, test_mse)
)

# Print the metrics
print(rf_metrics)

    metric      rmse       rsq       mae       mse
1 rf_train 0.6857352 0.9932940 0.5139248 0.4702328
2  rf_test 1.5869542 0.9502163 1.1706776 2.5184236

show code

rf_residuals_df = data.frame(residuals = Y_test - test_preds,
                                  life_expectancy = Y_test)

histogram_rf = ggplot(rf_residuals_df, aes(x = residuals)) + 
  geom_histogram(aes(y = ..density..), color = "black", alpha = 0.5) +
  geom_density(color = "red", size = 0.5) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Density") +
  geom_vline(xintercept = mean(rf_residuals_df$residuals), 
             linetype = 2, color = "blue")+
  geom_vline(xintercept = median(rf_residuals_df$residuals), 
             linetype = 2, color = "darkgreen")+
  annotate("label",x = 16, y = 0.03, 
           label = paste0("Mean: ", round(mean(rf_residuals_df$residuals),2)), 
           color = "blue" , size = 3.5)+
  annotate("label",x = 17, y = 0.025, 
           label = paste0("Median: ", round(median(rf_residuals_df$residuals),2)), 
           color = "darkgreen" , size = 3.5)+
  theme_bw()+
  ylab(NULL)

# lm model, uses ordinary least squares regession to fit a straight line
residual_fit_test_rf = ggplot(rf_residuals_df, aes(x = life_expectancy, y = residuals))+
  geom_point(alpha= 0.5) +
  geom_smooth(method ="loess", color = 'red', size = 0.3, se = FALSE) + 
  theme_bw()+
  labs(title = "fit vs residuals")


qq_rf_test = ggplot(rf_residuals_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles")+
  theme_bw()

# plot 
histogram_rf + (qq_rf_test/residual_fit_test_rf)

Model Comparison and Conclusion

show code

t1 = joined_metrics_r %>% 
  full_join(joined_metrics_lm, by = "metric") %>%
  t() %>% as.data.frame() %>%rownames_to_column() 
  
  # transposed metrics
colnames(t1) <- as.character(unlist(t1[1, ]))
 t1 <-  t1[-1, ]

t1$rmse <- as.numeric(t1$rmse)
t1$rsq <- as.numeric(t1$rsq)
t1$mae <- as.numeric(t1$mae)
t1$mse <- as.numeric(t1$mse)

model_comparison = t1 %>% full_join(rf_metrics, by = colnames(t1)) %>% arrange(rmse)
kable(model_comparison, caption = "All model's performance metric comparison")

All model’s performance metric comparison
metric	rmse	rsq	mae	mse
rf_train	0.6857352	0.9932940	0.5139248	0.4702328
rf_test	1.5869542	0.9502163	1.1706776	2.5184236
elastic_train	1.6795753	0.9540685	1.2902426	2.8209732
elastic_test	1.8293091	0.9322173	1.4343533	3.3463717
ridge_train	1.8708484	0.9430113	1.4392721	3.5000736
ridge_test	1.9318764	0.9244032	1.5130543	3.7321464
lm_train	1.9742770	0.9365360	1.4696300	3.8977690
lm_crossvalidate	2.0193737	0.9450802	1.5712364	4.5356599
lm_testing	2.1707938	0.9060414	1.6833470	4.7123459

Consideration throughout the modelling workflow

Throughout the modeling workflow, careful consideration was given to achieving a balance between interpretability and predictive accuracy. The process began with initial model estimation using multiple linear regression to establish a baseline and identify key predictors. Cross-validation was employed to ensure the model’s generalizability, addressing the bias-variance tradeoff by evaluating performance across multiple folds. Due to multicollinearity, we had to remove 3 highly correlated variables. Thus, regularization techniques, including ridge regression and elastic net, were incorporated to keep the 3 variables and handle multicollinearity. The Random Forest model was ultimately chosen for its superior predictive performance, despite its complexity. For every model, steps were taken to validate model assumptions, such as checking for homoscedasticity and normality of residuals, ensuring robustness and reliability in the final model.

Final models results conclusion

1. Multiple Linear Regression

The multiple linear regression model establishes the baseline relationship between predictors and life expectancy. Significant predictors include GDP per capita, disease death rate, and adult mortality, indicating that economic status, disease prevalence, and mortality rates substantially impact life expectancy. The model’s performance is moderate, with some limitations due to multicollinearity among predictors. Adjustments and removal of highly collinear variables were necessary to enhance model reliability and interpretability.

2. Elastic Net Regression

Elastic Net regression combines the penalties of both ridge and lasso regression to address multicollinearity and variable selection. This model showed improved performance over the linear model, with lower RMSE and higher R² values, indicating better predictive accuracy. Significant predictors remain consistent, reaffirming the importance of economic and health-related factors on life expectancy. The model’s regularization helps in retaining valuable predictors while penalizing less significant ones, ensuring a balanced and interpretable model.

3. Random Forest regression

The Random Forest model, an ensemble learning method, provided the best performance among the models. It handled multicollinearity effectively and captured complex relationships in the data. Key predictors like GDP per capita, disease death rate, and adult mortality were identified as the most influential. The model’s high accuracy, reflected in the lowest RMSE and highest R² values, underscores its robustness. However, its black-box nature limits interpretability compared to linear models. Thus, partial dependence plot was visualized. Overall, Random Forest proved highly effective in predicting life expectancy, reinforcing the significance of economic stability and health factors.

Was the research hypothesis achieved?

Yes, the research hypothesis was achieved. Overall the model was able predict the estimated life expectancy with very high accuracy, indicating that the variables we used can deeply explain life expectancy. According to our results, there is sufficient evidence to prove that countries with higher levels of income accessibility to healthcare services do infact exhibit higher life expectancy and lower mortality rates. As countries get wealthier, they increase health expenditure and afford more hospital beds, increasing their likelihood of higher live expectancy compared to countries that are from poorer income levels.

Salient Findings

• Economic Wealth plays a role: Our Random Forest model corroborates existing research on the negative association between mortality rate, diseases and life expectancy. Our analysis suggest the importance of addressing the health disparity between high- and low-income countries.

• Multifaceted Influences on Health: While the model confirms GDP’s significance, it identifies disease death rate and adult mortality as key factors. Obviously disease and mortality rates are key factors, other variables suggests there are external influences that capture the multifaceted nature of life expectancy. If we remove the mortality rate and diseases death rate, and alllow more datasets and variables to be imported, it may give a more detailed insights about life expectancy.

Proposed Solutions

The analysis shows money impacts health. To tackle this, governments can invest more in public healthcare for wider access, create programs for low-income areas, and focus on preventative care. Wealthy nations can also aid developing countries. By using data like this to identify problem areas, governments can create policies that target specific needs and work towards a future where everyone has equal chances for good health.

Limitations

Balancing Interpretability and Accuracy: This analysis aimed to strike a balance between a model that’s easy to understand (interpretable) and one that predicts well (accurate). While linear regression offered clear cause-and-effect relationships, its accuracy was limited by multicollinearity (correlated variables). Random Forest, on the other hand, achieved highest accuracy but its complex nature makes it harder to pinpoint why certain factors influence life expectancy.
Data Availability and Generalizability: The analysis relied on available data, which might not capture all relevant factors affecting life expectancy. Socioeconomic factors beyond GDP, like education or access to clean water,could be missing. Additionally, the chosen models may not generalize perfectly to all countries due to cultural or environmental variations not included in the data.

Introduction

Research Question

Research Hypothesis

Libraries used

Dataset Overview

Initial Data Preprocessing

1. Data collection and integration

Dependent Variable: Life_Expectancy

2. Handling Missing Values

2.1 Removing variables with very high percentage of missing values

Missing Values of Upper middle income to High income countries

Missing Values of Low middle income to Low income countries

2.2. Handling missing values through imputation methods

Use Bagged Trees (Random Forest) for Higher Missingness: (more than or equal to 10%)

Use K-Nearest Neighbor for Lower Missingness: (less than 10%)

2.3 Full dataset before input selection and normalization

3. Correlation analysis with target variable

3.1 Full data correlation matrix

4. Final selected Inputs

5. Data Normalization of selected inputs

6. Data partitioning

Multiple Linear Regression - Ordinary Least Squared

1. Training Multiple Linear Regression

1.1. Train and evaluate initial regression model

1.2. Checking for Multicollinearity: Variance of Influence (VIF)

1.3. Training model evaluation after adjusting for multicollinearity

1.4. Prediction vs actual plot

1.5. Residuals diagnostics

2. K-fold Cross-validation

2.1. Cross-validation performance evaluation

2.2. Prediction vs Actual Plot

2.3. Residual Diagnostic

3. Prediction and Evaluation

3.1. Testing performance evaluation (

3.2. Overall OLS regression’s performance metrics

3.3. Prediction vs actual plot

3.4. Residual Diagnostic of testing data

Regularized Regression : L2 and Elastic Net

1. Further preprocessing: Standardization

2. Ridge Regression Training and Evaluation

2.1 Regularization path plot

2.2. Cross-validation to search optimal lambda

2.4. Cross-validation plot

2.5. Cross-validation results

2.3. Uses the minimum lambda to predict training set

3. Elastic net to find the optimal tuning

3.1. Cross-validation plot

3.2. Variable Importance plot

3.2 Final evaluation of ridge and elastic net models

4. Prediction and evaluation

4.1. Ridge regression and Elastic results

4.2 Performance Metrics of testing and training

4.3 Residual Diagnostic best model

Random Forest Regression - Ensemble Learning

1. Model Building

1.2. Variable importance

1.3. Partial Dependence plot

1.4. Single tree instance example

2. Model Prediction and Evaluation

Model Comparison and Conclusion

Consideration throughout the modelling workflow

Final models results conclusion

1. Multiple Linear Regression

2. Elastic Net Regression

3. Random Forest regression

Was the research hypothesis achieved?

Salient Findings

Proposed Solutions

Limitations

Reference list