The conceptual research question guiding this analysis is: “How do a country’s socio-economic factors, including healthcare access, income levels, and consumption patterns, influence life expectancy globally?” This question aims to explore the complex relationship between these socio-economic determinants and health outcomes, highlighting disparities across countries with different demographics.
Research Hypothesis
The practical research hypothesis posited in this report is: “Countries with higher levels of income accessibility to healthcare services exhibit higher life expectancy and lower mortality rates” we believe the data we uses support the research hypothesis. The full dataset overview will be given below. For instance, Health expenditure, GDP per capita, and GNI per capita all increase significantly from low-income to high-income countries, indicating that wealthier nations invest more in healthcare infrastructure and services (Boniol et al., 2022). Disease death rates and both child and adult mortality rates decrease with rising income levels, reflecting the effectiveness of healthcare services in wealthier countries (Arias et al., 2022). The number of hospital beds and urban population percentages also increase, highlighting improved healthcare capacity and access in richer nations. Despite higher alcohol consumption in high-income countries, the overall health benefits from better healthcare infrastructure outweigh potential negative impacts (Watsons, 2022).
Libraries used
To perform the analysis and generate the necessary visualizations and s for this technical report, the following R packages will be utilized:
To perform this analysis, we utilizes the dataset from the previous infographic assignment. The dataset given, is a joined dataset from numerous sources which consists of socio-economic predictors and wealth status. The previous infographic dataset consists of aggregated countries, region and UN classify territories. Therefore, the code chunk below filter out those non-country values and keep all 193 UN member countries.
Initial Data Preprocessing
1. Data collection and integration
show code
# load previous infographic dataset IA3_df =read_excel("Datasets/Raw_IA3_df.xlsx")# convert 0s into NAfull_df <- IA3_df %>%mutate(across(where(is.numeric), ~na_if(., 0)))# countries column has loads of region and state names, even aggregated names.# collect country code that consists of all recognized 195 countries, however this datasets do not consists of Palestine and Vatican city as they are non-member observer state.codes =read_csv("Datasets/countries.csv") %>%select(en, alpha3) %>%transmute(Code =toupper(alpha3),Country = en)# get continents dataset for grouping descriptive summary resultscontinent =read_csv("Datasets/continents.csv") %>%rename(country = Country)# join two datasets, and remove rows that is inconsistent with the earlier country codecleaned_df = full_df %>%full_join(codes, by ="Code") %>%filter(Country !="0") %>%select(-Country) %>%distinct(Code, .keep_all =TRUE) %>%inner_join(continent, by ="country")# rewrite the name to be shorter and simpler to writecleaned_df = cleaned_df %>%rename(adult_mortality =`Adult Mortality rate per 1,000`,total_mortality =`Total Mortality rate per 1,000`)# replace white space with underscoresnames(cleaned_df) <-gsub(" ", "_",tolower(names(cleaned_df)))
Dependent Variable: Life_Expectancy
The dependent variable we use is `Life_Expectancy` variable which fortunately has no missing values. Data is obtained from World Bank Databank specifically taken from year 2019 before the heavy influx of Covid-19 that may affect our predictions.
2. Handling Missing Values
show code
# calculate the percentage amount of NA relative to the dataset populationpercent_na =sum(is.na(cleaned_df)) / (nrow(cleaned_df) *ncol(cleaned_df)) # calculate the total NAfull_na =sum(is.na(cleaned_df))# visualize the missing valuesvis_miss(cleaned_df)
In this dataset, it has 416 missing “NA” values. Which means 11.97% of this dataset is missing. gini_index and doctors dataset has the highest missing values while life expectancy (dependent variable) has no missing values. Because more than half of the observation is missing, imputing the value would not be sui as the predicted values would not replicate the supposed actual value.
2.1 Removing variables with very high percentage of missing values
Missing Values of Upper middle income to High income countries
show code
# we visualizes the data of the upper-middle- income and abovecleaned_df %>%filter(gni_percap >=3995)%>%vis_miss()
Missing Values of Low middle income to Low income countries
show code
# we visualizes the data of the low-middle- income and belowcleaned_df %>%filter(gni_percap <=3995)%>%vis_miss()
It can be seen that the lower income countries has relatively higher amount of missing values. This is no coincidence, likely to be Missing Not at Random (MNAR), suggesting that regions with lower income are less likely to report this data, even 50% of the hospital_beds variable is missing. Although docters and gini_index might be useful for the model, this non-informative missingness constitutes a large portion of the overall data (more than 50%) for doctors and gini_index. Therefore, we delete this variable and perform imputation on the rest.
2.2. Handling missing values through imputation methods
Use Bagged Trees (Random Forest) for Higher Missingness: (more than or equal to 10%)
Random forest imputation is robust and can handle higher levels of missing data better than KNN (Hu & Szymczak, 2022). It considers the relationships between variables and leverages the ensemble method to make better predictions (Hu & Szymczak, 2022).
Use K-Nearest Neighbor for Lower Missingness: (less than 10%)
KNN is effective for lower levels of missing data and can be quicker to implement. It uses the closest neighbors to impute the missing values, which works well when missingness is lower (Keith, 2021)..
# delete gini index and doctorsdeleted_df =select(cleaned_df,-c(gini_index, doctors))%>%# convert character columns to factors or vice versa as neededmutate_if(is.character, as.factor)# variable name with low missingnesslow_missingness =colnames(select(deleted_df,-c(`unemployment_(%)`, hospital_beds, adult_mortality)))# Create a recipe for data preprocessing with imputationrec0 <-recipe(~ ., data = deleted_df) %>%step_impute_bag(c(`unemployment_(%)`, hospital_beds, adult_mortality)) %>%step_impute_knn(c(low_missingness), neighbors =3) # Prepare the reciperec_prep0 <-prep(rec0, training = deleted_df)# Apply the recipe to the data (bake the recipe)final_data <-bake(rec_prep0, new_data =deleted_df )# Check the resultvis_miss(final_data)
Perfect!
2.3 Full dataset before input selection and normalization
Full dataset summary statistics aggregated by income levels
Variable
Low-income, N = 24
Lower-middle-income, N = 50
Upper-middle-income, N = 60
High-income, N = 59
health_expenditure
107 (72, 152)
213 (146, 412)
907 (657, 1,306)
3,324 (2,200, 5,314)
life_expectancy
62 (60, 64)
68 (63, 71)
74 (72, 77)
81 (78, 83)
unemployment_(%)
3.9 (3.0, 7.3)
5.1 (3.4, 9.3)
5.9 (4.4, 11.7)
5.0 (3.6, 6.4)
gdp_percap
698 (540, 810)
2,145 (1,454, 3,049)
6,697 (5,172, 9,158)
32,279 (19,138, 51,004)
gni_percap
695 (520, 823)
2,130 (1,525, 3,158)
6,635 (5,260, 8,975)
34,660 (18,630, 50,470)
disease_deathrate
291 (212, 407)
134 (61, 250)
44 (27, 75)
36 (25, 48)
diseases_death
46,205 (23,463, 90,374)
16,992 (2,982, 55,250)
2,049 (467, 8,696)
1,720 (222, 6,922)
hospital_beds
9 (8, 20)
11 (9, 18)
22 (13, 40)
31 (25, 51)
child_mortality
52 (41, 70)
26 (15, 38)
11 (7, 18)
4 (3, 6)
adult_mortality
290 (266, 325)
231 (187, 282)
183 (141, 227)
83 (69, 139)
total_mortality
7.93 (6.57, 9.10)
6.67 (5.66, 7.92)
7.17 (5.58, 8.51)
7.94 (6.32, 9.98)
alcohol_percap
2.0 (0.4, 3.8)
3.3 (1.4, 5.0)
5.3 (2.7, 7.4)
9.4 (6.3, 10.9)
urban_pop
37 (24, 43)
42 (31, 57)
64 (54, 77)
82 (68, 90)
continent
NA
NA
NA
NA
Africa
21 (88%)
23 (46%)
9 (15%)
1 (1.7%)
Asia
3 (13%)
17 (34%)
12 (20%)
11 (19%)
Europe
0 (0%)
1 (2.0%)
12 (20%)
34 (58%)
North America
0 (0%)
3 (6.0%)
13 (22%)
7 (12%)
Oceania
0 (0%)
5 (10%)
5 (8.3%)
4 (6.8%)
South America
0 (0%)
1 (2.0%)
9 (15%)
2 (3.4%)
show code
# Print the kable(describe(final_data),caption ="*Full dataset summary statistics for each variable*",digits =4)
Full dataset summary statistics for each variable
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
country*
1
193
97.0000
55.8585
97.0000
97.0000
71.1648
1.0000
193.0000
192.0000
0.0000
-1.2187
4.0208
health_expenditure
2
193
1573.7975
1897.9076
791.7386
1193.4901
957.4787
39.1323
10661.0284
10621.8961
1.8082
3.2498
136.6144
code*
3
193
97.0000
55.8585
97.0000
97.0000
71.1648
1.0000
193.0000
192.0000
0.0000
-1.2187
4.0208
life_expectancy
4
193
72.4264
7.6160
73.4000
72.7697
8.5991
52.9000
86.5000
33.6000
-0.3792
-0.6855
0.5482
unemployment_(%)
5
193
6.8661
5.1275
5.0200
6.0661
2.5797
0.1000
26.3070
26.2070
1.4854
1.8643
0.3691
gdp_percap
6
193
15889.1266
25892.1206
6070.3881
10373.6659
6901.5154
216.9730
199382.8386
199165.8656
3.6552
18.3176
1863.7556
gni_percap
7
193
14490.6563
19284.8572
5940.0000
10385.2903
6775.4820
230.0000
83160.0000
82930.0000
1.7921
2.3469
1388.1544
disease_deathrate
8
193
116.6439
135.6945
57.2600
89.4985
50.7642
4.1600
820.5100
816.3500
2.0516
4.8176
9.7675
diseases_death
9
193
40634.7789
150938.8479
5385.0000
15385.2344
7807.3716
6.0000
1841182.0000
1841176.0000
9.5761
105.6194
10864.8160
hospital_beds
10
193
27.5813
20.8002
21.5000
24.7495
17.4947
1.7000
128.8000
127.1000
1.3482
2.3763
1.4972
child_mortality
11
193
20.5006
21.2119
12.5398
16.5013
12.7742
0.8317
98.4470
97.6153
1.7055
2.6173
1.5269
adult_mortality
12
193
191.9923
96.5979
185.8290
184.8435
114.1291
46.4220
516.4180
469.9960
0.5762
-0.0935
6.9533
total_mortality
13
193
7.6299
2.6311
7.2900
7.4650
2.2387
0.9910
15.5000
14.5090
0.5186
0.4896
0.1894
alcohol_percap
14
193
5.4208
3.8830
4.9250
5.2308
4.6465
0.0051
16.9896
16.9846
0.3819
-0.8859
0.2795
urban_pop
15
193
59.2395
22.9545
59.0390
59.6902
27.2472
13.2500
100.0000
86.7500
-0.1482
-0.9836
1.6523
continent*
16
193
2.6684
1.4874
2.0000
2.5097
1.4826
1.0000
6.0000
5.0000
0.6513
-0.4678
0.1071
3. Correlation analysis with target variable
3.1 Full data correlation matrix
show code
# store a numeric only dataframe for our linear regressionnumeric_df =select(final_data, where(is.numeric)) # form a correlation matrixcor_matrix = numeric_df %>%cor() # identify the variable with high correlation cor_matrix %>%corrplot(method ="number",order ="hclust")
show code
# identify the correlated variables, setting a threshold of 0.3 correlation. (Moderate to strong correlation)correlated_variables =as.data.frame(cor_matrix) %>%select(life_expectancy) %>%filter(life_expectancy >0.3| life_expectancy <-0.3 )%>%filter(life_expectancy !=1) # these variables with the most correlation kable(correlated_variables %>%rownames_to_column() %>%arrange(desc(life_expectancy)),caption ="*Variables with moderate to strong Correlation with life expectancy*",label="*Variable*")
Variables with moderate to strong Correlation with life expectancy
rowname
life_expectancy
health_expenditure
0.7162116
gni_percap
0.6961471
gdp_percap
0.6107644
urban_pop
0.5798739
alcohol_percap
0.4023439
hospital_beds
0.3246269
disease_deathrate
-0.8006771
child_mortality
-0.8412293
adult_mortality
-0.9340389
show code
# Melt the data framedf_melt0 <-melt(numeric_df%>%select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars ="life_expectancy")# Create scatterplotsscatterplots0 <-ggplot(df_melt0, aes(x = value, y = life_expectancy)) +geom_point(alpha =0.2) +facet_wrap(~ variable, scales ="free") +labs(title ="Scatterplot Matrix Against Life Expectancy")+geom_smooth(size =0.3, color ="red",linetype =2, se =FALSE)+theme_bw()print(scatterplots0)
show code
distributions0 <-ggplot(df_melt0, aes(x = value)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.3) +facet_wrap(~ variable, scales ="free") +labs(title ="Distributions Against Life Expectancy")+theme_bw()distributions0
Selected dataset summary statistics aggregated by income levels
Variable
Low-income, N = 24
Lower-middle-income, N = 50
Upper-middle-income, N = 60
High-income, N = 59
health_expenditure
107 (72, 152)
213 (146, 412)
907 (657, 1,306)
3,324 (2,200, 5,314)
gdp_percap
698 (540, 810)
2,145 (1,454, 3,049)
6,697 (5,172, 9,158)
32,279 (19,138, 51,004)
gni_percap
695 (520, 823)
2,130 (1,525, 3,158)
6,635 (5,260, 8,975)
34,660 (18,630, 50,470)
disease_deathrate
291 (212, 407)
134 (61, 250)
44 (27, 75)
36 (25, 48)
hospital_beds
9 (8, 20)
11 (9, 18)
22 (13, 40)
31 (25, 51)
child_mortality
52 (41, 70)
26 (15, 38)
11 (7, 18)
4 (3, 6)
adult_mortality
290 (266, 325)
231 (187, 282)
183 (141, 227)
83 (69, 139)
alcohol_percap
2.0 (0.4, 3.8)
3.3 (1.4, 5.0)
5.3 (2.7, 7.4)
9.4 (6.3, 10.9)
urban_pop
37 (24, 43)
42 (31, 57)
64 (54, 77)
82 (68, 90)
life_expectancy
62 (60, 64)
68 (63, 71)
74 (72, 77)
81 (78, 83)
show code
kable(describe(correlated_inputs),caption ="*Selected dataset summary statistics for each variable*",digits =4)
Selected dataset summary statistics for each variable
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
health_expenditure
1
193
1573.7975
1897.9076
791.7386
1193.4901
957.4787
39.1323
10661.0284
10621.8961
1.8082
3.2498
136.6144
gdp_percap
2
193
15889.1266
25892.1206
6070.3881
10373.6659
6901.5154
216.9730
199382.8386
199165.8656
3.6552
18.3176
1863.7556
gni_percap
3
193
14490.6563
19284.8572
5940.0000
10385.2903
6775.4820
230.0000
83160.0000
82930.0000
1.7921
2.3469
1388.1544
disease_deathrate
4
193
116.6439
135.6945
57.2600
89.4985
50.7642
4.1600
820.5100
816.3500
2.0516
4.8176
9.7675
hospital_beds
5
193
27.5813
20.8002
21.5000
24.7495
17.4947
1.7000
128.8000
127.1000
1.3482
2.3763
1.4972
child_mortality
6
193
20.5006
21.2119
12.5398
16.5013
12.7742
0.8317
98.4470
97.6153
1.7055
2.6173
1.5269
adult_mortality
7
193
191.9923
96.5979
185.8290
184.8435
114.1291
46.4220
516.4180
469.9960
0.5762
-0.0935
6.9533
alcohol_percap
8
193
5.4208
3.8830
4.9250
5.2308
4.6465
0.0051
16.9896
16.9846
0.3819
-0.8859
0.2795
urban_pop
9
193
59.2395
22.9545
59.0390
59.6902
27.2472
13.2500
100.0000
86.7500
-0.1482
-0.9836
1.6523
life_expectancy
10
193
72.4264
7.6160
73.4000
72.7697
8.5991
52.9000
86.5000
33.6000
-0.3792
-0.6855
0.5482
show code
sources =as.data.frame(t(correlated_inputs))[2] %>%rownames_to_column()sources$V2 <-c("WorldBank","WorldBank","WorldBank","Our World In Data", "World Health Organization (WHO)","UNICEF", "WorldBank", "WorldBank", "WorldBank", "WorldBank")colnames(sources) <-c("variable", "source")kable(sources, caption ="*Each selected variables sources*")
Each selected variables sources
variable
source
health_expenditure
WorldBank
gdp_percap
WorldBank
gni_percap
WorldBank
disease_deathrate
Our World In Data
hospital_beds
World Health Organization (WHO)
child_mortality
UNICEF
adult_mortality
WorldBank
alcohol_percap
WorldBank
urban_pop
WorldBank
life_expectancy
WorldBank
How would this variable be useful?
All these 9 selected variables have moderate to strong correlation (threshold of 0.3 correlation). Hence, these variables have a noticable relationship with life expectancy. Each of these features provide valuable information for the predictive model to run. Analyzing these variables collectively can help develop predictive models for estimating life expectancy accurately. However, to fit the model, further data preprocessing will be used. By bringing features to a similair scale, the model is able to eliminate bias from outliers and produce more accurate prediction.
5. Data Normalization of selected inputs
show code
# Create a recipe for data preprocessing with imputationrec1 <-recipe(~ ., data = final_data) %>%step_YeoJohnson(colnames(t(correlated_variables))) # Normalize numeric predictors# Prepare the reciperec_prep1 <-prep(rec1, training = final_data)# Apply the recipe to the data (bake the recipe)normalized_final_data <-bake(rec_prep1, new_data =final_data )df_melt1 <-melt(select(normalized_final_data, where(is.numeric))%>%select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars ="life_expectancy")distributions1 <-ggplot(df_melt1, aes(x = value)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.3) +facet_wrap(~ variable, scales ="free") +labs(title ="Distributions Against Life Expectancy")+theme_bw()distributions1
show code
scatterplots1 <-ggplot(df_melt1, aes(x = value, y = life_expectancy)) +geom_point(alpha =0.2) +facet_wrap(~ variable, scales ="free") +labs(title ="Scatterplot Matrix Against Life Expectancy")+geom_smooth(size =0.3, color ="red",linetype =2, se =FALSE)+theme_bw()print(scatterplots1)
6. Data partitioning
show code
# our selected dataselected_inputs =select(normalized_final_data, c("life_expectancy",colnames(as.data.frame(t(correlated_variables)))))set.seed(000)# split data to training and testingdf_split =initial_split(selected_inputs, prop =0.7, strata = life_expectancy)# input train and test data df_train =training(df_split)df_test =testing(df_split)
Multiple Linear Regression - Ordinary Least Squared
\(y=β0+β1x1+β2x2+...+ϵ\)
1. Training Multiple Linear Regression
1.1. Train and evaluate initial regression model
show code
# form the formulafirst_fmla =as.formula("life_expectancy ~ health_expenditure + gdp_percap+ gni_percap + disease_deathrate + child_mortality + adult_mortality + urban_pop + alcohol_percap + hospital_beds")# formula being usedlinear_model0 =lm(first_fmla,# dataset being useddata = df_train) # recieve resultsdf_results0 =cbind(df_train,as.data.frame(predict(linear_model0,interval ="prediction")))# model summarylinear0_summary =summary(linear_model0)modelsummary_output <-modelsummary(list(linear_model0), output ="gt",stars =TRUE,statistic =c("t = {statistic}", "se = {std.error}", 'p.value'),shape =)# Convert the gt to kableExtrakable(modelsummary_output %>%as.data.frame(),caption ="*Initial training model summary*",full_width =FALSE, position ="center")
Initial training model summary
(1)
(Intercept)
94.831***
t = 27.843
se = 3.406
(<0.001)
health_expenditure
0.427
t = 1.404
se = 0.304
(0.163)
gdp_percap
3.718*
t = 2.560
se = 1.452
(0.012)
gni_percap
-2.779*
t = -2.098
se = 1.325
(0.038)
disease_deathrate
-1.373*
t = -2.367
se = 0.580
(0.020)
child_mortality
-3.209***
t = -6.382
se = 0.503
(<0.001)
adult_mortality
-1.131***
t = -13.562
se = 0.083
(<0.001)
urban_pop
-0.008
t = -0.834
se = 0.010
(0.406)
alcohol_percap
0.216
t = 1.525
se = 0.142
(0.130)
hospital_beds
-0.643**
t = -2.987
se = 0.215
(0.003)
Num.Obs.
133
R2
0.955
R2 Adj.
0.951
AIC
536.1
BIC
567.9
Log.Lik.
-257.047
F
286.747
RMSE
1.67
1.2. Checking for Multicollinearity: Variance of Influence (VIF)
show code
# form a VIF plot VIF_linear_0 =as.data.frame(vif(linear_model0)) %>%rownames_to_column() %>%ggplot(aes(x = rowname, y =vif(linear_model0)))+geom_col()+coord_flip() +geom_hline(yintercept =5, linetype =2, color ='blue')+labs(title ="VIF Values of Predictor Variables", x=NULL, y ='VIF values') +theme_bw()+theme(axis.text.y =element_text(size =12)) VIF_linear_0
The VIF plot suggest gni_percap and gdp_percap variable is correlated. Therefore we delete the variable wit the highest VIF values, gni_percap. Additionally, we delete child_mortality as well, this is likely to be correlated with adult_mortality. Removing multicollinearity may help improve model interpretability and enhance statistical significance.
show code
# form the formulasecond_fmla =as.formula("life_expectancy ~ gdp_percap + disease_deathrate + adult_mortality + urban_pop + alcohol_percap + hospital_beds + health_expenditure")# formula being usedlinear_model1 =lm(second_fmla,# dataset being useddata = df_train) # recieve resultsdf_results1 =cbind(df_train,as.data.frame(predict(linear_model1,interval ="prediction")))# recieve residualsdf_results1$residuals = linear_model1$residuals# model summarylinear1_summary =summary(linear_model1) VIF_linear_1 =as.data.frame(vif(linear_model1)) %>%rownames_to_column() %>%ggplot(aes(x = rowname, y =vif(linear_model1)))+geom_col()+coord_flip() +geom_hline(yintercept =5, linetype =2, color ='blue')+labs(title ="VIF Values of Predictor Variables", x=NULL, y ='VIF values') +theme_bw()+theme(axis.text.y =element_text(size =12)) (VIF_linear_0 +ggtitle(label ="Before removing gni_percap and child_mortality") ) /(VIF_linear_1 +ggtitle(label ="After removing gni_percap and child_mortality ") )
Health_expenditure is still correlated with gdp_percap, therefore we ultimately remove this.
show code
third_fmla =as.formula("life_expectancy ~ gdp_percap + disease_deathrate + adult_mortality + urban_pop + alcohol_percap + hospital_beds")# formula being usedlinear_model2 =lm(third_fmla,# dataset being useddata = df_train) # recieve resultsdf_results2 =cbind(df_train,as.data.frame(predict(linear_model2,interval ="prediction")))# recieve residualsdf_results2$residuals = linear_model2$residuals# model summarylinear2_summary =summary(linear_model2) VIF_linear_2 =as.data.frame(vif(linear_model2)) %>%rownames_to_column() %>%ggplot(aes(x = rowname, y =vif(linear_model2)))+geom_col()+coord_flip() +geom_hline(yintercept =5, linetype =2, color ='blue')+labs(title ="VIF Values of Predictor Variables", x=NULL, y ='VIF values') +theme_bw()+theme(axis.text.y =element_text(size =12)) (VIF_linear_1 +ggtitle(label ="Before removing health_expenditure") ) /(VIF_linear_2 +ggtitle(label ="After removing health_expenditure") )
1.3. Training model evaluation after adjusting for multicollinearity
We’ve eventually arrive to this initial final model after adjusting for multicollinearity.
1.4. Prediction vs actual plot
show code
# plot the predictionr_squared_linear1 =ggplot(df_results2, aes(x = fit, life_expectancy)) +geom_point(alpha =0.4) +geom_abline(color ='blue', linetype =2) +coord_obs_pred()+ggtitle("Predicted vs Actual life expectancy", label ="Training Model using Linear Regression") +theme_bw() +xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy") +annotate("label", x =80, y =50,label =paste0("R-squared:", round(linear2_summary$r.squared,2)))r_squared_linear1
1.5. Residuals diagnostics
show code
# form a residual histogram plotresidual_linear1 =ggplot(df_results2, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(df_results2$residuals), linetype =2, color ="blue") +geom_vline(xintercept =median(df_results2$residuals), linetype =2, color ="darkgreen") +annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(df_results2$residuals),2)), color ="blue" ) +annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(df_results2$residuals),2)), color ="darkgreen" ) +ylim(0, 0.21) +theme_bw() +ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit1 =ggplot(df_results2, aes(x = fit, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw() +ggtitle("Residual plot ")# qq plotqq_plot1 =ggplot(df_results2, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot ", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw() residual_linear1 + ( qq_plot1/ residual_fit1 )
Random Distribution: The residuals appear to be randomly scattered, indicating that the model fits the data well and that there are no obvious patterns or trends.
Homoscedasticity: The spread of residuals seems consistent across the range of fitted values, suggesting that the variance of the residuals is constant (homoscedasticity). No heteroskedasticity.
No Systematic Patterns: There are no clear curves or systematic patterns, implying that the model captures the relationship between predictors and the response variable adequately.
Outliers and normality: Good amount of residuals fall relatively far from zero but has a roughly symmetric distribution with a longer tails to the left, indicating potential outliers that may need further investigation.
Center and Spread: Mean of residuals is approximately 0 and the median is 0.26, Since the residuals are centered around 0, it suggest no significant bias in the model.
2. K-fold Cross-validation
2.1. Cross-validation performance evaluation
show code
# Specify a modellm_model0 <-linear_reg() %>%set_engine("lm")# Create a recipe for data preprocessingrec <-recipe(third_fmla, data = df_train) # Create a workflowwf <-workflow() %>%add_recipe(rec) %>%add_model(lm_model0)# Create 10-fold cross-validation resamplesset.seed(123) # For reproducibilityfolds <-vfold_cv(df_train, v =10)# Perform cross-validationresults <- wf %>%fit_resamples(resamples = folds,metrics =metric_set(rmse, rsq),control =control_resamples(save_pred =TRUE) )# Collect and summarize metricsmetrics_cv = results %>%collect_metrics()# Visualize results lm_cv_predictions = results %>%collect_predictions()# Function to fit model and extract AIC/BICextract_aic_bic <-function(split) { train_data <-analysis(split) mod_fit <-lm(life_expectancy ~ ., data = train_data) aic <-AIC(mod_fit) bic <-BIC(mod_fit)tibble(id = split$id,AIC = aic,BIC = bic )}aic_bic_results <- folds %>%mutate(results =map(splits, extract_aic_bic)) %>%select(id, results) %>%unnest(cols = results, names_repair ="unique")# Summarize the resultsaic_bic_summary <- aic_bic_results %>%summarise(mean_AIC =mean(AIC),mean_BIC =mean(BIC) )# collect cv metricslm_cv_metrics = metrics_cv %>%transmute(metric = .metric, lm_crossvalidate = mean,) # calculate residuals lm_cv_residuals = lm_cv_predictions$.pred - lm_cv_predictions$life_expectancy# Calculate MAEmae_cv <-mean(abs(lm_cv_residuals))# Calculate MSEmse_cv <-mean((lm_cv_residuals)^2)lm_cv_metrics = lm_cv_metrics %>%full_join(data.frame(metric =c("mae", "mse", "aic", "bic"),lm_crossvalidate =c(mae_cv, mse_cv, aic_bic_summary$mean_AIC, aic_bic_summary$mean_BIC)), by =c("lm_crossvalidate", "metric"))# combine all metrics in linear modeltrain_metrics_lm =full_join(lm_train_metrics, lm_cv_metrics,by ="metric")kable(train_metrics_lm,caption ="*Train and Cross-validation performance metrics*")
Train and Cross-validation performance metrics
metric
lm_train
lm_crossvalidate
rmse
1.974277
2.0193737
rsq
0.936536
0.9450802
mae
1.469630
1.5712364
mse
3.897769
4.5356599
aic
574.371421
483.0233169
bic
597.494214
513.6581109
2.2. Prediction vs Actual Plot
show code
lm_cv_predictions%>%ggplot(aes(x = .pred, y = life_expectancy)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="K-Fold Cross-validation resamples for Linear Regression") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy") +annotate("label", x =80, y =50,label =paste0("R-squared:", round(metrics_cv$mean[2],2)))
We’ve reached our optimum OLS linear model, the R-squared seems to reach 0.95 even with cross-validation testing!
2.3. Residual Diagnostic
show code
df_cv_resid =data.frame("residuals"= lm_cv_residuals,fit = df_train$life_expectancy)# form a residual histogram plotresidual_linear_cv =ggplot(df_cv_resid, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(df_cv_resid$residuals), linetype =2, color ="blue") +geom_vline(xintercept =median(df_cv_resid$residuals), linetype =2, color ="darkgreen") +annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(df_cv_resid$residuals),2)), color ="blue" ) +annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(df_cv_resid$residuals),2)), color ="darkgreen" ) +ylim(0, 0.21) +theme_bw() +ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit_cv =ggplot(df_cv_resid, aes(x = fit, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw() +ggtitle("Residual plot ")# qq plotqq_plot_cv =ggplot(df_cv_resid, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot ", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw() residual_linear_cv + ( qq_plot_cv/ residual_fit_cv )
3. Prediction and Evaluation
3.1. Testing performance evaluation (
show code
# Predict on the test datasetmodel_fit0 <- wf %>%fit(data = df_train)test_predictions0 <- model_fit0 %>%predict(new_data = df_test) %>%bind_cols(df_test)# get residuals test_predictions0 <- test_predictions0 %>%mutate(residuals = life_expectancy - .pred)metrics0 <- test_predictions0 %>%metrics(truth = life_expectancy, estimate = .pred)# collect testing metrics# get msemse_testing <-data.frame(metric ="mse",lm_testing =mean((test_predictions0$residuals)^2))# make test metrics lm_test_metrics = metrics0 %>%transmute(metric = .metric, lm_testing = .estimate) %>%full_join(mse_testing)# combine all metrics in linear modeljoined_metrics_lm =full_join(lm_train_metrics, lm_cv_metrics,by ="metric") %>%inner_join(lm_test_metrics)# Summary statistics of the test datasetkable(model_fit0 %>%extract_fit_parsnip() %>%tidy(),caption ="*Testing model summary*")
Testing model summary
term
estimate
std.error
statistic
p.value
(Intercept)
87.4834234
3.5986188
24.310278
0.0000000
gdp_percap
1.8201193
0.3279959
5.549214
0.0000002
disease_deathrate
-2.7424293
0.6261265
-4.379993
0.0000247
adult_mortality
-1.3511587
0.0900610
-15.002699
0.0000000
urban_pop
-0.0151245
0.0109225
-1.384708
0.1685889
alcohol_percap
0.5243082
0.1540085
3.404412
0.0008898
hospital_beds
-0.2696177
0.2418432
-1.114845
0.2670386
The testing model summary highlights the relationships between key predictors and life expectancy. GDP per capita positively influences life expectancy (estimate = 1.795, p < 0.001), indicating economic stability enhances health outcomes. Disease death rate (estimate = -2.755, p < 0.001) and adult mortality (estimate = -1.344, p < 0.001) negatively affect life expectancy, underscoring the adverse impact of health burdens. Urban population and hospital beds were not significant predictors (p > 0.1), while alcohol consumption per capita had a positive but moderate effect (estimate = 0.526, p < 0.001). These findings align with the hypothesis that healthcare accessibility and economic factors significantly influence life expectancy.
3.2. Overall OLS regression’s performance metrics
show code
kable(joined_metrics_lm,caption ="*Multiple Linear Regression's perfomance metrics*")
Multiple Linear Regression’s perfomance metrics
metric
lm_train
lm_crossvalidate
lm_testing
rmse
1.974277
2.0193737
2.1707938
rsq
0.936536
0.9450802
0.9060414
mae
1.469630
1.5712364
1.6833470
mse
3.897769
4.5356599
4.7123459
3.3. Prediction vs actual plot
show code
test_predictions0 %>%ggplot(aes(x = .pred, y = life_expectancy)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="Linear Regression Testing set") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy") +annotate("label", x =78, y =50,label =paste0("R-squared:", round(metrics0$.estimate[2],2)))
It performed really well!
3.4. Residual Diagnostic of testing data
show code
# form a residual histogram plothistogram_linear_test =ggplot(test_predictions0, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(test_predictions0$residuals), linetype =2, color ="blue")+geom_vline(xintercept =median(test_predictions0$residuals), linetype =2, color ="darkgreen")+annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(test_predictions0$residuals),2)), color ="blue" )+annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(test_predictions0$residuals),2)), color ="darkgreen" )+ylim(0, 0.21)+theme_bw()+ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit_test_linear =ggplot(test_predictions0, aes(x = life_expectancy, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw()+labs(title ="fit vs residuals")qq_linear_test =ggplot(test_predictions0, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw()# plot histogram_linear_test + (qq_linear_test/residual_fit_test_linear)
Regularized Regression : L2 and Elastic Net
Since there is severe multicollinearity in the original selected input, we had to delete 3 variables from the selected input even though we still believe those inputs are still useful. Hence, we could’ve capture important information that can better predict life_expectancy. Therefore, we decided to use Ridge Regression because we wanted to keep all selected variables. We will assign them different weightage and penalty.
We wanted to find the optimal values of lambda. We do this by using 10 fold cross validation
1. Further preprocessing: Standardization
show code
# standardize the train variableblueprint =recipe(life_expectancy ~ ., data = df_train)%>%step_center(all_numeric(), -all_outcomes()) %>%step_scale(all_numeric(), -all_outcomes())prepare <-prep(blueprint, training = df_train)ridge_train <-bake(prepare, new_data = df_train)# standardize the test variableblueprint_test =recipe(life_expectancy ~ ., data = df_test)%>%step_center(all_numeric(), -all_outcomes()) %>%step_scale(all_numeric(), -all_outcomes())prepare_test <-prep(blueprint, training = df_test)ridge_test <-bake(prepare, new_data = df_test)X_test <-model.matrix(life_expectancy ~ ., ridge_test)[, -1]# transform y with log transformationY_test <- ridge_test$life_expectancyX <-model.matrix(life_expectancy ~ ., ridge_train)[, -1]# transform y with log transformationY <- ridge_train$life_expectancyX %>%summary()
health_expenditure gdp_percap gni_percap disease_deathrate
Min. :-2.040838 Min. :-1.91869 Min. :-1.835669 Min. :-3.33174
1st Qu.:-0.845484 1st Qu.:-0.71710 1st Qu.:-0.753251 1st Qu.:-0.71886
Median :-0.001998 Median : 0.01193 Median : 0.006561 Median :-0.07494
Mean : 0.000000 Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
3rd Qu.: 0.750230 3rd Qu.: 0.79157 3rd Qu.: 0.768977 3rd Qu.: 0.72109
Max. : 2.120375 Max. : 2.33050 Max. : 1.855852 Max. : 2.09019
hospital_beds child_mortality adult_mortality alcohol_percap
Min. :-2.59460 Min. :-2.302658 Min. :-1.91589 Min. :-1.82613
1st Qu.:-0.82147 1st Qu.:-0.796102 1st Qu.:-0.83967 1st Qu.:-0.84449
Median : 0.05711 Median :-0.004018 Median : 0.09513 Median : 0.04171
Mean : 0.00000 Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.77446 3rd Qu.: 0.837161 3rd Qu.: 0.70380 3rd Qu.: 0.84159
Max. : 2.40967 Max. : 1.900753 Max. : 2.35828 Max. : 1.82986
urban_pop
Min. :-1.96012
1st Qu.:-0.77573
Median : 0.04723
Mean : 0.00000
3rd Qu.: 0.81685
Max. : 1.74379
2. Ridge Regression Training and Evaluation
2.1 Regularization path plot
show code
ridge <-glmnet(x = X,y = Y,alpha =0) # a = ridge plot(ridge, main ="Ridge penalty\n")
This plot shows how ridge regression coefficients shrink as the regularization parameter λ increases (moving right). Higher λ values lead to smaller coefficients, indicating stronger regularization, which reduces overfitting by limiting the model’s sensitivity to individual features. Unlike LASSO, coefficients remain non-zero, meaning all features contribute to the model. This helps balance model complexity and prediction accuracy, making it robust, especially when handling multicollinearity or many features relative to observations.
2.2. Cross-validation to search optimal lambda
show code
# cross validation ridge regressionridge_cv <-cv.glmnet(x = X,y = Y,alpha =0,type.measure ="mse")plot(ridge, main ="Ridge penalty\n")abline(v = ridge_cv$lambda.min, col ="red", lty ="dashed")abline(v = ridge_cv$lambda.1se, col ="blue", lty ="dashed")
2.4. Cross-validation plot
show code
# we will be using MSE to compare models plot(ridge_cv, main ="Ridge penalty\n\n")
2.5. Cross-validation results
show code
ridge_cv
Call: cv.glmnet(x = X, y = Y, type.measure = "mse", alpha = 0)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.7358 100 3.658 0.3665 9
1se 1.2859 94 3.967 0.4291 9
The 10-fold cross-validation Mean Squared Error (MSE) for a ridge model is depicted in a plot. The first vertical dotted line indicates the value of λ that corresponds to the smallest MSE, while the second vertical line represents the λ value that is within one standard error of the minimum MSE.
Minimum Lambda Found: 0.7358318
The crossvalidation gives us this optimized lambda values that allow us to yield the lowest Mean Squared Error, now we will use this value to predict using the training dataset.
2.3. Uses the minimum lambda to predict training set
show code
# using this crossvalidationridge_cv_result = ridge_cv %>%predict(newx = X) %>%rowMeans()%>%cbind(Y)%>%as.data.frame() ridge_cv_residuals = (Y - ridge_cv %>%predict(newx = X))SST <-sum((Y -mean(Y))^2) # Total Sum of SquaresSSR <-sum(ridge_cv_residuals^2) # Residual Sum of SquaresR_squared_ridgecv_train <-1- (SSR / SST)ridge_cv_result %>%ggplot(aes(x = . , y = Y)) +geom_point(alpha =0.5)+geom_abline(linetype =2, color ="blue") +coord_obs_pred() +theme_bw() +annotate("label", x =80, y =50,label =paste0("R-squared:", round(R_squared_ridgecv_train,2)))+xlab ("Predicted life expectancy") +ylab ("Actual life expectancy")
This plot visualizes which parameter yields the lowest RMSE in each of the 10 folds. Notice how the lambda has changes a bit, but allow us to use alpha of 0.2.
Alpha of 0.1 with lambda of 0.0078 gives the lowest rmse.
3.2. Variable Importance plot
show code
vip(cv_glmnet, bar =FALSE) +theme_bw()
3.2 Final evaluation of ridge and elastic net models
show code
# predict life_expectancy on training dataridge_grid_result <-data.frame(pred =predict(cv_glmnet, X),fit = df_train$life_expectancy)ridge_grid_residuals = ridge_grid_result$pred - ridge_grid_result$fit# compute RMSE of transformed predictedrmse_grid =RMSE(ridge_grid_result$pred, ridge_grid_result$fit)mae_grid =MAE(ridge_grid_result$pred, ridge_grid_result$fit)mse_grid =mean((ridge_grid_residuals)^2)SST <-sum((Y -mean(Y))^2) # Total Sum of SquaresSSR_grid <-sum(ridge_grid_residuals^2) # Residual Sum of SquaresR_squared_gridsq_train <-1- (SSR_grid / SST)ridge_cv_grid_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), elastic_train =c(rmse_grid, R_squared_gridsq_train, mae_grid ,mse_grid))# calculate MAEmae_cv_ridge <-mean(abs(ridge_cv_residuals))# calculate MSEmse_cv_ridge <-mean((ridge_cv_residuals)^2)# calculate MSErmse_cv_ridge <-sqrt(mse_cv_ridge)# calculate MAEmae_cv_ridge <-mean(abs(ridge_cv_residuals))# calculate MSEmse_cv_ridge <-mean((ridge_cv_residuals)^2)# calculate MSErmse_cv_ridge <-sqrt(mse_cv_ridge)ridge_cv_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), ridge_train =c(rmse_cv_ridge, R_squared_ridgecv_train, mae_cv_ridge ,mse_cv_ridge))train_regular_metrics = ridge_cv_metrics %>%full_join(ridge_cv_grid_metrics, by ="metric")kable(train_regular_metrics,caption ="*Ridge and Elastic net regressions training performance*")
Ridge and Elastic net regressions training performance
metric
ridge_train
elastic_train
rmse
1.8708484
1.6795753
rsq
0.9430113
0.9540685
mae
1.4392721
1.2902426
mse
3.5000736
2.8209732
4. Prediction and evaluation
4.1. Ridge regression and Elastic results
show code
ridge_predict =predict(ridge_cv, s=ridge_cv$lambda.1se, newx = X_test, lambda = ridge_cv$lambda.min )ridge_result =cbind(ridge_predict, Y_test) %>%as.data.frame()SST_r_test <-sum((Y_test -mean(Y_test))^2) # Total Sum of SquaresSSR_r_test<-sum((Y_test - ridge_predict)^2) # Residual Sum of SquaresR_squared_r_test <-1- ( SSR_r_test/SST_r_test)ridge_result0 = ridge_result %>%transmute(life_expectancy = Y_test,.pred = s1,residuals = Y_test - s1)ridge_plot = ridge_result0 %>%ggplot(aes(x = .pred, y = life_expectancy)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="Ridge Regression") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy")+annotate("label", x =76, y =50,label =paste0("R-squared:", round(R_squared_r_test,3))) elastic_predict <-data.frame(pred =predict(cv_glmnet, newdata = X_test))elastic_result =cbind(elastic_predict, Y_test) %>%as.data.frame()SST_net_test <-sum((Y_test -mean(Y_test))^2) # Total Sum of SquaresSSR_net_test<-sum((Y_test - elastic_predict)^2) # Residual Sum of SquaresR_squared_net_test <-1- (SSR_net_test / SST_net_test)elastic_result0 = elastic_result %>%mutate(residuals = Y_test - pred)elastic_plot = elastic_result0 %>%ggplot(aes(x = pred, y = Y_test)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="Elastic net") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy")+annotate("label", x =76, y =50,label =paste0("R-squared:", round(R_squared_net_test,3))) ridge_plot+elastic_plot
4.2 Performance Metrics of testing and training
show code
# Ridge------------ridge_residuals = ridge_result[1] - ridge_result[2]# compute RMSE of transformed predictedrmse_ridge =RMSE(ridge_result$s1, ridge_result$Y_test)mae_ridge=MAE(ridge_result$s1,ridge_result$Y_test)mse_ridge =mean((ridge_residuals$s1)^2)SSR_ridge <-sum(ridge_grid_residuals^2) # Residual Sum of Squares# calculate MSEridge_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), ridge_test =c(rmse_ridge, R_squared_r_test, mae_ridge ,mse_ridge))# Elastic------------elastic_residuals = elastic_result$pred - elastic_result$Y_test# compute RMSE of transformed predictedrmse_elastic =RMSE(elastic_result$pred , elastic_result$Y_test)mae_elastic =MAE(elastic_result$pred ,elastic_result$Y_test)mse_elastic =mean((elastic_residuals)^2)SSR_elastic <-sum(elastic_residuals^2) # Residual Sum of Squares# calculate MSEelastic_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), elastic_test =c(rmse_elastic, R_squared_net_test, mae_elastic ,mse_elastic))join_test_r = ridge_metrics %>%full_join(elastic_metrics, by ='metric') joined_metrics_r = train_regular_metrics %>%full_join(join_test_r, by ='metric')kable(joined_metrics_r,caption ="*Ridge and Elastic net regression performance metrics*")
Ridge and Elastic net regression performance metrics
metric
ridge_train
elastic_train
ridge_test
elastic_test
rmse
1.8708484
1.6795753
1.9318764
1.8293091
rsq
0.9430113
0.9540685
0.9244032
0.9322173
mae
1.4392721
1.2902426
1.5130543
1.4343533
mse
3.5000736
2.8209732
3.7321464
3.3463717
4.3 Residual Diagnostic best model
show code
elastic_residuals_df =data.frame(residuals = elastic_residuals,life_expectancy = Y_test)# form a residual histogram plothistogram_test =ggplot(elastic_residuals_df, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(elastic_residuals_df$residuals), linetype =2, color ="blue")+geom_vline(xintercept =median(elastic_residuals_df$residuals), linetype =2, color ="darkgreen")+annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(elastic_residuals_df$residuals),2)), color ="blue" )+annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(elastic_residuals_df$residuals),2)), color ="darkgreen" )+ylim(0, 0.21)+theme_bw()+ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit =ggplot(elastic_residuals_df, aes(x = life_expectancy, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw()+labs(title ="fit vs residuals")qq =ggplot(elastic_residuals_df, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw()# plot histogram_test + (qq/residual_fit)
Random Forest Regression - Ensemble Learning
Ensemble-based method is used to seek the bias variance trade-off. We plan to improve predictive accuracy but unfortunately, reduces interpretability as random forest is a black-box model that is difficult to interpret as compare to linear or regularized regression.
1. Model Building
Random Forest models are inherently robust to the scale of input features, as they base splits on relative values, making normalization less crucial (Hu & Szymczak, 2022). However, normalization can still enhance performance by ensuring balanced splits and equal feature contribution, leading to slightly better accuracy. This improvement is minor compared to methods like ridge regression or linear OLS, where normalization is essential for stability and convergence of coefficients. Therefore, to calculate the best predictive model, we’ll still use the same preprocessed data.
show code
# Building the random forest modelset.seed(123)n_features =length(setdiff(names(df_test), "life_expectancy"))# for each of the trees rf_model_normalized <-randomForest( life_expectancy ~ .,data = df_train,importance =TRUE,mtry =floor(n_features/3), # regressionxtest =subset(df_test,select =-life_expectancy),ytest = df_test$life_expectancy,keep.forest =TRUE)# Print the model summaryprint(rf_model_normalized)
Call:
randomForest(formula = life_expectancy ~ ., data = df_train, importance = TRUE, mtry = floor(n_features/3), xtest = subset(df_test, select = -life_expectancy), ytest = df_test$life_expectancy, keep.forest = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 2.528765
% Var explained: 95.88
Test set MSE: 2.52
% Var explained: 94.95
# create an explainer for the modelexplainer_rf =explain(model = rf_model_normalized, data = df_train,y = df_train$life_expectancy, label ="Random Forest")
Preparation of a new explainer is initiated
-> model label : Random Forest
-> data : 133 rows 10 cols
-> data : tibble converted into a data.frame
-> target variable : 133 values
-> predict function : yhat.randomForest will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package randomForest , ver. 4.7.1.1 , task regression ( default )
-> predicted values : numerical, min = 55.02809 , mean = 72.31606 , max = 84.59589
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -2.44344 , mean = -0.009295188 , max = 1.904113
A new explainer has been created!
1.3. Partial Dependence plot
show code
# plot Partial Dependence for three selected featurepdp_rf_adult <-model_profile(explainer = explainer_rf, variables ="adult_mortality")pdp_rf_disease <-model_profile(explainer = explainer_rf, variables ="disease_deathrate")pdp_rf_health <-model_profile(explainer = explainer_rf, variables ="health_expenditure")pdp_rf_gdp <-model_profile(explainer = explainer_rf, variables ="gdp_percap")# plot the PDP put them in a 2x2 figure(plot(pdp_rf_adult) +ggtitle(NULL, NULL) +plot(pdp_rf_disease) +ggtitle(NULL, NULL))/ (plot(pdp_rf_health) +ggtitle(NULL, NULL)+plot(pdp_rf_gdp)+ggtitle(NULL, NULL))
Adult mortality has the greatest influence on the model’s predictions compared to other variables. As the adult mortality decreases, the predicted probability decrease exponentially, till the lowest life expectancy. The diseases death rate suggest simlair results, which make sense where more people are dying, the lower the estimate life expectancy is. In contrast, Health expenditure and GDP percap positively influence the predicted probability. Suggesting as countries invest more in health expenditure the lesser people will die, and the richer the country the higher the life expectancy.
1.4. Single tree instance example
show code
# Extract a single tree from the random forest modelsingle_tree <-getTree(rf_model_normalized, k =1, labelVar =TRUE)# Convert it to a decision tree modeltree_model <-rpart(first_fmla, data = df_train)# Plot the tree# Plot the decision treerpart.plot(tree_model, type =1, extra =101, fallen.leaves =TRUE, main ="Decision Tree for Life Expectancy")
2. Model Prediction and Evaluation
show code
# Predictions on training datatrain_preds <-predict(rf_model_normalized, newdata = df_train)# Predictions on testing datatest_preds <-predict(rf_model_normalized, newdata = df_test)# Calculate metrics for training datatrain_rmse <-RMSE(pred = train_preds, obs = df_train$life_expectancy)train_mae <-MAE(pred = train_preds, obs = df_train$life_expectancy)train_mse <-mean((train_preds - df_train$life_expectancy)^2)train_rsq <-R2(pred = train_preds, obs = df_train$life_expectancy)# Calculate metrics for testing datatest_rmse <-RMSE(pred = test_preds, obs = df_test$life_expectancy)test_mae <-MAE(pred = test_preds, obs = df_test$life_expectancy)test_mse <-mean((test_preds - df_test$life_expectancy)^2)test_rsq <-R2(pred = test_preds, obs = df_test$life_expectancy)# Combine metrics into a dataframerf_metrics <-data.frame(metric =c("rf_train", "rf_test"),rmse =c(train_rmse, test_rmse),rsq =c(train_rsq, test_rsq),mae =c(train_mae, test_mae),mse =c(train_mse, test_mse))# Print the metricsprint(rf_metrics)
Throughout the modeling workflow, careful consideration was given to achieving a balance between interpretability and predictive accuracy. The process began with initial model estimation using multiple linear regression to establish a baseline and identify key predictors. Cross-validation was employed to ensure the model’s generalizability, addressing the bias-variance tradeoff by evaluating performance across multiple folds. Due to multicollinearity, we had to remove 3 highly correlated variables. Thus, regularization techniques, including ridge regression and elastic net, were incorporated to keep the 3 variables and handle multicollinearity. The Random Forest model was ultimately chosen for its superior predictive performance, despite its complexity. For every model, steps were taken to validate model assumptions, such as checking for homoscedasticity and normality of residuals, ensuring robustness and reliability in the final model.
Final models results conclusion
1. Multiple Linear Regression
The multiple linear regression model establishes the baseline relationship between predictors and life expectancy. Significant predictors include GDP per capita, disease death rate, and adult mortality, indicating that economic status, disease prevalence, and mortality rates substantially impact life expectancy. The model’s performance is moderate, with some limitations due to multicollinearity among predictors. Adjustments and removal of highly collinear variables were necessary to enhance model reliability and interpretability.
2. Elastic Net Regression
Elastic Net regression combines the penalties of both ridge and lasso regression to address multicollinearity and variable selection. This model showed improved performance over the linear model, with lower RMSE and higher R² values, indicating better predictive accuracy. Significant predictors remain consistent, reaffirming the importance of economic and health-related factors on life expectancy. The model’s regularization helps in retaining valuable predictors while penalizing less significant ones, ensuring a balanced and interpretable model.
3. Random Forest regression
The Random Forest model, an ensemble learning method, provided the best performance among the models. It handled multicollinearity effectively and captured complex relationships in the data. Key predictors like GDP per capita, disease death rate, and adult mortality were identified as the most influential. The model’s high accuracy, reflected in the lowest RMSE and highest R² values, underscores its robustness. However, its black-box nature limits interpretability compared to linear models. Thus, partial dependence plot was visualized. Overall, Random Forest proved highly effective in predicting life expectancy, reinforcing the significance of economic stability and health factors.
Was the research hypothesis achieved?
Yes, the research hypothesis was achieved. Overall the model was able predict the estimated life expectancy with very high accuracy, indicating that the variables we used can deeply explain life expectancy. According to our results, there is sufficient evidence to prove that countries with higher levels of income accessibility to healthcare services do infact exhibit higher life expectancy and lower mortality rates. As countries get wealthier, they increase health expenditure and afford more hospital beds, increasing their likelihood of higher live expectancy compared to countries that are from poorer income levels.
Salient Findings
• Economic Wealth plays a role: Our Random Forest model corroborates existing research on the negative association between mortality rate, diseases and life expectancy. Our analysis suggest the importance of addressing the health disparity between high- and low-income countries.
• Multifaceted Influences on Health: While the model confirms GDP’s significance, it identifies disease death rate and adult mortality as key factors. Obviously disease and mortality rates are key factors, other variables suggests there are external influences that capture the multifaceted nature of life expectancy. If we remove the mortality rate and diseases death rate, and alllow more datasets and variables to be imported, it may give a more detailed insights about life expectancy.
Proposed Solutions
The analysis shows money impacts health. To tackle this, governments can invest more in public healthcare for wider access, create programs for low-income areas, and focus on preventative care. Wealthy nations can also aid developing countries. By using data like this to identify problem areas, governments can create policies that target specific needs and work towards a future where everyone has equal chances for good health.
Limitations
Balancing Interpretability and Accuracy: This analysis aimed to strike a balance between a model that’s easy to understand (interpretable) and one that predicts well (accurate). While linear regression offered clear cause-and-effect relationships, its accuracy was limited by multicollinearity (correlated variables). Random Forest, on the other hand, achieved highest accuracy but its complex nature makes it harder to pinpoint why certain factors influence life expectancy.
Data Availability and Generalizability: The analysis relied on available data, which might not capture all relevant factors affecting life expectancy. Socioeconomic factors beyond GDP, like education or access to clean water,could be missing. Additionally, the chosen models may not generalize perfectly to all countries due to cultural or environmental variations not included in the data.
Reference list
Source Code
---title: "Assessing the impact of socio economic status across different demographics to understand the disparities: Analyzing Global Life Expectancy"author: "Karim 32228643"format: html: code-fold: true code-tools: true code-summary: "show code" toc: true message: false warning: falseeditor: visual---# Introduction## Research QuestionThe conceptual research question guiding this analysis is: "How do a country's socio-economic factors, including healthcare access, income levels, and consumption patterns, influence life expectancy globally?" This question aims to explore the complex relationship between these socio-economic determinants and health outcomes, highlighting disparities across countries with different demographics.## Research HypothesisThe practical research hypothesis posited in this report is: "*Countries with higher levels of income accessibility to healthcare services exhibit higher life expectancy and lower mortality rates*" we believe the data we uses support the research hypothesis. The full dataset overview will be given below. For instance, Health expenditure, GDP per capita, and GNI per capita all increase significantly from low-income to high-income countries, indicating that wealthier nations invest more in healthcare infrastructure and services (Boniol et al., 2022). Disease death rates and both child and adult mortality rates decrease with rising income levels, reflecting the effectiveness of healthcare services in wealthier countries (Arias et al., 2022). The number of hospital beds and urban population percentages also increase, highlighting improved healthcare capacity and access in richer nations. Despite higher alcohol consumption in high-income countries, the overall health benefits from better healthcare infrastructure outweigh potential negative impacts (Watsons, 2022).## Libraries usedTo perform the analysis and generate the necessary visualizations and s for this technical report, the following R packages will be utilized:```{r,message=FALSE}library(readr)library(tidyverse)library(readxl)library(visdat)library(gt)library(gtsummary)library(tidymodels)library(ggplot2)library(reshape2)library(gridExtra)library(rsample)library(tune)library(car)library(patchwork)library(glmnet)library(vip)library(DALEX)library(knitr)library(kableExtra)library(huxtable)library(modelsummary)library(gtsummary)library(psych)library(randomForest)library(caret)library(glmnet) library(corrplot)library(rpart.plot)library(rpart)```## Dataset OverviewTo perform this analysis, **we utilizes the dataset from the previous infographic assignment.** The dataset given, is a joined dataset from numerous sources which consists of socio-economic predictors and wealth status. The previous infographic dataset consists of aggregated countries, region and UN classify territories. Therefore, the code chunk below filter out those non-country values and keep all 193 UN member countries.# Initial Data Preprocessing## 1. Data collection and integration```{r, message =FALSE}# load previous infographic dataset IA3_df = read_excel("Datasets/Raw_IA3_df.xlsx")# convert 0s into NAfull_df <- IA3_df %>% mutate(across(where(is.numeric), ~na_if(., 0)))# countries column has loads of region and state names, even aggregated names.# collect country code that consists of all recognized 195 countries, however this datasets do not consists of Palestine and Vatican city as they are non-member observer state.codes = read_csv("Datasets/countries.csv") %>% select(en, alpha3) %>% transmute(Code = toupper(alpha3), Country = en)# get continents dataset for grouping descriptive summary resultscontinent = read_csv("Datasets/continents.csv") %>% rename(country = Country)# join two datasets, and remove rows that is inconsistent with the earlier country codecleaned_df = full_df %>% full_join(codes, by = "Code") %>% filter(Country != "0") %>% select(-Country) %>% distinct(Code, .keep_all =TRUE) %>% inner_join(continent, by = "country")# rewrite the name to be shorter and simpler to writecleaned_df = cleaned_df %>% rename(adult_mortality = `Adult Mortality rate per 1,000`, total_mortality = `Total Mortality rate per 1,000`)# replace white space with underscoresnames(cleaned_df) <- gsub(" ", "_",tolower(names(cleaned_df)))```### Dependent Variable: Life_ExpectancyThe dependent variable we use is \`Life_Expectancy\` variable which fortunately has no missing values. Data is obtained from World Bank Databank specifically taken from year 2019 before the heavy influx of Covid-19 that may affect our predictions.## 2. Handling Missing Values```{r}# calculate the percentage amount of NA relative to the dataset populationpercent_na =sum(is.na(cleaned_df)) / (nrow(cleaned_df) *ncol(cleaned_df)) # calculate the total NAfull_na =sum(is.na(cleaned_df))# visualize the missing valuesvis_miss(cleaned_df)```In this dataset, it has `r full_na` missing "NA" values. Which means `r paste0(round(percent_na*100,2),"%")` of this dataset is missing. gini_index and doctors dataset has the highest missing values while life expectancy (dependent variable) has no missing values. Because more than half of the observation is missing, imputing the value would not be sui as the predicted values would not replicate the supposed actual value.### 2.1 Removing variables with very high percentage of missing values#### Missing Values of Upper middle income to High income countries```{r}# we visualizes the data of the upper-middle- income and abovecleaned_df %>%filter(gni_percap >=3995)%>%vis_miss()```#### Missing Values of Low middle income to Low income countries```{r}# we visualizes the data of the low-middle- income and belowcleaned_df %>%filter(gni_percap <=3995)%>%vis_miss() ```It can be seen that the lower income countries has relatively higher amount of missing values. This is no coincidence, likely to be Missing Not at Random (MNAR), suggesting that regions with lower income are less likely to report this data, even 50% of the hospital_beds variable is missing. Although docters and gini_index might be useful for the model, this non-informative missingness constitutes a large portion of the overall data (more than 50%) for doctors and gini_index. Therefore, we delete this variable and perform imputation on the rest.### 2.2. Handling missing values through imputation methods#### **Use Bagged Trees (Random Forest) for Higher Missingness**: **(more than or equal to 10%)**Random forest imputation is robust and can handle higher levels of missing data better than KNN (Hu & Szymczak, 2022). It considers the relationships between variables and leverages the ensemble method to make better predictions (Hu & Szymczak, 2022).Variables: **unemployment\_(%), hospital_beds, adult_mortality**------------------------------------------------------------------------#### **Use K-Nearest Neighbor for Lower Missingness: (less than 10%)**KNN is effective for lower levels of missing data and can be quicker to implement. It uses the closest neighbors to impute the missing values, which works well when missingness is lower (Keith, 2021)..Variables: **health_expenditure, gdp_percap, gni_percap, disease_deathrate, disease_death, adult_mortality, total_mortality, child_mortality**```{r}# delete gini index and doctorsdeleted_df =select(cleaned_df,-c(gini_index, doctors))%>%# convert character columns to factors or vice versa as neededmutate_if(is.character, as.factor)# variable name with low missingnesslow_missingness =colnames(select(deleted_df,-c(`unemployment_(%)`, hospital_beds, adult_mortality)))# Create a recipe for data preprocessing with imputationrec0 <-recipe(~ ., data = deleted_df) %>%step_impute_bag(c(`unemployment_(%)`, hospital_beds, adult_mortality)) %>%step_impute_knn(c(low_missingness), neighbors =3) # Prepare the reciperec_prep0 <-prep(rec0, training = deleted_df)# Apply the recipe to the data (bake the recipe)final_data <-bake(rec_prep0, new_data =deleted_df )# Check the resultvis_miss(final_data)```Perfect!### 2.3 Full dataset before input selection and normalization```{r}table_1_df <- final_data %>%mutate(`Income Level`=factor(case_when( gni_percap <1035~"Low-income", gni_percap >=1036& gni_percap <=4045~"Lower-middle-income", gni_percap >=4046& gni_percap <=12535~"Upper-middle-income", gni_percap >=12535~"High-income"),levels =c("Low-income", "Lower-middle-income","Upper-middle-income","High-income")) ) %>%select(!c(country, code))summary_by_income <- table_1_df %>%tbl_summary(by =`Income Level`,missing ="no" ) %>%modify_header(label ="**Variable**")kable(summary_by_income,caption ="*Full dataset summary statistics aggregated by income levels*")# Print the kable(describe(final_data),caption ="*Full dataset summary statistics for each variable*",digits =4)```## 3. Correlation analysis with target variable### 3.1 Full data correlation matrix```{r}# store a numeric only dataframe for our linear regressionnumeric_df =select(final_data, where(is.numeric)) # form a correlation matrixcor_matrix = numeric_df %>%cor() # identify the variable with high correlation cor_matrix %>%corrplot(method ="number",order ="hclust")``````{r}# identify the correlated variables, setting a threshold of 0.3 correlation. (Moderate to strong correlation)correlated_variables =as.data.frame(cor_matrix) %>%select(life_expectancy) %>%filter(life_expectancy >0.3| life_expectancy <-0.3 )%>%filter(life_expectancy !=1) # these variables with the most correlation kable(correlated_variables %>%rownames_to_column() %>%arrange(desc(life_expectancy)),caption ="*Variables with moderate to strong Correlation with life expectancy*",label="*Variable*")``````{r}# Melt the data framedf_melt0 <-melt(numeric_df%>%select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars ="life_expectancy")# Create scatterplotsscatterplots0 <-ggplot(df_melt0, aes(x = value, y = life_expectancy)) +geom_point(alpha =0.2) +facet_wrap(~ variable, scales ="free") +labs(title ="Scatterplot Matrix Against Life Expectancy")+geom_smooth(size =0.3, color ="red",linetype =2, se =FALSE)+theme_bw()print(scatterplots0)``````{r}distributions0 <-ggplot(df_melt0, aes(x = value)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.3) +facet_wrap(~ variable, scales ="free") +labs(title ="Distributions Against Life Expectancy")+theme_bw()distributions0```## 4. Final selected Inputs```{r}# select corelated variable without normalization correlated_inputs =select(final_data, c(colnames(t(correlated_variables)), "life_expectancy", ))kable(correlated_inputs %>%mutate(`Income Level`=factor(case_when( gni_percap <1035~"Low-income", gni_percap >=1036& gni_percap <=4045~"Lower-middle-income", gni_percap >=4046& gni_percap <=12535~"Upper-middle-income", gni_percap >=12535~"High-income"),levels =c("Low-income", "Lower-middle-income","Upper-middle-income","High-income"))) %>%tbl_summary(by =`Income Level`,missing ="no" ) %>%modify_header(label ="**Variable**"),caption ="*Selected dataset summary statistics aggregated by income levels*",digit =4)kable(describe(correlated_inputs),caption ="*Selected dataset summary statistics for each variable*",digits =4)``````{r}sources =as.data.frame(t(correlated_inputs))[2] %>%rownames_to_column()sources$V2 <-c("WorldBank","WorldBank","WorldBank","Our World In Data", "World Health Organization (WHO)","UNICEF", "WorldBank", "WorldBank", "WorldBank", "WorldBank")colnames(sources) <-c("variable", "source")kable(sources, caption ="*Each selected variables sources*")```**How would this variable be useful?**All these 9 selected variables have moderate to strong correlation (threshold of 0.3 correlation). Hence, these variables have a noticable relationship with life expectancy. Each of these features provide valuable information for the predictive model to run. Analyzing these variables collectively can help develop predictive models for estimating life expectancy accurately. However, to fit the model, further data preprocessing will be used. By bringing features to a similair scale, the model is able to eliminate bias from outliers and produce more accurate prediction.## 5. Data Normalization of selected inputs```{r}# Create a recipe for data preprocessing with imputationrec1 <-recipe(~ ., data = final_data) %>%step_YeoJohnson(colnames(t(correlated_variables))) # Normalize numeric predictors# Prepare the reciperec_prep1 <-prep(rec1, training = final_data)# Apply the recipe to the data (bake the recipe)normalized_final_data <-bake(rec_prep1, new_data =final_data )df_melt1 <-melt(select(normalized_final_data, where(is.numeric))%>%select(c(colnames(t(correlated_variables)),"life_expectancy")), id.vars ="life_expectancy")distributions1 <-ggplot(df_melt1, aes(x = value)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.3) +facet_wrap(~ variable, scales ="free") +labs(title ="Distributions Against Life Expectancy")+theme_bw()distributions1``````{r}scatterplots1 <-ggplot(df_melt1, aes(x = value, y = life_expectancy)) +geom_point(alpha =0.2) +facet_wrap(~ variable, scales ="free") +labs(title ="Scatterplot Matrix Against Life Expectancy")+geom_smooth(size =0.3, color ="red",linetype =2, se =FALSE)+theme_bw()print(scatterplots1)```## 6. Data partitioning```{r}# our selected dataselected_inputs =select(normalized_final_data, c("life_expectancy",colnames(as.data.frame(t(correlated_variables)))))set.seed(000)# split data to training and testingdf_split =initial_split(selected_inputs, prop =0.7, strata = life_expectancy)# input train and test data df_train =training(df_split)df_test =testing(df_split)```# Multiple Linear Regression - Ordinary Least Squared$y=β0+β1x1+β2x2+...+ϵ$## 1. Training Multiple Linear Regression### 1.1. Train and evaluate initial regression model```{r}# form the formulafirst_fmla =as.formula("life_expectancy ~ health_expenditure + gdp_percap+ gni_percap + disease_deathrate + child_mortality + adult_mortality + urban_pop + alcohol_percap + hospital_beds")# formula being usedlinear_model0 =lm(first_fmla,# dataset being useddata = df_train) # recieve resultsdf_results0 =cbind(df_train,as.data.frame(predict(linear_model0,interval ="prediction")))# model summarylinear0_summary =summary(linear_model0)modelsummary_output <-modelsummary(list(linear_model0), output ="gt",stars =TRUE,statistic =c("t = {statistic}", "se = {std.error}", 'p.value'),shape =)# Convert the gt to kableExtrakable(modelsummary_output %>%as.data.frame(),caption ="*Initial training model summary*",full_width =FALSE, position ="center")```### 1.2. Checking for Multicollinearity: Variance of Influence (VIF)```{r}# form a VIF plot VIF_linear_0 =as.data.frame(vif(linear_model0)) %>%rownames_to_column() %>%ggplot(aes(x = rowname, y =vif(linear_model0)))+geom_col()+coord_flip() +geom_hline(yintercept =5, linetype =2, color ='blue')+labs(title ="VIF Values of Predictor Variables", x=NULL, y ='VIF values') +theme_bw()+theme(axis.text.y =element_text(size =12)) VIF_linear_0```The VIF plot suggest gni_percap and gdp_percap variable is correlated. Therefore we delete the variable wit the highest VIF values, gni_percap. Additionally, we delete child_mortality as well, this is likely to be correlated with adult_mortality. Removing multicollinearity may help improve model interpretability and enhance statistical significance.```{r}# form the formulasecond_fmla =as.formula("life_expectancy ~ gdp_percap + disease_deathrate + adult_mortality + urban_pop + alcohol_percap + hospital_beds + health_expenditure")# formula being usedlinear_model1 =lm(second_fmla,# dataset being useddata = df_train) # recieve resultsdf_results1 =cbind(df_train,as.data.frame(predict(linear_model1,interval ="prediction")))# recieve residualsdf_results1$residuals = linear_model1$residuals# model summarylinear1_summary =summary(linear_model1) VIF_linear_1 =as.data.frame(vif(linear_model1)) %>%rownames_to_column() %>%ggplot(aes(x = rowname, y =vif(linear_model1)))+geom_col()+coord_flip() +geom_hline(yintercept =5, linetype =2, color ='blue')+labs(title ="VIF Values of Predictor Variables", x=NULL, y ='VIF values') +theme_bw()+theme(axis.text.y =element_text(size =12)) (VIF_linear_0 +ggtitle(label ="Before removing gni_percap and child_mortality") ) /(VIF_linear_1 +ggtitle(label ="After removing gni_percap and child_mortality ") ) ```Health_expenditure is still correlated with gdp_percap, therefore we ultimately remove this.```{r}third_fmla =as.formula("life_expectancy ~ gdp_percap + disease_deathrate + adult_mortality + urban_pop + alcohol_percap + hospital_beds")# formula being usedlinear_model2 =lm(third_fmla,# dataset being useddata = df_train) # recieve resultsdf_results2 =cbind(df_train,as.data.frame(predict(linear_model2,interval ="prediction")))# recieve residualsdf_results2$residuals = linear_model2$residuals# model summarylinear2_summary =summary(linear_model2) VIF_linear_2 =as.data.frame(vif(linear_model2)) %>%rownames_to_column() %>%ggplot(aes(x = rowname, y =vif(linear_model2)))+geom_col()+coord_flip() +geom_hline(yintercept =5, linetype =2, color ='blue')+labs(title ="VIF Values of Predictor Variables", x=NULL, y ='VIF values') +theme_bw()+theme(axis.text.y =element_text(size =12)) (VIF_linear_1 +ggtitle(label ="Before removing health_expenditure") ) /(VIF_linear_2 +ggtitle(label ="After removing health_expenditure") ) ```### 1.3. Training model evaluation after adjusting for multicollinearity```{r}# calculate RMSErmse_lm_train <-sqrt(mean(df_results2$residuals^2))# calculate MAEmae_lm_train <-mean(abs(df_results2$residuals))# calculate MSEmse_lm_train <-mean(df_results2$residuals^2)# calculate AICaic_value_lm_train <-AIC(linear_model2)# calculate BICbic_value_lm_train <-BIC(linear_model2)modelsummary_output1 <-modelsummary(list(linear_model2), output ="gt",stars =TRUE,statistic =c("t = {statistic}", "se = {std.error}", 'p.value'))kable(as.data.frame(modelsummary_output1),caption ="*Training model summary*")``````{r}# collect train metricslm_train_metrics =data.frame(metric =c("rmse", "rsq", "mae", "mse", "aic","bic"),lm_train =c(rmse_lm_train, linear2_summary$r.squared, mae_lm_train, mse_lm_train, aic_value_lm_train, bic_value_lm_train ))kable(lm_train_metrics,caption ="*Training model performance metrics*")```We've eventually arrive to this initial final model after adjusting for multicollinearity.### 1.4. Prediction vs actual plot```{r}# plot the predictionr_squared_linear1 =ggplot(df_results2, aes(x = fit, life_expectancy)) +geom_point(alpha =0.4) +geom_abline(color ='blue', linetype =2) +coord_obs_pred()+ggtitle("Predicted vs Actual life expectancy", label ="Training Model using Linear Regression") +theme_bw() +xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy") +annotate("label", x =80, y =50,label =paste0("R-squared:", round(linear2_summary$r.squared,2)))r_squared_linear1```### 1.5. Residuals diagnostics```{r}# form a residual histogram plotresidual_linear1 =ggplot(df_results2, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(df_results2$residuals), linetype =2, color ="blue") +geom_vline(xintercept =median(df_results2$residuals), linetype =2, color ="darkgreen") +annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(df_results2$residuals),2)), color ="blue" ) +annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(df_results2$residuals),2)), color ="darkgreen" ) +ylim(0, 0.21) +theme_bw() +ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit1 =ggplot(df_results2, aes(x = fit, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw() +ggtitle("Residual plot ")# qq plotqq_plot1 =ggplot(df_results2, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot ", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw() residual_linear1 + ( qq_plot1/ residual_fit1 ) ```- **Random Distribution**: The residuals appear to be randomly scattered, indicating that the model fits the data well and that there are no obvious patterns or trends.- **Homoscedasticity**: The spread of residuals seems consistent across the range of fitted values, suggesting that the variance of the residuals is constant (homoscedasticity). No heteroskedasticity.- **No Systematic Patterns**: There are no clear curves or systematic patterns, implying that the model captures the relationship between predictors and the response variable adequately.- **Outliers and normality**: Good amount of residuals fall relatively far from zero but has a roughly symmetric distribution with a longer tails to the left, indicating potential outliers that may need further investigation.- **Center and Spread:** Mean of residuals is approximately `r round(mean(df_results1$residuals),2)` and the median is `r round(median(df_results1$residuals),2)`, Since the residuals are centered around 0, it suggest no significant bias in the model.## 2. K-fold Cross-validation### 2.1. Cross-validation performance evaluation```{r}# Specify a modellm_model0 <-linear_reg() %>%set_engine("lm")# Create a recipe for data preprocessingrec <-recipe(third_fmla, data = df_train) # Create a workflowwf <-workflow() %>%add_recipe(rec) %>%add_model(lm_model0)# Create 10-fold cross-validation resamplesset.seed(123) # For reproducibilityfolds <-vfold_cv(df_train, v =10)# Perform cross-validationresults <- wf %>%fit_resamples(resamples = folds,metrics =metric_set(rmse, rsq),control =control_resamples(save_pred =TRUE) )# Collect and summarize metricsmetrics_cv = results %>%collect_metrics()# Visualize results lm_cv_predictions = results %>%collect_predictions()# Function to fit model and extract AIC/BICextract_aic_bic <-function(split) { train_data <-analysis(split) mod_fit <-lm(life_expectancy ~ ., data = train_data) aic <-AIC(mod_fit) bic <-BIC(mod_fit)tibble(id = split$id,AIC = aic,BIC = bic )}aic_bic_results <- folds %>%mutate(results =map(splits, extract_aic_bic)) %>%select(id, results) %>%unnest(cols = results, names_repair ="unique")# Summarize the resultsaic_bic_summary <- aic_bic_results %>%summarise(mean_AIC =mean(AIC),mean_BIC =mean(BIC) )# collect cv metricslm_cv_metrics = metrics_cv %>%transmute(metric = .metric, lm_crossvalidate = mean,) # calculate residuals lm_cv_residuals = lm_cv_predictions$.pred - lm_cv_predictions$life_expectancy# Calculate MAEmae_cv <-mean(abs(lm_cv_residuals))# Calculate MSEmse_cv <-mean((lm_cv_residuals)^2)lm_cv_metrics = lm_cv_metrics %>%full_join(data.frame(metric =c("mae", "mse", "aic", "bic"),lm_crossvalidate =c(mae_cv, mse_cv, aic_bic_summary$mean_AIC, aic_bic_summary$mean_BIC)), by =c("lm_crossvalidate", "metric"))# combine all metrics in linear modeltrain_metrics_lm =full_join(lm_train_metrics, lm_cv_metrics,by ="metric")kable(train_metrics_lm,caption ="*Train and Cross-validation performance metrics*")```### 2.2. Prediction vs Actual Plot```{r}lm_cv_predictions%>%ggplot(aes(x = .pred, y = life_expectancy)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="K-Fold Cross-validation resamples for Linear Regression") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy") +annotate("label", x =80, y =50,label =paste0("R-squared:", round(metrics_cv$mean[2],2)))```We've reached our optimum OLS linear model, the R-squared seems to reach `r round(metrics_cv$mean[2],2)` even with cross-validation testing!### 2.3. Residual Diagnostic```{r}df_cv_resid =data.frame("residuals"= lm_cv_residuals,fit = df_train$life_expectancy)# form a residual histogram plotresidual_linear_cv =ggplot(df_cv_resid, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(df_cv_resid$residuals), linetype =2, color ="blue") +geom_vline(xintercept =median(df_cv_resid$residuals), linetype =2, color ="darkgreen") +annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(df_cv_resid$residuals),2)), color ="blue" ) +annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(df_cv_resid$residuals),2)), color ="darkgreen" ) +ylim(0, 0.21) +theme_bw() +ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit_cv =ggplot(df_cv_resid, aes(x = fit, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw() +ggtitle("Residual plot ")# qq plotqq_plot_cv =ggplot(df_cv_resid, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot ", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw() residual_linear_cv + ( qq_plot_cv/ residual_fit_cv ) ```## 3. Prediction and Evaluation### 3.1. Testing performance evaluation (```{r}# Predict on the test datasetmodel_fit0 <- wf %>%fit(data = df_train)test_predictions0 <- model_fit0 %>%predict(new_data = df_test) %>%bind_cols(df_test)# get residuals test_predictions0 <- test_predictions0 %>%mutate(residuals = life_expectancy - .pred)metrics0 <- test_predictions0 %>%metrics(truth = life_expectancy, estimate = .pred)# collect testing metrics# get msemse_testing <-data.frame(metric ="mse",lm_testing =mean((test_predictions0$residuals)^2))# make test metrics lm_test_metrics = metrics0 %>%transmute(metric = .metric, lm_testing = .estimate) %>%full_join(mse_testing)# combine all metrics in linear modeljoined_metrics_lm =full_join(lm_train_metrics, lm_cv_metrics,by ="metric") %>%inner_join(lm_test_metrics)# Summary statistics of the test datasetkable(model_fit0 %>%extract_fit_parsnip() %>%tidy(),caption ="*Testing model summary*")```The testing model summary highlights the relationships between key predictors and life expectancy. GDP per capita positively influences life expectancy (estimate = 1.795, p \< 0.001), indicating economic stability enhances health outcomes. Disease death rate (estimate = -2.755, p \< 0.001) and adult mortality (estimate = -1.344, p \< 0.001) negatively affect life expectancy, underscoring the adverse impact of health burdens. Urban population and hospital beds were not significant predictors (p \> 0.1), while alcohol consumption per capita had a positive but moderate effect (estimate = 0.526, p \< 0.001). These findings align with the hypothesis that healthcare accessibility and economic factors significantly influence life expectancy.### 3.2. Overall OLS regression's performance metrics```{r}kable(joined_metrics_lm,caption ="*Multiple Linear Regression's perfomance metrics*")```### 3.3. Prediction vs actual plot```{r}test_predictions0 %>%ggplot(aes(x = .pred, y = life_expectancy)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="Linear Regression Testing set") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy") +annotate("label", x =78, y =50,label =paste0("R-squared:", round(metrics0$.estimate[2],2)))```It performed really well!### 3.4. Residual Diagnostic of testing data```{r}# form a residual histogram plothistogram_linear_test =ggplot(test_predictions0, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(test_predictions0$residuals), linetype =2, color ="blue")+geom_vline(xintercept =median(test_predictions0$residuals), linetype =2, color ="darkgreen")+annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(test_predictions0$residuals),2)), color ="blue" )+annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(test_predictions0$residuals),2)), color ="darkgreen" )+ylim(0, 0.21)+theme_bw()+ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit_test_linear =ggplot(test_predictions0, aes(x = life_expectancy, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw()+labs(title ="fit vs residuals")qq_linear_test =ggplot(test_predictions0, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw()# plot histogram_linear_test + (qq_linear_test/residual_fit_test_linear)```# Regularized Regression : L2 and Elastic NetSince there is severe multicollinearity in the original selected input, we had to delete 3 variables from the selected input even though we still believe those inputs are still useful. Hence, we could've capture important information that can better predict life_expectancy. Therefore, we decided to use Ridge Regression because we wanted to keep all selected variables. We will assign them different weightage and penalty.We wanted to find the optimal values of lambda. We do this by using 10 fold cross validation## 1. Further preprocessing: Standardization```{r}# standardize the train variableblueprint =recipe(life_expectancy ~ ., data = df_train)%>%step_center(all_numeric(), -all_outcomes()) %>%step_scale(all_numeric(), -all_outcomes())prepare <-prep(blueprint, training = df_train)ridge_train <-bake(prepare, new_data = df_train)# standardize the test variableblueprint_test =recipe(life_expectancy ~ ., data = df_test)%>%step_center(all_numeric(), -all_outcomes()) %>%step_scale(all_numeric(), -all_outcomes())prepare_test <-prep(blueprint, training = df_test)ridge_test <-bake(prepare, new_data = df_test)X_test <-model.matrix(life_expectancy ~ ., ridge_test)[, -1]# transform y with log transformationY_test <- ridge_test$life_expectancyX <-model.matrix(life_expectancy ~ ., ridge_train)[, -1]# transform y with log transformationY <- ridge_train$life_expectancyX %>%summary()```## 2. Ridge Regression Training and Evaluation### 2.1 Regularization path plot```{r}ridge <-glmnet(x = X,y = Y,alpha =0) # a = ridge plot(ridge, main ="Ridge penalty\n")```This plot shows how ridge regression coefficients shrink as the regularization parameter λ increases (moving right). Higher λ values lead to smaller coefficients, indicating stronger regularization, which reduces overfitting by limiting the model's sensitivity to individual features. Unlike LASSO, coefficients remain non-zero, meaning all features contribute to the model. This helps balance model complexity and prediction accuracy, making it robust, especially when handling multicollinearity or many features relative to observations.### 2.2. Cross-validation to search optimal lambda```{r}# cross validation ridge regressionridge_cv <-cv.glmnet(x = X,y = Y,alpha =0,type.measure ="mse")plot(ridge, main ="Ridge penalty\n")abline(v = ridge_cv$lambda.min, col ="red", lty ="dashed")abline(v = ridge_cv$lambda.1se, col ="blue", lty ="dashed")```### 2.4. Cross-validation plot```{r}# we will be using MSE to compare models plot(ridge_cv, main ="Ridge penalty\n\n")```### 2.5. Cross-validation results```{r}ridge_cv```The 10-fold cross-validation Mean Squared Error (MSE) for a ridge model is depicted in a plot. The first vertical dotted line indicates the value of λ that corresponds to the smallest MSE, while the second vertical line represents the λ value that is within one standard error of the minimum MSE.**Minimum Lambda Found:** `r ridge_cv$lambda.min`The crossvalidation gives us this optimized lambda values that allow us to yield the lowest Mean Squared Error, now we will use this value to predict using the training dataset.### 2.3. Uses the minimum lambda to predict training set```{r}# using this crossvalidationridge_cv_result = ridge_cv %>%predict(newx = X) %>%rowMeans()%>%cbind(Y)%>%as.data.frame() ridge_cv_residuals = (Y - ridge_cv %>%predict(newx = X))SST <-sum((Y -mean(Y))^2) # Total Sum of SquaresSSR <-sum(ridge_cv_residuals^2) # Residual Sum of SquaresR_squared_ridgecv_train <-1- (SSR / SST)ridge_cv_result %>%ggplot(aes(x = . , y = Y)) +geom_point(alpha =0.5)+geom_abline(linetype =2, color ="blue") +coord_obs_pred() +theme_bw() +annotate("label", x =80, y =50,label =paste0("R-squared:", round(R_squared_ridgecv_train,2)))+xlab ("Predicted life expectancy") +ylab ("Actual life expectancy")```## 3. Elastic net to find the optimal tuning### 3.1. Cross-validation plot```{r}# grid search across cv_glmnet <-train(x = X,y = Y,method ="glmnet",#preProc = c("zv", "center", "scale"),trControl =trainControl(method ="cv", number =10),tuneLength =10)# plot cross-validated RMSEggplot(cv_glmnet)+theme_bw()```This plot visualizes which parameter yields the lowest RMSE in each of the 10 folds. Notice how the lambda has changes a bit, but allow us to use alpha of 0.2.```{r}kable(cv_glmnet$bestTune, caption ="*Grid seach result*")```Alpha of 0.1 with lambda of 0.0078 gives the lowest rmse.### 3.2. Variable Importance plot```{r}vip(cv_glmnet, bar =FALSE) +theme_bw()```### 3.2 Final evaluation of ridge and elastic net models```{r}# predict life_expectancy on training dataridge_grid_result <-data.frame(pred =predict(cv_glmnet, X),fit = df_train$life_expectancy)ridge_grid_residuals = ridge_grid_result$pred - ridge_grid_result$fit# compute RMSE of transformed predictedrmse_grid =RMSE(ridge_grid_result$pred, ridge_grid_result$fit)mae_grid =MAE(ridge_grid_result$pred, ridge_grid_result$fit)mse_grid =mean((ridge_grid_residuals)^2)SST <-sum((Y -mean(Y))^2) # Total Sum of SquaresSSR_grid <-sum(ridge_grid_residuals^2) # Residual Sum of SquaresR_squared_gridsq_train <-1- (SSR_grid / SST)ridge_cv_grid_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), elastic_train =c(rmse_grid, R_squared_gridsq_train, mae_grid ,mse_grid))# calculate MAEmae_cv_ridge <-mean(abs(ridge_cv_residuals))# calculate MSEmse_cv_ridge <-mean((ridge_cv_residuals)^2)# calculate MSErmse_cv_ridge <-sqrt(mse_cv_ridge)# calculate MAEmae_cv_ridge <-mean(abs(ridge_cv_residuals))# calculate MSEmse_cv_ridge <-mean((ridge_cv_residuals)^2)# calculate MSErmse_cv_ridge <-sqrt(mse_cv_ridge)ridge_cv_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), ridge_train =c(rmse_cv_ridge, R_squared_ridgecv_train, mae_cv_ridge ,mse_cv_ridge))train_regular_metrics = ridge_cv_metrics %>%full_join(ridge_cv_grid_metrics, by ="metric")kable(train_regular_metrics,caption ="*Ridge and Elastic net regressions training performance*")```## 4. Prediction and evaluation### 4.1. Ridge regression and Elastic results```{r}ridge_predict =predict(ridge_cv, s=ridge_cv$lambda.1se, newx = X_test, lambda = ridge_cv$lambda.min )ridge_result =cbind(ridge_predict, Y_test) %>%as.data.frame()SST_r_test <-sum((Y_test -mean(Y_test))^2) # Total Sum of SquaresSSR_r_test<-sum((Y_test - ridge_predict)^2) # Residual Sum of SquaresR_squared_r_test <-1- ( SSR_r_test/SST_r_test)ridge_result0 = ridge_result %>%transmute(life_expectancy = Y_test,.pred = s1,residuals = Y_test - s1)ridge_plot = ridge_result0 %>%ggplot(aes(x = .pred, y = life_expectancy)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="Ridge Regression") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy")+annotate("label", x =76, y =50,label =paste0("R-squared:", round(R_squared_r_test,3))) elastic_predict <-data.frame(pred =predict(cv_glmnet, newdata = X_test))elastic_result =cbind(elastic_predict, Y_test) %>%as.data.frame()SST_net_test <-sum((Y_test -mean(Y_test))^2) # Total Sum of SquaresSSR_net_test<-sum((Y_test - elastic_predict)^2) # Residual Sum of SquaresR_squared_net_test <-1- (SSR_net_test / SST_net_test)elastic_result0 = elastic_result %>%mutate(residuals = Y_test - pred)elastic_plot = elastic_result0 %>%ggplot(aes(x = pred, y = Y_test)) +geom_point(alpha =0.4) +geom_abline(linetype =2, color ="blue") +ggtitle("Predicted vs Actual life expectancy", label ="Elastic net") +coord_obs_pred()+theme_bw()+xlab("Predicted Life Expectancy")+ylab("Actual Life Expectancy")+annotate("label", x =76, y =50,label =paste0("R-squared:", round(R_squared_net_test,3))) ridge_plot+elastic_plot```### 4.2 Performance Metrics of testing and training```{r}# Ridge------------ridge_residuals = ridge_result[1] - ridge_result[2]# compute RMSE of transformed predictedrmse_ridge =RMSE(ridge_result$s1, ridge_result$Y_test)mae_ridge=MAE(ridge_result$s1,ridge_result$Y_test)mse_ridge =mean((ridge_residuals$s1)^2)SSR_ridge <-sum(ridge_grid_residuals^2) # Residual Sum of Squares# calculate MSEridge_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), ridge_test =c(rmse_ridge, R_squared_r_test, mae_ridge ,mse_ridge))# Elastic------------elastic_residuals = elastic_result$pred - elastic_result$Y_test# compute RMSE of transformed predictedrmse_elastic =RMSE(elastic_result$pred , elastic_result$Y_test)mae_elastic =MAE(elastic_result$pred ,elastic_result$Y_test)mse_elastic =mean((elastic_residuals)^2)SSR_elastic <-sum(elastic_residuals^2) # Residual Sum of Squares# calculate MSEelastic_metrics =data.frame(metric =c("rmse","rsq", "mae", "mse"), elastic_test =c(rmse_elastic, R_squared_net_test, mae_elastic ,mse_elastic))join_test_r = ridge_metrics %>%full_join(elastic_metrics, by ='metric') joined_metrics_r = train_regular_metrics %>%full_join(join_test_r, by ='metric')kable(joined_metrics_r,caption ="*Ridge and Elastic net regression performance metrics*")```### 4.3 Residual Diagnostic best model```{r}elastic_residuals_df =data.frame(residuals = elastic_residuals,life_expectancy = Y_test)# form a residual histogram plothistogram_test =ggplot(elastic_residuals_df, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(elastic_residuals_df$residuals), linetype =2, color ="blue")+geom_vline(xintercept =median(elastic_residuals_df$residuals), linetype =2, color ="darkgreen")+annotate("text",x =-1.5, y =0.17, label =paste0("Mean: ", round(mean(elastic_residuals_df$residuals),2)), color ="blue" )+annotate("text",x =2.3, y =0.20, label =paste0("Median: ", round(median(elastic_residuals_df$residuals),2)), color ="darkgreen" )+ylim(0, 0.21)+theme_bw()+ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit =ggplot(elastic_residuals_df, aes(x = life_expectancy, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw()+labs(title ="fit vs residuals")qq =ggplot(elastic_residuals_df, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw()# plot histogram_test + (qq/residual_fit)```# Random Forest Regression - Ensemble LearningEnsemble-based method is used to seek the bias variance trade-off. We plan to improve predictive accuracy but unfortunately, reduces interpretability as random forest is a black-box model that is difficult to interpret as compare to linear or regularized regression.## 1. Model BuildingRandom Forest models are inherently robust to the scale of input features, as they base splits on relative values, making normalization less crucial (Hu & Szymczak, 2022). However, normalization can still enhance performance by ensuring balanced splits and equal feature contribution, leading to slightly better accuracy. This improvement is minor compared to methods like ridge regression or linear OLS, where normalization is essential for stability and convergence of coefficients. Therefore, to calculate the best predictive model, we'll still use the same preprocessed data.```{r}# Building the random forest modelset.seed(123)n_features =length(setdiff(names(df_test), "life_expectancy"))# for each of the trees rf_model_normalized <-randomForest( life_expectancy ~ .,data = df_train,importance =TRUE,mtry =floor(n_features/3), # regressionxtest =subset(df_test,select =-life_expectancy),ytest = df_test$life_expectancy,keep.forest =TRUE)# Print the model summaryprint(rf_model_normalized)```### 1.2. Variable importance```{r}# Plot variable importancevip_rf1 <-vip(rf_model_normalized) +ggtitle("Variable Importance")+theme_bw()vip_rf1``````{r}# create an explainer for the modelexplainer_rf =explain(model = rf_model_normalized, data = df_train,y = df_train$life_expectancy, label ="Random Forest")```### 1.3. Partial Dependence plot```{r}# plot Partial Dependence for three selected featurepdp_rf_adult <-model_profile(explainer = explainer_rf, variables ="adult_mortality")pdp_rf_disease <-model_profile(explainer = explainer_rf, variables ="disease_deathrate")pdp_rf_health <-model_profile(explainer = explainer_rf, variables ="health_expenditure")pdp_rf_gdp <-model_profile(explainer = explainer_rf, variables ="gdp_percap")# plot the PDP put them in a 2x2 figure(plot(pdp_rf_adult) +ggtitle(NULL, NULL) +plot(pdp_rf_disease) +ggtitle(NULL, NULL))/ (plot(pdp_rf_health) +ggtitle(NULL, NULL)+plot(pdp_rf_gdp)+ggtitle(NULL, NULL))```Adult mortality has the greatest influence on the model's predictions compared to other variables. As the adult mortality decreases, the predicted probability decrease exponentially, till the lowest life expectancy. The diseases death rate suggest simlair results, which make sense where more people are dying, the lower the estimate life expectancy is. In contrast, Health expenditure and GDP percap positively influence the predicted probability. Suggesting as countries invest more in health expenditure the lesser people will die, and the richer the country the higher the life expectancy.### 1.4. Single tree instance example```{r}# Extract a single tree from the random forest modelsingle_tree <-getTree(rf_model_normalized, k =1, labelVar =TRUE)# Convert it to a decision tree modeltree_model <-rpart(first_fmla, data = df_train)# Plot the tree# Plot the decision treerpart.plot(tree_model, type =1, extra =101, fallen.leaves =TRUE, main ="Decision Tree for Life Expectancy")```## 2. Model Prediction and Evaluation```{r}# Predictions on training datatrain_preds <-predict(rf_model_normalized, newdata = df_train)# Predictions on testing datatest_preds <-predict(rf_model_normalized, newdata = df_test)# Calculate metrics for training datatrain_rmse <-RMSE(pred = train_preds, obs = df_train$life_expectancy)train_mae <-MAE(pred = train_preds, obs = df_train$life_expectancy)train_mse <-mean((train_preds - df_train$life_expectancy)^2)train_rsq <-R2(pred = train_preds, obs = df_train$life_expectancy)# Calculate metrics for testing datatest_rmse <-RMSE(pred = test_preds, obs = df_test$life_expectancy)test_mae <-MAE(pred = test_preds, obs = df_test$life_expectancy)test_mse <-mean((test_preds - df_test$life_expectancy)^2)test_rsq <-R2(pred = test_preds, obs = df_test$life_expectancy)# Combine metrics into a dataframerf_metrics <-data.frame(metric =c("rf_train", "rf_test"),rmse =c(train_rmse, test_rmse),rsq =c(train_rsq, test_rsq),mae =c(train_mae, test_mae),mse =c(train_mse, test_mse))# Print the metricsprint(rf_metrics)``````{r}rf_residuals_df =data.frame(residuals = Y_test - test_preds,life_expectancy = Y_test)histogram_rf =ggplot(rf_residuals_df, aes(x = residuals)) +geom_histogram(aes(y = ..density..), color ="black", alpha =0.5) +geom_density(color ="red", size =0.5) +labs(title ="Histogram of Residuals",x ="Residuals",y ="Density") +geom_vline(xintercept =mean(rf_residuals_df$residuals), linetype =2, color ="blue")+geom_vline(xintercept =median(rf_residuals_df$residuals), linetype =2, color ="darkgreen")+annotate("label",x =16, y =0.03, label =paste0("Mean: ", round(mean(rf_residuals_df$residuals),2)), color ="blue" , size =3.5)+annotate("label",x =17, y =0.025, label =paste0("Median: ", round(median(rf_residuals_df$residuals),2)), color ="darkgreen" , size =3.5)+theme_bw()+ylab(NULL)# lm model, uses ordinary least squares regession to fit a straight lineresidual_fit_test_rf =ggplot(rf_residuals_df, aes(x = life_expectancy, y = residuals))+geom_point(alpha=0.5) +geom_smooth(method ="loess", color ='red', size =0.3, se =FALSE) +theme_bw()+labs(title ="fit vs residuals")qq_rf_test =ggplot(rf_residuals_df, aes(sample = residuals)) +stat_qq() +stat_qq_line(color ="blue") +labs(title ="Q-Q Plot", x ="Theoretical Quantiles", y ="Sample Quantiles")+theme_bw()# plot histogram_rf + (qq_rf_test/residual_fit_test_rf)```# Model Comparison and Conclusion```{r}t1 = joined_metrics_r %>%full_join(joined_metrics_lm, by ="metric") %>%t() %>%as.data.frame() %>%rownames_to_column() # transposed metricscolnames(t1) <-as.character(unlist(t1[1, ])) t1 <- t1[-1, ]t1$rmse <-as.numeric(t1$rmse)t1$rsq <-as.numeric(t1$rsq)t1$mae <-as.numeric(t1$mae)t1$mse <-as.numeric(t1$mse)model_comparison = t1 %>%full_join(rf_metrics, by =colnames(t1)) %>%arrange(rmse)kable(model_comparison, caption ="All model's performance metric comparison")```# Consideration throughout the modelling workflowThroughout the modeling workflow, careful consideration was given to achieving a balance between interpretability and predictive accuracy. The process began with initial model estimation using multiple linear regression to establish a baseline and identify key predictors. Cross-validation was employed to ensure the model's generalizability, addressing the bias-variance tradeoff by evaluating performance across multiple folds. Due to multicollinearity, we had to remove 3 highly correlated variables. Thus, regularization techniques, including ridge regression and elastic net, were incorporated to keep the 3 variables and handle multicollinearity. The Random Forest model was ultimately chosen for its superior predictive performance, despite its complexity. For every model, steps were taken to validate model assumptions, such as checking for homoscedasticity and normality of residuals, ensuring robustness and reliability in the final model.# Final models results conclusion## 1. Multiple Linear RegressionThe multiple linear regression model establishes the baseline relationship between predictors and life expectancy. Significant predictors include GDP per capita, disease death rate, and adult mortality, indicating that economic status, disease prevalence, and mortality rates substantially impact life expectancy. The model's performance is moderate, with some limitations due to multicollinearity among predictors. Adjustments and removal of highly collinear variables were necessary to enhance model reliability and interpretability.## 2. Elastic Net RegressionElastic Net regression combines the penalties of both ridge and lasso regression to address multicollinearity and variable selection. This model showed improved performance over the linear model, with lower RMSE and higher R² values, indicating better predictive accuracy. Significant predictors remain consistent, reaffirming the importance of economic and health-related factors on life expectancy. The model's regularization helps in retaining valuable predictors while penalizing less significant ones, ensuring a balanced and interpretable model.## 3. Random Forest regressionThe Random Forest model, an ensemble learning method, provided the best performance among the models. It handled multicollinearity effectively and captured complex relationships in the data. Key predictors like GDP per capita, disease death rate, and adult mortality were identified as the most influential. The model's high accuracy, reflected in the lowest RMSE and highest R² values, underscores its robustness. However, its black-box nature limits interpretability compared to linear models. Thus, partial dependence plot was visualized. Overall, Random Forest proved highly effective in predicting life expectancy, reinforcing the significance of economic stability and health factors.### Was the research hypothesis achieved?Yes, the research hypothesis was achieved. Overall the model was able predict the estimated life expectancy with very high accuracy, indicating that the variables we used can deeply explain life expectancy. According to our results, there is sufficient evidence to prove that countries with higher levels of income accessibility to healthcare services do infact exhibit higher life expectancy and lower mortality rates. As countries get wealthier, they increase health expenditure and afford more hospital beds, increasing their likelihood of higher live expectancy compared to countries that are from poorer income levels.# Salient Findings• **Economic Wealth plays a role:** Our Random Forest model corroborates existing research on the negative association between mortality rate, diseases and life expectancy. Our analysis suggest the importance of addressing the health disparity between high- and low-income countries.• **Multifaceted Influences on Health:** While the model confirms GDP's significance, it identifies disease death rate and adult mortality as key factors. Obviously disease and mortality rates are key factors, other variables suggests there are external influences that capture the multifaceted nature of life expectancy. If we remove the mortality rate and diseases death rate, and alllow more datasets and variables to be imported, it may give a more detailed insights about life expectancy.## Proposed SolutionsThe analysis shows money impacts health. To tackle this, governments can invest more in public healthcare for wider access, create programs for low-income areas, and focus on preventative care. Wealthy nations can also aid developing countries. By using data like this to identify problem areas, governments can create policies that target specific needs and work towards a future where everyone has equal chances for good health.## Limitations- **Balancing Interpretability and Accuracy:** This analysis aimed to strike a balance between a model that's easy to understand (interpretable) and one that predicts well (accurate). While linear regression offered clear cause-and-effect relationships, its accuracy was limited by multicollinearity (correlated variables). Random Forest, on the other hand, achieved highest accuracy but its complex nature makes it harder to pinpoint why certain factors influence life expectancy.- **Data Availability and Generalizability:** The analysis relied on available data, which might not capture all relevant factors affecting life expectancy. Socioeconomic factors beyond GDP, like education or access to clean water,could be missing. Additionally, the chosen models may not generalize perfectly to all countries due to cultural or environmental variations not included in the data.# Reference list{width="232"}\