1. Introduction

1.1. Life Expectancy Concept

Life expectancy, a term that known to many individuals such as policymakers, and health professionals worldwide, encapsulates the average number of years a person is expected to live based on a myriad of factors. Originating from life tables in demography, life expectancy takes into account the age-specific mortality rates for a given population in a specific year or over a period. For instance, life expectancy at birth is the average number of years a newborn can expect to live if current mortality rates remain consistent throughout their lifetime. However, as one ages, they’ve essentially survived certain risks, so their life expectancy can change.

Several factors influence life expectancy, including but not limited to:

  • Healthcare: Access to and quality of medical services can lead to early detection and better treatment of diseases.
  • Socioeconomic Factors: Wealth, education, and employment can affect access to nutritious food, healthcare, and opportunities for physical activity.
  • Lifestyle: Factors such as diet, physical activity, smoking, and alcohol consumption play a significant role.
  • Environment: Clean water, sanitation, and air quality are critical determinants.
  • Genetics: Hereditary factors can predispose individuals to certain health conditions.

Global disparities in life expectancy are evident, often highlighting the socio-economic divides and inequalities in access to healthcare. Over the past century, global life expectancy has witnessed a substantial rise, owing to advancements in medicine, improved living conditions, and reduced child mortality, among other factors. Hence, we will try to do a regression analysis and prediction regarding life expectancy and the surrounding factors that might affect it. Rooted in statistical analysis, this measure provides a snapshot of the overall health and well-being of populations, and it serves as a crucial barometer for assessing the efficacy of healthcare systems, policies, and interventions.

1.2. Datasets

The dataset is mostly from 1 dataset from Kaggle about life expectancy of 193 countries with mostly health and country-related factors with the year duration from 2000 until 2015; however, since then, these is an updated version of this dataset where the data is claimed to be updating missing and wrong value. Some countries that have more than 4 missing value is also removed from the new dataset such as Sudan, South Sudan, and North Korea. However, this updated dataset have missed 1 variable from the old dataset which is Human Development Index (HDI) and we will incorporate it with another HDI dataset which listed all 189 countries HDI from 1990-2017. We reworded and merged both the updated version and hdi dataset in order to incorporate HDI feature into completed dataset (NA value in HDI column are imputed with the average yearly change of the HDI trend of that country).

## Rows: 2,848
## Columns: 22
## $ Country                     <chr> "Hungary", "Afghanistan", "Solomon Islands…
## $ Region                      <chr> "European Union", "Asia", "Oceania", "Afri…
## $ Year                        <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ Infant_deaths               <dbl> 8.7, 90.5, 25.6, 106.5, 30.3, 75.7, 6.0, 1…
## $ Under_five_deaths           <dbl> 10.1, 129.2, 30.7, 159.3, 37.0, 119.9, 7.1…
## $ Adult_mortality             <dbl> 192.9690, 310.8305, 228.1690, 344.2520, 17…
## $ Alcohol_consumption         <dbl> 12.230, 0.020, 0.710, 2.050, 2.470, 1.080,…
## $ Hepatitis_B                 <int> 88, 62, 81, 69, 93, 81, 95, 89, 79, 79, 66…
## $ Measles                     <int> 99, 12, 65, 64, 83, 64, 86, 86, 64, 64, 96…
## $ BMI                         <dbl> 25.9, 21.7, 24.9, 20.7, 24.8, 22.0, 26.3, …
## $ Polio                       <int> 99, 24, 88, 42, 88, 63, 86, 80, 73, 78, 90…
## $ Diphtheria                  <int> 99, 24, 86, 40, 94, 64, 86, 88, 75, 78, 90…
## $ Incidents_HIV               <dbl> 0.08, 0.02, 0.17, 1.07, 0.53, 2.97, 0.04, …
## $ GDP_per_capita              <int> 8971, 148, 1939, 351, 1790, 506, 41623, 76…
## $ Population_mln              <dbl> 10.21, 20.78, 0.41, 47.11, 6.57, 4.92, 3.8…
## $ Thinness_ten_nineteen_years <dbl> 2.3, 2.3, 1.4, 12.4, 2.8, 9.4, 0.3, 2.4, 8…
## $ Thinness_five_nine_years    <dbl> 2.3, 2.5, 1.4, 12.3, 2.7, 9.3, 0.3, 2.3, 7…
## $ Schooling                   <dbl> 10.2, 2.2, 4.6, 3.3, 4.3, 4.0, 10.8, 8.0, …
## $ Economy_status_Developed    <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ Economy_status_Developing   <int> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
## $ Life_expectancy             <dbl> 71.2, 55.8, 67.4, 50.0, 70.7, 53.5, 76.5, …
## $ HDI                         <dbl> 0.7690000, 0.7689774, 0.4500000, 0.3330000…

There are 22 columns in our dataset. These columns’s label are listed below.

  • Country - 178 Countries
  • Region - These countries are divided into 9 regions: Africa,Asia, Central America and Caribbean, European Union, Middle East, North America, Oceania, Rest of Europe, South America
  • Year - all of the data is taken during 15 years spans of 2000 - 2015
  • Infant_deaths - Number of Infant Deaths per 1000 population
  • under-five deaths - Number of under-five deaths per 1000 population
  • Adult_mortality - Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
  • Alcohol_consumption - Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • Hepatitis_B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  • Measles - number of reported cases per 1000 population
  • BMI - Average Body Mass Index of entire population
  • Polio - (Pol3) immunization coverage among 1-year-olds (%)
  • Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  • Incidents_HIV - Deaths per 1000 live births HIV/AIDS (0-4 years)
  • GDP_per_capita - Gross Domestic Product per capita (in USD)
  • Population_mln - Population of the country
  • Thinness_ten_nineteen_years - Prevalence of thinness among children and adolescents for Age 10 to 19 (%)
  • Thinness_five_nine_years - Prevalence of thinness among children for Age 5 to 9(%)
  • Schooling - Number of years of Schooling(years)
  • Economy_status_Developed - The status of developed country economy
  • Economy_status_Developing - The status of developing country economy
  • Life expectancy - Life Expectancy in age, our dependent/target variable
  • HDI - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

2. Data Preprocess

2.1. Pre-split

In this analysis, we will also measure how good our regression model is in predicting unseen data which will allow us to see whether our model overfit or not; however, before splitting the data into train/test, we will need to turn either country or region feature into dummy variables in order to capture their effect on the life expectancy (you can also use panel modeling since this dataset have entities and time). We will choose the region as our dummy variable since the country in this dataset have 178 unique value which will turns into a lot of dummy variables compared with 9 unique regions and the sheer number of country dummy variables will make our model prone to overfit more. After we turned our region feature into dummy variable, we will remove the country and region column as well as splitting our dataset into train and test set based on the year (train dataset will be derived from 2000-2013 and test dataset will be derived from 2014-2015).

2.2. Post-split

After splitting the dataset into train/test, we need to check whether the features need to be transformed or not. We can see the features that needed to be transformed or not from the boxplot of each numeric & non-dummy features where we can see the outliers and the distribution the said features from histogram.

There are 13 variables that have outliers that might affect our regression: Infants_deaths, Under_five_deaths, Adult_mortality, Hepatitis_B, Measles, Polio, Diphtheria, Incidents_HIV, GDP_per_capita, Population_mln, Thinness_ten_nineteen_years, Thinness_five_nine_years,life_expectancy. We will not focused on treating outliers as in this dataset, outliers are also important informations.

From the histograms, it appears the following 14 columns (the same 13 columns from outlier problem + alcohol consumptions) might benefit from a transformation due to their data are not normally distributed (have either left or right skewness). So there are 3 reason to do transformation on variables: outliers, distribution/skewness and scales uniformity (uniforming variables value range). We will do a Yeo-Johnson power transformation due to its capabilities in taking zero and negative value where as logarithmic and Box-Cox cannot transform zero value to make the data more normally distributed.

After transforming the columns and storing their lambda (λ) which represents the power to which original data values are raised which helped in stabilizing variance for later use, we can see that there is improvements to the data distribution of the changed columns by Yeo-Johnson power transformation. Next, we will do a series of assumption test to ensure that our regression assumptions are not violated to be able to create a good predicting and interpretation model.

3. Assumption Tests

There are 5 assumptions that we will test in this Life expectancy regression model: Multicollinearity, Linearity, Autocorrelation, Heteroscedasticity and Normality. we will start with Multicollinearity first because it is related with feature selection.

3.1. Multicollinearity

Multicollinearity refers how impactful independent variable can impact other independent variables. If there are independent variables that can affect other independent variables, there will be more variations and error in predicting and interpreting models, hence, Independent variables should not be too highly correlated with each other (Example: country economy status might already be represented by GDP_per_capite where high GDP means developed economy). We will use Variance Inflation Factor (VIF) test and hoped that all of our variables VIF score less than 10 (In model below, we will cut both economy statys developed and Region_Asia as to avoid aliased coefficients)

##                                 Year                        Infant_deaths 
##                             1.164121                           305.718524 
##                    Under_five_deaths                      Adult_mortality 
##                           370.027291                             8.319065 
##                  Alcohol_consumption                          Hepatitis_B 
##                             3.649400                             3.296382 
##                              Measles                                  BMI 
##                             2.050875                             4.364934 
##                                Polio                           Diphtheria 
##                            15.210188                            16.274813 
##                        Incidents_HIV                       GDP_per_capita 
##                             3.805251                             9.742814 
##                       Population_mln          Thinness_ten_nineteen_years 
##                             1.614902                            11.295801 
##             Thinness_five_nine_years                            Schooling 
##                            11.756535                             7.099457 
##            Economy_status_Developing                                  HDI 
##                             7.205920                            18.626039 
##                        Region_Africa Region_Central_America_and_Caribbean 
##                             4.559939                             2.603636 
##                Region_European_Union                   Region_Middle_East 
##                             6.683058                             2.285719 
##                 Region_North_America                       Region_Oceania 
##                             1.515814                             2.490851 
##                Region_Rest_of_Europe                 Region_South_America 
##                             2.499140                             2.084235

We can see that there are some concerningly high vif score that exceeds 10 such as Infant_deaths and Under_five_deaths. We will remove the variables with high VIF score from the model and rerun VIF test until we can get satisfactory VIF results.

##                                 Year                      Adult_mortality 
##                             1.142952                             5.155392 
##                  Alcohol_consumption                          Hepatitis_B 
##                             3.518365                             2.926214 
##                              Measles                                  BMI 
##                             1.952402                             3.747846 
##                                Polio                        Incidents_HIV 
##                             3.821591                             3.701664 
##                       GDP_per_capita                       Population_mln 
##                             5.717783                             1.555507 
##          Thinness_ten_nineteen_years                            Schooling 
##                             2.834949                             4.981127 
##                        Region_Africa Region_Central_America_and_Caribbean 
##                             3.894593                             2.529881 
##                Region_European_Union                   Region_Middle_East 
##                             3.678440                             2.166233 
##                 Region_North_America                       Region_Oceania 
##                             1.397787                             2.308858 
##                Region_Rest_of_Europe                 Region_South_America 
##                             2.218543                             1.979518

After iterative process of removing high VIF variables and rerun the VIF test, we can get the results above where we removed dummy variable of countries economy status where it already represented by GDP, under five or infant deaths, HDI and Diphtheria vaccination where it might represented by other vaccinations variables, and thinness 5-9 years where it also might be represented by thinness 10-19 years.

3.2. Linearity

Linearity refers to the relationship between the independent and dependent variable where it should be linear. If the relationship is not linear, the regression model will have a hard time to capture the relationship between independent and dependent variable. A simple scatter plot of residual vs. fitted values can help in visualizing linearity and sometimes homoscedasticity assumptions.

The dotted horizontal line indicates where residuals would lie if they were perfectly linear. The red line represents a smoothed curve (locally weighted scatter plot smoothing, LOWESS) to aid in visualizing any patterns in the residuals which ideally should be flat and hover around the zero line if the relationship is perfectly linear. In this plot, the line does show a slight curvature, especially towards the lower ends of the fitted values. This suggests that there might be some non-linearity in the relationship that hasn’t been captured by the model.

Ideally, the spread of residuals should be fairly consistent (homoscedastic) across all fitted values. If the variance of residuals changes (either increasing or decreasing) as the fitted values increase, it’s a sign of heteroscedasticity. From the plot, there seems to be a slight fan shape, where the spread of residuals tends to increase a bit as the fitted values increase. This indicates potential heteroscedasticity.

3.3. Heteroscedasticity

Heteroscedasticity refers to the error/residual terms of regression model should be the same but unknown variance, hence Homogenous in nature. If the error term of our model varies widely, it means that there is Heteroscedasticity in our model. We will use Breusch-Pagan test in order to test the heteroscedasticity of our model.

## 
##  studentized Breusch-Pagan test
## 
## data:  fit1
## BP = 322.95, df = 20, p-value < 2.2e-16

Since the p-value from Breusch-Pagan test is less than 0.05, it means that we have Heteroscedasticity in our model by rejecting our null hypothesis. In order for us to have better interpretation capabilities, we will use Robust standard error to ensure robustness and reliability of models

3.4. Autocorrelation

Autocorrelation refers to whether the residuals should be independent, i.e., there should be no correlation between consecutive residuals.The presence of autocorrelation can be problematic in time series modeling and regression analysis because many statistical methods assume that observations are independent of each other. When autocorrelation is present, these assumptions are violated, which can lead to inefficient parameter estimates and incorrect conclusions. We will use Durbin-Watson test to test whether there is autocorrelation in our dataset or not.

## 
##  Durbin-Watson test
## 
## data:  fit1
## DW = 1.9973, p-value = 0.4662
## alternative hypothesis: true autocorrelation is greater than 0

Since the p-value of our Durbin-Watson test is more than 0.05, it means that we would not reject the null hypothesis. In the context of the DW test, this means you do not have sufficient evidence to claim that there’s first-order autocorrelation in the residuals.

3.5. Normality of Residuals

Normality of Residuals refers to the assumption that the residuals (or errors) from a regression model are normally distributed which ensures that the statistical tests (like the t-test for individual coefficients) are valid. We will use Shapiro-Wilk normality test in order to test our model normality

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit1)
## W = 0.99361, p-value = 5.789e-09

W = 0.99252: The W statistic is close to 1, which is generally a good sign, suggesting that the data is approximately normally distributed. p-value = 5.033e-10: However, the p-value is extremely small. This indicates that the residuals are not normally distributed. Given the large size of our data, even small deviations from normality can produce significant p-values.

4. Interpretation and Prediction

4.1. Model Interpretation

Now, we will interpret our model and see how good it is in predicting Life_expectancy in test dataset.

## 
## Call:
## lm(formula = Life_expectancy ~ ., data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64505 -0.10795  0.00621  0.10641  0.69276 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          -9.1104002  1.8267801  -4.987 6.55e-07 ***
## Year                                  0.0045056  0.0009161   4.918 9.32e-07 ***
## Adult_mortality                      -0.6899321  0.0078451 -87.945  < 2e-16 ***
## Alcohol_consumption                   0.0441816  0.0064809   6.817 1.16e-11 ***
## Hepatitis_B                           0.0024826  0.0059104   0.420 0.674500    
## Measles                              -0.0063020  0.0048278  -1.305 0.191891    
## BMI                                  -0.0093914  0.0030630  -3.066 0.002192 ** 
## Polio                                 0.0768296  0.0067544  11.375  < 2e-16 ***
## Incidents_HIV                        -0.0491185  0.0066476  -7.389 2.01e-13 ***
## GDP_per_capita                        0.1089825  0.0082619  13.191  < 2e-16 ***
## Population_mln                        0.0454504  0.0043093  10.547  < 2e-16 ***
## Thinness_ten_nineteen_years           0.0036816  0.0058175   0.633 0.526890    
## Schooling                             0.0313091  0.0024507  12.776  < 2e-16 ***
## Region_Africa                        -0.0590259  0.0151684  -3.891 0.000102 ***
## Region_Central_America_and_Caribbean  0.3495110  0.0177940  19.642  < 2e-16 ***
## Region_European_Union                 0.1171846  0.0184697   6.345 2.64e-10 ***
## Region_Middle_East                   -0.0486078  0.0188871  -2.574 0.010122 *  
## Region_North_America                  0.1580368  0.0317278   4.981 6.76e-07 ***
## Region_Oceania                        0.0675518  0.0217993   3.099 0.001965 ** 
## Region_Rest_of_Europe                 0.1009959  0.0185222   5.453 5.45e-08 ***
## Region_South_America                  0.2601763  0.0193836  13.423  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1724 on 2471 degrees of freedom
## Multiple R-squared:  0.9705, Adjusted R-squared:  0.9703 
## F-statistic:  4065 on 20 and 2471 DF,  p-value: < 2.2e-16

As we can see that from our model, we have a very good number of significance from F-test p-value which is less than 0.05 (our model is significant) and can explain 96.55% of our model with its adjusted R-squared; however, we need to use Robust standard error since our model have heteroscedasticity.

Robust standard errors are used in econometrics and statistics to provide valid hypothesis tests and confidence intervals for regression coefficients when the assumptions of the classical linear regression model are violated. In practice, using robust standard errors can provide more trustworthy p-values and confidence intervals in the face of certain model violations.

## 
## t test of coefficients:
## 
##                                         Estimate  Std. Error  t value  Pr(>|t|)
## (Intercept)                          -9.11040019  1.93207029  -4.7154 2.547e-06
## Year                                  0.00450563  0.00096829   4.6532 3.441e-06
## Adult_mortality                      -0.68993206  0.00857755 -80.4346 < 2.2e-16
## Alcohol_consumption                   0.04418165  0.00675305   6.5425 7.330e-11
## Hepatitis_B                           0.00248256  0.00601303   0.4129 0.6797425
## Measles                              -0.00630205  0.00481262  -1.3095 0.1904924
## BMI                                  -0.00939137  0.00305366  -3.0754 0.0021249
## Polio                                 0.07682961  0.00727322  10.5634 < 2.2e-16
## Incidents_HIV                        -0.04911852  0.00628288  -7.8178 7.889e-15
## GDP_per_capita                        0.10898248  0.00954311  11.4200 < 2.2e-16
## Population_mln                        0.04545036  0.00454639   9.9970 < 2.2e-16
## Thinness_ten_nineteen_years           0.00368163  0.00593805   0.6200 0.5353110
## Schooling                             0.03130909  0.00276880  11.3078 < 2.2e-16
## Region_Africa                        -0.05902590  0.01536175  -3.8424 0.0001249
## Region_Central_America_and_Caribbean  0.34951105  0.01862831  18.7624 < 2.2e-16
## Region_European_Union                 0.11718458  0.01892934   6.1906 6.995e-10
## Region_Middle_East                   -0.04860785  0.02222822  -2.1868 0.0288533
## Region_North_America                  0.15803681  0.02012734   7.8518 6.058e-15
## Region_Oceania                        0.06755176  0.02126077   3.1773 0.0015049
## Region_Rest_of_Europe                 0.10099593  0.01757811   5.7456 1.029e-08
## Region_South_America                  0.26017627  0.01922722  13.5317 < 2.2e-16
##                                         
## (Intercept)                          ***
## Year                                 ***
## Adult_mortality                      ***
## Alcohol_consumption                  ***
## Hepatitis_B                             
## Measles                                 
## BMI                                  ** 
## Polio                                ***
## Incidents_HIV                        ***
## GDP_per_capita                       ***
## Population_mln                       ***
## Thinness_ten_nineteen_years             
## Schooling                            ***
## Region_Africa                        ***
## Region_Central_America_and_Caribbean ***
## Region_European_Union                ***
## Region_Middle_East                   *  
## Region_North_America                 ***
## Region_Oceania                       ** 
## Region_Rest_of_Europe                ***
## Region_South_America                 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The only change from using normal SE into robust SE is the alcohol consumption and Hepatitis_B immunization which are becoming less significant even though its still significant. Most of the variables are significant with the exceptions of Measles immunization, Population and Thinness of 10-19 y.o (these variables have insufficient evidence in the data to conclude that the coefficient is different from zero).

How to interpret the variables coefficient and significance: foe example: with the increase in year variable by 1 unit and holding all of the variables constant, there will be an increase in life expectancy by 0.03826 years and we have strong evidence against the null hypothesis that the coefficient is zero, suggesting that year has an association with life expectancy.

Interpretation with Asia as the Omitted Region: When you use dummy variables to represent categorical data, one category is typically left out as the reference category to avoid the dummy variable trap (perfect multicollinearity). In this model, the Asia region was the omitted category, meaning all other regions’ coefficients are interpreted in comparison to Asia.For instance, the coefficient for “Region_North_America” represents the average difference in the response variable between Asia and North America, holding all else constant. If “Region_North_America” has a positive coefficient, it means that, on average and holding all else constant, being in North America might increase life expectancy by approximately 1.4371 years compared to Asia.

4.2. Model Prediction

So, we will need to prepare out test dataset first by applying Yeo-Johnson power transformation with the same lambda from the train dataset. and then, we will use the same model and use these metrics (MAE, MSE, RMSE and R-squared) to evaluate our model predictive capability on preprocessed test dataset. Then, we will compare it with the baseline model where The metrics explanation:

Mean Absolute Error (MAE): Represents the average absolute error between the observed actual outturns and the predictions. Lower values are better, but the acceptability of a value is context-dependent. Compare it to the range and standard deviation of the response variable to get a sense of scale.

Root Mean Squared Error (RMSE): Represents the sample standard deviation of the differences between predicted values and observed values (or errors). Again, lower values are better. Comparing it to the standard deviation of the response variable can give insights.

R-squared: Indicates the proportion of variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1. A higher R-squared means the model explains more variability of the response.

## [1] "MAE: 0.143294109383935"
## [1] "RMSE: 0.183896158850675"
## [1] "R-squared: 0.966086941359291"
## [1] "Baseline Mean Absolute Error (MAE):  0.826880966880805"
## [1] "Baseline Root Mean Squared Error (RMSE):  0.99859451792805"

Evaluating Metrics in Context:

Mean Absolute Error (MAE): Given MAE is 0.142, this means that on average, the model’s predictions are about 0.142 units away from the actual values. In the context of the range (-2 to 2), this represents about 7.1% of the range, which is fairly small.Compared to the baseline MAE where predictions are made using just the mean of the dependent variable without any model, the baseline MAE of 0.8269 represents about 41.34% of the range. A higher baseline MAE compared to the model’s MAE suggests the model is doing a good job

Root Mean Squared Error (RMSE): An RMSE of 0.1839, slightly higher than MAE, suggests that there are some predictions that deviate more from the actual values, as RMSE gives more weight to larger errors.In the context of the range, this is about 9.2% of the range. The RMSE is particularly useful when large errors are more problematic than smaller ones. Comparing the baseline RMSE of 0.9986 is about 49.93% of the data range, The model’s RMSE is significantly lower, showcasing the model’s value

R-squared: An R-squared value of 0.9611 is quite high, indicating that the model explains 96.11% of the variance in the dependent variable. This is typically seen as very good.

5. Conclusion & Improvements

5.1. Conclusion

We can conclude that our model worked well on unseen data of life expectancy and other variables of 179 countries in 2014-2015 by training our model on training dataset of life expectancy and its variables in 2000-2013. Here are key takeaways:

  1. Life expectancy shows a rising trend each year, suggesting improvements in health, technology, or other systemic factors over time.
  2. Direct negative factors such as mortality and HIV incidents are decreasing people’s life expectancy
  3. Immunization factors such as Polio are positive; however, mostly immunization factors such as Hepatitis B have negative relationship with life expectancy which might indicates many speculations like: other variables that are not included in the model that are affecting both the immunization rates and life expectancy like recent vaccination campaign on low life expectancy countries (Confounding Factors) or like regions with declining life expectancy due to external factors (e.g., economic downturns, conflicts, natural disasters) might receive more international aid, which includes vaccination campaigns (Reverse Causation).
  4. Interestingly, higher alcohol consumption is linked to increased life expectancy. This could be indicative of moderate consumption patterns or cultural factors that positively impact health.
  5. A higher BMI is associated with a decrease in life expectancy, hinting at the negative health implications of obesity.
  6. GDP per capita and Schooling indicates that Economy and Education have positive impact toward people’s life expectancy
  7. Different regions have varied impacts on life expectancy. For instance, being in Central America and the Caribbean or the European Union is associated with higher life expectancy, while being in Africa tends to reduce it when compared to Asia. However, it’s crucial to note that for regions like Oceania, the effect isn’t statistically significant, meaning we can’t draw a clear conclusion.

5.2. Improvemets

  • Experiment more with Panel or hierarchical data structure (The Author have done it; however, the model become overfit with complicated models and lack of domain knowledge)
  • Even though Linearity and Heteroscedasticity have been mitigated by Robust standard error, both are still remained on the datasets and might need to use additional transformation on variables