Life expectancy, a term that known to many individuals such as
policymakers, and health professionals worldwide, encapsulates the
average number of years a person is expected to live based on a myriad
of factors. Originating from life tables in demography, life expectancy
takes into account the age-specific mortality rates for a given
population in a specific year or over a period. For instance, life
expectancy at birth is the average number of years a newborn can expect
to live if current mortality rates remain consistent throughout their
lifetime. However, as one ages, they’ve essentially survived certain
risks, so their life expectancy can change.
Several factors influence life expectancy, including but not limited to:
Global disparities in life expectancy are evident, often highlighting the socio-economic divides and inequalities in access to healthcare. Over the past century, global life expectancy has witnessed a substantial rise, owing to advancements in medicine, improved living conditions, and reduced child mortality, among other factors. Hence, we will try to do a regression analysis and prediction regarding life expectancy and the surrounding factors that might affect it. Rooted in statistical analysis, this measure provides a snapshot of the overall health and well-being of populations, and it serves as a crucial barometer for assessing the efficacy of healthcare systems, policies, and interventions.
The dataset is mostly from 1 dataset from Kaggle about life expectancy of 193 countries with mostly health and country-related factors with the year duration from 2000 until 2015; however, since then, these is an updated version of this dataset where the data is claimed to be updating missing and wrong value. Some countries that have more than 4 missing value is also removed from the new dataset such as Sudan, South Sudan, and North Korea. However, this updated dataset have missed 1 variable from the old dataset which is Human Development Index (HDI) and we will incorporate it with another HDI dataset which listed all 189 countries HDI from 1990-2017. We reworded and merged both the updated version and hdi dataset in order to incorporate HDI feature into completed dataset (NA value in HDI column are imputed with the average yearly change of the HDI trend of that country).
## Rows: 2,848
## Columns: 22
## $ Country <chr> "Hungary", "Afghanistan", "Solomon Islands…
## $ Region <chr> "European Union", "Asia", "Oceania", "Afri…
## $ Year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ Infant_deaths <dbl> 8.7, 90.5, 25.6, 106.5, 30.3, 75.7, 6.0, 1…
## $ Under_five_deaths <dbl> 10.1, 129.2, 30.7, 159.3, 37.0, 119.9, 7.1…
## $ Adult_mortality <dbl> 192.9690, 310.8305, 228.1690, 344.2520, 17…
## $ Alcohol_consumption <dbl> 12.230, 0.020, 0.710, 2.050, 2.470, 1.080,…
## $ Hepatitis_B <int> 88, 62, 81, 69, 93, 81, 95, 89, 79, 79, 66…
## $ Measles <int> 99, 12, 65, 64, 83, 64, 86, 86, 64, 64, 96…
## $ BMI <dbl> 25.9, 21.7, 24.9, 20.7, 24.8, 22.0, 26.3, …
## $ Polio <int> 99, 24, 88, 42, 88, 63, 86, 80, 73, 78, 90…
## $ Diphtheria <int> 99, 24, 86, 40, 94, 64, 86, 88, 75, 78, 90…
## $ Incidents_HIV <dbl> 0.08, 0.02, 0.17, 1.07, 0.53, 2.97, 0.04, …
## $ GDP_per_capita <int> 8971, 148, 1939, 351, 1790, 506, 41623, 76…
## $ Population_mln <dbl> 10.21, 20.78, 0.41, 47.11, 6.57, 4.92, 3.8…
## $ Thinness_ten_nineteen_years <dbl> 2.3, 2.3, 1.4, 12.4, 2.8, 9.4, 0.3, 2.4, 8…
## $ Thinness_five_nine_years <dbl> 2.3, 2.5, 1.4, 12.3, 2.7, 9.3, 0.3, 2.3, 7…
## $ Schooling <dbl> 10.2, 2.2, 4.6, 3.3, 4.3, 4.0, 10.8, 8.0, …
## $ Economy_status_Developed <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ Economy_status_Developing <int> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
## $ Life_expectancy <dbl> 71.2, 55.8, 67.4, 50.0, 70.7, 53.5, 76.5, …
## $ HDI <dbl> 0.7690000, 0.7689774, 0.4500000, 0.3330000…
There are 22 columns in our dataset. These columns’s label are listed below.
In this analysis, we will also measure how good our regression model is in predicting unseen data which will allow us to see whether our model overfit or not; however, before splitting the data into train/test, we will need to turn either country or region feature into dummy variables in order to capture their effect on the life expectancy (you can also use panel modeling since this dataset have entities and time). We will choose the region as our dummy variable since the country in this dataset have 178 unique value which will turns into a lot of dummy variables compared with 9 unique regions and the sheer number of country dummy variables will make our model prone to overfit more. After we turned our region feature into dummy variable, we will remove the country and region column as well as splitting our dataset into train and test set based on the year (train dataset will be derived from 2000-2013 and test dataset will be derived from 2014-2015).
After splitting the dataset into train/test, we need to check whether the features need to be transformed or not. We can see the features that needed to be transformed or not from the boxplot of each numeric & non-dummy features where we can see the outliers and the distribution the said features from histogram.
There are 13 variables that have outliers that might affect our regression: Infants_deaths, Under_five_deaths, Adult_mortality, Hepatitis_B, Measles, Polio, Diphtheria, Incidents_HIV, GDP_per_capita, Population_mln, Thinness_ten_nineteen_years, Thinness_five_nine_years,life_expectancy. We will not focused on treating outliers as in this dataset, outliers are also important informations.
From the histograms, it appears the following 14 columns (the same 13 columns from outlier problem + alcohol consumptions) might benefit from a transformation due to their data are not normally distributed (have either left or right skewness). So there are 3 reason to do transformation on variables: outliers, distribution/skewness and scales uniformity (uniforming variables value range). We will do a Yeo-Johnson power transformation due to its capabilities in taking zero and negative value where as logarithmic and Box-Cox cannot transform zero value to make the data more normally distributed.
After transforming the columns and storing their lambda (λ) which represents the power to which original data values are raised which helped in stabilizing variance for later use, we can see that there is improvements to the data distribution of the changed columns by Yeo-Johnson power transformation. Next, we will do a series of assumption test to ensure that our regression assumptions are not violated to be able to create a good predicting and interpretation model.
There are 5 assumptions that we will test in this Life expectancy regression model: Multicollinearity, Linearity, Autocorrelation, Heteroscedasticity and Normality. we will start with Multicollinearity first because it is related with feature selection.
Multicollinearity refers how impactful independent variable can impact other independent variables. If there are independent variables that can affect other independent variables, there will be more variations and error in predicting and interpreting models, hence, Independent variables should not be too highly correlated with each other (Example: country economy status might already be represented by GDP_per_capite where high GDP means developed economy). We will use Variance Inflation Factor (VIF) test and hoped that all of our variables VIF score less than 10 (In model below, we will cut both economy statys developed and Region_Asia as to avoid aliased coefficients)
## Year Infant_deaths
## 1.164121 305.718524
## Under_five_deaths Adult_mortality
## 370.027291 8.319065
## Alcohol_consumption Hepatitis_B
## 3.649400 3.296382
## Measles BMI
## 2.050875 4.364934
## Polio Diphtheria
## 15.210188 16.274813
## Incidents_HIV GDP_per_capita
## 3.805251 9.742814
## Population_mln Thinness_ten_nineteen_years
## 1.614902 11.295801
## Thinness_five_nine_years Schooling
## 11.756535 7.099457
## Economy_status_Developing HDI
## 7.205920 18.626039
## Region_Africa Region_Central_America_and_Caribbean
## 4.559939 2.603636
## Region_European_Union Region_Middle_East
## 6.683058 2.285719
## Region_North_America Region_Oceania
## 1.515814 2.490851
## Region_Rest_of_Europe Region_South_America
## 2.499140 2.084235
We can see that there are some concerningly high vif score that exceeds 10 such as Infant_deaths and Under_five_deaths. We will remove the variables with high VIF score from the model and rerun VIF test until we can get satisfactory VIF results.
## Year Adult_mortality
## 1.142952 5.155392
## Alcohol_consumption Hepatitis_B
## 3.518365 2.926214
## Measles BMI
## 1.952402 3.747846
## Polio Incidents_HIV
## 3.821591 3.701664
## GDP_per_capita Population_mln
## 5.717783 1.555507
## Thinness_ten_nineteen_years Schooling
## 2.834949 4.981127
## Region_Africa Region_Central_America_and_Caribbean
## 3.894593 2.529881
## Region_European_Union Region_Middle_East
## 3.678440 2.166233
## Region_North_America Region_Oceania
## 1.397787 2.308858
## Region_Rest_of_Europe Region_South_America
## 2.218543 1.979518
After iterative process of removing high VIF variables and rerun the VIF test, we can get the results above where we removed dummy variable of countries economy status where it already represented by GDP, under five or infant deaths, HDI and Diphtheria vaccination where it might represented by other vaccinations variables, and thinness 5-9 years where it also might be represented by thinness 10-19 years.
Linearity refers to the relationship between the independent and dependent variable where it should be linear. If the relationship is not linear, the regression model will have a hard time to capture the relationship between independent and dependent variable. A simple scatter plot of residual vs. fitted values can help in visualizing linearity and sometimes homoscedasticity assumptions.
The dotted horizontal line indicates where residuals would lie if they
were perfectly linear. The red line represents a smoothed curve (locally
weighted scatter plot smoothing, LOWESS) to aid in visualizing any
patterns in the residuals which ideally should be flat and hover around
the zero line if the relationship is perfectly linear. In this plot, the
line does show a slight curvature, especially towards the lower ends of
the fitted values. This suggests that there might be some non-linearity
in the relationship that hasn’t been captured by the model.
Ideally, the spread of residuals should be fairly consistent (homoscedastic) across all fitted values. If the variance of residuals changes (either increasing or decreasing) as the fitted values increase, it’s a sign of heteroscedasticity. From the plot, there seems to be a slight fan shape, where the spread of residuals tends to increase a bit as the fitted values increase. This indicates potential heteroscedasticity.
Heteroscedasticity refers to the error/residual terms of regression model should be the same but unknown variance, hence Homogenous in nature. If the error term of our model varies widely, it means that there is Heteroscedasticity in our model. We will use Breusch-Pagan test in order to test the heteroscedasticity of our model.
##
## studentized Breusch-Pagan test
##
## data: fit1
## BP = 322.95, df = 20, p-value < 2.2e-16
Since the p-value from Breusch-Pagan test is less than 0.05, it means that we have Heteroscedasticity in our model by rejecting our null hypothesis. In order for us to have better interpretation capabilities, we will use Robust standard error to ensure robustness and reliability of models
Autocorrelation refers to whether the residuals should be independent, i.e., there should be no correlation between consecutive residuals.The presence of autocorrelation can be problematic in time series modeling and regression analysis because many statistical methods assume that observations are independent of each other. When autocorrelation is present, these assumptions are violated, which can lead to inefficient parameter estimates and incorrect conclusions. We will use Durbin-Watson test to test whether there is autocorrelation in our dataset or not.
##
## Durbin-Watson test
##
## data: fit1
## DW = 1.9973, p-value = 0.4662
## alternative hypothesis: true autocorrelation is greater than 0
Since the p-value of our Durbin-Watson test is more than 0.05, it means that we would not reject the null hypothesis. In the context of the DW test, this means you do not have sufficient evidence to claim that there’s first-order autocorrelation in the residuals.
Normality of Residuals refers to the assumption that the residuals (or errors) from a regression model are normally distributed which ensures that the statistical tests (like the t-test for individual coefficients) are valid. We will use Shapiro-Wilk normality test in order to test our model normality
##
## Shapiro-Wilk normality test
##
## data: residuals(fit1)
## W = 0.99361, p-value = 5.789e-09
W = 0.99252: The W statistic is close to 1, which is generally a good sign, suggesting that the data is approximately normally distributed. p-value = 5.033e-10: However, the p-value is extremely small. This indicates that the residuals are not normally distributed. Given the large size of our data, even small deviations from normality can produce significant p-values.
Now, we will interpret our model and see how good it is in predicting Life_expectancy in test dataset.
##
## Call:
## lm(formula = Life_expectancy ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64505 -0.10795 0.00621 0.10641 0.69276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1104002 1.8267801 -4.987 6.55e-07 ***
## Year 0.0045056 0.0009161 4.918 9.32e-07 ***
## Adult_mortality -0.6899321 0.0078451 -87.945 < 2e-16 ***
## Alcohol_consumption 0.0441816 0.0064809 6.817 1.16e-11 ***
## Hepatitis_B 0.0024826 0.0059104 0.420 0.674500
## Measles -0.0063020 0.0048278 -1.305 0.191891
## BMI -0.0093914 0.0030630 -3.066 0.002192 **
## Polio 0.0768296 0.0067544 11.375 < 2e-16 ***
## Incidents_HIV -0.0491185 0.0066476 -7.389 2.01e-13 ***
## GDP_per_capita 0.1089825 0.0082619 13.191 < 2e-16 ***
## Population_mln 0.0454504 0.0043093 10.547 < 2e-16 ***
## Thinness_ten_nineteen_years 0.0036816 0.0058175 0.633 0.526890
## Schooling 0.0313091 0.0024507 12.776 < 2e-16 ***
## Region_Africa -0.0590259 0.0151684 -3.891 0.000102 ***
## Region_Central_America_and_Caribbean 0.3495110 0.0177940 19.642 < 2e-16 ***
## Region_European_Union 0.1171846 0.0184697 6.345 2.64e-10 ***
## Region_Middle_East -0.0486078 0.0188871 -2.574 0.010122 *
## Region_North_America 0.1580368 0.0317278 4.981 6.76e-07 ***
## Region_Oceania 0.0675518 0.0217993 3.099 0.001965 **
## Region_Rest_of_Europe 0.1009959 0.0185222 5.453 5.45e-08 ***
## Region_South_America 0.2601763 0.0193836 13.423 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1724 on 2471 degrees of freedom
## Multiple R-squared: 0.9705, Adjusted R-squared: 0.9703
## F-statistic: 4065 on 20 and 2471 DF, p-value: < 2.2e-16
As we can see that from our model, we have a very good number of significance from F-test p-value which is less than 0.05 (our model is significant) and can explain 96.55% of our model with its adjusted R-squared; however, we need to use Robust standard error since our model have heteroscedasticity.
Robust standard errors are used in econometrics and statistics to provide valid hypothesis tests and confidence intervals for regression coefficients when the assumptions of the classical linear regression model are violated. In practice, using robust standard errors can provide more trustworthy p-values and confidence intervals in the face of certain model violations.
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.11040019 1.93207029 -4.7154 2.547e-06
## Year 0.00450563 0.00096829 4.6532 3.441e-06
## Adult_mortality -0.68993206 0.00857755 -80.4346 < 2.2e-16
## Alcohol_consumption 0.04418165 0.00675305 6.5425 7.330e-11
## Hepatitis_B 0.00248256 0.00601303 0.4129 0.6797425
## Measles -0.00630205 0.00481262 -1.3095 0.1904924
## BMI -0.00939137 0.00305366 -3.0754 0.0021249
## Polio 0.07682961 0.00727322 10.5634 < 2.2e-16
## Incidents_HIV -0.04911852 0.00628288 -7.8178 7.889e-15
## GDP_per_capita 0.10898248 0.00954311 11.4200 < 2.2e-16
## Population_mln 0.04545036 0.00454639 9.9970 < 2.2e-16
## Thinness_ten_nineteen_years 0.00368163 0.00593805 0.6200 0.5353110
## Schooling 0.03130909 0.00276880 11.3078 < 2.2e-16
## Region_Africa -0.05902590 0.01536175 -3.8424 0.0001249
## Region_Central_America_and_Caribbean 0.34951105 0.01862831 18.7624 < 2.2e-16
## Region_European_Union 0.11718458 0.01892934 6.1906 6.995e-10
## Region_Middle_East -0.04860785 0.02222822 -2.1868 0.0288533
## Region_North_America 0.15803681 0.02012734 7.8518 6.058e-15
## Region_Oceania 0.06755176 0.02126077 3.1773 0.0015049
## Region_Rest_of_Europe 0.10099593 0.01757811 5.7456 1.029e-08
## Region_South_America 0.26017627 0.01922722 13.5317 < 2.2e-16
##
## (Intercept) ***
## Year ***
## Adult_mortality ***
## Alcohol_consumption ***
## Hepatitis_B
## Measles
## BMI **
## Polio ***
## Incidents_HIV ***
## GDP_per_capita ***
## Population_mln ***
## Thinness_ten_nineteen_years
## Schooling ***
## Region_Africa ***
## Region_Central_America_and_Caribbean ***
## Region_European_Union ***
## Region_Middle_East *
## Region_North_America ***
## Region_Oceania **
## Region_Rest_of_Europe ***
## Region_South_America ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The only change from using normal SE into robust SE is the alcohol consumption and Hepatitis_B immunization which are becoming less significant even though its still significant. Most of the variables are significant with the exceptions of Measles immunization, Population and Thinness of 10-19 y.o (these variables have insufficient evidence in the data to conclude that the coefficient is different from zero).
How to interpret the variables coefficient and significance: foe example: with the increase in year variable by 1 unit and holding all of the variables constant, there will be an increase in life expectancy by 0.03826 years and we have strong evidence against the null hypothesis that the coefficient is zero, suggesting that year has an association with life expectancy.
Interpretation with Asia as the Omitted Region: When you use dummy variables to represent categorical data, one category is typically left out as the reference category to avoid the dummy variable trap (perfect multicollinearity). In this model, the Asia region was the omitted category, meaning all other regions’ coefficients are interpreted in comparison to Asia.For instance, the coefficient for “Region_North_America” represents the average difference in the response variable between Asia and North America, holding all else constant. If “Region_North_America” has a positive coefficient, it means that, on average and holding all else constant, being in North America might increase life expectancy by approximately 1.4371 years compared to Asia.
So, we will need to prepare out test dataset first by applying Yeo-Johnson power transformation with the same lambda from the train dataset. and then, we will use the same model and use these metrics (MAE, MSE, RMSE and R-squared) to evaluate our model predictive capability on preprocessed test dataset. Then, we will compare it with the baseline model where The metrics explanation:
Mean Absolute Error (MAE): Represents the average absolute error between the observed actual outturns and the predictions. Lower values are better, but the acceptability of a value is context-dependent. Compare it to the range and standard deviation of the response variable to get a sense of scale.
Root Mean Squared Error (RMSE): Represents the sample standard deviation of the differences between predicted values and observed values (or errors). Again, lower values are better. Comparing it to the standard deviation of the response variable can give insights.
R-squared: Indicates the proportion of variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1. A higher R-squared means the model explains more variability of the response.
## [1] "MAE: 0.143294109383935"
## [1] "RMSE: 0.183896158850675"
## [1] "R-squared: 0.966086941359291"
## [1] "Baseline Mean Absolute Error (MAE): 0.826880966880805"
## [1] "Baseline Root Mean Squared Error (RMSE): 0.99859451792805"
Evaluating Metrics in Context:
Mean Absolute Error (MAE): Given MAE is 0.142, this means that on average, the model’s predictions are about 0.142 units away from the actual values. In the context of the range (-2 to 2), this represents about 7.1% of the range, which is fairly small.Compared to the baseline MAE where predictions are made using just the mean of the dependent variable without any model, the baseline MAE of 0.8269 represents about 41.34% of the range. A higher baseline MAE compared to the model’s MAE suggests the model is doing a good job
Root Mean Squared Error (RMSE): An RMSE of 0.1839, slightly higher than MAE, suggests that there are some predictions that deviate more from the actual values, as RMSE gives more weight to larger errors.In the context of the range, this is about 9.2% of the range. The RMSE is particularly useful when large errors are more problematic than smaller ones. Comparing the baseline RMSE of 0.9986 is about 49.93% of the data range, The model’s RMSE is significantly lower, showcasing the model’s value
R-squared: An R-squared value of 0.9611 is quite high, indicating that the model explains 96.11% of the variance in the dependent variable. This is typically seen as very good.
We can conclude that our model worked well on unseen data of life expectancy and other variables of 179 countries in 2014-2015 by training our model on training dataset of life expectancy and its variables in 2000-2013. Here are key takeaways: