Introduction and Background

Human life expectancy is an estimate of the average age that members of a population group will be when they die, and is the dominant metric for analysing population health. Economic social, environmental, genetic and cultural factors all influence the overall health and well-being of individuals, communities, and populations, and therefore play a significant role in determining life expectancy (Hernandez and Blazer 2006). Having a higher income means better healthcare can be afforded, and greater access to medical facilities ensures timely access to healthcare, both therefore improving life expectancy (Senate Community Affairs References Committee 2014). On the other hand, smoking, alcohol consumption, genetic factors that may predispose certain people to diseases or conditions (such as heart disease, cancer, diabetes, etc.), poor living conditions, low sanitation levels, poor environmental quality and lack of access to clean water can adversely affect overall health and well-being, and therefore reduce life expectancy (Hernandez and Blazer, 2006; Ford et al. 2011).

Mackenbach et al. (2019) find that smoking, low income and high bodyweight are the most important factors reducing life expectancy in Europe, where as that ambient ozone pollution is recognised as an adverse factor in China and India (Kerdprasop, Kerdprasop and Chuaybamroong 2020). In studies that consider a larger amount of countries, employment rate, gross national income, age dependency ratio, the percentage of adults infected with HIV, income and schooling have all been discerned as influential factors (Nandi et al. 2023; Mondal and Shitan 2013; Raphael, Prasad and Ronmi 2023; Bali et al. 2021). Evidently, human life expectancy is impacted by a complex interplay of copious factors and determinants.

Developing a predictive model that estimates and forecasts life expectancy based on these potentially key factors is therefore a complex task, but can be achieved by applying machine learning techniques on comprehensive data sets retrieved from repositories available to the public. The World Bank, the United Nations Development Programme, the Institute for Health Metrics and Evaluation and the World Health Organisation (WHO) are some organisations that offer an extensive amount and wide range of global development and health data that can be utilised to estimate life expectancy.

With recent advancements in the accuracy and capabilities of predictive modelling, life expectancy predictions have increased in demand by governments and private organisations (Bali et al. 2021). In turn, many different analytical studies have been conducted to model life expectancy. Out of several different approaches, Bali et al. (2021) and Raphael, Prasad and Ronmi (2023) find the greatest success (96% and 95% prediction accuracy respectively) with a random forest model. The important features these models identify are mentioned above. Wang (2021) performs a similar analysis, retrieves similar results with a random forest model, but deduces that the feature importance differs within each continent. These three studies all utilise the same publicly available WHO data set, which consists of 2000-2015 data from 193 countries and contains 22 attributes for each observation (Life Expectancy (WHO) | Kaggle). Alternatively, Kerdprasop, Kerdprasop and Chuaybamroong (2020) predict just within China and India, and observe the best results with a linear regression model. The authors utilise World Bank data from 1960-2017, consisting of 16 economic and environmental features.

Models that can accurately predict life expectancy is of great benefit to insurance companies, pension funds and other financial organisations, as they can assist in making data-informed business decisions and managing their resources effectively, and the knowledge of what factors pertain to higher life expectancy can optimise strategic spending and drive strategic investments (Bali et al. 2021). For example, if the research findings indicate that access to healthcare services is a significant factor in determining life expectancy, policymakers can allocate resources to improve healthcare infrastructure and services in regions with limited access (Raphael, Prasad and Ronmi 2023).

Considering that further research is required to fully leverage the potential of these predictive models, we now present our problem statement.

Problem Statement:

This project aims to develop predictive models to accurately predict life expectancy and understand the key factors influencing it. This will involve recommending how to strategically allocate resources to bridge the life expectancy gap between different countries, ultimately improving overall health outcomes.

Data Understanding and Preparation

The Data Set

We will obtain the WHO data set ‘Life Expectancy (WHO) Fixed’ from Kaggle, which is an updated version of the WHO data set mentioned previously. This contains 21 variables, 2864 rows, 179 countries, and spans years 2000-2015. The attributes are or relate to: country, region, year, infant deaths, children (<5 years old) deaths, adult mortality, alcohol consumption, hepatitis B immunisation, measles immunisation, BMI, polio immunisation, diphtheria immunisation, HIV incidents, GDP per capita, population, teenage thinness, child thinness, schooling, developed or developing indication, and finally life expectancy, which will serve as our response variable. These potential explanatory variables cover a broad range of factors, and many have been identified as key predictive indicators in the literature, thus this data set will be sufficient in investigating the project objectives outlined above.

Data Pre-Processing

To begin our analysis we first conduct data pre-processing. The appropriate packages and the data set are loaded in, then some variables are converted to factors.

# load packages
library(tidyverse)
library(kableExtra)
library(ggResidpanel)
library(caret)
library(glmnet)
library(markdown)
library(randomForest)
library(ggpubr)

# load data in
lifeexp <- read.csv("Life-Expectancy-Data-Updated.csv")

# convert to factors
lifeexp$Economy_status_Developed <- as.factor(lifeexp$Economy_status_Developed)
lifeexp$Economy_status_Developing <- as.factor(lifeexp$Economy_status_Developing)
lifeexp$Region <- as.factor(lifeexp$Region)

We then confirm that there are no NA entries, and that all of the data is present, i.e. each country has an observation for all years from 2000 to 2015. We also confirm that there is no instances of countries being recorded as developed and developing simultaneously. Given this, we remove the developing indication as having both is unnecessary since they provide the same information, just inverted.

yearstesttable <- lifeexp %>% group_by(Country) %>% 
  summarise(Years_included = n_distinct(Year)) # same amount of years

any(is.na(lifeexp)) # no NAs

any(lifeexp$Economy_status_Developed == lifeexp$Economy_status_Developing)
# no instances of this

# remove Economy_status_Developing variable
lifeexp <- select(lifeexp, -Economy_status_Developing)

Since the data set was retrieved from an online repository that allows for discussion regarding the data set, we review these comments for any problems other people have encountered and reported, yet we find nothing necessary to address.

Exploratory Data Analysis

In this section we explore the data through visualisations and tables.

# average life expectancy table 

# filter to just use 2000 and 2015 
filtered_data <- lifeexp %>%
  filter(Year %in% c(2000, 2015))

# average life expectancy for each region in 2000
avg_2000 <- filtered_data %>%
  filter(Year == 2000) %>%
  group_by(Region) %>%
  summarise(Average_2000 = mean(Life_expectancy))

# average life expectancy for each region in 2015
avg_2015 <- filtered_data %>%
  filter(Year == 2015) %>%
  group_by(Region) %>%
  summarise(Average_2015 = mean(Life_expectancy))

# merge the average values for 2000 and 2015 by region
result <- merge(avg_2000, avg_2015, by = "Region") %>% 
  arrange(desc(Average_2015))

# 1 decimal place
result$Average_2000 <- sprintf("%.1f", result$Average_2000)
result$Average_2015 <- sprintf("%.1f", result$Average_2015)

# put results in table
kable(result, caption = "Average Life Expectancy by Region in 2000 and 2015. ")

Average Life Expectancy by Region in 2000 and 2015.
Region	Average_2000	Average_2015
European Union	75.7	79.5
North America	76.7	78.5
Rest of Europe	72.8	76.4
Middle East	72.3	75.3
South America	70.8	74.5
Central America and Caribbean	70.8	73.9
Asia	66.6	72.0
Oceania	67.5	71.2
Africa	54.1	62.5

# plotting life expectancy over years
lifeexp %>%
  group_by(Region, Year) %>%
  summarize(Life_expectancy = mean(Life_expectancy)) %>%
  ggplot(aes(x = Year, y = Life_expectancy, col = Region)) +
  geom_line() +
  theme_bw()

Average life expectancy per region from 2000-2015.

Figure 1 displays the average life expectancy across global regions from 2000-2015. Table 1 lists these averages for the earliest (2000) and latest years (2015). Evidently, the average life expectancy increases over this time period for every region, with African countries increasing the most, followed by Asian. Despite the overall improvement Africa achieved over the years, it maintains the lowest average life expectancy by a considerable margin. The European Union surpassed North America around 2009 and prevails as the region with the highest life expectancy as of 2015. In Figure 1 we also see that the life expectancy trend lines of all regions other than Africa tend to congregate into pairs. For the most part, these regions are close to each other geographically, which suggests there may be shared factors or commonalities that contribute to similar trends among neighbouring regions. Nevertheless, the plot generally indicates the persisting disparities and inequalities that exist between global regions.

# plotting schooling vs. life expectancy
ggplot(lifeexp, aes(x=Schooling, y=Life_expectancy, col=Region))+
  geom_point(alpha=0.4)+
  facet_wrap(~Economy_status_Developed, 
             labeller=as_labeller(c("0" = "Developing", "1" = "Developed")))+
  theme_bw()

Life expectancy against schooling levels of each region, with each country seperated into developed or developing.

In the literature review, we mention that schooling levels has been identified as an influential predictor of life expectancy. Figure 2 visualises this relationship and provides us with several insights.

Firstly, we confirm the presence of a somewhat weak positive linear relationship between life expectancy and schooling. Despite having relatively very low life expectancies, African countries do not necessarily exhibit the lowest schooling levels. There are observations with comparable schooling levels to African countries, yet significantly higher life expectancies. This discrepancy highlights the influence of other factors beyond schooling that contribute to differences in life expectancy.

By separating countries based on their economic status as developed and developing, we observe that developed countries consistently exhibit relatively very high life expectancies. However, it is worth noting that a handful of developing countries manage to match the life expectancy levels of some developed countries. This indicates there are cases in developing countries where other factors contribute to higher life expectancies.

Furthermore, the plot demonstrates a greater variation in life expectancies among developing countries, ranging from 40 to 80 years. In contrast, developed countries tend to have a narrower range, typically falling between 70 and 85 years. This variation signifies again the interplay of a range of factors that impact life expectancy, particularly in developing countries.

Finally, streak patterns observed in the plot reflect the slight year-to-year changes experienced by countries.

# plotting correlated predictors
p1 <- ggplot(lifeexp, aes(x=Infant_deaths, y=Under_five_deaths, col=Life_expectancy))+
  geom_point()+
  theme_bw()

p2 <- ggplot(lifeexp, aes(x=GDP_per_capita, y=Adult_mortality, col=Life_expectancy))+
  geom_point()+
  theme_bw()

p3 <- ggplot(lifeexp,aes(x=Polio, y=Diphtheria, col=Life_expectancy))+
  geom_point()+
  theme_bw()

ggarrange(p1, p2, p3, common.legend = TRUE)

Scatter plots of several different predictors with each point coloured by life expectancy.

Figure 3 shows that under five deaths and infant deaths are quite positively linearly correlated, which is expected as both indicators reflect child mortality rates. Adult mortality generally increases exponentially as GDP per capita decreases toward the lower end of its range, suggesting the strong impact of wealth and economic conditions on health outcomes. On the far right panel we also see that Diphtheria and Polio are quite positively linearly correlated.

Considering the linear correlations, we note that some variables may be sharing redundant information, and therefore multicollinearity may be present and cause problems in further analysis. We also note that the exponential relationship between GDP and adult mortality may also cause issues due its non-linearity.

Predictive Modelling

In this predictive modelling section, we use linear regression and random forest methods to predict life expectancy based on the factors listed above in the data set subsection. These two chosen methods offer distinct advantages in capturing the complexity of life expectancy prediction. Linear regression allows us to examine the linear relationship between the predictor variables and the response variable, providing interpretable coefficients that indicate the direction and strength of the associations. On the other hand, random forest offers a more flexible approach by constructing an ensemble of decision trees and aggregating their predictions, enabling the detection of nonlinear and interaction effects among the predictors.

It is common practice to first split the data into a training set and a testing set. The model should then be trained with the training set, and then evaluated on the test set by making predictions on that test data, and comparing those predictions to the known true values. This allows us to assess the model’s performance and generalisation capabilities. To ensure consistency and comparability between the two methods, we will employ the same train-test split for both. Our data set will be divided into a training set spanning the years 2000 to 2011, which accounts for approximately 75% of the data, and a testing set covering the years 2012 to 2015, constituting the remaining 25% of the data. The decade long training set enables the model to learn the complex relationships and gradual long-term trends, while the test set allows the predictive accuracy to be evaluated in a shorter, more recent time frame.

# train and test split
train <- lifeexp %>% filter(Year %in% 2000:2011)
test <- lifeexp %>% filter(Year %in% 2012:2015)

Linear Regression

To begin our linear regression analysis, we first fit all of our predictors except country and region into a linear model and conduct a diagnostic test of the results. These two variables are excluded for two reasons. Firstly, one-hot coding or the creation dummy variables for each country or region would need to be conducted and would result in an excessive number of variables, especially considering the large amount of countries in the data set. This can lead to computational issues and difficulties in interpreting the coefficients associated with each individual country or region. Secondly, this study is focused on identifying factors that can be modified to improve health outcomes and focused on to coordinate health interventions or policy changes. Country and region variables are obviously not modifiable, so by excluding them, the analysis can focus more directly on the predictors that are actually changeable and directly influence health outcomes. This allows for a more focused investigation of the relationships between these predictors and life expectancy globally.

lm1 <- lm(Life_expectancy~., data = train[,3:20])
resid_panel(lm1) # diagnostic test

Distribution of residuals panel plot.

# obtain coefficients and p-values of variables 
summary_lm1 <- summary(lm1)
p_values <- summary_lm1$coefficients[, "Pr(>|t|)"]
coefficients <- summary_lm1$coefficients[, "Estimate"]
variable_p_values <- data.frame(Variable = names(p_values),
                              
                                Coefficient = coefficients, P_Value = p_values)
variable_p_values <- variable_p_values[variable_p_values$Variable != "(Intercept)",]

# sort p-values in ascending order
sorted_variable_p_values <- variable_p_values[order(variable_p_values$P_Value),]

# make table
kable(sorted_variable_p_values[,-1], format = "markdown", 
      col.names = c("Coefficient", "P-Value"), caption=
        "Coefficient estimates and p-values of all predictors 
         when fit to a linear regression model.")

Coefficient estimates and p-values of all predictors when fit to a linear regression model.
	Coefficient	P-Value
Adult_mortality	-0.0475451	0.0000000
Under_five_deaths	-0.0594336	0.0000000
GDP_per_capita	0.0000240	0.0000000
BMI	-0.1622148	0.0000000
Infant_deaths	-0.0434048	0.0000000
Alcohol_consumption	0.0693065	0.0000000
Incidents_HIV	0.0931510	0.0000032
Economy_status_Developed1	0.5658491	0.0000044
Schooling	0.0822316	0.0000165
Hepatitis_B	-0.0085671	0.0011943
Thinness_ten_nineteen_years	-0.0477732	0.0134310
Year	0.0109296	0.2030168
Polio	0.0051492	0.4596677
Population_mln	-0.0001338	0.5650512
Thinness_five_nine_years	0.0083247	0.6562104
Measles	-0.0002666	0.8876142
Diphtheria	-0.0003438	0.9612193

# evaluate on test set
test_predictions <- predict(lm1, newdata = test)
lm1_mse <- mean((test$Life_expectancy - test_predictions)^2)

The diagnostic plots in Figure 4 do not reveal any concerning patterns. There is no pattern in the residuals, they are normally distributed, and the Q-Q plot embodies a straight line. These outcomes suggest that the model assumptions of linearity between the response variable and predictors, constant variance of the residuals, and normality of the residuals are reasonably met. However, there does appear to be an outlier.

The p-value column in Table 2 reveals that only 7 of the 17 variables do not have significant predictive power, using a significance level of p = 0.01. This observation emphasises the intricate interplay of various, possibly 10, factors that contribute to life expectancy and the complexity involved in predicting it.

Feature Selection

The LASSO (Least Absolute Shrinkage and Selection Operator) method will now be used to select the most relevant predictors to further emphasise the importance of, to add to, or to negate any variables identified as significant in the model produced above (Table 2). The LASSO technique introduces a penalty term to the traditional linear model, aiming to shrink the coefficients of less important predictors towards 0, thus promoting the exclusion of irrelevant variables. Lambda is the hyperparameter/ regularisation parameter that controls the amount of shrinkage applied to the coefficients. To select the most optimal lambda, k-fold cross-validation is employed. In this process, the data set is divided into k equally-sized folds, and the Lasso model is trained and evaluated k times. During each iteration, one fold is used as the validation set, and the remaining k-1 folds are used for model training. The performance metric mean squared error (MSE), which measures the average squared difference between the predicted values and the actual values in a data set, is calculated for each fold. The average MSE across each iteration is then used to evaluate and compare the models with different lambda values. We can then identify the optimal value of lambda that minimises MSE / prediction error and maximises the model’s predictive power. Finally, we evaluate this optimal model on the test set to attain a list of relevant predictors.

set.seed(30)
# creating an x matrix and a y vector for train and test sets
xtrain <- model.matrix(Life_expectancy ~ ., train[,3:20])
ytrain <- train$Life_expectancy
xtest <- model.matrix(Life_expectancy ~ ., test[,3:20])
ytest <- test$Life_expectancy

# grid of lambda values
grid <- 10^seq(10, -2, length=100)

lasso.mod <- glmnet(xtrain, ytrain, alpha = 1, lambda = grid) 

# determining the optimal lambda with CV
cv.out <- cv.glmnet(xtrain, ytrain, alpha = 1)

# create a data frame from  cv.out 
cv_data <- data.frame(lambda = cv.out$lambda, mse = cv.out$cvm,  
                      ncoeff = cv.out$nzero)

# subset this data for plot aesthetic purposes
subset_data <- cv_data[c(seq(1, nrow(cv_data), by = 8)), ]

# plotting mse vs. log lambda
ggplot(cv_data, aes(x = log(lambda), y = mse)) +
  geom_line() +
  labs(x = "Log(lambda)", y = "Mean Squared Error") +
  theme_bw()+
  geom_text(data = subset_data, aes(x = log(lambda), y = mse, label = ncoeff), 
            hjust = -0.1, vjust = -0.5, size = 3, color = "red")

MSE of each log lambda deduced with the Lasso method. The number of coefficients corresponding to each point is given by the numbers in red above the line.

# obtain the optimal lambda
bestlam <- cv.out$lambda.min

# evaluating the model on the test set
set.seed(30)
lasso.pred <- predict(lasso.mod, s=bestlam, newx = xtest )
test.out <- glmnet(xtest,ytest, alpha = 1, lambda = bestlam)
lassocoef <- predict(test.out, type = "coefficients", s = bestlam)
kable(as.matrix(lassocoef)[-1,], caption=
        "Coefficient values of each predictor, deduced with the Lasso method.")

Coefficient values of each predictor, deduced with the Lasso method.
	x
(Intercept)	0.0000000
Year	0.0553218
Infant_deaths	-0.0268663
Under_five_deaths	-0.0762040
Adult_mortality	-0.0488684
Alcohol_consumption	0.0820437
Hepatitis_B	-0.0089793
Measles	0.0057556
BMI	-0.1049565
Polio	0.0000000
Diphtheria	0.0000000
Incidents_HIV	0.0165080
GDP_per_capita	0.0000329
Population_mln	-0.0001349
Thinness_ten_nineteen_years	0.0000000
Thinness_five_nine_years	-0.0222421
Schooling	0.0091622
Economy_status_Developed1	0.9989239

# mse
lasso_mse <- mean((lasso.pred - ytest)^2)

Table 3 reveals that the LASSO method deduced 14 variables to be relevant (possess non-zero coefficients). The three insignificant variables include polio, diphtheria and thinness 10-19 years. All three of these variables were also insignificant in the traditional linear model above, although that model also recognised year, measles, population and thinness 5-9 years as insignificant too.

Figure 5 reveals that as more coefficients are taken away (increased lambda), the MSE monotonically increases. Although, this trend only gathers momentum when the model is down to 5-6 predictors, so removing up to about 10 does not have a discernible impact on the models performance. We therefore suspect that more variables could be removed from the LASSO deduced variable list to increase simplicity without sacraficing model performance.

Nevertheless, we put these 14 variables back into a linear regression model manually, present the coefficients and their p-values, then evaluate on the test set.

# fit linear regression model with lasso-identified variables
lm2 <- lm(Life_expectancy ~ Year + Infant_deaths + Under_five_deaths +
            Adult_mortality + Alcohol_consumption + Hepatitis_B + Measles +
            BMI + Incidents_HIV + GDP_per_capita + Population_mln +
            Thinness_five_nine_years + Schooling + Economy_status_Developed,
            data=train)

# obtain coefficients and p-values of variables 
summary_lm2 <- summary(lm2)
p_values <- summary_lm2$coefficients[, "Pr(>|t|)"]
coefficients <- summary_lm2$coefficients[, "Estimate"]
variable_p_values <- data.frame(Variable = names(p_values), 
                                Coefficient = coefficients, P_Value = p_values)
variable_p_values <- variable_p_values[variable_p_values$Variable != "(Intercept)", ]

# sort p-values in ascending order
sorted_variable_p_values <- variable_p_values[order(variable_p_values$P_Value), ]

# make table
kable(sorted_variable_p_values[,-1], format = "markdown", 
      col.names = c("Coefficient", "P-Value"), caption=
        "Coefficient estimates and p-values of 
         the linear regression model created with lasso-deduced variables.")

Coefficient estimates and p-values of the linear regression model created with lasso-deduced variables.
	Coefficient	P-Value
Adult_mortality	-0.0475402	0.0000000
Under_five_deaths	-0.0605910	0.0000000
GDP_per_capita	0.0000237	0.0000000
BMI	-0.1612018	0.0000000
Infant_deaths	-0.0433355	0.0000000
Alcohol_consumption	0.0707675	0.0000000
Incidents_HIV	0.0959669	0.0000015
Economy_status_Developed1	0.5679478	0.0000040
Schooling	0.0866749	0.0000050
Thinness_five_nine_years	-0.0324624	0.0002661
Hepatitis_B	-0.0069532	0.0013010
Year	0.0101710	0.2359176
Population_mln	-0.0001756	0.4494011
Measles	-0.0004070	0.8265777

# evaluate on test set
test_predictions2 <- predict(lm2, newdata = test)
lm2_mse <- mean((test$Life_expectancy - test_predictions2)^2)

Still using a significance level of p=0.01, Table 4 reveals that year, population and measles do not have any impact in this model. We therefore remove them from the input equation and refit the model.

lm3 <- lm(Life_expectancy ~ Infant_deaths + Under_five_deaths +
            Adult_mortality + Alcohol_consumption + Hepatitis_B  +
            BMI + Incidents_HIV + GDP_per_capita +
            Thinness_five_nine_years + Schooling + Economy_status_Developed,
            data=train)

# obtain coefficients and p-values of variables 
summary_lm3 <- summary(lm3)
p_values <- summary_lm3$coefficients[, "Pr(>|t|)"]
coefficients <- summary_lm3$coefficients[, "Estimate"]
variable_p_values <- data.frame(Variable = names(p_values), 
                                Coefficient = coefficients, P_Value = p_values)
variable_p_values <- variable_p_values[variable_p_values$Variable != "(Intercept)", ]

# sort p-values in ascending order
sorted_variable_p_values <- variable_p_values[order(variable_p_values$P_Value), ]

# make table
kable(sorted_variable_p_values[,-1], format = "markdown", 
      col.names = c("Coefficient", "P-Value"), caption=
    "Coefficient estimates and p-values of the final linear regression model.")

Coefficient estimates and p-values of the final linear regression model.
	Coefficient	P-Value
Adult_mortality	-0.0474786	0.0000000
Under_five_deaths	-0.0603793	0.0000000
GDP_per_capita	0.0000238	0.0000000
BMI	-0.1581001	0.0000000
Infant_deaths	-0.0435972	0.0000000
Alcohol_consumption	0.0699261	0.0000000
Incidents_HIV	0.0944122	0.0000021
Schooling	0.0867501	0.0000039
Economy_status_Developed1	0.5660255	0.0000041
Thinness_five_nine_years	-0.0331801	0.0001281
Hepatitis_B	-0.0066292	0.0014436

# evaluate on test set
test_predictions3 <- predict(lm3, newdata = test)
lm3_mse <- mean((test$Life_expectancy - test_predictions3)^2)

In this final model, all of the 11 included variables are significant (Table 5). These were all identified by the original model, just thinness 5-9 years has now been added.

As seen in the code chunks above, while fitting each subsequent model we computed its test MSE. We present the values in Table 6.

# create a table comparing the MSEs of each LR model
MSE <- c(lm1_mse, lasso_mse, lm2_mse, lm3_mse)
Methods <- c("Original LM","Lasso model","LM w/ Lasso vars.","Final model")
metrictable <- data.frame(Methods, MSE)
metrictable %>% 
  kable(caption="Test set MSE values attained with each linear model method 
        tried during the analysis.")

Test set MSE values attained with each linear model method tried during the analysis.
Methods	MSE
Original LM	2.068526
Lasso model	2.091013
LM w/ Lasso vars.	2.060113
Final model	2.098765

The MSE values of each model is extremely similar, so the final model having the highest MSE is not necessarily concerning. The slightly higher MSE might indicate a slight compromise in predictive performance, but the gain in interpretability and the ability to identify the most relevant factors influencing life expectancy make it the preferable choice in this context.

Random Forest

We now fit a random forest model on the training set and employ 10-fold cross-validation to obtain the number of predictors to be included that will minimise the model’s Root Mean Squared Error (RMSE), i.e. the square root of the MSE. This is controlled by the parameter ‘mtry’, which specifies the number of randomly selected predictors considered at each split when growing the trees in the random forest. Similar to before, the dataset is divided into 10 equally-sized folds, and the random forest model is trained and evaluated 10 times. During each iteration, one fold is used as the validation set, while the remaining nine folds are used for model training. For each mtry, the performance metric RMSE is averaged across each fold. The optimal number of predictors corresponds to the minimum average RMSE.

set.seed(2)

# grid to search for optimal tuning parameter
control <- trainControl(method="repeatedcv", number=10, repeats=3, 
                        search="grid")
tunegrid <- expand.grid(.mtry=c(1:17))

# fit the random forest model 
rf_gridsearch <- train(Life_expectancy~., data=train[,3:20], method="rf",
                       metric="RMSE", tuneGrid=tunegrid,
                       trControl=control)

# plot RMSE against number of predictors
plot(rf_gridsearch)

Random forest model cross-validation RMSE against number of predictors included (mtry).

# predict on test set using the optimal model (from 'rf_gridsearch')
test_predictionsrf <- predict(rf_gridsearch, newdata = test)

# attain test mse
rf_mse <- mean((test$Life_expectancy - test_predictionsrf)^2)

As seen in Figure 6, 7 predictors minimise the cross-validation RMSE. Although, 7 specific optimal predictors are not explicitly given due to the nature of the random forest technique.

Feature Selection

Random forest is considered a black-box technique as its ensemble structure makes it difficult to interpret. However, it does offer a measure of variable importance that indicates the relative contribution of input features to accurate predictions. This measure is given as the increase in node purity (IncNodePurity), which indicates the impact of each predictor on improving node purity and enhancing the overall performance of the model. Higher values suggest that the corresponding variable plays a more significant role in the decision-making process of random forest model, and therefore has a more significant impact on predicting life expectancy. We extract this value from the model below and use it to visualise how impactful each of our variables are.

# extract variable importance
rf_imp <- rf_gridsearch$finalModel$importance

# convert to data frame
rf_imp_df <- data.frame(Variables = rownames(rf_imp), IncNodePurity = rf_imp[, 1])

# create column graph of variable importance
ggplot(rf_imp_df, aes(x = reorder(Variables, -IncNodePurity), y = IncNodePurity)) +
  geom_col() +
  labs(x = "Variables", y = "IncNodePurity")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Increase in node purity of each variable in the optimal random forest model.

The top 7 variables, represented by the heights of their respective bars, exhibit larger impacts on the prediction of life expectancy in this random forest model (Figure 7). The remaining variables exhibit very small bar heights. Figure 7 also allows for the top 7 variables to be identified; under five deaths, adult mortality, infant deaths, HIV incidents, BMI, GDP per capita and schooling. However, the exact strength and direction of the associations cannot be determined.

Model Results

# create a table comparing the MSEs of each final model
MSE <- c(lm3_mse, rf_mse)
Methods <- c("Linear Regression", "Random Forest")
metrictable <- data.frame(Methods, MSE)
metrictable %>% kable(caption="Test set MSE values attained with each method.")

Test set MSE values attained with each method.
Methods	MSE
Linear Regression	2.098765
Random Forest	0.880979

Our random forest model achieves a much lower MSE than our linear regression model (Table 7). Random forest may be performing better due to the multicollinearity between predictors we discovered in the exploratory data analysis section. Random forest can handle complex interactions between features and is not sensitive to multicollinearity, whereas linear regression assumes they are all independent.

Despite this better performance, the linear regression results are still valuable as it provides interpretable coefficient estimates and magnitudes of each variable. We therefore compare the variables identified by both methods below.

# important RF variables
optimal_rf_predictors <- rf_gridsearch$finalModel$mtry
sorted_df <- rf_imp_df[order(rf_imp_df$IncNodePurity, decreasing = TRUE), ]
important_rf_variables <- head(sorted_df$Variables, n = optimal_rf_predictors)

# order LR variables by importance (p-value)
summary_table <- summary(lm3)
p_values <- summary_table$coefficients[-1, "Pr(>|t|)"]
sorted_lr_vars <- names(p_values)[order(p_values)]

# add blank values to RF variables to match length of LR to put into table
important_rf_variables <- c(important_rf_variables, rep(" ", 4))

# table of both
kable(data.frame(sorted_lr_vars, important_rf_variables), 
      col.names = c("Linear Regression", "Random Forest"),
      caption="Important variables identified by each method, 
      listed in order of significance.")

Important variables identified by each method, listed in order of significance.
Linear Regression	Random Forest
Adult_mortality	Under_five_deaths
Under_five_deaths	Adult_mortality
GDP_per_capita	Infant_deaths
BMI	Incidents_HIV
Infant_deaths	BMI
Alcohol_consumption	GDP_per_capita
Incidents_HIV	Schooling
Schooling
Economy_status_Developed1
Thinness_five_nine_years
Hepatitis_B

Adult mortality and under five deaths appear at first and second place in both lists given in Table 8, indicating their strong significance in predicting life expectancy. GDP per capita, BMI, infant deaths, HIV incidents and schooling are also identified by both methods, although their rankings vary. Alcohol consumption, economy status, hepatitis B and thinness 5-9 years are exclusive to the linear regression model and are not deemed important by the random forest model. It is worth noting that three of these four variables that random forest does not include are the three least important in linear regression.

The linear regression coefficient estimates are given in Table 5. Adult mortality, under five deaths, BMI, infant deaths, thinness 5-9 years and Hepatitis B all have negative correlations with life expectancy, meaning that in general, if these values are lower for a country, the life expectancy is higher. On the other hand, GDP per capita, alcohol consumption, HIV incidents, schooling and being a developed country has a positive relationship with life expectancy.

Interpretations & Recommendations

The models above provide valuable insights into the factors influencing life expectancy. In this section, we interpret 7 of the most important variables identified above, based on their coefficients in Table 5, and consequently provide recommendations on how to strategically allocate resources to bridge the life expectancy gap between different countries to ultimately improve overall health outcomes.

Adult mortality, under five deaths and infant deaths: Adult mortality and under five deaths have the strongest correlations with life expectancy, with infant deaths closely following in both lists. Their negative associations indicate that higher death rates among adults, children and infants equates to a reduction in life expectancy. The relationship can be quantitatively estimated by interpreting the magnitude of the coefficient estimates. Adult mortality and under five deaths have coefficients of -0.047 and -0.060 respectively. Since the data points are given in number of deaths per 1000 people, for every 1 extra death per 1000 people due to adult mortality, the average life expectancy of that country decreases by 0.047 years, holding the other variables constant. Similarily, for every 1 extra death per 1000 children under the age of five, the average life expectancy decreases by 0.06 years, holding the other variables constant. Addressing these areas could lead to significant improvements in life expectancy. Plans to reduce these mortality rates would need to concentrate on investing more into the healthcare system. This could include increasing the number of healthcare facilities, improving healthcare infrastructure, ensuring essential medicines and supplies are accessible and available, etc. A focus could be put on improving healthcare during childbirth to reduce maternal and infant mortality rates. This could include enhancing access to prenatal care, skilled birth attendants and emergency obstetric services.
BMI: Surprisingly, BMI also shows a negative correlation with life expectancy, suggesting that BMI should be reduced to increase life expectancy. However, having a very low BMI can lead to health concerns such as malnutrition, weakened muscles and organs, etc. So it not desirable to aim for the low BMIs, rather healthy, mid-range BMIs. To attain these optimal levels amongst a population, policymakers should focus on promoting and providing access to healthy nutrition and balanced diets.
GDP per capita: A positive relationship exists between GDP per capita and life expectancy. Higher GDP per capita reflects better economic resources and development, enabling countries to invest in healthcare infrastructure, education, and social programs. Policymakers should prioritize strategies that foster economic growth and equitable distribution of resources, as these can positively impact life expectancy.
Alcohol consumption: Also surprisingly, the model suggests a positive relationship between alcohol consumption and life expectancy. However, we recognise that this relationship might be influenced by various other factors, such as affordability and accessibility of alcohol, or cultural differences. Policymakers in countries with excessive alcohol consumers should focus on implementing effective alcohol control policies, ensuring responsible consumption and addressing the potential health risks excessive drinking may cause.
HIV Incidents: The positive relationship between HIV incidents and life expectancy is also unexpected. This contradicts the negative relationship that appears when the two variables are plotted against each other (Figure 8). The many other factors that are taken into account simultaneously in the linear regression model may be counterbalancing the negative impact of HIV incidents on life expectancy, leading to an unexpected positive relationship. Perhaps some countries with HIV incidents may have high levels of healthcare access and advanced medical treatments that contribute to better the overall health outcomes of people living with the virus. It is crucial to note that this result should not be interpreted as an endorsement for increasing HIV incidents. Policymakers must continue efforts to prevent and manage HIV/AIDS through comprehensive healthcare services, awareness campaigns, and access to appropriate treatment and support.

ggplot(lifeexp, aes(x=Incidents_HIV, y=Life_expectancy))+
  geom_point()+
  geom_smooth(method = 'lm')+
  theme_bw()

Life expectancy against HIV incidents.

It is important to recognize that these recommendations should be tailored to the specific context and needs of each country. For example, countries that have relatively high infant deaths should allocate more resources into childbirth healthcare schemes outlined above, instead of investing in ensuring responsible alcohol consumption when very little alcohol is even available, for example. We explore an actual example of this below.

Zimbabwe Case Study

Zimbabwe has one of the lowest life expectancies in Africa, as seen in Figure 9. In this section, we aim to understand why this is the case using our important variables identified above, and thus recommend how Zimbabwe could improve its life expectancy.

lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -Life_expectancy), y = Life_expectancy, 
             fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Countries")+
  guides(fill = 'none')

Life expectancies of African countries. Zimbabwe is hihglighted in red. The values on the y-axis reflect the cumulative sum of average life expectancies over the 16-year data range.

c1 <- lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -Adult_mortality), y = Adult_mortality, 
             fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_blank()) +
  labs(x = "Countries")

c2 <- lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -Under_five_deaths), y = Under_five_deaths, 
             fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_blank()) +
  labs(x = "Countries")

c3 <- lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -BMI), y = BMI, fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_blank()) +
  labs(x = "Countries")

c4 <- lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -Alcohol_consumption), y = Alcohol_consumption, 
             fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_blank()) +
  labs(x = "Countries")

c5 <- lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -Incidents_HIV), y = Incidents_HIV, 
             fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_blank()) +
  labs(x = "Countries")

c6 <- lifeexp %>% 
  filter(Region == 'Africa') %>% 
  ggplot(aes(x = reorder(Country, -GDP_per_capita), y = GDP_per_capita, 
             fill = Country == 'Zimbabwe')) +
  geom_col() +
  theme_bw()+
  scale_fill_manual(values = c("FALSE" = "darkgrey", "TRUE" = "red")) + 
  theme(axis.text.x = element_blank()) +
  labs(x = "Countries")

ggarrange(c1, c2, c3, c4, c5, c6, legend='none')

Column graphs representing how African countries rank considering 6 important variables, with Zimbabwe highlighted in red.

The column graphs in Figure 10 show that Zimbabwe has around average alcohol consumption, GDP per capita and BMI values when compared to other African countries. Although, Zimbabwe has the highest adult mortality rate, almost the highest number of HIV incidents, yet it has nowhere near the highest under five death rates. We therefore suggest Zimbabwe’s poor life expectancy compared to the rest of Africa is predominantly due to their adult mortality rates, which may be caused from the HIV epidemic. The country should thus prioritise investment into HIV prevention and treatment programs, as well as general healthcare infrastructure and services. It would not be strategic to emphasise the development of child and infant healthcare, nor alcohol limitation, healthy eating and economic growth as relatively these are not as severe issues.

This case study exemplifies how struggling countries and policymakers can utilise the influential variables we uncovered to detect the driving force of their poor life expectancies. Obviously all of the recommendations we provided above could not be implemented at once, so countries should be targeting interventions where they are most needed. For Zimbabwe, this should be adult healthcare and HIV prevention.

Conclusion & Future Work

Both random forest and linear regression models brought unique advantages to our analysis and both allowed us to identify the most important variables that influence life expectancy. Random forest handled the outliers and multicollinear predictors present in our data set to provide more robustness and more accuratcy with its predictions. Linear regression allowed us to assess the direction and strength of the relationships between the predictors and life expectancy, and confirm the impact of the predictors random forest identified. The application of both of these models in this project has complemented each other well, providing us with a comprehensive understanding of the predictors of life expectancy.

By combining the results from both models, we were able to identify a set of variables that consistently appeared as significant predictors of life expectancy. These variables include adult mortality, under five deaths, GDP per capita, BMI, infant deaths and HIV incidents. For each of these, we provided recommendations to policymakers to improve life expectancy outcomes. This included enhancing healthcare systems, improving economic conditions and promoting healthy lifestyle choices.

As these recommendations were quite general and cannot be applied all at once, we wish to reinforce that policymakers should tailor these findings to their specific needs. Nevertheless, policymakers also must understand that life expectancy is influenced by a multitude of interrelated factors that may require comprehensive and multidimensional approaches. By adopting a holistic approach when considering our results and the recommendations provided, they can work towards achieving sustainable improvements in life expectancy and overall well-being for their populations over time.

In terms of future work, it may be possible that the feature importance rankings would differ for different countries, so fitting entirely new models that just pertain to one country or region could also be conducted. We also did not incorporate the region variable in our analysis. Future work may find value in doing so, and we suspect it would lead to an increased prediction accuracy due to the disparity seen between different regions in our exploratory data analysis section. Finally, since we discovered that adult and under five deaths are the most significant variables, looking into what is actually causing these deaths (i.e. a specific disease, maternal complications, etc.) would be extremely valuable to countries, as then more targeted and specific preventive schemes could be devised.

References

Bali, V., Aggarwal, D., Singh, S., Shukla, A. (2021), ‘Life Expectancy: Prediction & Analysis using ML’, Conference: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization, 1-8. DOI:10.1109/ICRITO51393.2021.9596123.

Ford, E., Zhao, G., Tsai, J. and Li, C. (2011), ‘Low-Risk Lifestyle Behaviors and All-Cause Mortality: Findings From the National Health and Nutrition Examination Survey III Mortality Study’, American Journal of Public Health 101, 1922-1929.

Hernandez, L., Blazer, D. et al. (2006), ‘The Impact of Social and Cultural Environment on Health’, Institute of Medicine (US) Committee on Assessing Interactions Among Social, Behavioral, and Genetic Factors in Health; Genes, Behavior, and the Social Environment: Moving Beyond the Nature/Nurture Debate 2.

Kerdprasop, N., Kerdprasop, K., Chuaybamroong. P. (2020), ‘Economic and Environmental Analysis of Life Expectancy in China and India: A Data Driven Approach’, Advances in Science, Technology and Engineering Systems Journal 5(5), 308-313.

Levantesi, S. and Nigri, A. (2020), ‘A random forest algorithm to improve the Lee–Carter mortality forecasting: impact on q-forward’, Soft Computing 24(12), 8553–8567.

Life Expectancy (WHO) | Kaggle. https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who (accessed Apr. 16, 2023).

Life Expectancy (WHO) Fixed | Kaggle. https://www.kaggle.com/datasets/lashagoch/life-expectancy-who-updated (accessed Apr. 16, 2023).

Machkenbach, J. et al. (2019), ‘Determinants of inequalities in life expectancy: an international comparative study of eight risk factors’, The Lancet Public Health 4(10), 529-537.

Mondal, M., Shitan, M. (2013), ‘Impact of Socio-Health Factors on Life Expectancy in the Low and Lower Middle Income Countries’, Iran J Public Health 42(12), 1354-1362.

Raphael, B., Ronmi, A. and Prasad, D. (2023), ‘How can Artificial Intelligence and Data Science Algorithms predict Life Expectancy - An empirical investigation spanning 193 countries’, International Journal of Information Management 3.

Senate Community Affairs References Committee. (2014), ‘Bridging our growing divide: inequality in Australia’. ISBN 978-1-76010-114-5.

Predicting Human Life Exepectancy

Tara Lind

Introduction and Background

Problem Statement:

Data Understanding and Preparation

The Data Set

Data Pre-Processing

Exploratory Data Analysis

Predictive Modelling

Linear Regression

Feature Selection

Random Forest

Feature Selection

Model Results

Interpretations & Recommendations

Zimbabwe Case Study

Conclusion & Future Work

References