Predicting Life Expectancy with Linear Regression

About Dataset

Source: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/data

The dataset I found on Kaggle focuses on factors affecting life expectancy, including demographic variables, income composition, mortality rates, immunization, and the Human Development Index (HDI). It uses data from 193 countries over a period from 2000 to 2015. The dataset aims to fill gaps in previous studies by including significant immunization factors like Hepatitis B, Polio, and Diphtheria, and considering data from a longer period. This comprehensive dataset, compiled from the World Health Organization (WHO) and United Nations, includes 22 columns and 2938 rows. The variables are categorized into immunization-related factors, mortality factors, economic factors, and social factors. The study provides insights for countries to identify key areas affecting life expectancy and suggests improvements in health policies and interventions.

Feature Description

Country: The name of the country.
Year: The year of observation.
Status: Developed or Developing status.
Life.expectancy: Life Expectancy in age.
Adult.Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population).
infant.deaths: Number of Infant Deaths per 1000 population.
Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol).
percentage.expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita (%).
Hepatitis.B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%).
Measles: Measles - number of reported cases per 1000 population.
BMI: Average Body Mass Index of entire population.
Under.five.deaths: Number of under-five deaths per 1000 population.
Polio: Polio (Pol3) immunization coverage among 1-year-olds (%).
Total.expenditure: General government expenditure on health as a percentage of total government expenditure (%).
Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%).
HIV.AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years).
GDP: Gross Domestic Product per capita (in USD).
Population: Population of the country.
thinness.1to19.years: Prevalence of thinness among children and adolescents for Age 10 to 19 (%).
thinness.5to9.years: Prevalence of thinness among children for Age 5 to 9 (%).
Income.composition.of.resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1).
Schooling: Number of years of Schooling (years).

Data Preparation

In the beginning, I loaded the necessary packages and the dataset. I also perform basic data cleaning steps, such as removing NA values and irrelevant columns. This is crucial to ensure the quality of the data before any further analysis.

# Load necessary packages
library(glmnet)
library(ggplot2)
library(ggpubr)
library(Metrics)
library(MASS)
library(car)
library(kableExtra)

# Load the data set
life.ex <- read.csv("C:/Users/qu981/OneDrive/桌面/Project 3/life expentancy data.csv")

# View Data Structure
str(life.ex)

## 'data.frame':    2938 obs. of  22 variables:
##  $ Country          : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year             : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
##  $ Status           : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy  : num  54.8 55.3 56.2 56.7 57 57.3 57.3 57.5 58.1 58.6 ...
##  $ Adult.Mortality  : int  321 316 3 295 293 291 295 295 287 281 ...
##  $ Infant.d         : int  88 88 88 87 87 85 84 82 80 77 ...
##  $ Alcohol          : num  0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.02 0.03 0.01 ...
##  $ Expenditure.p    : num  10.4 10.6 16.9 11.1 15.3 ...
##  $ Hepatitis.B      : int  62 63 64 65 67 66 64 63 64 63 ...
##  $ Measles          : int  6532 8762 2486 798 466 1296 1990 1141 1599 2861 ...
##  $ BMI              : num  12.2 12.6 13 13.4 13.8 14.2 14.7 15.2 15.7 16.2 ...
##  $ Under.five.d     : int  122 122 122 122 120 118 116 113 110 106 ...
##  $ Polio            : int  24 35 36 41 5 58 58 63 64 63 ...
##  $ Total.expenditure: num  8.2 7.8 7.76 8.82 8.79 8.7 7.43 6.73 8.33 9.42 ...
##  $ Diphtheria       : int  24 33 36 41 5 58 58 63 64 63 ...
##  $ HIV.AIDS         : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP              : num  115 117 188 199 219 ...
##  $ Population       : num  293756 2966463 21979923 2364851 24118979 ...
##  $ Thinnes.1to19    : num  2.3 2.1 19.9 19.7 19.5 19.3 19.2 19 18.8 18.6 ...
##  $ Thinness.5to9    : num  2.5 2.4 2.2 19.9 19.7 19.5 19.3 19.1 18.9 18.7 ...
##  $ Income.cor       : num  0.338 0.34 0.341 0.373 0.381 0.396 0.405 0.415 0.433 0.434 ...
##  $ Schooling        : num  5.5 5.9 6.2 6.5 6.8 7.9 8.1 8.4 8.7 8.9 ...

This dataset contains 22 features and 2938 observations. Among these features, 3 are categorical (Country, Year, Status), and the other 19 are numerical.

# Checking NA values
colSums(is.na(life.ex))

##           Country              Year            Status   Life.expectancy 
##                 0                 0                 0                10 
##   Adult.Mortality          Infant.d           Alcohol     Expenditure.p 
##                10                 0               194                 0 
##       Hepatitis.B           Measles               BMI      Under.five.d 
##               553                 0                34                 0 
##             Polio Total.expenditure        Diphtheria          HIV.AIDS 
##                19               226                19                 0 
##               GDP        Population     Thinnes.1to19     Thinness.5to9 
##               448               652                34                34 
##        Income.cor         Schooling 
##               167               163

# Remove rows contain NA(s)
n.data <- na.omit(life.ex)

The removal of the rows with NA values is for ensuring the analysis is based on complete data, since it helps maintain data integrity and accuracy.

# Remove unusable features
n.data <- n.data[, !(names(n.data) %in% c("Country", "Year"))]

The “Country” and “Year” columns are removed because they are not directly relevant to the analysis. The focus of the analysis is on human in general rather than country-specific or time-specific data.

LASSO Regression for Feature Selection

LASSO regression is used to identify the most important features in the dataset. Cross-validation determines the optimal lambda value, which reduces overfitting and enhances model interpretability by selecting only the most relevant variables.

# Generate design matrix using model.matrix and remove intercept
x <- model.matrix(Life.expectancy ~ ., data = n.data)[, -1]
y <- n.data$Life.expectancy

# Perform LASSO regression with cross-validation
set.seed(123)
lasso <- cv.glmnet(x, y, alpha = 1, nfolds = 10)

# Plot cross-validation results
plot(lasso)

# View the best lambda
lambda.1se <- lasso$lambda.1se
lambda.1se

## [1] 0.03495256

I chose lambda.1se instead of lambda.min because lambda.1se provides a simpler model with fewer variables (prevent overfitting). It is also more balanced for model complexity and generalizability.

# Fit the model using the best lambda
model.1se <- glmnet(x, y, alpha = 1, lambda = lambda.1se)
coef.1se <- coef(model.1se)
coef.1se

## 20 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        5.344786e+01
## StatusDeveloping  -7.822316e-01
## Adult.Mortality   -1.779206e-02
## Infant.d           .           
## Alcohol           -1.046294e-01
## Expenditure.p      3.316084e-04
## Hepatitis.B       -3.278626e-03
## Measles            1.058403e-05
## BMI                3.503956e-02
## Under.five.d      -2.329758e-03
## Polio              8.756226e-03
## Total.expenditure  5.423603e-02
## Diphtheria         1.743139e-02
## HIV.AIDS          -4.353962e-01
## GDP                1.119383e-05
## Population         1.783431e-09
## Thinnes.1to19     -1.957785e-02
## Thinness.5to9     -1.409825e-02
## Income.cor         1.026360e+01
## Schooling          8.825419e-01

I fitted a LASSO regression model using the glmnet function and selected features based on the lambda.1se value. By examining the absolute values of the coefficients, regardless of their direction, I assessed the strength of each feature’s relationship with life expectancy and ranked them accordingly.

# Get the importance of features
importance <- abs(coef.1se[-1])
features <- rownames(coef.1se)[-1]

# Create data frame for important features
i.df <- data.frame(Feature = features, Importance = importance)
i.df <- i.df[order(i.df$Importance, decreasing = TRUE), ]

# Plot feature importance
ggplot(i.df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "brown1") +
  coord_flip() +
  labs(title = "Feature Importance from LASSO Regression", x = "Features", y = "Importance") +
  theme(
    plot.title = element_text(hjust = 0.5, family = "Arial", face = "bold"),
    axis.title.y = element_text(color = "black", size = 12),
    axis.title.x = element_text(color = "black", size = 12))

EDA

Later, I conducted the EDA to uncover the patterns, trends, and potential outliers between the top 5 important features & life expectancy. (4 scatter plots and 1 box plot)

ggplot(n.data, aes(x = Income.cor, y = Life.expectancy)) +
  geom_point(color = "blue3", shape = 16, size = 3, alpha = 0.6) +
  labs(title = "Income.cor vs Life Expectancy", x = "Income.cor", y = "Life Expectancy") + 
  theme(plot.title = element_text(face="bold"),
        axis.title.x = element_text(face="bold"),
        axis.title.y = element_text(face="bold")) +
  theme_minimal()

Clearly a positive correlation between Income.cor and life expectancy. Most data points follow an upward trend, but there are some outliers with low life expectancy despite higher incomes.

ggplot(n.data, aes(x = Schooling, y = Life.expectancy)) +
  geom_point(color = "orange", shape = 16, size = 3, alpha = 0.5) +
  labs(title = "Schooling vs Life Expectancy", x = "Schooling", y = "Life Expectancy") +
  theme(plot.title = element_text(face="bold"),
        axis.title.x = element_text(face="bold"),
        axis.title.y = element_text(face="bold")) +
  theme_minimal()

Also a clear positive relationship between schooling and life expectancy in the scatter plot. As years of schooling increase, life expectancy tends to rise. However, some individuals with less education still achieve high life expectancy, which potentially could be additional factors that contribute to longevity.

ggplot(n.data, aes(x = HIV.AIDS, y = Life.expectancy)) +
  geom_point(color = "green4", shape = 16, size = 3, alpha = 0.5) +
  labs(title = "HIV.AIDS vs Life Expectancy", x = "HIV.AIDS", y = "Life Expectancy") +
  theme(plot.title = element_text(face="bold"),
        axis.title.x = element_text(face="bold"),
        axis.title.y = element_text(face="bold")) +
  theme_minimal()

Plot may be showing a negative correlation, but most data points are clustered on the left, making it challenging to identify a distinct pattern across the dataset.

ggplot(n.data, aes(x = Alcohol, y = Life.expectancy)) +
  geom_point(color = "deeppink", shape = 16, size = 3, alpha = 0.5) +
  labs(title = "Alcohol vs Life Expectancy", x = "Alcohol", y = "Life Expectancy") +
  theme(plot.title = element_text(face="bold"),
        axis.title.x = element_text(face="bold"),
        axis.title.y = element_text(face="bold")) +
  theme_minimal()

This plot is also not clearly defined. The data points are widely dispersed, which may suggest that alcohol consumption alone does not directly determine life expectancy.

ggboxplot(n.data, x = "Status", y = "Life.expectancy",
          color = "Status", palette = c("brown1","green3"),
          title = "Life Expectancy by Status",
          xlab = "Status", ylab = "Life Expectancy")

Obvious distinction was found in life expectancy between developing and developed countries. Developed countries show higher life expectancy with less variability, while developing countries have lower median life expectancy and a broader range. Notably, some developing countries exhibit extremely low life expectancy, highlighting severe disparities and potential data issues.

Linear Regression Model Fitting (Original Model)

Using the 5 features selected from the LASSO regression, I fit a linear regression model to predict life expectancy. The model’s coefficients and summary stats provide insights into the relationship between the predictors and the target feature.

# LASSO Selected features
selected.data <- subset(n.data, select = c(Life.expectancy, Income.cor, Schooling,
                                           Status, HIV.AIDS, Alcohol))

# Split data set
set.seed(123)
index <- sample(x = nrow(selected.data), size = nrow(selected.data) * 0.8)
train <- selected.data[index, ]
test <- selected.data[-index, ]

# Fit the model for further selection
model1 <- lm(Life.expectancy ~ ., data = train)
summary(model1)

## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.692  -2.703   0.102   2.844  13.984 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      48.60737    0.82439  58.962  < 2e-16 ***
## Income.cor       14.10698    1.02719  13.734  < 2e-16 ***
## Schooling         1.31519    0.07101  18.522  < 2e-16 ***
## StatusDeveloping -2.37264    0.42091  -5.637 2.12e-08 ***
## HIV.AIDS         -0.59272    0.01956 -30.305  < 2e-16 ***
## Alcohol          -0.19686    0.04120  -4.779 1.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.221 on 1313 degrees of freedom
## Multiple R-squared:  0.7711, Adjusted R-squared:  0.7702 
## F-statistic: 884.5 on 5 and 1313 DF,  p-value: < 2.2e-16

# Residual Analysis
par(mfrow = c(2, 2))
plot(model1)

# Multicollinearity
vif(model1)

## Income.cor  Schooling     Status   HIV.AIDS    Alcohol 
##   2.677647   2.927047   1.673182   1.097298   2.073824

# Outlier test
outlierTest(model1)

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 2305 -3.50149         0.00047819      0.63073

# Predict test & train set
test.predict <- predict(model1, newdata = test)
train.predict <- predict(model1, newdata = train)

# Calculate AIC
AIC <- round(AIC(model1))

# Calculate Adjusted R-squared
adj.r <- round(summary(model1)$adj.r.squared, 2)

# Calculate RMSE
rmse.test <- round(rmse(test$Life.expectancy, test.predict), 3)
rmse.train <- round(rmse(train$Life.expectancy, train.predict), 3)

# Calculate RMSE Gap
rmse.gap <- round(rmse.test - rmse.train,3)

# Print the model performance indicators
cat("AIC:", AIC, "\n")
cat("Adjusted R-squared:", adj.r, "\n")
cat("Test RMSE:", rmse.test, "\n")
cat("Train RMSE:", rmse.train, "\n")
cat("RMSE Gap", rmse.gap, "\n")

## AIC: 7550

## Adjusted R-squared: 0.77

## Test RMSE: 4.267

## Train RMSE: 4.212

## RMSE Gap 0.055

The residual plots reveal heteroscedasticity and slight non-linearity issues. The VIF values are low, indicating no significant multicollinearity. The outlier test also shows the observation is not significantly impactful (p-value > 0.05). Thus, to address the heteroscedasticity, transforming features could be a suitable approach for model improvement.

Feature Transformation (1st Model Optimization)

The log transformation was used to address the heteroscedasticity. It helps stabilize the variance and improve model fit. Here, I re-fit the model with the transformed features and evaluate its performance.

# Create new vectors for transformed features
n.data$log.Income.cor <- log(n.data$Income.cor + 1)
n.data$log.Schooling <- log(n.data$Schooling + 1)
n.data$log.HIV.AIDS <- log(n.data$HIV.AIDS + 1)
n.data$log.Alcohol <- log(n.data$Alcohol + 1)

selected.data2 <- subset(n.data, select = c(Life.expectancy, log.Income.cor, log.Schooling,
                                           Status, log.HIV.AIDS, log.Alcohol))

# Split data set
set.seed(123)
index2 <- sample(x = nrow(selected.data2), size = nrow(selected.data2) * 0.8)
train2 <- selected.data2[index2, ]
test2 <- selected.data2[-index2, ]

# Fit the model for further selection
model2 <- lm(Life.expectancy ~ ., data = train2)
summary(model2)

## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7931  -2.2911   0.2661   2.1286  15.5769 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7937     1.6093  24.107  < 2e-16 ***
## log.Income.cor    13.5392     1.2155  11.139  < 2e-16 ***
## log.Schooling     11.1255     0.7031  15.824  < 2e-16 ***
## StatusDeveloping  -2.1574     0.3452  -6.249 5.56e-10 ***
## log.HIV.AIDS      -6.1170     0.1450 -42.174  < 2e-16 ***
## log.Alcohol        0.4737     0.1573   3.011  0.00265 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.746 on 1313 degrees of freedom
## Multiple R-squared:  0.8197, Adjusted R-squared:  0.819 
## F-statistic:  1194 on 5 and 1313 DF,  p-value: < 2.2e-16

# Residual Analysis
par(mfrow = c(2, 2))
plot(model2)

# Multicollinearity
vif(model2)

## log.Income.cor  log.Schooling         Status   log.HIV.AIDS    log.Alcohol 
##       2.178615       2.492001       1.428985       1.281164       1.755818

# Outlier test
outlierTest(model2)

##     rstudent unadjusted p-value Bonferroni p
## 763 4.207096         2.7625e-05     0.036437

# Predict test & train set
test.predict2 <- predict(model2, newdata = test2)
train.predict2 <- predict(model2, newdata = train2)

# Calculate AIC
AIC2 <- round(AIC(model2))

# Calculate Adjusted R-squared
adj.r2 <- round(summary(model2)$adj.r.squared, 2)

# Calculate RMSE
rmse.test2 <- round(rmse(test2$Life.expectancy, test.predict2), 3)
rmse.train2 <- round(rmse(train2$Life.expectancy, train.predict2), 3)

# Calculate RMSE Gap
rmse.gap2 <- round(rmse.test2 - rmse.train2,3)

# Print the model performance indicators
cat("AIC:", AIC2, "\n")
cat("Adjusted R-squared:", adj.r2, "\n")
cat("Test RMSE:", rmse.test2, "\n")
cat("Train RMSE:", rmse.train2, "\n")
cat("RMSE Gap", rmse.gap2, "\n")

## AIC: 7235

## Adjusted R-squared: 0.82

## Test RMSE: 3.866

## Train RMSE: 3.738

## RMSE Gap 0.128

Log transformation of the numerical features significantly improved the issue of heteroscedasticity, as seen in the residual plots. After checking multicollinearity, which remained low, I performed an outlier test and identified a significant outlier (763) with its p-value < 0.05. Thus, I decided to remove it to further improve the model’s accuracy.

Outlier Removal (2nd Model Optimization)

# Remove outliers
n.train2 <- train2[-763, ]

n.selected.data2 <- subset(n.train2, select = c(Life.expectancy, log.Income.cor, log.Schooling,
                                            Status, log.HIV.AIDS, log.Alcohol))

# Split data set
set.seed(123)
n.index2 <- sample(x = nrow(n.selected.data2), size = nrow(n.selected.data2) * 0.8)
n.train2 <- n.selected.data2[n.index2, ]
n.test2 <- n.selected.data2[-n.index2, ]

# Fit the model for further selection
n.model2 <- lm(Life.expectancy ~ ., data = n.train2)
summary(n.model2)

## 
## Call:
## lm(formula = Life.expectancy ~ ., data = n.train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7100  -2.2438   0.2779   2.1226  11.4901 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       36.8683     1.7642  20.898  < 2e-16 ***
## log.Income.cor    12.2658     1.2259  10.006  < 2e-16 ***
## log.Schooling     12.0957     0.7597  15.923  < 2e-16 ***
## StatusDeveloping  -1.9961     0.3749  -5.325 1.24e-07 ***
## log.HIV.AIDS      -6.1706     0.1594 -38.723  < 2e-16 ***
## log.Alcohol        0.4604     0.1728   2.665  0.00782 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.646 on 1048 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.8262 
## F-statistic:  1002 on 5 and 1048 DF,  p-value: < 2.2e-16

# Residual Analysis
par(mfrow = c(2, 2))
plot(n.model2)

# Multicollinearity
vif(n.model2)

## log.Income.cor  log.Schooling         Status   log.HIV.AIDS    log.Alcohol 
##       2.018930       2.407147       1.434448       1.279039       1.776430

# Outlier test
outlierTest(n.model2)

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 2305 -3.793225         0.00015721       0.1657

# Predict test & train set
n.test.predict2 <- predict(n.model2, newdata = n.test2)
n.train.predict2 <- predict(n.model2, newdata = n.train2)

# Calculate AIC
n.AIC2 <- round(AIC(n.model2))

# Calculate Adjusted R-squared
n.adj.r2 <- round(summary(n.model2)$adj.r.squared, 2)

# Calculate RMSE
n.rmse.test2 <- round(rmse(n.test2$Life.expectancy, n.test.predict2), 3)
n.rmse.train2 <- round(rmse(n.train2$Life.expectancy, n.train.predict2), 3)

# Calculate RMSE Gap
n.rmse.gap2 <- round(n.rmse.test2 - n.rmse.train2,3)

# Print the model performance indicators
cat("AIC:", n.AIC2, "\n")
cat("Adjusted R-squared:", n.adj.r2, "\n")
cat("Test RMSE:", n.rmse.test2, "\n")
cat("Train RMSE:", n.rmse.train2, "\n")
cat("RMSE Gap", n.rmse.gap2, "\n")

## AIC: 5726

## Adjusted R-squared: 0.83

## Test RMSE: 4.141

## Train RMSE: 3.636

## RMSE Gap 0.505

After removing the significant outlier (763), no other influential observations were identified. Further verification using the outlier test revealed that observation 2305 exhibits some characteristics of an outlier, but its p-value was not statistically significant. Therefore, I decided not to exclude this observation.

Model’s Performance Comparison

# Create data frame for the comparison
comparison <- data.frame(
  Model = c("Original Model", "Log Transformed", "Log Transformed + Outlier Removal"),
  Adjusted.R2 = c(adj.r, adj.r2, n.adj.r2),
  AIC = c(AIC, AIC2, n.AIC2),
  Test.RMSE = c(rmse.test, rmse.test2, n.rmse.test2),
  Train.RMSE = c(rmse.train, rmse.train2, n.rmse.train2),
  RMSE.Gap = c(rmse.gap, rmse.gap2, n.rmse.gap2))

# Comparison table
kable(comparison)

Model	Adjusted.R2	AIC	Test.RMSE	Train.RMSE	RMSE.Gap
Original Model	0.77	7550	4.267	4.212	0.055
Log Transformed	0.82	7235	3.866	3.738	0.128
Log Transformed + Outlier Removal	0.83	5726	4.141	3.636	0.505

Model Accuracy: As the adjusted R² increases, the ability of the model to explain variance was enhanced. (0.77 -> 0.83)
Model Fit & Complexity: The decline of AIC shows a better model fit and reduced complexity. (7550 -> 5726)
Prediction Error: Overall, the Test & Train RMSEs shows the prediction errors are generally small. [3.636 : 4.267]
Generalization Ability: The increase in RMSE Gap suggest a slight decline in generalization ability. (0.055 -> 0.505)