Source: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/data
The dataset I found on Kaggle focuses on factors affecting life expectancy, including demographic variables, income composition, mortality rates, immunization, and the Human Development Index (HDI). It uses data from 193 countries over a period from 2000 to 2015. The dataset aims to fill gaps in previous studies by including significant immunization factors like Hepatitis B, Polio, and Diphtheria, and considering data from a longer period. This comprehensive dataset, compiled from the World Health Organization (WHO) and United Nations, includes 22 columns and 2938 rows. The variables are categorized into immunization-related factors, mortality factors, economic factors, and social factors. The study provides insights for countries to identify key areas affecting life expectancy and suggests improvements in health policies and interventions.
In the beginning, I loaded the necessary packages and the dataset. I also perform basic data cleaning steps, such as removing NA values and irrelevant columns. This is crucial to ensure the quality of the data before any further analysis.
# Load necessary packages
library(glmnet)
library(ggplot2)
library(ggpubr)
library(Metrics)
library(MASS)
library(car)
library(kableExtra)
# Load the data set
life.ex <- read.csv("C:/Users/qu981/OneDrive/桌面/Project 3/life expentancy data.csv")
# View Data Structure
str(life.ex)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 54.8 55.3 56.2 56.7 57 57.3 57.3 57.5 58.1 58.6 ...
## $ Adult.Mortality : int 321 316 3 295 293 291 295 295 287 281 ...
## $ Infant.d : int 88 88 88 87 87 85 84 82 80 77 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.02 0.03 0.01 ...
## $ Expenditure.p : num 10.4 10.6 16.9 11.1 15.3 ...
## $ Hepatitis.B : int 62 63 64 65 67 66 64 63 64 63 ...
## $ Measles : int 6532 8762 2486 798 466 1296 1990 1141 1599 2861 ...
## $ BMI : num 12.2 12.6 13 13.4 13.8 14.2 14.7 15.2 15.7 16.2 ...
## $ Under.five.d : int 122 122 122 122 120 118 116 113 110 106 ...
## $ Polio : int 24 35 36 41 5 58 58 63 64 63 ...
## $ Total.expenditure: num 8.2 7.8 7.76 8.82 8.79 8.7 7.43 6.73 8.33 9.42 ...
## $ Diphtheria : int 24 33 36 41 5 58 58 63 64 63 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 115 117 188 199 219 ...
## $ Population : num 293756 2966463 21979923 2364851 24118979 ...
## $ Thinnes.1to19 : num 2.3 2.1 19.9 19.7 19.5 19.3 19.2 19 18.8 18.6 ...
## $ Thinness.5to9 : num 2.5 2.4 2.2 19.9 19.7 19.5 19.3 19.1 18.9 18.7 ...
## $ Income.cor : num 0.338 0.34 0.341 0.373 0.381 0.396 0.405 0.415 0.433 0.434 ...
## $ Schooling : num 5.5 5.9 6.2 6.5 6.8 7.9 8.1 8.4 8.7 8.9 ...
This dataset contains 22 features and 2938 observations. Among these features, 3 are categorical (Country, Year, Status), and the other 19 are numerical.
# Checking NA values
colSums(is.na(life.ex))
## Country Year Status Life.expectancy
## 0 0 0 10
## Adult.Mortality Infant.d Alcohol Expenditure.p
## 10 0 194 0
## Hepatitis.B Measles BMI Under.five.d
## 553 0 34 0
## Polio Total.expenditure Diphtheria HIV.AIDS
## 19 226 19 0
## GDP Population Thinnes.1to19 Thinness.5to9
## 448 652 34 34
## Income.cor Schooling
## 167 163
# Remove rows contain NA(s)
n.data <- na.omit(life.ex)
The removal of the rows with NA values is for ensuring the analysis is based on complete data, since it helps maintain data integrity and accuracy.
# Remove unusable features
n.data <- n.data[, !(names(n.data) %in% c("Country", "Year"))]
The “Country” and “Year” columns are removed because they are not directly relevant to the analysis. The focus of the analysis is on human in general rather than country-specific or time-specific data.
LASSO regression is used to identify the most important features in the dataset. Cross-validation determines the optimal lambda value, which reduces overfitting and enhances model interpretability by selecting only the most relevant variables.
# Generate design matrix using model.matrix and remove intercept
x <- model.matrix(Life.expectancy ~ ., data = n.data)[, -1]
y <- n.data$Life.expectancy
# Perform LASSO regression with cross-validation
set.seed(123)
lasso <- cv.glmnet(x, y, alpha = 1, nfolds = 10)
# Plot cross-validation results
plot(lasso)
# View the best lambda
lambda.1se <- lasso$lambda.1se
lambda.1se
## [1] 0.03495256
I chose lambda.1se instead of lambda.min because lambda.1se provides a simpler model with fewer variables (prevent overfitting). It is also more balanced for model complexity and generalizability.
# Fit the model using the best lambda
model.1se <- glmnet(x, y, alpha = 1, lambda = lambda.1se)
coef.1se <- coef(model.1se)
coef.1se
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 5.344786e+01
## StatusDeveloping -7.822316e-01
## Adult.Mortality -1.779206e-02
## Infant.d .
## Alcohol -1.046294e-01
## Expenditure.p 3.316084e-04
## Hepatitis.B -3.278626e-03
## Measles 1.058403e-05
## BMI 3.503956e-02
## Under.five.d -2.329758e-03
## Polio 8.756226e-03
## Total.expenditure 5.423603e-02
## Diphtheria 1.743139e-02
## HIV.AIDS -4.353962e-01
## GDP 1.119383e-05
## Population 1.783431e-09
## Thinnes.1to19 -1.957785e-02
## Thinness.5to9 -1.409825e-02
## Income.cor 1.026360e+01
## Schooling 8.825419e-01
I fitted a LASSO regression model using the glmnet function and selected features based on the lambda.1se value. By examining the absolute values of the coefficients, regardless of their direction, I assessed the strength of each feature’s relationship with life expectancy and ranked them accordingly.
# Get the importance of features
importance <- abs(coef.1se[-1])
features <- rownames(coef.1se)[-1]
# Create data frame for important features
i.df <- data.frame(Feature = features, Importance = importance)
i.df <- i.df[order(i.df$Importance, decreasing = TRUE), ]
# Plot feature importance
ggplot(i.df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "brown1") +
coord_flip() +
labs(title = "Feature Importance from LASSO Regression", x = "Features", y = "Importance") +
theme(
plot.title = element_text(hjust = 0.5, family = "Arial", face = "bold"),
axis.title.y = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12))
Later, I conducted the EDA to uncover the patterns, trends, and potential outliers between the top 5 important features & life expectancy. (4 scatter plots and 1 box plot)
ggplot(n.data, aes(x = Income.cor, y = Life.expectancy)) +
geom_point(color = "blue3", shape = 16, size = 3, alpha = 0.6) +
labs(title = "Income.cor vs Life Expectancy", x = "Income.cor", y = "Life Expectancy") +
theme(plot.title = element_text(face="bold"),
axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold")) +
theme_minimal()
Clearly a positive correlation between Income.cor and life expectancy. Most data points follow an upward trend, but there are some outliers with low life expectancy despite higher incomes.
ggplot(n.data, aes(x = Schooling, y = Life.expectancy)) +
geom_point(color = "orange", shape = 16, size = 3, alpha = 0.5) +
labs(title = "Schooling vs Life Expectancy", x = "Schooling", y = "Life Expectancy") +
theme(plot.title = element_text(face="bold"),
axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold")) +
theme_minimal()
Also a clear positive relationship between schooling and life expectancy in the scatter plot. As years of schooling increase, life expectancy tends to rise. However, some individuals with less education still achieve high life expectancy, which potentially could be additional factors that contribute to longevity.
ggplot(n.data, aes(x = HIV.AIDS, y = Life.expectancy)) +
geom_point(color = "green4", shape = 16, size = 3, alpha = 0.5) +
labs(title = "HIV.AIDS vs Life Expectancy", x = "HIV.AIDS", y = "Life Expectancy") +
theme(plot.title = element_text(face="bold"),
axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold")) +
theme_minimal()
Plot may be showing a negative correlation, but most data points are clustered on the left, making it challenging to identify a distinct pattern across the dataset.
ggplot(n.data, aes(x = Alcohol, y = Life.expectancy)) +
geom_point(color = "deeppink", shape = 16, size = 3, alpha = 0.5) +
labs(title = "Alcohol vs Life Expectancy", x = "Alcohol", y = "Life Expectancy") +
theme(plot.title = element_text(face="bold"),
axis.title.x = element_text(face="bold"),
axis.title.y = element_text(face="bold")) +
theme_minimal()
This plot is also not clearly defined. The data points are widely dispersed, which may suggest that alcohol consumption alone does not directly determine life expectancy.
ggboxplot(n.data, x = "Status", y = "Life.expectancy",
color = "Status", palette = c("brown1","green3"),
title = "Life Expectancy by Status",
xlab = "Status", ylab = "Life Expectancy")
Obvious distinction was found in life expectancy between developing and developed countries. Developed countries show higher life expectancy with less variability, while developing countries have lower median life expectancy and a broader range. Notably, some developing countries exhibit extremely low life expectancy, highlighting severe disparities and potential data issues.
Using the 5 features selected from the LASSO regression, I fit a linear regression model to predict life expectancy. The model’s coefficients and summary stats provide insights into the relationship between the predictors and the target feature.
# LASSO Selected features
selected.data <- subset(n.data, select = c(Life.expectancy, Income.cor, Schooling,
Status, HIV.AIDS, Alcohol))
# Split data set
set.seed(123)
index <- sample(x = nrow(selected.data), size = nrow(selected.data) * 0.8)
train <- selected.data[index, ]
test <- selected.data[-index, ]
# Fit the model for further selection
model1 <- lm(Life.expectancy ~ ., data = train)
summary(model1)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.692 -2.703 0.102 2.844 13.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.60737 0.82439 58.962 < 2e-16 ***
## Income.cor 14.10698 1.02719 13.734 < 2e-16 ***
## Schooling 1.31519 0.07101 18.522 < 2e-16 ***
## StatusDeveloping -2.37264 0.42091 -5.637 2.12e-08 ***
## HIV.AIDS -0.59272 0.01956 -30.305 < 2e-16 ***
## Alcohol -0.19686 0.04120 -4.779 1.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.221 on 1313 degrees of freedom
## Multiple R-squared: 0.7711, Adjusted R-squared: 0.7702
## F-statistic: 884.5 on 5 and 1313 DF, p-value: < 2.2e-16
# Residual Analysis
par(mfrow = c(2, 2))
plot(model1)
# Multicollinearity
vif(model1)
## Income.cor Schooling Status HIV.AIDS Alcohol
## 2.677647 2.927047 1.673182 1.097298 2.073824
# Outlier test
outlierTest(model1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 2305 -3.50149 0.00047819 0.63073
# Predict test & train set
test.predict <- predict(model1, newdata = test)
train.predict <- predict(model1, newdata = train)
# Calculate AIC
AIC <- round(AIC(model1))
# Calculate Adjusted R-squared
adj.r <- round(summary(model1)$adj.r.squared, 2)
# Calculate RMSE
rmse.test <- round(rmse(test$Life.expectancy, test.predict), 3)
rmse.train <- round(rmse(train$Life.expectancy, train.predict), 3)
# Calculate RMSE Gap
rmse.gap <- round(rmse.test - rmse.train,3)
# Print the model performance indicators
cat("AIC:", AIC, "\n")
cat("Adjusted R-squared:", adj.r, "\n")
cat("Test RMSE:", rmse.test, "\n")
cat("Train RMSE:", rmse.train, "\n")
cat("RMSE Gap", rmse.gap, "\n")
## AIC: 7550
## Adjusted R-squared: 0.77
## Test RMSE: 4.267
## Train RMSE: 4.212
## RMSE Gap 0.055
The residual plots reveal heteroscedasticity and slight non-linearity issues. The VIF values are low, indicating no significant multicollinearity. The outlier test also shows the observation is not significantly impactful (p-value > 0.05). Thus, to address the heteroscedasticity, transforming features could be a suitable approach for model improvement.
The log transformation was used to address the heteroscedasticity. It helps stabilize the variance and improve model fit. Here, I re-fit the model with the transformed features and evaluate its performance.
# Create new vectors for transformed features
n.data$log.Income.cor <- log(n.data$Income.cor + 1)
n.data$log.Schooling <- log(n.data$Schooling + 1)
n.data$log.HIV.AIDS <- log(n.data$HIV.AIDS + 1)
n.data$log.Alcohol <- log(n.data$Alcohol + 1)
selected.data2 <- subset(n.data, select = c(Life.expectancy, log.Income.cor, log.Schooling,
Status, log.HIV.AIDS, log.Alcohol))
# Split data set
set.seed(123)
index2 <- sample(x = nrow(selected.data2), size = nrow(selected.data2) * 0.8)
train2 <- selected.data2[index2, ]
test2 <- selected.data2[-index2, ]
# Fit the model for further selection
model2 <- lm(Life.expectancy ~ ., data = train2)
summary(model2)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7931 -2.2911 0.2661 2.1286 15.5769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7937 1.6093 24.107 < 2e-16 ***
## log.Income.cor 13.5392 1.2155 11.139 < 2e-16 ***
## log.Schooling 11.1255 0.7031 15.824 < 2e-16 ***
## StatusDeveloping -2.1574 0.3452 -6.249 5.56e-10 ***
## log.HIV.AIDS -6.1170 0.1450 -42.174 < 2e-16 ***
## log.Alcohol 0.4737 0.1573 3.011 0.00265 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.746 on 1313 degrees of freedom
## Multiple R-squared: 0.8197, Adjusted R-squared: 0.819
## F-statistic: 1194 on 5 and 1313 DF, p-value: < 2.2e-16
# Residual Analysis
par(mfrow = c(2, 2))
plot(model2)
# Multicollinearity
vif(model2)
## log.Income.cor log.Schooling Status log.HIV.AIDS log.Alcohol
## 2.178615 2.492001 1.428985 1.281164 1.755818
# Outlier test
outlierTest(model2)
## rstudent unadjusted p-value Bonferroni p
## 763 4.207096 2.7625e-05 0.036437
# Predict test & train set
test.predict2 <- predict(model2, newdata = test2)
train.predict2 <- predict(model2, newdata = train2)
# Calculate AIC
AIC2 <- round(AIC(model2))
# Calculate Adjusted R-squared
adj.r2 <- round(summary(model2)$adj.r.squared, 2)
# Calculate RMSE
rmse.test2 <- round(rmse(test2$Life.expectancy, test.predict2), 3)
rmse.train2 <- round(rmse(train2$Life.expectancy, train.predict2), 3)
# Calculate RMSE Gap
rmse.gap2 <- round(rmse.test2 - rmse.train2,3)
# Print the model performance indicators
cat("AIC:", AIC2, "\n")
cat("Adjusted R-squared:", adj.r2, "\n")
cat("Test RMSE:", rmse.test2, "\n")
cat("Train RMSE:", rmse.train2, "\n")
cat("RMSE Gap", rmse.gap2, "\n")
## AIC: 7235
## Adjusted R-squared: 0.82
## Test RMSE: 3.866
## Train RMSE: 3.738
## RMSE Gap 0.128
Log transformation of the numerical features significantly improved the issue of heteroscedasticity, as seen in the residual plots. After checking multicollinearity, which remained low, I performed an outlier test and identified a significant outlier (763) with its p-value < 0.05. Thus, I decided to remove it to further improve the model’s accuracy.
# Remove outliers
n.train2 <- train2[-763, ]
n.selected.data2 <- subset(n.train2, select = c(Life.expectancy, log.Income.cor, log.Schooling,
Status, log.HIV.AIDS, log.Alcohol))
# Split data set
set.seed(123)
n.index2 <- sample(x = nrow(n.selected.data2), size = nrow(n.selected.data2) * 0.8)
n.train2 <- n.selected.data2[n.index2, ]
n.test2 <- n.selected.data2[-n.index2, ]
# Fit the model for further selection
n.model2 <- lm(Life.expectancy ~ ., data = n.train2)
summary(n.model2)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = n.train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7100 -2.2438 0.2779 2.1226 11.4901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.8683 1.7642 20.898 < 2e-16 ***
## log.Income.cor 12.2658 1.2259 10.006 < 2e-16 ***
## log.Schooling 12.0957 0.7597 15.923 < 2e-16 ***
## StatusDeveloping -1.9961 0.3749 -5.325 1.24e-07 ***
## log.HIV.AIDS -6.1706 0.1594 -38.723 < 2e-16 ***
## log.Alcohol 0.4604 0.1728 2.665 0.00782 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.646 on 1048 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.8262
## F-statistic: 1002 on 5 and 1048 DF, p-value: < 2.2e-16
# Residual Analysis
par(mfrow = c(2, 2))
plot(n.model2)
# Multicollinearity
vif(n.model2)
## log.Income.cor log.Schooling Status log.HIV.AIDS log.Alcohol
## 2.018930 2.407147 1.434448 1.279039 1.776430
# Outlier test
outlierTest(n.model2)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 2305 -3.793225 0.00015721 0.1657
# Predict test & train set
n.test.predict2 <- predict(n.model2, newdata = n.test2)
n.train.predict2 <- predict(n.model2, newdata = n.train2)
# Calculate AIC
n.AIC2 <- round(AIC(n.model2))
# Calculate Adjusted R-squared
n.adj.r2 <- round(summary(n.model2)$adj.r.squared, 2)
# Calculate RMSE
n.rmse.test2 <- round(rmse(n.test2$Life.expectancy, n.test.predict2), 3)
n.rmse.train2 <- round(rmse(n.train2$Life.expectancy, n.train.predict2), 3)
# Calculate RMSE Gap
n.rmse.gap2 <- round(n.rmse.test2 - n.rmse.train2,3)
# Print the model performance indicators
cat("AIC:", n.AIC2, "\n")
cat("Adjusted R-squared:", n.adj.r2, "\n")
cat("Test RMSE:", n.rmse.test2, "\n")
cat("Train RMSE:", n.rmse.train2, "\n")
cat("RMSE Gap", n.rmse.gap2, "\n")
## AIC: 5726
## Adjusted R-squared: 0.83
## Test RMSE: 4.141
## Train RMSE: 3.636
## RMSE Gap 0.505
After removing the significant outlier (763), no other influential observations were identified. Further verification using the outlier test revealed that observation 2305 exhibits some characteristics of an outlier, but its p-value was not statistically significant. Therefore, I decided not to exclude this observation.
# Create data frame for the comparison
comparison <- data.frame(
Model = c("Original Model", "Log Transformed", "Log Transformed + Outlier Removal"),
Adjusted.R2 = c(adj.r, adj.r2, n.adj.r2),
AIC = c(AIC, AIC2, n.AIC2),
Test.RMSE = c(rmse.test, rmse.test2, n.rmse.test2),
Train.RMSE = c(rmse.train, rmse.train2, n.rmse.train2),
RMSE.Gap = c(rmse.gap, rmse.gap2, n.rmse.gap2))
# Comparison table
kable(comparison)
| Model | Adjusted.R2 | AIC | Test.RMSE | Train.RMSE | RMSE.Gap |
|---|---|---|---|---|---|
| Original Model | 0.77 | 7550 | 4.267 | 4.212 | 0.055 |
| Log Transformed | 0.82 | 7235 | 3.866 | 3.738 | 0.128 |
| Log Transformed + Outlier Removal | 0.83 | 5726 | 4.141 | 3.636 | 0.505 |