In the following analysis, we present a case study of average life expectancy for the populations in 127 countries during the year 2014.
The first data set, Life Expectancy (WHO), records and tracks life expectancy and other health, social, and economic factors in 193 countries between the periods of 2000-2015, comes from the Global Health Observatory (GHO) data repository under the authority of the World Health Organization (WHO). A second data set, Country Mapping - ISO, Continent, Region, was created by Kaggle user andradaolteanu for the explicit purpose of country mapping. The second data set is used solely for merging the region to the Life Expectancy (WHO) data set to determine the region of the country. Our final data set, aptly named Country.stats, contains the following variables:
The purpose of the following analysis is to generate an empirical connection between life expectancy and various social, economic, health, and geographic factors in 127 countries for the year 2014.
There will be five main components to the following analysis:
In this preliminary analysis of the data, the data will be imported, transformed, and cleaned, and two plots, a pairwise scatter plot and an exploratory graph, will help determine the relationship between the variables and develop a narrative to be explored.
Expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv",
header = TRUE) #read in data
Region <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/continents2.csv",
header = TRUE)
expectancy <- filter(Expectancy, Year == 2014) %>%
na.omit() # construct data set containing only the year 2014 and omit NAs.
Country.stats <- inner_join(expectancy, Region, by="Country") %>%
select(Country, Status, Life.expectancy, GDP, HIV.AIDS, Total.expenditure, region,Income.composition.of.resources) #merge data sets expectancy and Region and select only certain variables for testing. After omitting NAs, our data set only has 127 countries
pander(head(Country.stats))
| Country | Status | Life.expectancy | GDP | HIV.AIDS |
|---|---|---|---|---|
| Afghanistan | Developing | 59.9 | 612.7 | 0.1 |
| Albania | Developing | 77.5 | 4576 | 0.1 |
| Algeria | Developing | 75.4 | 547.9 | 0.1 |
| Angola | Developing | 51.7 | 479.3 | 2 |
| Argentina | Developing | 76.2 | 12245 | 0.1 |
| Armenia | Developing | 74.6 | 3995 | 0.1 |
| Total.expenditure | region | Income.composition.of.resources |
|---|---|---|
| 8.18 | Asia | 0.476 |
| 5.88 | Europe | 0.761 |
| 7.21 | Africa | 0.741 |
| 3.31 | Africa | 0.527 |
| 4.79 | Americas | 0.825 |
| 4.48 | Asia | 0.739 |
The following pairwise scatter plot visualizes the distributions of each of the variables and the scatter plots of the relationship between variables. An assessment of the plot reveals that the quantitative variables have the following correlation with the response variable Life.expectancy: GDP = -0.445, HIV.AIDS = -0.611, Total.expenditures = 0.332, and Income.composition.of.resources = 0.891.
ggpairs(Country.stats, columns = 2:8) # pairwise plot of all variables in data set
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The following graph is meant to explore some of the variables deeper and define a narrative for the response variable and the independent variables. In particular, this graph shows the correlation between life expectancy and resource utilization with the individual points colored by their respective region, shaped by status of development, and sized by GDP per capita. What is evident on first glance is that the countries furthest to the upper-right area of the graph are disproportionately developed European and Oceanic countries with high GDP per capita (although, not all). Those in the bottom-left area of the map are disproportionately developing African countries with very low GDP per capita.
ggplot(Country.stats, aes(x=Income.composition.of.resources, y=Life.expectancy, col = region, shape=Status, size=GDP))+
geom_point()+
theme_minimal()+
labs(title="Life Expectancy as a Function of Resource Utilization in 127 Countries",
subtitle = "Shaped by Status of Development, Sized by GDP per capita (in USD), and Colored by Region",
x = "Income Composition of Resources",
y = "Life Expectancy")+
scale_color_manual(values=c("#68aed6","#4292c6","#2171b5","#08519c","#08306b"), name="Region")+
guides(size=guide_legend(
override.aes = list(color = c("azure3","azure3","azure3"))
), color=guide_legend(
override.aes = list(size=3)), shape=guide_legend(override.aes = list(size=2)))+
scale_size_continuous(name = "GDP (per capita)")
This concludes the exploratory section of the analysis. The following section will build upon this analysis by fitting three multiple linear regression models and determine which is best suited for modelling life expectancy.
Again, the following analysis will attempt to assess how the independent variables impact life expectancy by building three models: a MLR model, a log-transformation model, and a squared-transformation model. The three models will then be assessed based on a residual analysis and various goodness-of-fit measures.
The following model is a MLR model with the structure \(Y=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) with Life.expectancy as the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources as the explanatory variables. An analysis of the model continues below.
full.model <- lm(Life.expectancy~.-Country, data = Country.stats)
pander(summary(full.model))
| Estimate | Std. Error | t value | |
|---|---|---|---|
| (Intercept) | 43.84 | 2.695 | 16.27 |
| StatusDeveloping | -0.7017 | 1.331 | -0.5272 |
| GDP | 4.253e-06 | 2.446e-05 | 0.1739 |
| HIV.AIDS | -1.322 | 0.2615 | -5.056 |
| Total.expenditure | 0.2785 | 0.1285 | 2.167 |
| regionAmericas | 2.091 | 1.153 | 1.814 |
| regionAsia | 0.9378 | 1.046 | 0.8964 |
| regionEurope | 2.11 | 1.503 | 1.404 |
| regionOceania | 1.36 | 1.441 | 0.944 |
| Income.composition.of.resources | 38.11 | 3.399 | 11.21 |
| Pr(>|t|) | |
|---|---|
| (Intercept) | 8.085e-32 |
| StatusDeveloping | 0.5991 |
| GDP | 0.8623 |
| HIV.AIDS | 1.603e-06 |
| Total.expenditure | 0.03229 |
| regionAmericas | 0.07217 |
| regionAsia | 0.3719 |
| regionEurope | 0.163 |
| regionOceania | 0.3471 |
| Income.composition.of.resources | 2.934e-20 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 127 | 3.379 | 0.8561 | 0.845 |
par(mfrow = c(2,2))
plot(full.model)
The multiple linear regression model yields the following results:
An analysis of the residual plots to assess violations to the assumption \(\epsilon \sim U(0,\sigma^2)\) yields minor violations as the residuals are not completely randomly distributed around 0. Points 4, 12, and 105 show some irregularity and may have to be tested as outliers. Further analysis of residuals will be conducted below in section 2.3.
The following model is a log-transformed MLR model with the structure \(log(Y)=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) with \(log(\)Life.expectancy\()\) as the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources as the explanatory variables. A nuance of this model compared to the previous model is that in order to calculate the percent increase (or decrease) in the response for every one-unit increase in the independent variable, the following equation has to be applied to the coefficient \((e^\beta-1)*100\). An analysis of the model continues below.
log.model <- lm(log(Life.expectancy)~.-Country, data = Country.stats)
pander(summary(log.model))
| Estimate | Std. Error | t value | |
|---|---|---|---|
| (Intercept) | 3.858 | 0.03987 | 96.77 |
| StatusDeveloping | -0.004144 | 0.0197 | -0.2104 |
| GDP | -2.364e-08 | 3.619e-07 | -0.06531 |
| HIV.AIDS | -0.0214 | 0.003869 | -5.53 |
| Total.expenditure | 0.003373 | 0.001902 | 1.773 |
| regionAmericas | 0.03277 | 0.01705 | 1.921 |
| regionAsia | 0.01685 | 0.01548 | 1.089 |
| regionEurope | 0.02907 | 0.02224 | 1.307 |
| regionOceania | 0.02363 | 0.02132 | 1.109 |
| Income.composition.of.resources | 0.5575 | 0.05029 | 11.09 |
| Pr(>|t|) | |
|---|---|
| (Intercept) | 1.63e-113 |
| StatusDeveloping | 0.8337 |
| GDP | 0.948 |
| HIV.AIDS | 1.975e-07 |
| Total.expenditure | 0.07878 |
| regionAmericas | 0.0571 |
| regionAsia | 0.2785 |
| regionEurope | 0.1938 |
| regionOceania | 0.2699 |
| Income.composition.of.resources | 5.864e-20 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 127 | 0.04999 | 0.8522 | 0.8408 |
par(mfrow=c(2,2))
plot(log.model)
##define a function to compute percentage change of coefficients
percent <- function(x){
(exp(x)-1)*100
}
The log-transformed multiple linear regression model yields the following results:
An analysis of the residual plots to assess violations to the assumption \(\epsilon \sim U(0,\sigma^2)\) yields minor violations as the residuals are not completely randomly distributed around 0. Points 4, 12, and 105 show some irregularity and may have to be tested as outliers. Further analysis of residuals will be conducted below in section 2.3.
The following model is a squared-transformed MLR model with the structure \(Y^2=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) with \(sqrt(\)Life.expectancy\()\) as the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources as the explanatory variables.
sq.model <- lm((Life.expectancy)^2~.-Country, data = Country.stats)
pander(summary(sq.model))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1367 | 383.8 | 3.56 | 0.0005362 |
| StatusDeveloping | -160.5 | 189.6 | -0.8465 | 0.399 |
| GDP | 0.001416 | 0.003484 | 0.4064 | 0.6852 |
| HIV.AIDS | -163.9 | 37.25 | -4.401 | 2.39e-05 |
| Total.expenditure | 44.69 | 18.31 | 2.441 | 0.01615 |
| regionAmericas | 266.2 | 164.2 | 1.621 | 0.1076 |
| regionAsia | 99.17 | 149 | 0.6655 | 0.5071 |
| regionEurope | 309.4 | 214.1 | 1.445 | 0.1511 |
| regionOceania | 152.5 | 205.2 | 0.7433 | 0.4588 |
| Income.composition.of.resources | 5265 | 484.2 | 10.87 | 1.851e-19 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 127 | 481.3 | 0.8504 | 0.8389 |
par(mfrow=c(2,2))
plot(sq.model)
The squared-transformed multiple linear regression model yields the following results:
An analysis of the residual plots to assess violations to the assumption \(\epsilon \sim U(0,\sigma^2)\) yields minor violations as the residuals are not completely randomly distributed around 0. Points 12, 97, and 105 show some irregularity and may have to be tested as outliers. Further analysis of residuals will be conducted below in section 2.3.
The following analysis builds upon the analyses of the individual models and will examine the residuals of the models plotted side-by-side. Two plots for each model are provided below: one QQ-plot and one Residual vs. Fitted Values plot. Based upon the residual analysis, the log-transformed model is considered to be the best fit as the violation against the assumption \(\epsilon \sim U(0,\sigma^2)\) is the least severe. This is determined by the fact that the points in the QQ-plot are better placed upon the QQ-line than the other models and the residuals show more randomness in the Residual vs. Fitted Values plot.
#define plotting area
par(pty = "s", mfrow = c(2, 3))
#Q-Q plot for original model
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
#Q-Q plot for log transformed model
qqnorm(log.model$residuals, main = "Log-Life Expectancy")
qqline(log.model$residuals)
#Q-Q plot for sq transformed model
qqnorm(sq.model$residuals, main = "Squared-Life Expectancy")
qqline(sq.model$residuals)
#Residuals vs Fitted values plots for all models
plot(full.model$residuals) %>% abline(0,0, col="red")
plot(log.model$residuals) %>% abline(0,0, col="red")
plot(sq.model$residuals) %>% abline(0,0, col="red")
The following analysis of the goodness-of-fit measures contain the measures:
The analysis concluded that the log-transformed model is the best model as it has the lowest SSE, AIC, SBC, and PRESS values. The adjusted-R-squared value is slightly lower than its peer models; however, it is not significantly lower enough to invalidate its performance.
GoF=function(m){ # m is an object: model
e = m$resid # residuals
n0 = length(e) # sample size
SSE=(m$df)*(summary(m)$sigma)^2 # sum of squared error
R.sq=summary(m)$r.squared # Coefficient of determination: R square!
R.adj=summary(m)$adj.r # Adjusted R square
MSE=(summary(m)$sigma)^2 # square error
Cp=(SSE/MSE)-(n0-2*(n0-m$df)) # Mallow's cp
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df) # Akaike information criterion: lower score means the model is more efficient (requires less information)
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df) # Schwarz Bayesian Information criterion: similar to AIC in that the lowest score is considered the most efficient
X=model.matrix(m) # design matrix of the model
H=X%*%solve(t(X)%*%X)%*%t(X) # hat matrix
d=e/(1-diag(H))
PRESS=t(d)%*%d # predicted residual error sum of squares (PRESS)- a cross-validation measure
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
output.sum = rbind(GoF(full.model), GoF(sq.model), GoF(log.model))
row.names(output.sum) = c("full.model", "sq.model", "log.model")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 1.335590e+03 | 0.8560808 | 0.8450101 | 10 | 318.8236 | 347.2654 | 1.592216e+03 |
| sq.model | 2.709770e+07 | 0.8504205 | 0.8389144 | 10 | 1578.3881 | 1606.8300 | 3.261404e+07 |
| log.model | 2.923658e-01 | 0.8521880 | 0.8408179 | 10 | -751.3900 | -722.9481 | 3.461065e-01 |
We can explicitly write the final model in the following manner \[log(Life.expectancy)=3.858-0.004144\times StatusDeveloping-2.364*10^-8\times GDP-0.0214*\times HIV.AIDS+\]\[ 0.003373\times Total.expenditures+0.03277\times regionAmericas+0.01685\times regionAsia+0.02907\times regionEurope+\]\[ 0.02363\times regionOceania+0.5575\times Income.composition.of.resources\] An explanation of the model and its coefficients have already been interpreted in section 2.2.2; however, an explanation of the mathematics behind interpreting the coefficients as a percentage change need to be further elaborated:
Let us presume the assumption a priori that all explanatory variables, with the exception of Status, are held constant at 0. The two countries are equally the same besides the fact that the Status of one country will be Developing, or 1, and the other country is set to Developed, or 0. Then \[log(Developing)-log(Developed)=-0.004144 \to log(\frac{Developing}{Developed})=-0.004144 \to Developing=.995856\times Developed\] The above equation can re-written in the following way: \[Developing-Developed=.995856\times Developed \to \frac {Developing-Developed}{Developed}=-0.0041436=-0.4135058\%\] The life expectancy of a developing country vis-à-vis a developed country is -0.4135058 percent. Similarly, the other regression coefficients can be interpreted, in which, StatusDeveloping, GDP, and HIV.AIDS are negatively associated with life expectancy and the rest are positively associated. The variables that are statistically significant are HIV.AIDS, Total.expenditures, regionAmericas, and Income.composition.of.resources.
pander(summary(log.model)$coef)
| Estimate | Std. Error | t value | |
|---|---|---|---|
| (Intercept) | 3.858 | 0.03987 | 96.77 |
| StatusDeveloping | -0.004144 | 0.0197 | -0.2104 |
| GDP | -2.364e-08 | 3.619e-07 | -0.06531 |
| HIV.AIDS | -0.0214 | 0.003869 | -5.53 |
| Total.expenditure | 0.003373 | 0.001902 | 1.773 |
| regionAmericas | 0.03277 | 0.01705 | 1.921 |
| regionAsia | 0.01685 | 0.01548 | 1.089 |
| regionEurope | 0.02907 | 0.02224 | 1.307 |
| regionOceania | 0.02363 | 0.02132 | 1.109 |
| Income.composition.of.resources | 0.5575 | 0.05029 | 11.09 |
| Pr(>|t|) | |
|---|---|
| (Intercept) | 1.63e-113 |
| StatusDeveloping | 0.8337 |
| GDP | 0.948 |
| HIV.AIDS | 1.975e-07 |
| Total.expenditure | 0.07878 |
| regionAmericas | 0.0571 |
| regionAsia | 0.2785 |
| regionEurope | 0.1938 |
| regionOceania | 0.2699 |
| Income.composition.of.resources | 5.864e-20 |
Three models were constructed using various transformation techniques including log and squared transformations. Analyses were conducted on the individual models using: interpretations of the regression coefficients, residual analysis, and goodness-of-fit measurements. The log-transformed model was determined to be the best fit as the residual analysis and goodness-of-fit measures were considerably better than the peer models.
The regression coefficients were interpreted in section 2.2.2 into a more practical scale, increasing interpretability of confusing log scale coefficients.