1 Introduction

In this report, two models are created, a simple linear regression and a bootstrap regression, to determine the effect the Income Composition of Resources, which is a measure from 0 to 1 of how well a country utilizes its resources, has on Life Expectancy, the average life span of a country’s population. The data, which records and tracks life expectancy and other health, social, and economic factors in 193 countries between the periods of 2000-2015, comes from the Global Health Observatory (GHO) data repository under the authority of the World Health Organization (WHO). The purpose of the following analysis is to generate an empirical connection between efficient resource utilization and higher life expectancy, and vice versa, in 140 countries for the year 2015.

2 Analysis

There are five components of the following analysis:

  1. a preliminary analysis of the data
  2. a calculation of a simple linear regression model and analysis of residual plots
  3. a construction of a bootstrap sample, bootstrap regression model, and an analysis bootstrap residual plots
  4. a construction of a 95% confidence interval of the bootstrap model
  5. a comparison of the two models

2.1 Import and clean the data

expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv", 
                   header = TRUE) #read in data

expectancy <- filter(expectancy, Year == 2015) # construct data set containing only the year 2015

expectancy <- expectancy %>% 
  select(Country, Status, Life.expectancy, Adult.Mortality, infant.deaths, GDP, Population, Income.composition.of.resources, Schooling) %>% na.omit() # select only certain variables for testing and omit NAs. After omitting NAs, our data set only has 140 countries

var.name = names(expectancy)
kable(data.frame(var.name))
var.name
Country
Status
Life.expectancy
Adult.Mortality
infant.deaths
GDP
Population
Income.composition.of.resources
Schooling
IncComp <- expectancy$Income.composition.of.resources
LifeExp <- expectancy$Life.expectancy 

2.2 Preliminary analysis

2.2.1 Pairwise scatterplot

ggpairs(expectancy, columns = 2:9) # pairwise plot of all variables in data set
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2.2 Scatterplot of life expectancy as a function of resource utilization

## plot of life expectancy ~ resource utilization colored by status of development (developed vs developing) and sized by GDP
ggplot(expectancy, aes(x = IncComp, y = LifeExp, col=Status, size=GDP))+ 
  geom_point()+
  theme_minimal()+
  labs(title = "Life Expectancy as a Function of Resource Utilization in 140 Countries", 
      subtitle = "Colored by Status of Development and Sized by GDP per capita (in USD)",
                     x = "Income Composition of Resources", 
                     y = "Life Expectancy")+
  scale_color_manual(values=c("goldenrod1","cadetblue2"))+
  theme(legend.position = "right")+
  scale_size_continuous(name = "GDP (per capita)")+
  guides(size=guide_legend(
    override.aes = list(color = c("azure3", "azure3", "azure3", "azure3", "azure3"))
  ), color=guide_legend(
    override.aes = list(size=3)))

The preliminary analysis of the data yields the following observations:

  1. the average life expectancy of a country’s population is highly positively correlated with how efficiently that country utilizes its resources
  2. countries with Status equal to “Developed” are disproportionately represented in the high life expectancy and resource utilization area of the graph with most developed countries having a life expectancy above 75 and resource utilization greater than 0.80
  3. GDP per capita seems to play some role in life expectancy and resource utilization as those countries “ranked” higher in the graph tend to have higher GDP per capita

2.3 Simple linear regression model

## construct simple linear regression model of life expectancy ~ resource utilization
simple_model <- lm(LifeExp~IncComp)
summary(simple_model)
## 
## Call:
## lm(formula = LifeExp ~ IncComp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4963  -1.9681   0.3152   2.3089   6.9504 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.383      1.316   29.17   <2e-16 ***
## IncComp       48.048      1.873   25.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.488 on 138 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8255 
## F-statistic: 658.4 on 1 and 138 DF,  p-value: < 2.2e-16
reg.table <- coef(summary(simple_model))
kable(reg.table, caption = "Inferential statistics for the simple linear
      regression model: life expectancy as a function of resource utilization")
Inferential statistics for the simple linear regression model: life expectancy as a function of resource utilization
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.38254 1.315886 29.16860 0
IncComp 48.04848 1.872615 25.65849 0

A simple linear regression model with the structure of \(Y=\beta_0+\beta_1x_1\) of the variable LifeExp, which measures average life expectancy in a country, as a function of the explanatory variable IncComp, which measures the income composition of resources or resource utilization in a country, yields the following results:

  1. the model and the explanatory variable IncComp are both statistically significant at \(\alpha=.01\) since \(p < .0000\)
  2. the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(R^2 = 0.8255\))
  3. the intercept coefficient is 38.383 meaning that, holding the explanatory variable constant at 0, life expectancy is approx. 38 years
  4. the slope coefficient, or the coefficient of IncComp, is 48.048 which means that a 0.01 increase in IncComp will increase LifeExp by 0.4805

2.4 Residual analysis

## assumption testing using residual plots
par(mfrow = c(2,2))
plot(simple_model) # residual plots

par(mfrow = c(1,1))
resid(simple_model) %>% #residuals vs fitted values plot
  plot(xlab="Fitted Values", ylab = "Residuals", 
       main = "Fitted vs Residuals") %>% 
  abline(0,0, col = "red")

An analysis of the residual plot to assess violations to the assumption \(X \sim U(0,\sigma^2)\) yields no violations as the residuals are randomly distributed around 0.

2.5 Construction of bootstrap sample and regression model

boot.beta0 <- NULL #define empty vector to store coefficient intercept from bootstrap regression
boot.beta1 <- NULL #define empty vector to store coefficient slope from bootstrap regression

## bootstrap regression models using for-loop
vec.id <- 1:length(LifeExp)   # vector of observation ID
for(i in 1:1000){ # for loop with 1000 iterations
  boot.id <- sample(vec.id, length(LifeExp), replace = TRUE)   # bootstrap obs ID.
  boot.LifeExp <- LifeExp[boot.id]           # bootstrap life expectancy
  boot.IncComp <- IncComp[boot.id]     # corresponding bootstrap resource utilization
  
  ## regression
  boot.reg <-lm(LifeExp[boot.id] ~ IncComp[boot.id]) # regression using bootstrap
  boot.beta0[i] <- coef(boot.reg)[1]   # bootstrap intercept
  boot.beta1[i] <- coef(boot.reg)[2]   # bootstrap slope
}

2.6 Bootstrap regression model and residual analysis

## bootstrapped regression and residual plots
summary(boot.reg) # bootstrap regression
## 
## Call:
## lm(formula = LifeExp[boot.id] ~ IncComp[boot.id])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2345  -1.8196   0.5052   2.0665   6.9412 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        37.718      1.199   31.47   <2e-16 ***
## IncComp[boot.id]   48.808      1.758   27.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.52 on 138 degrees of freedom
## Multiple R-squared:  0.8481, Adjusted R-squared:  0.847 
## F-statistic: 770.4 on 1 and 138 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(boot.reg) # residual plots

par(mfrow = c(1,1))
resid(boot.reg) %>% #residuals vs fitted values plot
  plot(xlab="Fitted Values", ylab = "Residuals", 
       main = "Fitted vs Residuals") %>% 
  abline(0,0, col = "red")

After generating bootstrap samples, a regression model utilizing the bootstrapped samples was constructed and yielded the following results:

  1. the model and the explanatory variable IncComp are both statistically significant at \(\alpha=.01\) since \(p < .0000\)
  2. the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(R^2 =\) 0.847)
  3. the intercept coefficient is 37.718 meaning that, holding the explanatory variable constant at 0, life expectancy is approx. 38 years
  4. the slope coefficient, or the coefficient of IncComp, is 48.808 which means that a 0.01 increase in IncComp will increase LifeExp by 0.4881

An analysis of the residual plot for the bootsrap model to assess violations to the assumption \(X \sim U(0,\sigma^2)\) yields no violations as the residuals are randomly distributed around 0.

2.7 Construction of bootstrap confidence interval

##  95% bootstrap confidence intervals
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2) # CI for intercept
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2) # CI for slope
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
kable(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
Bootstrap confidence intervals of regression coefficients.
2.5% 97.5%
boot.beta0.ci 35.63901 41.14317
boot.beta1.ci 44.42212 51.59784

The 95% confidence interval for the intercept coefficient is [35.63901, 41.14317]. The 95% confidence interval for the slope coefficient of the bootstrap regression model is [44.42212, 51.59784]. Both coefficients yield tight confidence intervals which include the simple linear regression coefficients within the interval, further adding validity to the stability of the coefficients and the model.

2.8 Plot comparison of the two models

## plot of life expectancy ~ resource utilization with OLS regression line
x1 <- ggplot(expectancy, aes(x=Income.composition.of.resources,y=Life.expectancy))+
  geom_point()+
  geom_smooth(method = "lm", formula = y~x)+
  theme_minimal()+
  labs(title = "Life Expectancy ~ Resource Utilization \n (Simple Lin. Reg.)",
       x = "Life Expectancy",
       y= "Income Composition of Resources")

## plot of bootstrapped life expectancy ~ bootstrapped resource utilization with OLS regression line
x2 <- ggplot(expectancy, aes(x=boot.IncComp, y=boot.LifeExp))+
  geom_point()+
  geom_smooth(method="lm",formula=y~x)+
  theme_minimal()+
  labs(title = "Life Expectancy ~ Resource Utilization \n (Bootstrap)",
       x = "Life Expectancy",
       y= "Income Composition of Resources")

figure <- ggarrange(x1, x2, nrow = 1, ncol = 2) #plot both graphs in the same graphic
figure

A comparison of the two models, the OLS and bootstrap regression, are as follows: The OLS and bootstrap regression models are essentially equivalent with similar coefficients and predictive power; however, the OLS model, since it does not violate the assumption of normally distributed residuals, is probably better suited as it has a higher adjusted \(R^2\) and lower standard errors.

3 Conclusion

In this project, an attempt to understand the empirical relationship between resource utilization and life expectancy in 140 countries was asserted using two approaches: simple OLS regression and a bootstrap regression model. Both models concluded that there is a clear and distinct connection between more efficient resource utilization and higher life expectancy. However, it was decided that, while both models were essentially equivalent, it would be more effective to use the OLS regression model for the following reasons:

  1. since there was no conceived violations of the assumptions, a non-parametric approach such as the bootstrap method is unnecessary since the residuals are already uniformly distributed
  2. while both models are essentially equivalent with similar coefficients and predictive power, the OLS model has a higher adjusted \(R^2\) and lower standard errors