In this report, two models are created, a simple linear regression and a bootstrap regression, to determine the effect the Income Composition of Resources, which is a measure from 0 to 1 of how well a country utilizes its resources, has on Life Expectancy, the average life span of a country’s population. The data, which records and tracks life expectancy and other health, social, and economic factors in 193 countries between the periods of 2000-2015, comes from the Global Health Observatory (GHO) data repository under the authority of the World Health Organization (WHO). The purpose of the following analysis is to generate an empirical connection between efficient resource utilization and higher life expectancy, and vice versa, in 140 countries for the year 2015.
There are five components of the following analysis:
expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv",
header = TRUE) #read in data
expectancy <- filter(expectancy, Year == 2015) # construct data set containing only the year 2015
expectancy <- expectancy %>%
select(Country, Status, Life.expectancy, Adult.Mortality, infant.deaths, GDP, Population, Income.composition.of.resources, Schooling) %>% na.omit() # select only certain variables for testing and omit NAs. After omitting NAs, our data set only has 140 countries
var.name = names(expectancy)
kable(data.frame(var.name))
| var.name |
|---|
| Country |
| Status |
| Life.expectancy |
| Adult.Mortality |
| infant.deaths |
| GDP |
| Population |
| Income.composition.of.resources |
| Schooling |
IncComp <- expectancy$Income.composition.of.resources
LifeExp <- expectancy$Life.expectancy
ggpairs(expectancy, columns = 2:9) # pairwise plot of all variables in data set
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## plot of life expectancy ~ resource utilization colored by status of development (developed vs developing) and sized by GDP
ggplot(expectancy, aes(x = IncComp, y = LifeExp, col=Status, size=GDP))+
geom_point()+
theme_minimal()+
labs(title = "Life Expectancy as a Function of Resource Utilization in 140 Countries",
subtitle = "Colored by Status of Development and Sized by GDP per capita (in USD)",
x = "Income Composition of Resources",
y = "Life Expectancy")+
scale_color_manual(values=c("goldenrod1","cadetblue2"))+
theme(legend.position = "right")+
scale_size_continuous(name = "GDP (per capita)")+
guides(size=guide_legend(
override.aes = list(color = c("azure3", "azure3", "azure3", "azure3", "azure3"))
), color=guide_legend(
override.aes = list(size=3)))
The preliminary analysis of the data yields the following observations:
## construct simple linear regression model of life expectancy ~ resource utilization
simple_model <- lm(LifeExp~IncComp)
summary(simple_model)
##
## Call:
## lm(formula = LifeExp ~ IncComp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4963 -1.9681 0.3152 2.3089 6.9504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.383 1.316 29.17 <2e-16 ***
## IncComp 48.048 1.873 25.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.488 on 138 degrees of freedom
## Multiple R-squared: 0.8267, Adjusted R-squared: 0.8255
## F-statistic: 658.4 on 1 and 138 DF, p-value: < 2.2e-16
reg.table <- coef(summary(simple_model))
kable(reg.table, caption = "Inferential statistics for the simple linear
regression model: life expectancy as a function of resource utilization")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 38.38254 | 1.315886 | 29.16860 | 0 |
| IncComp | 48.04848 | 1.872615 | 25.65849 | 0 |
A simple linear regression model with the structure of \(Y=\beta_0+\beta_1x_1\) of the variable LifeExp, which measures average life expectancy in a country, as a function of the explanatory variable IncComp, which measures the income composition of resources or resource utilization in a country, yields the following results:
## assumption testing using residual plots
par(mfrow = c(2,2))
plot(simple_model) # residual plots
par(mfrow = c(1,1))
resid(simple_model) %>% #residuals vs fitted values plot
plot(xlab="Fitted Values", ylab = "Residuals",
main = "Fitted vs Residuals") %>%
abline(0,0, col = "red")
An analysis of the residual plot to assess violations to the assumption \(X \sim U(0,\sigma^2)\) yields no violations as the residuals are randomly distributed around 0.
boot.beta0 <- NULL #define empty vector to store coefficient intercept from bootstrap regression
boot.beta1 <- NULL #define empty vector to store coefficient slope from bootstrap regression
## bootstrap regression models using for-loop
vec.id <- 1:length(LifeExp) # vector of observation ID
for(i in 1:1000){ # for loop with 1000 iterations
boot.id <- sample(vec.id, length(LifeExp), replace = TRUE) # bootstrap obs ID.
boot.LifeExp <- LifeExp[boot.id] # bootstrap life expectancy
boot.IncComp <- IncComp[boot.id] # corresponding bootstrap resource utilization
## regression
boot.reg <-lm(LifeExp[boot.id] ~ IncComp[boot.id]) # regression using bootstrap
boot.beta0[i] <- coef(boot.reg)[1] # bootstrap intercept
boot.beta1[i] <- coef(boot.reg)[2] # bootstrap slope
}
## bootstrapped regression and residual plots
summary(boot.reg) # bootstrap regression
##
## Call:
## lm(formula = LifeExp[boot.id] ~ IncComp[boot.id])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2345 -1.8196 0.5052 2.0665 6.9412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.718 1.199 31.47 <2e-16 ***
## IncComp[boot.id] 48.808 1.758 27.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.52 on 138 degrees of freedom
## Multiple R-squared: 0.8481, Adjusted R-squared: 0.847
## F-statistic: 770.4 on 1 and 138 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(boot.reg) # residual plots
par(mfrow = c(1,1))
resid(boot.reg) %>% #residuals vs fitted values plot
plot(xlab="Fitted Values", ylab = "Residuals",
main = "Fitted vs Residuals") %>%
abline(0,0, col = "red")
After generating bootstrap samples, a regression model utilizing the bootstrapped samples was constructed and yielded the following results:
An analysis of the residual plot for the bootsrap model to assess violations to the assumption \(X \sim U(0,\sigma^2)\) yields no violations as the residuals are randomly distributed around 0.
## 95% bootstrap confidence intervals
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2) # CI for intercept
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2) # CI for slope
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci))
names(boot.coef) <- c("2.5%", "97.5%")
kable(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
| 2.5% | 97.5% | |
|---|---|---|
| boot.beta0.ci | 35.63901 | 41.14317 |
| boot.beta1.ci | 44.42212 | 51.59784 |
The 95% confidence interval for the intercept coefficient is [35.63901, 41.14317]. The 95% confidence interval for the slope coefficient of the bootstrap regression model is [44.42212, 51.59784]. Both coefficients yield tight confidence intervals which include the simple linear regression coefficients within the interval, further adding validity to the stability of the coefficients and the model.
## plot of life expectancy ~ resource utilization with OLS regression line
x1 <- ggplot(expectancy, aes(x=Income.composition.of.resources,y=Life.expectancy))+
geom_point()+
geom_smooth(method = "lm", formula = y~x)+
theme_minimal()+
labs(title = "Life Expectancy ~ Resource Utilization \n (Simple Lin. Reg.)",
x = "Life Expectancy",
y= "Income Composition of Resources")
## plot of bootstrapped life expectancy ~ bootstrapped resource utilization with OLS regression line
x2 <- ggplot(expectancy, aes(x=boot.IncComp, y=boot.LifeExp))+
geom_point()+
geom_smooth(method="lm",formula=y~x)+
theme_minimal()+
labs(title = "Life Expectancy ~ Resource Utilization \n (Bootstrap)",
x = "Life Expectancy",
y= "Income Composition of Resources")
figure <- ggarrange(x1, x2, nrow = 1, ncol = 2) #plot both graphs in the same graphic
figure
A comparison of the two models, the OLS and bootstrap regression, are as follows: The OLS and bootstrap regression models are essentially equivalent with similar coefficients and predictive power; however, the OLS model, since it does not violate the assumption of normally distributed residuals, is probably better suited as it has a higher adjusted \(R^2\) and lower standard errors.
In this project, an attempt to understand the empirical relationship between resource utilization and life expectancy in 140 countries was asserted using two approaches: simple OLS regression and a bootstrap regression model. Both models concluded that there is a clear and distinct connection between more efficient resource utilization and higher life expectancy. However, it was decided that, while both models were essentially equivalent, it would be more effective to use the OLS regression model for the following reasons: