2 Analysis

There are five components of the following analysis:

a preliminary analysis of the data
a calculation of a simple linear regression model and analysis of residual plots
a construction of a bootstrap sample, bootstrap regression model, and an analysis bootstrap residual plots
a construction of a 95% confidence interval of the bootstrap model
a comparison of the two models

2.1 Import and clean the data

expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv", 
                   header = TRUE) #read in data

expectancy <- filter(expectancy, Year == 2015) # construct data set containing only the year 2015

expectancy <- expectancy %>% 
  select(Country, Status, Life.expectancy, Adult.Mortality, infant.deaths, GDP, Population, Income.composition.of.resources, Schooling) %>% na.omit() # select only certain variables for testing and omit NAs. After omitting NAs, our data set only has 140 countries

var.name = names(expectancy)
kable(data.frame(var.name))

var.name
Country
Status
Life.expectancy
Adult.Mortality
infant.deaths
GDP
Population
Income.composition.of.resources
Schooling

IncComp <- expectancy$Income.composition.of.resources
LifeExp <- expectancy$Life.expectancy

2.2 Preliminary analysis

2.2.1 Pairwise scatterplot

ggpairs(expectancy, columns = 2:9) # pairwise plot of all variables in data set

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2.2 Scatterplot of life expectancy as a function of resource utilization

## plot of life expectancy ~ resource utilization colored by status of development (developed vs developing) and sized by GDP
ggplot(expectancy, aes(x = IncComp, y = LifeExp, col=Status, size=GDP))+ 
  geom_point()+
  theme_minimal()+
  labs(title = "Life Expectancy as a Function of Resource Utilization in 140 Countries", 
      subtitle = "Colored by Status of Development and Sized by GDP per capita (in USD)",
                     x = "Income Composition of Resources", 
                     y = "Life Expectancy")+
  scale_color_manual(values=c("goldenrod1","cadetblue2"))+
  theme(legend.position = "right")+
  scale_size_continuous(name = "GDP (per capita)")+
  guides(size=guide_legend(
    override.aes = list(color = c("azure3", "azure3", "azure3", "azure3", "azure3"))
  ), color=guide_legend(
    override.aes = list(size=3)))

The preliminary analysis of the data yields the following observations:

the average life expectancy of a country’s population is highly positively correlated with how efficiently that country utilizes its resources
countries with Status equal to “Developed” are disproportionately represented in the high life expectancy and resource utilization area of the graph with most developed countries having a life expectancy above 75 and resource utilization greater than 0.80
GDP per capita seems to play some role in life expectancy and resource utilization as those countries “ranked” higher in the graph tend to have higher GDP per capita

2.3 Simple linear regression model

## construct simple linear regression model of life expectancy ~ resource utilization
simple_model <- lm(LifeExp~IncComp)
summary(simple_model)

## 
## Call:
## lm(formula = LifeExp ~ IncComp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4963  -1.9681   0.3152   2.3089   6.9504 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.383      1.316   29.17   <2e-16 ***
## IncComp       48.048      1.873   25.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.488 on 138 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8255 
## F-statistic: 658.4 on 1 and 138 DF,  p-value: < 2.2e-16

reg.table <- coef(summary(simple_model))
kable(reg.table, caption = "Inferential statistics for the simple linear
      regression model: life expectancy as a function of resource utilization")

Inferential statistics for the simple linear regression model: life expectancy as a function of resource utilization
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	38.38254	1.315886	29.16860	0
IncComp	48.04848	1.872615	25.65849	0

A simple linear regression model with the structure of \(Y=\beta_0+\beta_1x_1\) of the variable LifeExp, which measures average life expectancy in a country, as a function of the explanatory variable IncComp, which measures the income composition of resources or resource utilization in a country, yields the following results:

the model and the explanatory variable IncComp are both statistically significant at \(\alpha=.01\) since \(p < .0000\)
the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(R^2 = 0.8255\))
the intercept coefficient is 38.383 meaning that, holding the explanatory variable constant at 0, life expectancy is approx. 38 years
the slope coefficient, or the coefficient of IncComp, is 48.048 which means that a 0.01 increase in IncComp will increase LifeExp by 0.4805

2.4 Residual analysis

## assumption testing using residual plots
par(mfrow = c(2,2))
plot(simple_model) # residual plots

par(mfrow = c(1,1))
resid(simple_model) %>% #residuals vs fitted values plot
  plot(xlab="Fitted Values", ylab = "Residuals", 
       main = "Fitted vs Residuals") %>% 
  abline(0,0, col = "red")

An analysis of the residual plot to assess violations to the assumption \(X \sim U(0,\sigma^2)\) yields no violations as the residuals are randomly distributed around 0.

2.5 Construction of bootstrap sample and regression model

boot.beta0 <- NULL #define empty vector to store coefficient intercept from bootstrap regression
boot.beta1 <- NULL #define empty vector to store coefficient slope from bootstrap regression

## bootstrap regression models using for-loop
vec.id <- 1:length(LifeExp)   # vector of observation ID
for(i in 1:1000){ # for loop with 1000 iterations
  boot.id <- sample(vec.id, length(LifeExp), replace = TRUE)   # bootstrap obs ID.
  boot.LifeExp <- LifeExp[boot.id]           # bootstrap life expectancy
  boot.IncComp <- IncComp[boot.id]     # corresponding bootstrap resource utilization
  
  ## regression
  boot.reg <-lm(LifeExp[boot.id] ~ IncComp[boot.id]) # regression using bootstrap
  boot.beta0[i] <- coef(boot.reg)[1]   # bootstrap intercept
  boot.beta1[i] <- coef(boot.reg)[2]   # bootstrap slope
}

2.6 Bootstrap regression model and residual analysis

## bootstrapped regression and residual plots
summary(boot.reg) # bootstrap regression

## 
## Call:
## lm(formula = LifeExp[boot.id] ~ IncComp[boot.id])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2345  -1.8196   0.5052   2.0665   6.9412 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        37.718      1.199   31.47   <2e-16 ***
## IncComp[boot.id]   48.808      1.758   27.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.52 on 138 degrees of freedom
## Multiple R-squared:  0.8481, Adjusted R-squared:  0.847 
## F-statistic: 770.4 on 1 and 138 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(boot.reg) # residual plots

par(mfrow = c(1,1))
resid(boot.reg) %>% #residuals vs fitted values plot
  plot(xlab="Fitted Values", ylab = "Residuals", 
       main = "Fitted vs Residuals") %>% 
  abline(0,0, col = "red")

After generating bootstrap samples, a regression model utilizing the bootstrapped samples was constructed and yielded the following results:

the model and the explanatory variable IncComp are both statistically significant at \(\alpha=.01\) since \(p < .0000\)
the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(R^2 =\) 0.847)
the intercept coefficient is 37.718 meaning that, holding the explanatory variable constant at 0, life expectancy is approx. 38 years
the slope coefficient, or the coefficient of IncComp, is 48.808 which means that a 0.01 increase in IncComp will increase LifeExp by 0.4881

An analysis of the residual plot for the bootsrap model to assess violations to the assumption \(X \sim U(0,\sigma^2)\) yields no violations as the residuals are randomly distributed around 0.

2.7 Construction of bootstrap confidence interval

##  95% bootstrap confidence intervals
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2) # CI for intercept
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2) # CI for slope
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
kable(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")

Bootstrap confidence intervals of regression coefficients.
	2.5%	97.5%
boot.beta0.ci	35.63901	41.14317
boot.beta1.ci	44.42212	51.59784

The 95% confidence interval for the intercept coefficient is [35.63901, 41.14317]. The 95% confidence interval for the slope coefficient of the bootstrap regression model is [44.42212, 51.59784]. Both coefficients yield tight confidence intervals which include the simple linear regression coefficients within the interval, further adding validity to the stability of the coefficients and the model.

2.8 Plot comparison of the two models

## plot of life expectancy ~ resource utilization with OLS regression line
x1 <- ggplot(expectancy, aes(x=Income.composition.of.resources,y=Life.expectancy))+
  geom_point()+
  geom_smooth(method = "lm", formula = y~x)+
  theme_minimal()+
  labs(title = "Life Expectancy ~ Resource Utilization \n (Simple Lin. Reg.)",
       x = "Life Expectancy",
       y= "Income Composition of Resources")

## plot of bootstrapped life expectancy ~ bootstrapped resource utilization with OLS regression line
x2 <- ggplot(expectancy, aes(x=boot.IncComp, y=boot.LifeExp))+
  geom_point()+
  geom_smooth(method="lm",formula=y~x)+
  theme_minimal()+
  labs(title = "Life Expectancy ~ Resource Utilization \n (Bootstrap)",
       x = "Life Expectancy",
       y= "Income Composition of Resources")

figure <- ggarrange(x1, x2, nrow = 1, ncol = 2) #plot both graphs in the same graphic
figure

A comparison of the two models, the OLS and bootstrap regression, are as follows: The OLS and bootstrap regression models are essentially equivalent with similar coefficients and predictive power; however, the OLS model, since it does not violate the assumption of normally distributed residuals, is probably better suited as it has a higher adjusted \(R^2\) and lower standard errors.

Assignment 2

Angelo Saporito

2023-02-06

1 Introduction