1 Introduction

In the following analysis, we build upon the examination conducted previously on the case study of average life expectancy for the populations in 127 countries during the year 2014. We found previously that the most effective model is a log-transformed model with the explicit structure given by \[log(Life.expectancy)=3.858-0.004144\times StatusDeveloping-2.364*10^{-8}\times GDP-0.0214*\times HIV.AIDS+\]\[ 0.003373\times Total.expenditures+0.03277\times regionAmericas+0.01685\times regionAsia+0.02907\times regionEurope+\]\[ 0.02363\times regionOceania+0.5575\times Income.composition.of.resources\] However, a residual analysis concluded that there were still minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\). The goal of this report is to correct these violations using the bootstrap sampling method to construct bootstrap confidence intervals for the regression coefficients and bootstrap the residuals.

1.1 Data Description

Our data set for this analysis is the same as the data set from the previous assignment. Similarly, the first data set, Life Expectancy (WHO), records and tracks life expectancy and other health, social, and economic factors in 193 countries between the periods of 2000-2015, comes from the Global Health Observatory (GHO) data repository under the authority of the World Health Organization (WHO). A second data set, Country Mapping - ISO, Continent, Region, was created by Kaggle user andradaolteanu for the explicit purpose of country mapping. The second data set is used solely for merging the region to the Life Expectancy (WHO) data set to determine the region of the country. Our final data set, aptly named Country.stats, contains the following variables:

  1. Country
  2. Status: categorical variable for determining whether a country is Developed or Developing
  3. Life.expectancy: the average life expectancy of a country
  4. GDP: Gross Domestic Product (GDP) per capita
  5. Income.composition.of.resources: a scale from 0 to 1 of how well a country utilizes its resources
  6. HIV.AIDS: Deaths from HIV/AIDS per 1,000 live births (0-4 years)
  7. Total.expenditure: General government expenditure on health as a percentage of total government expenditure (%)
  8. region: regional location (Americas, Africa, Asia, Oceania, Europe) of country

1.2 Practical Question

The purpose of the following analysis is to verify the empirical connection between life expectancy and various social, economic, health, and geographic factors in 127 countries for the year 2014 determined in the previous analysis. This will be conducted using bootstrap sampling to construct confidence intervals around the regression coefficients and bootstrap the residuals to correct the violations found in the previous analysis.

2 Analysis

There will be three main components to the following analysis:

  1. A summary of the exploratory data analysis conducted in the previous assignment where a preliminary examination of the variables and their interaction with each other was conducted
  2. A summary of the final model generated in the previous report
  3. Construction of bootstrap cases using bootstrap sampling to create a bootstrap confidence interval for the regression coefficients and bootstrap residuals for residual analysis.

2.1 Exploratory Data Analysis

In accordance with the previous report, the preliminary analysis will be conducted in which the data will be imported, transformed, and cleaned, and two plots, a pairwise scatter plot and an exploratory graph, will help determine the relationship between the variables and develop a narrative to be explored.

2.1.1 Import and Clean Data

Expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv", 
                   header = TRUE) #read in data

Region <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/continents2.csv", 
                   header = TRUE)

expectancy <- filter(Expectancy, Year == 2014) %>%
  na.omit() # construct data set containing only the year 2014 and omit NAs. 

Country.stats <- inner_join(expectancy, Region, by="Country") %>% 
  select(Country, Status, Life.expectancy, GDP, HIV.AIDS, Total.expenditure, region,Income.composition.of.resources) #merge data sets expectancy and Region and select only certain variables for testing. After omitting NAs, our data set only has 127 countries

pander(head(Country.stats))
Table continues below
Country Status Life.expectancy GDP HIV.AIDS
Afghanistan Developing 59.9 612.7 0.1
Albania Developing 77.5 4576 0.1
Algeria Developing 75.4 547.9 0.1
Angola Developing 51.7 479.3 2
Argentina Developing 76.2 12245 0.1
Armenia Developing 74.6 3995 0.1
Total.expenditure region Income.composition.of.resources
8.18 Asia 0.476
5.88 Europe 0.761
7.21 Africa 0.741
3.31 Africa 0.527
4.79 Americas 0.825
4.48 Asia 0.739

2.1.2 Pairwise Scatterplot

The following pairwise scatter plot visualizes the distributions of each of the variables and the scatter plots of the relationship between variables. An assessment of the plot reveals that the quantitative variables have the following correlation with the response variable Life.expectancy: GDP = -0.445, HIV.AIDS = -0.611, Total.expenditures = 0.332, and Income.composition.of.resources = 0.891.

ggpairs(Country.stats, columns = 2:8) # pairwise plot of all variables in data set
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1.2 Exploratory Scatter Plot

The following graph is meant to explore some of the variables deeper and define a narrative for the response variable and the independent variables. In particular, this graph shows the correlation between life expectancy and resource utilization with the individual points colored by their respective region, shaped by status of development, and sized by GDP per capita. What is evident upon first glance is that the countries furthest to the upper-right area of the graph are disproportionately developed European and Oceanic countries with high GDP per capita (although, not all). Those in the bottom-left area of the map are disproportionately developing African countries with very low GDP per capita.

ggplot(Country.stats, aes(x=Income.composition.of.resources, y=Life.expectancy, col = region, shape=Status, size=GDP))+
  geom_point()+
  theme_minimal()+
  labs(title="Life Expectancy as a Function of Resource Utilization in 127 Countries",
       subtitle = "Shaped by Status of Development, Sized by GDP per capita (in USD), and Colored by Region",
                     x = "Income Composition of Resources", 
                     y = "Life Expectancy")+
  scale_color_manual(values=c("#68aed6","#4292c6","#2171b5","#08519c","#08306b"), name="Region")+
  guides(size=guide_legend(
    override.aes = list(color = c("azure3","azure3","azure3"))
  ), color=guide_legend(
    override.aes = list(size=3)), shape=guide_legend(override.aes = list(size=2)))+
  scale_size_continuous(name = "GDP (per capita)")

This concludes the exploratory section of the analysis. The following section will build upon this analysis by restating the final model generated in the previous analysis and then constructing a bootstrap sample to recreate the final model.

2.2 Restating the Final Model

The previous analysis found that, after creating three linear and non-linear models - a multiple OLS linear regression, a log-transformed regression, and a squared-transformed regression - that the log-transformed response variable regression with the structure \(log(Y)=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) is the best model based on residual analysis and goodness-of-fit measures. The model is summarized in the following table

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8580921 0.0398693 96.7685100 0.0000000
StatusDeveloping -0.0041436 0.0196952 -0.2103883 0.8337306
GDP 0.0000000 0.0000004 -0.0653112 0.9480377
HIV.AIDS -0.0213968 0.0038689 -5.5304367 0.0000002
Total.expenditure 0.0033727 0.0019019 1.7733084 0.0787795
regionAmericas 0.0327692 0.0170540 1.9214960 0.0571017
regionAsia 0.0168546 0.0154794 1.0888439 0.2784605
regionEurope 0.0290678 0.0222393 1.3070421 0.1937614
regionOceania 0.0236311 0.0213172 1.1085497 0.2698980
Income.composition.of.resources 0.5574610 0.0502895 11.0850397 0.0000000

As previously stated, the explicit structure of the model is \[log(Life.expectancy)=3.858-0.004144\times StatusDeveloping-2.364*10^{-8}\times GDP-0.0214*\times HIV.AIDS+\]\[ 0.003373\times Total.expenditures+0.03277\times regionAmericas+0.01685\times regionAsia+0.02907\times regionEurope+\]\[ 0.02363\times regionOceania+0.5575\times Income.composition.of.resources\] in which \(log(\)Life.expectancy\()\) is the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources are the explanatory variables.

We then conducted residual analyses on the final model. The residual plots show minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\).

Residual Analysis Plots

Residual Analysis Plots

Alongside the residual analysis, goodness-of-fit measures were also conducted using the following measures:

  • Sum of Squares Error (SSE): used to determine how well a model fits the data by squaring the error, or the difference between the observed value and the predicted value
  • R.squared: the coefficient of determination
  • adjusted.R.squared: the coefficient of determination that accounts for predictors that are not significant in a regression model
  • Mallows’ cp: addresses over-fitting by penalizing the addition of unnecessary variables
  • Akaike information criterion (AIC): estimates the quality (prediction power) of multiple models relative to one another (a smaller value are preferred)
  • Schwarz-Bayesian Information criterion (SBC): penalizes a model for adding extra (unnecessary) variables (a smaller value are preferred)
  • Predicted Residual Sum of Squares Error (PRESS): tests for over-fitting by testing the residuals of the left-out or untested observation (a smaller value are preferred)
Goodness-of-fit Measures of Log-transformed Model
SSE R.sq R.adj Cp AIC SBC PRESS
log.model 0.2923658 0.852188 0.8408179 10 -751.39 -722.9481 0.3461065

Finally, we explained how to interpret the estimated regression coefficients. For this, we elaborated upon the coefficient for the binary dummy variable Status using the following logic:

Let us presume the assumption a priori that all explanatory variables, with the exception of Status, are held constant at 0. The two countries are equally the same besides the fact that the Status of one country will be Developing, or 1, and the other country is set to Developed, or 0. Then \[log(Developing)-log(Developed)=-0.004144 \to log(\frac{Developing}{Developed})=-0.004144 \to Developing=.995856\times Developed\] The above equation can re-written in the following way: \[Developing-Developed=.995856\times Developed \to \frac {Developing-Developed}{Developed}=-0.0041436=-0.4135058\%\] The life expectancy of a developing country vis-Ă -vis a developed country is -0.4135058 percent. Another way to calculate the percent increase (or decrease) in the response for every one-unit increase in the independent variable is to utilize the following equation and apply it to the individual coefficients: \((e^{\beta_i}-1)\times 100\).

2.3 Bootstrap Cases

In the following section, we use bootstrap sampling to generate a bootstrap regression model of the log-transformed model discussed in the previous section. Following the bootstrap model, we will construct bootstrap confidence intervals for each of the coefficients to assess the stability and significance of each coefficient.

2.3.1 Bootstrap Sampling and Confidence Interval Construction

A function will be defined in order to generate the bootstrap samples and regression using 1,000 replicates.

## redefine the log-transformed model
log.model <- lm(log(Life.expectancy)~.-Country, data = Country.stats)

## define number of bootstrap replicates
B = 1000  # choose 1000 bootstrap replicates to sample

## define number of parameters, sample size, and empty coefficient matrix
num.p = dim(model.frame(log.model))[2]+2  # returns number of parameters in the model. Added an additional "2" to the parameters because the dim() function does not account for the variable region which is separated into four distinct categories
smpl.n = dim(model.frame(log.model))[1] # sample size
## zero matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)       
## 
for (i in 1:B){
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
  log.model.btc = lm(log(Life.expectancy)~.-Country, data = Country.stats[bootc.id,])     
  coef.mtrx[i,] = coef(log.model.btc)    # extract coefs from bootstrap regression model    
}

Then, a function will be defined for histograms that represent each of the individual regression coefficient estimates and their sampling distributions.

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
}
par(mar=c(2,2,2,2))
par(mfrow=c(4,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="StatusDeveloping" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="GDP" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="HIV.AIDS" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Total.expenditure" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="regionAmericas" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=7, var.nm ="regionAsia" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=8, var.nm ="regionEurope" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=9, var.nm ="regionOceania" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=10, var.nm ="Income.composition.of.resources" )

The 10 histograms seen above contain two normal-density curves:

  • The red density curve uses the estimated regression coefficients and their corresponding standard error in the output of the regression procedure. The p-values reported in the output are based on the red curve.
  • The blue density curve is a non-parametric data-driven estimate of the density of bootstrap sampling distribution. The bootstrap confidence intervals of the regressions are based on these non-parametric bootstrap sampling distributions.

The density curves in each of the 10 histograms are all relatively similar and all distributions are approximately normal with only minor deviations.

Finally, we construct 95% bootstrap confidence intervals for each of the coefficients and then combine them to the output of the final model to elucidate further observations about the model.

num.p = (dim(coef.mtrx)[2]) # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(coef.mtrx[,i], 0.025, type = 2), 8) #lower bound of 95% CI
  uci.975 = round(quantile(coef.mtrx[,i], 0.975, type = 2 ),8) #upper bound of 95% CI
  btc.wd[i] =  uci.975 - lci.025 #difference between the upper and lower bounds
  btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
 }
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(summary(log.model)$coef,4,format="f"), btc.ci.95=btc.ci)), 
      caption = "Regression Coefficient Matrix") #combine inferential statistics of the model with the bootstrap CI 
Regression Coefficient Matrix
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) 3.8581 0.0399 96.7685 0.0000 [ 3.7794 , 3.9361 ]
StatusDeveloping -0.0041 0.0197 -0.2104 0.8337 [ -0.0433 , 0.0361 ]
GDP -0.0000 0.0000 -0.0653 0.9480 [ 0 , 0 ]
HIV.AIDS -0.0214 0.0039 -5.5304 0.0000 [ -0.0326 , -0.0132 ]
Total.expenditure 0.0034 0.0019 1.7733 0.0788 [ -0.0027 , 0.0089 ]
regionAmericas 0.0328 0.0171 1.9215 0.0571 [ 0.003 , 0.0642 ]
regionAsia 0.0169 0.0155 1.0888 0.2785 [ -0.0116 , 0.0476 ]
regionEurope 0.0291 0.0222 1.3070 0.1938 [ -0.0114 , 0.0742 ]
regionOceania 0.0236 0.0213 1.1085 0.2699 [ -0.0117 , 0.0637 ]
Income.composition.of.resources 0.5575 0.0503 11.0850 0.0000 [ 0.4536 , 0.6537 ]

The table shown above summarizes the inferential statistics of the final model, in which, the the significance tests of each coefficients based on the p-values are consistent with the corresponding bootstrap confidence intervals.

2.4 Bootstrap Residuals and Residual Analysis

In the following section, we will first restate the residuals from original model and explain their distribution. Next, we will take bootstrap samples of the residual and construct a confidence interval around the residuals.

2.4.1 Restating the Original Residuals

The distribution of the residuals of the final model are shown in the following histogram.

hist(sort(log.model$residuals),n=40,
     xlab = "Residuals",
     col = "lightblue",
     border = "navy",
     main = "Histogram of Residuals")

The histogram shown above reveals the following information about the distribution of the residuals:

  • there are two possible outliers in the model
  • the distribution is skewed right

2.4.2 Residual Bootstrap Samples

This section will take 1,000 bootstrap samples of the residuals from the final model and construct a confidence interval around said residuals. The following code will take the sample and then construct histograms of the residual distributions.

## Final model
log.model <- lm(log(Life.expectancy)~.-Country, data = Country.stats)
model.resid = log.model$residuals
##
B=1000
num.p = dim(model.matrix(log.model))[2]   # number of parameters
samp.n = dim(model.matrix(log.model))[1]  # sample size
btr.mtrx = matrix(rep(0,num.p*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B){
  ## Bootstrap response values
  bt.lg.expectancy = log.model$fitted.values + 
        sample(log.model$residuals, samp.n, replace = TRUE)  # bootstrap residuals
  #  send the boot response to the data
  btr.model = lm(bt.lg.expectancy ~ .-Country-Life.expectancy, data = Country.stats)   # bootstrap regression model of original model
  btr.mtrx[i,]=btr.model$coefficients #store coefficients in the zero matrix
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
} 
par(mar=c(2,2,2,2))
par(mfrow=c(4,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="StatusDeveloping" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="GDP" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="HIV.AIDS" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Total.expenditure" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="regionAmericas" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=7, var.nm ="regionAsia" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=8, var.nm ="regionEurope" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=9, var.nm ="regionOceania" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=10, var.nm ="Income.composition.of.resources" )

Above are the residual bootstrap sampling distributions of each estimated regression coefficient. The normal and LOESS curves are close to each other. This also indicates that the inference of the significance of variables based on p-values and residual bootstrap will yield the same results.

The 95% bootstrap residual confidence intervals are combined with the inferential statistics of the final model in the following table.

#
num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL #define an empty vector to store bootstrap residual CI
btr.wd = NULL #define an empty vector to store the difference between upper and lower bound
for (i in 1:num.p){
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8) # lower bound
  uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8) # upper bound
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(summary(log.model)$coef,4,format="f"), btr.ci.95=btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
Regression Coefficient Matrix with 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) 3.8581 0.0399 96.7685 0.0000 [ 3.7847 , 3.9316 ]
StatusDeveloping -0.0041 0.0197 -0.2104 0.8337 [ -0.0413 , 0.0328 ]
GDP -0.0000 0.0000 -0.0653 0.9480 [ 0 , 0 ]
HIV.AIDS -0.0214 0.0039 -5.5304 0.0000 [ -0.0292 , -0.0142 ]
Total.expenditure 0.0034 0.0019 1.7733 0.0788 [ -1e-04 , 0.0069 ]
regionAmericas 0.0328 0.0171 1.9215 0.0571 [ -0.0012 , 0.0666 ]
regionAsia 0.0169 0.0155 1.0888 0.2785 [ -0.0122 , 0.0462 ]
regionEurope 0.0291 0.0222 1.3070 0.1938 [ -0.0128 , 0.0731 ]
regionOceania 0.0236 0.0213 1.1085 0.2699 [ -0.0217 , 0.0634 ]
Income.composition.of.resources 0.5575 0.0503 11.0850 0.0000 [ 0.4622 , 0.6475 ]

The coefficients fall within the the confidence interval and therefore verify the significance tests represented by the p-values.

3 Conclusion and Discussion

In the analysis conducted above, we first restated the preliminary analysis of the data and the final model found in the previous report. Since the last report discovered minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\), we used bootstrap cases with 1,000 replicates to construct confidence intervals around the regression coefficients and residuals. It was found that all the tests for significance represented by the p-value were verified by the confidence intervals.

We combine all inferential statistics, including the bootstrap CIs, in the following table:

kable(as.data.frame(cbind(formatC(summary(log.model)$coef[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)), 
      caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")
Final Combined Inferential Statistics: p-values and Bootstrap CIs
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) 3.8581 0.0399 0.0000 [ 3.7794 , 3.9361 ] [ 3.7847 , 3.9316 ]
StatusDeveloping -0.0041 0.0197 0.8337 [ -0.0433 , 0.0361 ] [ -0.0413 , 0.0328 ]
GDP -0.0000 0.0000 0.9480 [ 0 , 0 ] [ 0 , 0 ]
HIV.AIDS -0.0214 0.0039 0.0000 [ -0.0326 , -0.0132 ] [ -0.0292 , -0.0142 ]
Total.expenditure 0.0034 0.0019 0.0788 [ -0.0027 , 0.0089 ] [ -1e-04 , 0.0069 ]
regionAmericas 0.0328 0.0171 0.0571 [ 0.003 , 0.0642 ] [ -0.0012 , 0.0666 ]
regionAsia 0.0169 0.0155 0.2785 [ -0.0116 , 0.0476 ] [ -0.0122 , 0.0462 ]
regionEurope 0.0291 0.0222 0.1938 [ -0.0114 , 0.0742 ] [ -0.0128 , 0.0731 ]
regionOceania 0.0236 0.0213 0.2699 [ -0.0117 , 0.0637 ] [ -0.0217 , 0.0634 ]
Income.composition.of.resources 0.5575 0.0503 0.0000 [ 0.4536 , 0.6537 ] [ 0.4622 , 0.6475 ]

After combining all inferential statistics, we verify the conclusion of the last report in which we found that:

  1. the model is statistically significant at \(\alpha=.01\) since \(p < .0000\)
  2. the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(adj.R^2 = 0.8408\))
  3. the intercept coefficient is 3.858 meaning that, holding the explanatory variables constant at 0, life expectancy is approx. 47.3748789
  4. the independent variables’ coefficients and significance are as follows:

An analysis of the bootstrap confidence intervals for the regression coefficients and residuals finds that, despite minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\), the significance tests are verified by the bootstrap confidence intervals as all coefficients fall within the lower and upper bounds.