1 Introduction

In the following analysis, we build upon the examination conducted previously on the case study of average life expectancy for the populations in 127 countries during the year 2014. We found previously that the most effective model is a log-transformed model with the explicit structure given by \[log(Life.expectancy)=3.858-0.004144\times StatusDeveloping-2.364*10^{-8}\times GDP-0.0214*\times HIV.AIDS+\]\[ 0.003373\times Total.expenditures+0.03277\times regionAmericas+0.01685\times regionAsia+0.02907\times regionEurope+\]\[ 0.02363\times regionOceania+0.5575\times Income.composition.of.resources\] However, a residual analysis concluded that there were still minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\). The goal of this report is to correct these violations using the bootstrap sampling method to construct bootstrap confidence intervals for the regression coefficients and bootstrap the residuals.

1.1 Data Description

Our data set for this analysis is the same as the data set from the previous assignment. Similarly, the first data set, Life Expectancy (WHO), records and tracks life expectancy and other health, social, and economic factors in 193 countries between the periods of 2000-2015, comes from the Global Health Observatory (GHO) data repository under the authority of the World Health Organization (WHO). A second data set, Country Mapping - ISO, Continent, Region, was created by Kaggle user andradaolteanu for the explicit purpose of country mapping. The second data set is used solely for merging the region to the Life Expectancy (WHO) data set to determine the region of the country. Our final data set, aptly named Country.stats, contains the following variables:

Country
Status: categorical variable for determining whether a country is Developed or Developing
Life.expectancy: the average life expectancy of a country
GDP: Gross Domestic Product (GDP) per capita
Income.composition.of.resources: a scale from 0 to 1 of how well a country utilizes its resources
HIV.AIDS: Deaths from HIV/AIDS per 1,000 live births (0-4 years)
Total.expenditure: General government expenditure on health as a percentage of total government expenditure (%)
region: regional location (Americas, Africa, Asia, Oceania, Europe) of country

1.2 Practical Question

The purpose of the following analysis is to verify the empirical connection between life expectancy and various social, economic, health, and geographic factors in 127 countries for the year 2014 determined in the previous analysis. This will be conducted using bootstrap sampling to construct confidence intervals around the regression coefficients and bootstrap the residuals to correct the violations found in the previous analysis.

2 Analysis

There will be three main components to the following analysis:

A summary of the exploratory data analysis conducted in the previous assignment where a preliminary examination of the variables and their interaction with each other was conducted
A summary of the final model generated in the previous report
Construction of bootstrap cases using bootstrap sampling to create a bootstrap confidence interval for the regression coefficients and bootstrap residuals for residual analysis.

2.1 Exploratory Data Analysis

In accordance with the previous report, the preliminary analysis will be conducted in which the data will be imported, transformed, and cleaned, and two plots, a pairwise scatter plot and an exploratory graph, will help determine the relationship between the variables and develop a narrative to be explored.

2.1.1 Import and Clean Data

Expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv", 
                   header = TRUE) #read in data

Region <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/continents2.csv", 
                   header = TRUE)

expectancy <- filter(Expectancy, Year == 2014) %>%
  na.omit() # construct data set containing only the year 2014 and omit NAs. 

Country.stats <- inner_join(expectancy, Region, by="Country") %>% 
  select(Country, Status, Life.expectancy, GDP, HIV.AIDS, Total.expenditure, region,Income.composition.of.resources) #merge data sets expectancy and Region and select only certain variables for testing. After omitting NAs, our data set only has 127 countries

pander(head(Country.stats))

Table continues below
Country	Status	Life.expectancy	GDP	HIV.AIDS
Afghanistan	Developing	59.9	612.7	0.1
Albania	Developing	77.5	4576	0.1
Algeria	Developing	75.4	547.9	0.1
Angola	Developing	51.7	479.3	2
Argentina	Developing	76.2	12245	0.1
Armenia	Developing	74.6	3995	0.1

Total.expenditure	region	Income.composition.of.resources
8.18	Asia	0.476
5.88	Europe	0.761
7.21	Africa	0.741
3.31	Africa	0.527
4.79	Americas	0.825
4.48	Asia	0.739

2.1.2 Pairwise Scatterplot

The following pairwise scatter plot visualizes the distributions of each of the variables and the scatter plots of the relationship between variables. An assessment of the plot reveals that the quantitative variables have the following correlation with the response variable Life.expectancy: GDP = -0.445, HIV.AIDS = -0.611, Total.expenditures = 0.332, and Income.composition.of.resources = 0.891.

ggpairs(Country.stats, columns = 2:8) # pairwise plot of all variables in data set

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1.2 Exploratory Scatter Plot

The following graph is meant to explore some of the variables deeper and define a narrative for the response variable and the independent variables. In particular, this graph shows the correlation between life expectancy and resource utilization with the individual points colored by their respective region, shaped by status of development, and sized by GDP per capita. What is evident upon first glance is that the countries furthest to the upper-right area of the graph are disproportionately developed European and Oceanic countries with high GDP per capita (although, not all). Those in the bottom-left area of the map are disproportionately developing African countries with very low GDP per capita.

ggplot(Country.stats, aes(x=Income.composition.of.resources, y=Life.expectancy, col = region, shape=Status, size=GDP))+
  geom_point()+
  theme_minimal()+
  labs(title="Life Expectancy as a Function of Resource Utilization in 127 Countries",
       subtitle = "Shaped by Status of Development, Sized by GDP per capita (in USD), and Colored by Region",
                     x = "Income Composition of Resources", 
                     y = "Life Expectancy")+
  scale_color_manual(values=c("#68aed6","#4292c6","#2171b5","#08519c","#08306b"), name="Region")+
  guides(size=guide_legend(
    override.aes = list(color = c("azure3","azure3","azure3"))
  ), color=guide_legend(
    override.aes = list(size=3)), shape=guide_legend(override.aes = list(size=2)))+
  scale_size_continuous(name = "GDP (per capita)")

This concludes the exploratory section of the analysis. The following section will build upon this analysis by restating the final model generated in the previous analysis and then constructing a bootstrap sample to recreate the final model.

2.2 Restating the Final Model

The previous analysis found that, after creating three linear and non-linear models - a multiple OLS linear regression, a log-transformed regression, and a squared-transformed regression - that the log-transformed response variable regression with the structure \(log(Y)=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) is the best model based on residual analysis and goodness-of-fit measures. The model is summarized in the following table

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	3.8580921	0.0398693	96.7685100	0.0000000
StatusDeveloping	-0.0041436	0.0196952	-0.2103883	0.8337306
GDP	0.0000000	0.0000004	-0.0653112	0.9480377
HIV.AIDS	-0.0213968	0.0038689	-5.5304367	0.0000002
Total.expenditure	0.0033727	0.0019019	1.7733084	0.0787795
regionAmericas	0.0327692	0.0170540	1.9214960	0.0571017
regionAsia	0.0168546	0.0154794	1.0888439	0.2784605
regionEurope	0.0290678	0.0222393	1.3070421	0.1937614
regionOceania	0.0236311	0.0213172	1.1085497	0.2698980
Income.composition.of.resources	0.5574610	0.0502895	11.0850397	0.0000000

As previously stated, the explicit structure of the model is \[log(Life.expectancy)=3.858-0.004144\times StatusDeveloping-2.364*10^{-8}\times GDP-0.0214*\times HIV.AIDS+\]\[ 0.003373\times Total.expenditures+0.03277\times regionAmericas+0.01685\times regionAsia+0.02907\times regionEurope+\]\[ 0.02363\times regionOceania+0.5575\times Income.composition.of.resources\] in which \(log(\)Life.expectancy\()\) is the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources are the explanatory variables.

We then conducted residual analyses on the final model. The residual plots show minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\).

Residual Analysis Plots

Alongside the residual analysis, goodness-of-fit measures were also conducted using the following measures:

Sum of Squares Error (SSE): used to determine how well a model fits the data by squaring the error, or the difference between the observed value and the predicted value
R.squared: the coefficient of determination
adjusted.R.squared: the coefficient of determination that accounts for predictors that are not significant in a regression model
Mallows’ cp: addresses over-fitting by penalizing the addition of unnecessary variables
Akaike information criterion (AIC): estimates the quality (prediction power) of multiple models relative to one another (a smaller value are preferred)
Schwarz-Bayesian Information criterion (SBC): penalizes a model for adding extra (unnecessary) variables (a smaller value are preferred)
Predicted Residual Sum of Squares Error (PRESS): tests for over-fitting by testing the residuals of the left-out or untested observation (a smaller value are preferred)

Goodness-of-fit Measures of Log-transformed Model
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
log.model	0.2923658	0.852188	0.8408179	10	-751.39	-722.9481	0.3461065

Finally, we explained how to interpret the estimated regression coefficients. For this, we elaborated upon the coefficient for the binary dummy variable Status using the following logic:

Let us presume the assumption a priori that all explanatory variables, with the exception of Status, are held constant at 0. The two countries are equally the same besides the fact that the Status of one country will be Developing, or 1, and the other country is set to Developed, or 0. Then \[log(Developing)-log(Developed)=-0.004144 \to log(\frac{Developing}{Developed})=-0.004144 \to Developing=.995856\times Developed\] The above equation can re-written in the following way: \[Developing-Developed=.995856\times Developed \to \frac {Developing-Developed}{Developed}=-0.0041436=-0.4135058\%\] The life expectancy of a developing country vis-à-vis a developed country is -0.4135058 percent. Another way to calculate the percent increase (or decrease) in the response for every one-unit increase in the independent variable is to utilize the following equation and apply it to the individual coefficients: \((e^{\beta_i}-1)\times 100\).

2.3 Bootstrap Cases

In the following section, we use bootstrap sampling to generate a bootstrap regression model of the log-transformed model discussed in the previous section. Following the bootstrap model, we will construct bootstrap confidence intervals for each of the coefficients to assess the stability and significance of each coefficient.

2.3.1 Bootstrap Sampling and Confidence Interval Construction

A function will be defined in order to generate the bootstrap samples and regression using 1,000 replicates.

## redefine the log-transformed model
log.model <- lm(log(Life.expectancy)~.-Country, data = Country.stats)

## define number of bootstrap replicates
B = 1000  # choose 1000 bootstrap replicates to sample

## define number of parameters, sample size, and empty coefficient matrix
num.p = dim(model.frame(log.model))[2]+2  # returns number of parameters in the model. Added an additional "2" to the parameters because the dim() function does not account for the variable region which is separated into four distinct categories
smpl.n = dim(model.frame(log.model))[1] # sample size
## zero matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)       
## 
for (i in 1:B){
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
  log.model.btc = lm(log(Life.expectancy)~.-Country, data = Country.stats[bootc.id,])     
  coef.mtrx[i,] = coef(log.model.btc)    # extract coefs from bootstrap regression model    
}

Then, a function will be defined for histograms that represent each of the individual regression coefficient estimates and their sampling distributions.

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
}

par(mar=c(2,2,2,2))
par(mfrow=c(4,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="StatusDeveloping" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="GDP" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="HIV.AIDS" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Total.expenditure" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="regionAmericas" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=7, var.nm ="regionAsia" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=8, var.nm ="regionEurope" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=9, var.nm ="regionOceania" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=10, var.nm ="Income.composition.of.resources" )

The 10 histograms seen above contain two normal-density curves:

The red density curve uses the estimated regression coefficients and their corresponding standard error in the output of the regression procedure. The p-values reported in the output are based on the red curve.
The blue density curve is a non-parametric data-driven estimate of the density of bootstrap sampling distribution. The bootstrap confidence intervals of the regressions are based on these non-parametric bootstrap sampling distributions.

The density curves in each of the 10 histograms are all relatively similar and all distributions are approximately normal with only minor deviations.

Finally, we construct 95% bootstrap confidence intervals for each of the coefficients and then combine them to the output of the final model to elucidate further observations about the model.

num.p = (dim(coef.mtrx)[2]) # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(coef.mtrx[,i], 0.025, type = 2), 8) #lower bound of 95% CI
  uci.975 = round(quantile(coef.mtrx[,i], 0.975, type = 2 ),8) #upper bound of 95% CI
  btc.wd[i] =  uci.975 - lci.025 #difference between the upper and lower bounds
  btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
 }
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(summary(log.model)$coef,4,format="f"), btc.ci.95=btc.ci)), 
      caption = "Regression Coefficient Matrix") #combine inferential statistics of the model with the bootstrap CI

Regression Coefficient Matrix
	Estimate	Std. Error	t value	Pr(>\|t\|)	btc.ci.95
(Intercept)	3.8581	0.0399	96.7685	0.0000	[ 3.7794 , 3.9361 ]
StatusDeveloping	-0.0041	0.0197	-0.2104	0.8337	[ -0.0433 , 0.0361 ]
GDP	-0.0000	0.0000	-0.0653	0.9480	[ 0 , 0 ]
HIV.AIDS	-0.0214	0.0039	-5.5304	0.0000	[ -0.0326 , -0.0132 ]
Total.expenditure	0.0034	0.0019	1.7733	0.0788	[ -0.0027 , 0.0089 ]
regionAmericas	0.0328	0.0171	1.9215	0.0571	[ 0.003 , 0.0642 ]
regionAsia	0.0169	0.0155	1.0888	0.2785	[ -0.0116 , 0.0476 ]
regionEurope	0.0291	0.0222	1.3070	0.1938	[ -0.0114 , 0.0742 ]
regionOceania	0.0236	0.0213	1.1085	0.2699	[ -0.0117 , 0.0637 ]
Income.composition.of.resources	0.5575	0.0503	11.0850	0.0000	[ 0.4536 , 0.6537 ]

The table shown above summarizes the inferential statistics of the final model, in which, the the significance tests of each coefficients based on the p-values are consistent with the corresponding bootstrap confidence intervals.

2.4 Bootstrap Residuals and Residual Analysis

In the following section, we will first restate the residuals from original model and explain their distribution. Next, we will take bootstrap samples of the residual and construct a confidence interval around the residuals.

2.4.1 Restating the Original Residuals

The distribution of the residuals of the final model are shown in the following histogram.

hist(sort(log.model$residuals),n=40,
     xlab = "Residuals",
     col = "lightblue",
     border = "navy",
     main = "Histogram of Residuals")

The histogram shown above reveals the following information about the distribution of the residuals:

there are two possible outliers in the model
the distribution is skewed right

2.4.2 Residual Bootstrap Samples

This section will take 1,000 bootstrap samples of the residuals from the final model and construct a confidence interval around said residuals. The following code will take the sample and then construct histograms of the residual distributions.

## Final model
log.model <- lm(log(Life.expectancy)~.-Country, data = Country.stats)
model.resid = log.model$residuals
##
B=1000
num.p = dim(model.matrix(log.model))[2]   # number of parameters
samp.n = dim(model.matrix(log.model))[1]  # sample size
btr.mtrx = matrix(rep(0,num.p*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:B){
  ## Bootstrap response values
  bt.lg.expectancy = log.model$fitted.values + 
        sample(log.model$residuals, samp.n, replace = TRUE)  # bootstrap residuals
  #  send the boot response to the data
  btr.model = lm(bt.lg.expectancy ~ .-Country-Life.expectancy, data = Country.stats)   # bootstrap regression model of original model
  btr.mtrx[i,]=btr.model$coefficients #store coefficients in the zero matrix
}

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
}

par(mar=c(2,2,2,2))
par(mfrow=c(4,3))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="StatusDeveloping" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="GDP" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="HIV.AIDS" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Total.expenditure" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="regionAmericas" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=7, var.nm ="regionAsia" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=8, var.nm ="regionEurope" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=9, var.nm ="regionOceania" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=10, var.nm ="Income.composition.of.resources" )

Above are the residual bootstrap sampling distributions of each estimated regression coefficient. The normal and LOESS curves are close to each other. This also indicates that the inference of the significance of variables based on p-values and residual bootstrap will yield the same results.

The 95% bootstrap residual confidence intervals are combined with the inferential statistics of the final model in the following table.

#
num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL #define an empty vector to store bootstrap residual CI
btr.wd = NULL #define an empty vector to store the difference between upper and lower bound
for (i in 1:num.p){
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8) # lower bound
  uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8) # upper bound
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(summary(log.model)$coef,4,format="f"), btr.ci.95=btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")

Regression Coefficient Matrix with 95% Residual Bootstrap CI
	Estimate	Std. Error	t value	Pr(>\|t\|)	btr.ci.95
(Intercept)	3.8581	0.0399	96.7685	0.0000	[ 3.7847 , 3.9316 ]
StatusDeveloping	-0.0041	0.0197	-0.2104	0.8337	[ -0.0413 , 0.0328 ]
GDP	-0.0000	0.0000	-0.0653	0.9480	[ 0 , 0 ]
HIV.AIDS	-0.0214	0.0039	-5.5304	0.0000	[ -0.0292 , -0.0142 ]
Total.expenditure	0.0034	0.0019	1.7733	0.0788	[ -1e-04 , 0.0069 ]
regionAmericas	0.0328	0.0171	1.9215	0.0571	[ -0.0012 , 0.0666 ]
regionAsia	0.0169	0.0155	1.0888	0.2785	[ -0.0122 , 0.0462 ]
regionEurope	0.0291	0.0222	1.3070	0.1938	[ -0.0128 , 0.0731 ]
regionOceania	0.0236	0.0213	1.1085	0.2699	[ -0.0217 , 0.0634 ]
Income.composition.of.resources	0.5575	0.0503	11.0850	0.0000	[ 0.4622 , 0.6475 ]

The coefficients fall within the the confidence interval and therefore verify the significance tests represented by the p-values.

3 Conclusion and Discussion

In the analysis conducted above, we first restated the preliminary analysis of the data and the final model found in the previous report. Since the last report discovered minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\), we used bootstrap cases with 1,000 replicates to construct confidence intervals around the regression coefficients and residuals. It was found that all the tests for significance represented by the p-value were verified by the confidence intervals.

We combine all inferential statistics, including the bootstrap CIs, in the following table:

kable(as.data.frame(cbind(formatC(summary(log.model)$coef[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)), 
      caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")

Final Combined Inferential Statistics: p-values and Bootstrap CIs
	Estimate	Std. Error	Pr(>\|t\|)	btc.ci.95	btr.ci.95
(Intercept)	3.8581	0.0399	0.0000	[ 3.7794 , 3.9361 ]	[ 3.7847 , 3.9316 ]
StatusDeveloping	-0.0041	0.0197	0.8337	[ -0.0433 , 0.0361 ]	[ -0.0413 , 0.0328 ]
GDP	-0.0000	0.0000	0.9480	[ 0 , 0 ]	[ 0 , 0 ]
HIV.AIDS	-0.0214	0.0039	0.0000	[ -0.0326 , -0.0132 ]	[ -0.0292 , -0.0142 ]
Total.expenditure	0.0034	0.0019	0.0788	[ -0.0027 , 0.0089 ]	[ -1e-04 , 0.0069 ]
regionAmericas	0.0328	0.0171	0.0571	[ 0.003 , 0.0642 ]	[ -0.0012 , 0.0666 ]
regionAsia	0.0169	0.0155	0.2785	[ -0.0116 , 0.0476 ]	[ -0.0122 , 0.0462 ]
regionEurope	0.0291	0.0222	0.1938	[ -0.0114 , 0.0742 ]	[ -0.0128 , 0.0731 ]
regionOceania	0.0236	0.0213	0.2699	[ -0.0117 , 0.0637 ]	[ -0.0217 , 0.0634 ]
Income.composition.of.resources	0.5575	0.0503	0.0000	[ 0.4536 , 0.6537 ]	[ 0.4622 , 0.6475 ]

After combining all inferential statistics, we verify the conclusion of the last report in which we found that:

the model is statistically significant at \(\alpha=.01\) since \(p < .0000\)
the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(adj.R^2 = 0.8408\))
the intercept coefficient is 3.858 meaning that, holding the explanatory variables constant at 0, life expectancy is approx. 47.3748789
the independent variables’ coefficients and significance are as follows:

Status has a coefficient of -0.004144 meaning that when Status is equal to Developing, life expectancy decreases by -0.4135058 percent. The variable is not statistically significant as the p-value is equal to 0.8337.
GDP has a coefficient of -2.364e-08 meaning that when GDP increases by 10,000, average life expectancy decreases by -0.0236372 percent. The variable is not statistically significant as the p-value is equal to 0.948.
HIV.AIDS has a coefficient of -0.0214 meaning that a 0.1 increase, or an increase of one death per 100 births, decreases life expectancy by -0.2116947 percent. The variable is highly statistically significant at a p-value of 1.975e-07.
Total.expenditure has a coefficient of 0.003373 meaning that a one percentage point increase increases life expectancy by 0.3378421 percent. The variable is statistically significant as the p-value is equal to 0.07878.
region is delineated down to four categories - Americas, Asia, Europe, and Oceania - with Africa excluded as a control. An analysis of the four categories is as follows:
- regionAmericas has a coefficient of 0.03277 meaning that a country being located in the Americas increases life expectancy 3.3312066 percent. The variable is statistically significant at a p-value of 0.0571.
- regionAsia has a coefficient of 0.01685 meaning that a country being located in Asia increases life expectancy by approximately 1.6997459 percent. The variable is not statistically significant at a p-value equal to 0.2785.
- regionEurope has a coefficient equal to 2.11 meaning that a country being located in Europe increases life expectancy by 2.9494349 percent. The variable is not statistically significant at a p-value of 0.1938.
- regionOceania has a coefficient equal to 0.02363 meaning that a country being located in Oceania increases life expectancy by 2.3912577 percent. The variable is not statistically significant at a p-value equal to 0.2699.
Income.composition.of.resources has a coefficient equal to 0.5575 meaning that if a country increases the efficiency at which they utilize their scarce resources by .1, life expectancy will increase by 0.2391258 percent. The variable is statistically significant at a p-value equal to 5.864e-20.

An analysis of the bootstrap confidence intervals for the regression coefficients and residuals finds that, despite minor violations to the assumption \(\epsilon \sim U(0,\sigma^2)\), the significance tests are verified by the bootstrap confidence intervals as all coefficients fall within the lower and upper bounds.

Project 1: Bootstrapping Regression Model

Angelo Saporito

2023-02-22