2 Analysis

There will be five main components to the following analysis:

An exploratory data analysis where a preliminary examination of the variables and their interaction with each other will be conducted
A calculation of a multiple linear regression model and two transformation models: log and squared
A residual analysis of the three models
A Goodness-of-fit analysis of the three models
A determination of the best model and a summary of the final model

2.1 Exploratory Data Analysis

In this preliminary analysis of the data, the data will be imported, transformed, and cleaned, and two plots, a pairwise scatter plot and an exploratory graph, will help determine the relationship between the variables and develop a narrative to be explored.

2.1.1 Import and Clean Data

Expectancy <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/Life%20Expectancy%20Data.csv", 
                   header = TRUE) #read in data

Region <- read.csv("https://raw.githubusercontent.com/as927097/STA321/main/continents2.csv", 
                   header = TRUE)

expectancy <- filter(Expectancy, Year == 2014) %>%
  na.omit() # construct data set containing only the year 2014 and omit NAs. 

Country.stats <- inner_join(expectancy, Region, by="Country") %>% 
  select(Country, Status, Life.expectancy, GDP, HIV.AIDS, Total.expenditure, region,Income.composition.of.resources) #merge data sets expectancy and Region and select only certain variables for testing. After omitting NAs, our data set only has 127 countries

pander(head(Country.stats))

Table continues below
Country	Status	Life.expectancy	GDP	HIV.AIDS
Afghanistan	Developing	59.9	612.7	0.1
Albania	Developing	77.5	4576	0.1
Algeria	Developing	75.4	547.9	0.1
Angola	Developing	51.7	479.3	2
Argentina	Developing	76.2	12245	0.1
Armenia	Developing	74.6	3995	0.1

Total.expenditure	region	Income.composition.of.resources
8.18	Asia	0.476
5.88	Europe	0.761
7.21	Africa	0.741
3.31	Africa	0.527
4.79	Americas	0.825
4.48	Asia	0.739

2.1.2 Pairwise Scatterplot

The following pairwise scatter plot visualizes the distributions of each of the variables and the scatter plots of the relationship between variables. An assessment of the plot reveals that the quantitative variables have the following correlation with the response variable Life.expectancy: GDP = -0.445, HIV.AIDS = -0.611, Total.expenditures = 0.332, and Income.composition.of.resources = 0.891.

ggpairs(Country.stats, columns = 2:8) # pairwise plot of all variables in data set

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1.2 Exploratory Scatter Plot

The following graph is meant to explore some of the variables deeper and define a narrative for the response variable and the independent variables. In particular, this graph shows the correlation between life expectancy and resource utilization with the individual points colored by their respective region, shaped by status of development, and sized by GDP per capita. What is evident on first glance is that the countries furthest to the upper-right area of the graph are disproportionately developed European and Oceanic countries with high GDP per capita (although, not all). Those in the bottom-left area of the map are disproportionately developing African countries with very low GDP per capita.

ggplot(Country.stats, aes(x=Income.composition.of.resources, y=Life.expectancy, col = region, shape=Status, size=GDP))+
  geom_point()+
  theme_minimal()+
  labs(title="Life Expectancy as a Function of Resource Utilization in 127 Countries",
       subtitle = "Shaped by Status of Development, Sized by GDP per capita (in USD), and Colored by Region",
                     x = "Income Composition of Resources", 
                     y = "Life Expectancy")+
  scale_color_manual(values=c("#68aed6","#4292c6","#2171b5","#08519c","#08306b"), name="Region")+
  guides(size=guide_legend(
    override.aes = list(color = c("azure3","azure3","azure3"))
  ), color=guide_legend(
    override.aes = list(size=3)), shape=guide_legend(override.aes = list(size=2)))+
  scale_size_continuous(name = "GDP (per capita)")

This concludes the exploratory section of the analysis. The following section will build upon this analysis by fitting three multiple linear regression models and determine which is best suited for modelling life expectancy.

2.2 Fitting MLR to Data

Again, the following analysis will attempt to assess how the independent variables impact life expectancy by building three models: a MLR model, a log-transformation model, and a squared-transformation model. The three models will then be assessed based on a residual analysis and various goodness-of-fit measures.

2.2.1 Original Model

The following model is a MLR model with the structure \(Y=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) with Life.expectancy as the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources as the explanatory variables. An analysis of the model continues below.

full.model <- lm(Life.expectancy~.-Country, data = Country.stats)
pander(summary(full.model))

Table continues below
	Estimate	Std. Error	t value
(Intercept)	43.84	2.695	16.27
StatusDeveloping	-0.7017	1.331	-0.5272
GDP	4.253e-06	2.446e-05	0.1739
HIV.AIDS	-1.322	0.2615	-5.056
Total.expenditure	0.2785	0.1285	2.167
regionAmericas	2.091	1.153	1.814
regionAsia	0.9378	1.046	0.8964
regionEurope	2.11	1.503	1.404
regionOceania	1.36	1.441	0.944
Income.composition.of.resources	38.11	3.399	11.21

	Pr(>\|t\|)
(Intercept)	8.085e-32
StatusDeveloping	0.5991
GDP	0.8623
HIV.AIDS	1.603e-06
Total.expenditure	0.03229
regionAmericas	0.07217
regionAsia	0.3719
regionEurope	0.163
regionOceania	0.3471
Income.composition.of.resources	2.934e-20

Fitting linear model: Life.expectancy ~ . - Country
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
127	3.379	0.8561	0.845

par(mfrow = c(2,2))
plot(full.model)

The multiple linear regression model yields the following results:

the model is statistically significant at \(\alpha=.01\) since \(p < .0000\)
the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(adj.R^2 = 0.845\))
the intercept coefficient is 43.84 meaning that, holding the explanatory variables constant at 0, life expectancy is approx. 44 years
the independent variables’ coefficients and significance are as follows:

Status has a coefficient of -0.7017 meaning that when Status is equal to Developing, the country’s average life expectancy decreases by 0.7 years. The variable is not statistically significant as the p-value is equal to 0.5991.
GDP has a coefficient of 4.253e-06 meaning that when GDP increases by 10,000, average life expectancy increases by 0.043 years. The variable is not statistically significant as the p-value is equal to 0.8623.
HIV.AIDS has a coefficient of -1.322 meaning that a 0.1 increase, or an increase of one death per 100 births, decreases life expectancy by 0.13 years. The variable is highly statistically significant at a p-value of 1.603e-06.
Total.expenditure has a coefficient of 0.2785 meaning that a one percentage point increase increases life expectancy by .28 years. The variable is statistically significant as the p-value is equal to 0.0323.
region is delineated down to four categories - Americas, Asia, Europe, and Oceania - with Africa excluded as a control. An analysis of the four catagories is as follows:
- regionAmericas has a coefficient of 2.091 meaning that a country being located in the Americas increases life expectancy by 2.1 years. The variable is statistically significant at a p-value of 0.0722.
- regionAsia has a coefficient of 0.9378 meaning that a country being located in Asia increases life expectancy by approximately one year. The variable is not statistically significant at a p-value equal to 0.3719
- regionEurope has a coefficient equal to 2.11 meaning that a country being located in Europe increases life expectancy by 2.1 years. The variable is not statistically significant at a p-value of 0.163.
- regionOceania has a coefficient equal to 1.36 meaning that a country being located in Oceania increases life expectancy by 1.4 years. The variable is not statistically significant at a p-value equal to 0.3471.
Income.composition.of.resources has a coefficient equal to 38.11 meaning that if a country increases the efficiency at which they utilize their scarce resources by .1, life expectancy will increase by 3.8 years. The variable is statistically significant at a p-value equal to 2.934e-20.

An analysis of the residual plots to assess violations to the assumption \(\epsilon \sim U(0,\sigma^2)\) yields minor violations as the residuals are not completely randomly distributed around 0. Points 4, 12, and 105 show some irregularity and may have to be tested as outliers. Further analysis of residuals will be conducted below in section 2.3.

2.2.2 Log Transformation Model

The following model is a log-transformed MLR model with the structure \(log(Y)=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) with \(log(\)Life.expectancy\()\) as the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources as the explanatory variables. A nuance of this model compared to the previous model is that in order to calculate the percent increase (or decrease) in the response for every one-unit increase in the independent variable, the following equation has to be applied to the coefficient \((e^\beta-1)*100\). An analysis of the model continues below.

log.model <- lm(log(Life.expectancy)~.-Country, data = Country.stats)
pander(summary(log.model))

Table continues below
	Estimate	Std. Error	t value
(Intercept)	3.858	0.03987	96.77
StatusDeveloping	-0.004144	0.0197	-0.2104
GDP	-2.364e-08	3.619e-07	-0.06531
HIV.AIDS	-0.0214	0.003869	-5.53
Total.expenditure	0.003373	0.001902	1.773
regionAmericas	0.03277	0.01705	1.921
regionAsia	0.01685	0.01548	1.089
regionEurope	0.02907	0.02224	1.307
regionOceania	0.02363	0.02132	1.109
Income.composition.of.resources	0.5575	0.05029	11.09

	Pr(>\|t\|)
(Intercept)	1.63e-113
StatusDeveloping	0.8337
GDP	0.948
HIV.AIDS	1.975e-07
Total.expenditure	0.07878
regionAmericas	0.0571
regionAsia	0.2785
regionEurope	0.1938
regionOceania	0.2699
Income.composition.of.resources	5.864e-20

Fitting linear model: log(Life.expectancy) ~ . - Country
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
127	0.04999	0.8522	0.8408

par(mfrow=c(2,2))
plot(log.model)

##define a function to compute percentage change of coefficients
percent <- function(x){
  (exp(x)-1)*100
}

The log-transformed multiple linear regression model yields the following results:

the model is statistically significant at \(\alpha=.01\) since \(p < .0000\)
the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(adj.R^2 = 0.8408\))
the intercept coefficient is 3.858 meaning that, holding the explanatory variables constant at 0, life expectancy is approx. 47.3748789
the independent variables’ coefficients and significance are as follows:

Status has a coefficient of -0.004144 meaning that when Status is equal to Developing, life expectancy decreases by -0.4135058 percent. The variable is not statistically significant as the p-value is equal to 0.8337.
GDP has a coefficient of -2.364e-08 meaning that when GDP increases by 10,000, average life expectancy decreases by -0.0236372 percent. The variable is not statistically significant as the p-value is equal to 0.948.
HIV.AIDS has a coefficient of -0.0214 meaning that a 0.1 increase, or an increase of one death per 100 births, decreases life expectancy by -0.2116947 percent. The variable is highly statistically significant at a p-value of 1.975e-07.
Total.expenditure has a coefficient of 0.003373 meaning that a one percentage point increase increases life expectancy by 0.3378421 percent. The variable is statistically significant as the p-value is equal to 0.07878.
region is delineated down to four categories - Americas, Asia, Europe, and Oceania - with Africa excluded as a control. An analysis of the four categories is as follows:
- regionAmericas has a coefficient of 0.03277 meaning that a country being located in the Americas increases life expectancy 3.3312066 percent. The variable is statistically significant at a p-value of 0.0571.
- regionAsia has a coefficient of 0.01685 meaning that a country being located in Asia increases life expectancy by approximately 1.6997459 percent. The variable is not statistically significant at a p-value equal to 0.2785.
- regionEurope has a coefficient equal to 2.11 meaning that a country being located in Europe increases life expectancy by 2.9494349 percent. The variable is not statistically significant at a p-value of 0.1938.
- regionOceania has a coefficient equal to 0.02363 meaning that a country being located in Oceania increases life expectancy by 2.3912577 percent. The variable is not statistically significant at a p-value equal to 0.2699.
Income.composition.of.resources has a coefficient equal to 0.5575 meaning that if a country increases the efficiency at which they utilize their scarce resources by .1, life expectancy will increase by 0.2391258 percent. The variable is statistically significant at a p-value equal to 5.864e-20.

2.2.3 Squared-Transformation Model

The following model is a squared-transformed MLR model with the structure \(Y^2=\beta_0+\beta_1x_1+\cdots+\beta_ix_i\) with \(sqrt(\)Life.expectancy\()\) as the response variable and Status, GDP, HIV.AIDS, Total.expenditures, region, and Income.composition.of.resources as the explanatory variables.

sq.model <- lm((Life.expectancy)^2~.-Country, data = Country.stats)
pander(summary(sq.model))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1367	383.8	3.56	0.0005362
StatusDeveloping	-160.5	189.6	-0.8465	0.399
GDP	0.001416	0.003484	0.4064	0.6852
HIV.AIDS	-163.9	37.25	-4.401	2.39e-05
Total.expenditure	44.69	18.31	2.441	0.01615
regionAmericas	266.2	164.2	1.621	0.1076
regionAsia	99.17	149	0.6655	0.5071
regionEurope	309.4	214.1	1.445	0.1511
regionOceania	152.5	205.2	0.7433	0.4588
Income.composition.of.resources	5265	484.2	10.87	1.851e-19

Fitting linear model: (Life.expectancy)^2 ~ . - Country
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
127	481.3	0.8504	0.8389

par(mfrow=c(2,2))
plot(sq.model)

The squared-transformed multiple linear regression model yields the following results:

the model is statistically significant at \(\alpha=.01\) since \(p < .0000\)
the model is relatively effective at prediction and estimation as the adjusted coefficient of determination is relatively high (\(adj.R^2 = 0.8389\))
the intercept coefficient is 1367 meaning that, holding the explanatory variables constant at 0, the square of life expectancy is approx. 1367 \(years^2\)
the independent variables’ coefficients and significance are as follows:

Status has a coefficient of -160.5 meaning that when Status is equal to Developing, the country’s average life expectancy decreases by 161 \(years^2\). The variable is not statistically significant as the p-value is equal to 0.399.
GDP has a coefficient of 0.001416 meaning that when GDP increases by 10,000, average life expectancy increases by 14 \(years^2\). The variable is not statistically significant as the p-value is equal to 0.6852.
HIV.AIDS has a coefficient of -163.9 meaning that a 0.1 increase, or an increase of one death per 100 births, decreases life expectancy by 16 \(years^2\). The variable is highly statistically significant at a p-value of 0.0000239.
Total.expenditure has a coefficient of 44.69 meaning that a one percentage point increase increases life expectancy by 45 \(years^2\). The variable is statistically significant as the p-value is equal to 0.01615.
region is delineated down to four categories - Americas, Asia, Europe, and Oceania - with Africa excluded as a control. An analysis of the four catagories is as follows:
- regionAmericas has a coefficient of 266.2 meaning that a country being located in the Americas increases life expectancy by 266 \(years^2\). The variable is not statistically significant at a p-value of 0.1076.
- regionAsia has a coefficient of 99.17 meaning that a country being located in Asia increases life expectancy by approximately 99 \(years^2\). The variable is not statistically significant at a p-value equal to 0.5071.
- regionEurope has a coefficient equal to 309.4 meaning that a country being located in Europe increases life expectancy by 309 \(years^2\). The variable is not statistically significant at a p-value of 0.1511.
- regionOceania has a coefficient equal to 152.5 meaning that a country being located in Oceania increases life expectancy by 153 \(years^2\). The variable is not statistically significant at a p-value equal to 0.4588.
Income.composition.of.resources has a coefficient equal to 5265 meaning that if a country increases the efficiency at which they utilize their scarce resources by .1, life expectancy will increase by 527 \(years^2\). The variable is statistically significant at a p-value equal to 1.851e-19.

An analysis of the residual plots to assess violations to the assumption \(\epsilon \sim U(0,\sigma^2)\) yields minor violations as the residuals are not completely randomly distributed around 0. Points 12, 97, and 105 show some irregularity and may have to be tested as outliers. Further analysis of residuals will be conducted below in section 2.3.

2.3 Residual Analysis

The following analysis builds upon the analyses of the individual models and will examine the residuals of the models plotted side-by-side. Two plots for each model are provided below: one QQ-plot and one Residual vs. Fitted Values plot. Based upon the residual analysis, the log-transformed model is considered to be the best fit as the violation against the assumption \(\epsilon \sim U(0,\sigma^2)\) is the least severe. This is determined by the fact that the points in the QQ-plot are better placed upon the QQ-line than the other models and the residuals show more randomness in the Residual vs. Fitted Values plot.

#define plotting area
par(pty = "s", mfrow = c(2, 3))
#Q-Q plot for original model
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
#Q-Q plot for log transformed model
qqnorm(log.model$residuals, main = "Log-Life Expectancy")
qqline(log.model$residuals)
#Q-Q plot for sq transformed model
qqnorm(sq.model$residuals, main = "Squared-Life Expectancy")
qqline(sq.model$residuals)
#Residuals vs Fitted values plots for all models
plot(full.model$residuals) %>% abline(0,0, col="red")
plot(log.model$residuals) %>% abline(0,0, col="red")
plot(sq.model$residuals) %>% abline(0,0, col="red")

2.4 Goodness-of-Fit Measures

The following analysis of the goodness-of-fit measures contain the measures:

Sum of Squares Error (SSE): used to determine how well a model fits the data by squaring the error, or the difference between the observed value and the predicted value
R.squared: the coefficient of determination
adjusted.R.squared: the coefficient of determination that accounts for predictors that are not significant in a regression model
Mallows’ cp: addresses over-fitting by penalizing the addition of unnecessary variables
Akaike information criterion (AIC): estimates the quality (prediction power) of multiple models relative to one another (a smaller value are preferred)
Schwarz-Bayesian Information criterion (SBC): penalizes a model for adding extra (unnecessary) variables (a smaller value are preferred)
Predicted Residual Sum of Squares Error (PRESS): tests for over-fitting by testing the residuals of the left-out or untested observation (a smaller value are preferred)

The analysis concluded that the log-transformed model is the best model as it has the lowest SSE, AIC, SBC, and PRESS values. The adjusted-R-squared value is slightly lower than its peer models; however, it is not significantly lower enough to invalidate its performance.

GoF=function(m){ # m is an object: model
 e = m$resid                           # residuals
 n0 = length(e)                        # sample size
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
 R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
 R.adj=summary(m)$adj.r                # Adjusted R square
 MSE=(summary(m)$sigma)^2              # square error
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mallow's cp
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion: lower score means the model is more efficient (requires less information)
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion: similar to AIC in that the lowest score is considered the most efficient
 X=model.matrix(m)                     # design matrix of the model
 H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
}

output.sum = rbind(GoF(full.model), GoF(sq.model), GoF(log.model))
row.names(output.sum) = c("full.model", "sq.model", "log.model")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
full.model	1.335590e+03	0.8560808	0.8450101	10	318.8236	347.2654	1.592216e+03
sq.model	2.709770e+07	0.8504205	0.8389144	10	1578.3881	1606.8300	3.261404e+07
log.model	2.923658e-01	0.8521880	0.8408179	10	-751.3900	-722.9481	3.461065e-01

2.5 Summary of Final Model

We can explicitly write the final model in the following manner \[log(Life.expectancy)=3.858-0.004144\times StatusDeveloping-2.364*10^-8\times GDP-0.0214*\times HIV.AIDS+\]\[ 0.003373\times Total.expenditures+0.03277\times regionAmericas+0.01685\times regionAsia+0.02907\times regionEurope+\]\[ 0.02363\times regionOceania+0.5575\times Income.composition.of.resources\] An explanation of the model and its coefficients have already been interpreted in section 2.2.2; however, an explanation of the mathematics behind interpreting the coefficients as a percentage change need to be further elaborated:

Let us presume the assumption a priori that all explanatory variables, with the exception of Status, are held constant at 0. The two countries are equally the same besides the fact that the Status of one country will be Developing, or 1, and the other country is set to Developed, or 0. Then \[log(Developing)-log(Developed)=-0.004144 \to log(\frac{Developing}{Developed})=-0.004144 \to Developing=.995856\times Developed\] The above equation can re-written in the following way: \[Developing-Developed=.995856\times Developed \to \frac {Developing-Developed}{Developed}=-0.0041436=-0.4135058\%\] The life expectancy of a developing country vis-à-vis a developed country is -0.4135058 percent. Similarly, the other regression coefficients can be interpreted, in which, StatusDeveloping, GDP, and HIV.AIDS are negatively associated with life expectancy and the rest are positively associated. The variables that are statistically significant are HIV.AIDS, Total.expenditures, regionAmericas, and Income.composition.of.resources.

pander(summary(log.model)$coef)

Table continues below
	Estimate	Std. Error	t value
(Intercept)	3.858	0.03987	96.77
StatusDeveloping	-0.004144	0.0197	-0.2104
GDP	-2.364e-08	3.619e-07	-0.06531
HIV.AIDS	-0.0214	0.003869	-5.53
Total.expenditure	0.003373	0.001902	1.773
regionAmericas	0.03277	0.01705	1.921
regionAsia	0.01685	0.01548	1.089
regionEurope	0.02907	0.02224	1.307
regionOceania	0.02363	0.02132	1.109
Income.composition.of.resources	0.5575	0.05029	11.09

	Pr(>\|t\|)
(Intercept)	1.63e-113
StatusDeveloping	0.8337
GDP	0.948
HIV.AIDS	1.975e-07
Total.expenditure	0.07878
regionAmericas	0.0571
regionAsia	0.2785
regionEurope	0.1938
regionOceania	0.2699
Income.composition.of.resources	5.864e-20

MLR Assignment

Angelo Saporito

2023-02-14

1 Introduction

1.1 Data Description

1.2 Practical Question