Estimating Life Expectancy Through Multiple Regression

Question 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the

F statistics,
R^2,
standard error,and
p-values only. Discuss whether the assumptions of simple linear regression met.

For a linear regression model, there are four prerequisites:

A linear relationship between Total Expenditures and Life Expectancy
Homoscedasticity, the variance of Life Expectancy is somewhat constant for all values of Total Expenditures
All observations of Life Expectancy are independent
Life Expectancy is normally distributed

From the scatter plot, we can see there is not a linear relationship between Life Expectancy and Total Expenditures (violating requirement 1)

plot(LifeExp ~ TotExp,   data=countries)

Further, the residue is not constantly distributed for all values of Total Expenditure (violating requirement 2)

model_1 <- lm(LifeExp ~ TotExp,   data=countries)
plot(resid(model_1) ~ countries$TotExp )
abline(0,0)

Heteroscedasticity (violating 2) ) can also be confirmed with the QQ Plot.

qqnorm(resid(model_1))
qqline(resid(model_1))

Therefore, we can assume the linear regression is not appropriate for this dataset. But we can go ahead and perform the linear regression just an exercise:

summary(model_1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Observations:

p-value = < 0.05 so this model has a more significance level than a model with no variables
F statistic = 65.26 shows an improvement over a model with no variables R^2 = this model explains 26% of the dataset
Standard error = each standard error is 10-15 times smaller than its respective term which is a requirement

Question 2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Raising the two columns with the mutate function, we get:

countries <- countries %>% mutate (LifeExpRaised = LifeExp^4.6)
countries <- countries %>% mutate (TotExpRaised = TotExp^0.06)
plot(LifeExpRaised ~ TotExpRaised,   data=countries)

model_2 <- lm(LifeExpRaised ~ TotExpRaised,   data=countries)
plot(resid(model_2) ~ countries$TotExpRaised )
abline(0,0)

We see the scatterplot and residue plots show a better distribution and linear relationships between our ‘raised’ variables.

The model itself shows an improvement:

summary(model_2)

## 
## Call:
## lm(formula = LifeExpRaised ~ TotExpRaised, data = countries)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExpRaised  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Regarding this model:

The standard errors are again 10-15 smaller than the estimated values so that’s good
The R^2 improved quite bit from 26% to 73% and the p-value was smaller as well.
Finally, the F-statistics increased showing this is a better model overall.

With the plots and R^2 stat, we can conclude this is a ‘better’ model overall from the the simple linear regression model.

Question 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

(-736527910 + 620060216 * 1.5)^(1/4.6)

## [1] 63.31153

(-736527910 + 620060216 * 2.5)^(1/4.6)

## [1] 86.50645

We get estimates of 63.3 and 86.5 which appear to be reasonable.

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

We can see there are two quantitative terms and an interaction term combining them together.

multiple_model <- lm(LifeExp ~ PropMD +       
                       TotExp +      
                       PropMD*TotExp,  
                       data=countries)
summary(multiple_model)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

With the p-value and F-statistic, we see this model is an improvement over a model with no variables. However, the R^2 stat explains 34.5% of the dataset while the ‘raised’ power model explains 73% of the dataset.

This model doesn’t appear to perform as well as the last model.

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD <- 0.03
TotExp <- 14

multiple_model$coefficients[1] + 
  multiple_model$coefficients[2] * PropMD + 
  multiple_model$coefficients[3] * TotExp + 
  multiple_model$coefficients[4] * PropMD * TotExp

## (Intercept) 
##     107.696

We get an estimate of 107.7 which seems on the high side but understandable since the standard error is 8.77 and the R^2 statistic is 35%. There’s a lot of variance to this model so it might overestimate the life expectancy.