Question 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the

For a linear regression model, there are four prerequisites:

  1. A linear relationship between Total Expenditures and Life Expectancy
  2. Homoscedasticity, the variance of Life Expectancy is somewhat constant for all values of Total Expenditures
  3. All observations of Life Expectancy are independent
  4. Life Expectancy is normally distributed

From the scatter plot, we can see there is not a linear relationship between Life Expectancy and Total Expenditures (violating requirement 1)

plot(LifeExp ~ TotExp,   data=countries)

Further, the residue is not constantly distributed for all values of Total Expenditure (violating requirement 2)

model_1 <- lm(LifeExp ~ TotExp,   data=countries)
plot(resid(model_1) ~ countries$TotExp )
abline(0,0)

Heteroscedasticity (violating 2) ) can also be confirmed with the QQ Plot.

qqnorm(resid(model_1))
qqline(resid(model_1))

Therefore, we can assume the linear regression is not appropriate for this dataset. But we can go ahead and perform the linear regression just an exercise:

summary(model_1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Observations:

Question 2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Raising the two columns with the mutate function, we get:

countries <- countries %>% mutate (LifeExpRaised = LifeExp^4.6)
countries <- countries %>% mutate (TotExpRaised = TotExp^0.06)
plot(LifeExpRaised ~ TotExpRaised,   data=countries)

model_2 <- lm(LifeExpRaised ~ TotExpRaised,   data=countries)
plot(resid(model_2) ~ countries$TotExpRaised )
abline(0,0)

We see the scatterplot and residue plots show a better distribution and linear relationships between our ‘raised’ variables.

The model itself shows an improvement:

summary(model_2)
## 
## Call:
## lm(formula = LifeExpRaised ~ TotExpRaised, data = countries)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExpRaised  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Regarding this model:

With the plots and R^2 stat, we can conclude this is a ‘better’ model overall from the the simple linear regression model.

Question 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

(-736527910 + 620060216 * 1.5)^(1/4.6)
## [1] 63.31153
(-736527910 + 620060216 * 2.5)^(1/4.6)
## [1] 86.50645

We get estimates of 63.3 and 86.5 which appear to be reasonable.

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

We can see there are two quantitative terms and an interaction term combining them together.

multiple_model <- lm(LifeExp ~ PropMD +       
                       TotExp +      
                       PropMD*TotExp,  
                       data=countries)
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

With the p-value and F-statistic, we see this model is an improvement over a model with no variables. However, the R^2 stat explains 34.5% of the dataset while the ‘raised’ power model explains 73% of the dataset.

This model doesn’t appear to perform as well as the last model.

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD <- 0.03
TotExp <- 14

multiple_model$coefficients[1] + 
  multiple_model$coefficients[2] * PropMD + 
  multiple_model$coefficients[3] * TotExp + 
  multiple_model$coefficients[4] * PropMD * TotExp
## (Intercept) 
##     107.696

We get an estimate of 107.7 which seems on the high side but understandable since the standard error is 8.77 and the R^2 statistic is 35%. There’s a lot of variance to this model so it might overestimate the life expectancy.