# get the data
df <- fread("who.csv")
# change country to factor
df$Country <- as.factor(df$Country)

Problem 1

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatterplot:

plot(df$TotExp/1000, df$LifeExp,
     main='Life Expectancy and Total Expenditures',
     ylab='Years', xlab='Expenditures ($000)', col='blue')

# abline(lm(df$LifeExp~df$TotExp), col='red', lwd=3)

Linear model

fit1 <- lm(LifeExp~TotExp, data=df)
summary(fit1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933 < 0.0000000000000002 ***
## TotExp       0.000062970  0.000007795   8.079   0.0000000000000771 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 0.00000000000007714

Interpretation

The mean life expectancy is 64.75 years. For every $1B in total expenditures, life expectancy increases by .00006 years. The R2 of 0.25 suggests that a linear model is a weak predictor for this data; the model explains only about a quarter of the variation in life expectancy. This could be explained by outliers, or more probably given our plot, because the relationship between our variables is not linear. Because the p-value of the F-statisic is < 0.05, we can reject the null hypothesis that the model is worthless (i.e, that that coef of Total Expenditures = 0). However, the F-statistic is really designed to measure multivariate models, so it’s not particularly meaningful in this model.

The standard error is 8.08 smaller than the coefficient, an acceptable ratio for error. There is a strong left skew to the residuals, suggesting that variance is not equally distributed across the errors. That would violate one of the assumptions of linear regression; as shown in the plot, the predictor does not appear to be linearly related to the dependent variable.

Problem 2

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
y <- df$LifeExp^4.6
X <- df$TotExp^0.06

plot(X, y,
     main='Life Expectancy/Expenditures Transformed',
     col='blue')
abline(lm(y~X), col='red', lwd=3)
abline(v=c(1.5), col='green', lty=2, lwd=3)
abline(v=c(2.5), col='green', lty=2, lwd=3)

fit2 <- lm(y~X)
summary(fit2)
## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73 <0.0000000000000002 ***
## X            620060216   27518940   22.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 0.00000000000000022

Interpretation

With the transformations, we now have a linear model. The adjusted R2 has zoomed to 0.73, so the model is capturing most of the variation in expectancy. The standard error is 23 times smaller than the coefficient for total expenditures, so that is also a dramatic improvement. Best, the residuals are evenly distributed per the model summary output and the diagnostic plots that follow.

A drawback is that the coefficient is harder to interpret and must be transformed back to the original scale to make an understandable prediction.

The plots below compare the distribution of residuals in the original (left) and transformed (right) models. Standard dignostic plots for the transformed model follow.

residplot <- function(fit, nbreaks=10) {
               z <- rstudent(fit)
               hist(z, breaks=nbreaks, freq=FALSE,
                xlab="Studentized Residual",
                main="Distribution of Errors")
               rug(jitter(z), col="brown")
               curve(dnorm(x, mean=mean(z), sd=sd(z)),
                       add=TRUE, col="blue", lwd=2)
               lines(density(z)$x, density(z)$y,
                       col="red", lwd=2, lty=2)
               legend("topright",
                        legend = c( "Normal Curve", "Kernel Density Curve"),
                        lty=1:2, col=c("blue","red"), cex=.7)
             }
plot.new()
par(mfrow=c(1,2))
residplot(fit1)
residplot(fit2)

plot(fit2)

Problem 3

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Our regression formula in the transformed model is:

Life Expectancy = -736527910 + 620060216 x Total Expenditures + \(\epsilon\)

= 193562414

We must transform the result back to the original scale.

= 63.31

Does our result make sense? Looks so, based on plotting the point. See intersection of red lines below. This represents a typical life expectancy at a low-spending country.

What about predicting at Total Expenditures = 2.5 transformed? Because that data point is outside the range of our model, it’s an inappropriate extrapolation. We have no evidence to believe the data will behave the same outside of X in {1.17, 2.19} transformed.

yt <- (-736527910 + 620060216*1.5)^(1/4.6)
xt <- 1.5^(1/.06)

plot(df$TotExp/1000, df$LifeExp,
     main='Life Expectancy and Total Expenditures',
     ylab='Years', xlab='Expenditures ($000)', col='blue')
abline(h=yt, col='red', lwd=2)
abline(v=xt/1000, col='red', lwd=2)

Problem 4

  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

fit3 <- lm(LifeExp~PropMD + TotExp + PropMD:TotExp, data=df)
summary(fit3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                     Estimate     Std. Error t value             Pr(>|t|)
## (Intercept)     62.772703255    0.795605238  78.899 < 0.0000000000000002
## PropMD        1497.493952519  278.816879652   5.371   0.0000002320602774
## TotExp           0.000072333    0.000008982   8.053   0.0000000000000939
## PropMD:TotExp   -0.006025686    0.001472357  -4.093   0.0000635273294941
##                  
## (Intercept)   ***
## PropMD        ***
## TotExp        ***
## PropMD:TotExp ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 0.00000000000000022

Interpretation

This is a weak but interesting model. Al three predictors are significant (i.e., there is evidence to believe they are not zero) at p > 0.001. But the adjusted R2 shows the predictors capture only a small (0.35) fraction of the variation in life expectancy in countries, and the F-statistic, though significant, also is small. Given the multitude of events that can shorten one’s life, it’s nor surprising that R2 is low.

An important thing worth noting is the effect of PropMD – the proportion of doctors – has a much higher magnitude than Total Expenditures or their interaction. Doctors make a difference. The effects plot below substantiates the influence of doctors in countries with fewer resoruces. The plot shows the PropMD effect at four TotExp levels; 1st quartile, median, mean and 90th percentile. Note how the slope flattens out.

The residual plots below suggest the model is outside normality and homoscedacity assumptions for a linear regression.

They also reveal countries 45 and 146 as wild outliers. Life is good and long in Cyprus and San Marino, even though total expenditures are relatively small. How DO they do it? (Money laundering, maybe?) Actually, no. The two countries have the highest proportion of doctors in the data.

plot(effect("PropMD:TotExp", fit3,,
     list(TotExp=c(584,5541,41696,166510.8)),
     multiline=TRUE))

residplot(fit3)

plot(fit3)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Problem 5

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Our not-so-good linear model says the formula for life expectancy is:

LifeExp = 62.773 + 1497.494xPropMd + 0.000072xTotExp -0.00603PropMDTotExp + \(\epsilon\)

So for PropMD = .03 and TotExp=14:

LifeExp = 62.7727 + 1497.4939x0.03 + 0.000072333x14 - 0.00603x0.03x14

= 107.696 years life expectancy.

We are predicting at the fringes here, with the highest proportion of doctors and near bottom expenditures, an unrealistic scenario. PropMD has a disproportionate effect. (Although take a look at Sierra Leone, where life expectancy is only 40 and TotExp and PropMD are also in the bottom decile.)