# get the data
df <- fread("who.csv")
# change country to factor
df$Country <- as.factor(df$Country)Problem 1
- Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Scatterplot:
plot(df$TotExp/1000, df$LifeExp,
main='Life Expectancy and Total Expenditures',
ylab='Years', xlab='Expenditures ($000)', col='blue')# abline(lm(df$LifeExp~df$TotExp), col='red', lwd=3)Linear model
fit1 <- lm(LifeExp~TotExp, data=df)
summary(fit1)##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.753374534 0.753536611 85.933 < 0.0000000000000002 ***
## TotExp 0.000062970 0.000007795 8.079 0.0000000000000771 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 0.00000000000007714
Interpretation
The mean life expectancy is 64.75 years. For every $1B in total expenditures, life expectancy increases by .00006 years. The R2 of 0.25 suggests that a linear model is a weak predictor for this data; the model explains only about a quarter of the variation in life expectancy. This could be explained by outliers, or more probably given our plot, because the relationship between our variables is not linear. Because the p-value of the F-statisic is < 0.05, we can reject the null hypothesis that the model is worthless (i.e, that that coef of Total Expenditures = 0). However, the F-statistic is really designed to measure multivariate models, so it’s not particularly meaningful in this model.
The standard error is 8.08 smaller than the coefficient, an acceptable ratio for error. There is a strong left skew to the residuals, suggesting that variance is not equally distributed across the errors. That would violate one of the assumptions of linear regression; as shown in the plot, the predictor does not appear to be linearly related to the dependent variable.
Problem 2
- Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
y <- df$LifeExp^4.6
X <- df$TotExp^0.06
plot(X, y,
main='Life Expectancy/Expenditures Transformed',
col='blue')
abline(lm(y~X), col='red', lwd=3)
abline(v=c(1.5), col='green', lty=2, lwd=3)
abline(v=c(2.5), col='green', lty=2, lwd=3)fit2 <- lm(y~X)
summary(fit2)##
## Call:
## lm(formula = y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <0.0000000000000002 ***
## X 620060216 27518940 22.53 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 0.00000000000000022
Interpretation
With the transformations, we now have a linear model. The adjusted R2 has zoomed to 0.73, so the model is capturing most of the variation in expectancy. The standard error is 23 times smaller than the coefficient for total expenditures, so that is also a dramatic improvement. Best, the residuals are evenly distributed per the model summary output and the diagnostic plots that follow.
A drawback is that the coefficient is harder to interpret and must be transformed back to the original scale to make an understandable prediction.
The plots below compare the distribution of residuals in the original (left) and transformed (right) models. Standard dignostic plots for the transformed model follow.
residplot <- function(fit, nbreaks=10) {
z <- rstudent(fit)
hist(z, breaks=nbreaks, freq=FALSE,
xlab="Studentized Residual",
main="Distribution of Errors")
rug(jitter(z), col="brown")
curve(dnorm(x, mean=mean(z), sd=sd(z)),
add=TRUE, col="blue", lwd=2)
lines(density(z)$x, density(z)$y,
col="red", lwd=2, lty=2)
legend("topright",
legend = c( "Normal Curve", "Kernel Density Curve"),
lty=1:2, col=c("blue","red"), cex=.7)
}plot.new()
par(mfrow=c(1,2))
residplot(fit1)
residplot(fit2)plot(fit2)Problem 3
- Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Our regression formula in the transformed model is:
Life Expectancy = -736527910 + 620060216 x Total Expenditures + \(\epsilon\)
= 193562414
We must transform the result back to the original scale.
= 63.31
Does our result make sense? Looks so, based on plotting the point. See intersection of red lines below. This represents a typical life expectancy at a low-spending country.
What about predicting at Total Expenditures = 2.5 transformed? Because that data point is outside the range of our model, it’s an inappropriate extrapolation. We have no evidence to believe the data will behave the same outside of X in {1.17, 2.19} transformed.
yt <- (-736527910 + 620060216*1.5)^(1/4.6)
xt <- 1.5^(1/.06)
plot(df$TotExp/1000, df$LifeExp,
main='Life Expectancy and Total Expenditures',
ylab='Years', xlab='Expenditures ($000)', col='blue')
abline(h=yt, col='red', lwd=2)
abline(v=xt/1000, col='red', lwd=2)Problem 4
- Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
fit3 <- lm(LifeExp~PropMD + TotExp + PropMD:TotExp, data=df)
summary(fit3)##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.772703255 0.795605238 78.899 < 0.0000000000000002
## PropMD 1497.493952519 278.816879652 5.371 0.0000002320602774
## TotExp 0.000072333 0.000008982 8.053 0.0000000000000939
## PropMD:TotExp -0.006025686 0.001472357 -4.093 0.0000635273294941
##
## (Intercept) ***
## PropMD ***
## TotExp ***
## PropMD:TotExp ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 0.00000000000000022
Interpretation
This is a weak but interesting model. Al three predictors are significant (i.e., there is evidence to believe they are not zero) at p > 0.001. But the adjusted R2 shows the predictors capture only a small (0.35) fraction of the variation in life expectancy in countries, and the F-statistic, though significant, also is small. Given the multitude of events that can shorten one’s life, it’s nor surprising that R2 is low.
An important thing worth noting is the effect of PropMD – the proportion of doctors – has a much higher magnitude than Total Expenditures or their interaction. Doctors make a difference. The effects plot below substantiates the influence of doctors in countries with fewer resoruces. The plot shows the PropMD effect at four TotExp levels; 1st quartile, median, mean and 90th percentile. Note how the slope flattens out.
The residual plots below suggest the model is outside normality and homoscedacity assumptions for a linear regression.
They also reveal countries 45 and 146 as wild outliers. Life is good and long in Cyprus and San Marino, even though total expenditures are relatively small. How DO they do it? (Money laundering, maybe?) Actually, no. The two countries have the highest proportion of doctors in the data.
plot(effect("PropMD:TotExp", fit3,,
list(TotExp=c(584,5541,41696,166510.8)),
multiline=TRUE))residplot(fit3)plot(fit3)## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Problem 5
- Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
Our not-so-good linear model says the formula for life expectancy is:
LifeExp = 62.773 + 1497.494xPropMd + 0.000072xTotExp -0.00603PropMDTotExp + \(\epsilon\)
So for PropMD = .03 and TotExp=14:
LifeExp = 62.7727 + 1497.4939x0.03 + 0.000072333x14 - 0.00603x0.03x14
= 107.696 years life expectancy.
We are predicting at the fringes here, with the highest proportion of doctors and near bottom expenditures, an unrealistic scenario. PropMD has a disproportionate effect. (Although take a look at Sierra Leone, where life expectancy is only 40 and TotExp and PropMD are also in the bottom decile.)