The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the countryLifeExp: average life expectancy for the country in
yearsInfantSurvival: proportion of those surviving to one
year or more Under5Survival: proportion of those surviving to five years
or moreTBFree: proportion of the population without TB.PropMD: proportion of the population who are MDsPropRN: proportion of the population who are RNsPersExp: mean personal expenditures on healthcare in US
dollars at average exchange rateGovtExp: mean government expenditures per capita on
healthcare, US dollars at average exchange rateTotExp: sum of personal and government
expenditures.# data
df <- read_csv('data/who.csv')
# scatterplot
df %>% ggplot(aes(x=TotExp, y=LifeExp)) +
geom_point(alpha=0.2) +
geom_smooth(method='lm', se=FALSE, alpha=0.5, color='blue')
df %>% ggplot(aes(x=TotExp)) +
geom_histogram()
# lm
lm1 <- lm(LifeExp ~ TotExp, data=df)
summary(lm1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
# fitted vs residual
get_regression_points(lm1) %>%
ggplot(aes(x=LifeExp_hat, y=residual)) +
geom_point(alpha=0.3) +
geom_hline(yintercept=0, color='blue')
# qq
qqnorm(resid(lm1))
qqline(resid(lm1))
Model Statistics
The Adjusted \(R^2\) is 0.25, suggesting this single-predictor model explains only 25% of the variance in Life Expectancy. The F-Statistic of 65.26 provides the coefficient we’d observe in an intercept-only model, and we can see it is very lose to the coefficient of our intercept here (64.75). The p-value for the F-Statistic is very small (7.714e-14) indicating we’d almost never observe a value this extreme in the absence of a linear relationship. The Residual SE is 9.37, the amount of total variance in the residual values.
Assumptions of Linear Regression
We can scatterplot the transformed predictor and response, and plainly see we’ve managed to “bend” the original values into a scale that lends itself well to linear regression:
df2 <- df %>%
mutate(LifeExp = LifeExp ^ 4.6, TotExp = TotExp ^ 0.06)
df2 %>% ggplot(aes(x=TotExp, y=LifeExp)) +
geom_point(alpha=0.2) +
geom_smooth(method='lm', se=FALSE, alpha=0.5, color='blue')
lm2 <- lm(LifeExp ~ TotExp, data=df2)
summary(lm2)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
# fitted vs residual
get_regression_points(lm2) %>%
ggplot(aes(x=LifeExp_hat, y=residual)) +
geom_point(alpha=0.3) +
geom_hline(yintercept=0, color='blue')
# qq
qqnorm(resid(lm2))
qqline(resid(lm2))
Model Statistics
Overall, this second model performs better than the first. The Adjusted \(R^2\) is 0.73, suggesting this single-predictor model explains about 73% of the variance in Life Expectancy. The F-Statistic of 507.7 provides the coefficient we’d observe in an intercept-only model, and we can see it is quite far from coefficient of our intercept here. The p-value for the F-Statistic is very small (2.2e-16) indicating we’d almost never observe a value this extreme in the absence of a linear relationship. The Residual SE is 90490000, the amount of total variance in the residual values - although if we wanted to interpret this (and both model coefficients) in real terms, we would need to transform them back to the original scale.
While the coefficients and statistics are on an exponential scale, we can see the assumptions of linear largely hold true, compared to the first model. The fitted vs. residuals plot demonstrates greater homoscedasticity, and the Q-Q plot shows the residuals are more normally distributed.
newval1 <- data.frame(TotExp = c(1.5))
lm2_predict1 <- predict(lm2, newdata = newval1) ^ (1/4.6)
newval2 <- data.frame(TotExp = c(2.5))
lm2_predict2 <- predict(lm2, newdata = newval2) ^ (1/4.6)
print(lm2_predict1)
## 1
## 63.31153
print(lm2_predict2)
## 1
## 86.50645
Remembering that the raw prediction returned is on the transformed scale of our predictor (to the power of 4.6), we simply transform back to the original scale to get a “real world” estimate of life expectancy.
When TotExp^.06 = 1.5 ($861), the LifeExp is approx 63.3.
When TotExp^.06 = 2.5 ($4.3M), the LifeExp is approx 86.5.
lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df)
summary(lm3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
# fitted vs residual
get_regression_points(lm3) %>%
ggplot(aes(x=LifeExp_hat, y=residual)) +
geom_point(alpha=0.3) +
geom_hline(yintercept=0, color='blue')
# qq
qqnorm(resid(lm3))
qqline(resid(lm3))
In this model we return to non-transformed values for the response and predictors, and we can observe similar problems regarding our assumptions for linear regression.
The Adjusted \(R^2\) is 0.35, suggesting this model only explains about 35% of the variance in Life Expectancy. The F-Statistic of 34.5 provides the coefficient we’d observe in an intercept-only model, and we can see it is quite far from coefficient of our intercept here. The p-value for the F-Statistic is very small (2.2e-16) indicating we’d almost never observe a value this extreme in the absence of a linear relationship. The Residual SE is 8.77, the amount of total variance in the residual values.
The fitted vs residual plot shows extreme heteroscedasticity, and the Q-Q plot shows the residuals are less-normally distributed than our transformed model in the second example.
newval3 <- data.frame(PropMD=c(.03), TotExp = c(1.5))
lm3_predict1 <- predict(lm3, newdata = newval3)
print(lm3_predict1)
## 1
## 107.6974
At first glance, the estimate of an average life expectancy of nearly 108 years seems extremely optimistic! If we compare to the second model in which we used transformed variables, we saw that a per-capita investment of $4.2M (holding all other variables equal) would be required just to raise life expectancy to 85.
Given the low Adjusted \(R^2\) and multiple violations of our assumptions for linear regression – and applying a dose of common sense – we can safely reject this prediction and look for ways to improve the model instead.