We are asked to analyze World Health Organization (WHO) data from 2008 containing life expectancy, child mortality data, proportion of medical staff and total per capita expenditure on healthcare. By analyzing these data using linear and multiple regression models, we seek to understand how to fit a model that explains country-level life expectancy in terms of the other variables.
The data is loaded from a csv file provided called HW12_who.csv. In the code block below, we see representative rows from the dataset. There are 190 observations of 10 variables.
library(tidyverse)
who = read_csv("HW12_who.csv")
## Parsed with column specification:
## cols(
## Country = col_character(),
## LifeExp = col_double(),
## InfantSurvival = col_double(),
## Under5Survival = col_double(),
## TBFree = col_double(),
## PropMD = col_double(),
## PropRN = col_double(),
## PersExp = col_double(),
## GovtExp = col_double(),
## TotExp = col_double()
## )
glimpse(who)
## Observations: 190
## Variables: 10
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "…
## $ LifeExp <dbl> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 7…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, …
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, …
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99…
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.00329729…
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.00350000…
## $ PersExp <dbl> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788,…
## $ GovtExp <dbl> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856,…
## $ TotExp <dbl> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944…
plot(who$TotExp, who$LifeExp)
mod1 = lm( LifeExp ~ TotExp, data=who)
abline(mod1, color="red")
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## "color" is not a graphical parameter
summary(mod1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
par(mfrow=c(2,2))
plot(mod1)
par(mfrow=c(1,1))
hist(mod1$residuals)
who[c(99, 11, 76, 152),c("Country", "LifeExp", "TotExp")]
## # A tibble: 4 x 3
## Country LifeExp TotExp
## <chr> <dbl> <dbl>
## 1 Luxembourg 80 482750
## 2 Azerbaijan 64 842
## 3 Iceland 81 400776
## 4 Sierra Leone 40 172
F-statistic of 65.26 tells us whether the model’s independent variables jointly provide a better fit to the response variable than a model with no independent variables. I.e. where none of the independent variables are statistically significant predictors of the response. In this case, the p-value of the F-test is very small so the predictor variables are statistically significant.
R^2 0.2577 measures the percentage of variance of the response variable (LifeExp) explained by the model’s predictor variables (TotExp). In this case, it is 25.77%.
standard error – 9.371 defines the average distance that the observations deviate from the regression line. 9.371 is measured in years in this case. It measures how wrong the model is on average.
p-value 7.71e-14 for TotExp, 2e-16 for Intercept - the p-value of the coefficients tell us the probability of the coefficient actually being zero but showing the observed values is nearly zero. In this case, the coefficients of the regression are highly statistically significant.
The assumptions of simple linear regression are not met for several reasons:
Non-linear trend – visually we see the scatterplot appear to define an inverted L-shaped curve – not a linear trend.
Non-constant variance - most of the variance is bunched at the lower range of the TotExp variable from 0 to 1000 dollars per capita.
Residuals don’t fit a normal distribution. Both on the Q-Q plot above, the realized residuals seem to form a thin tailed distribution. Likewise the histogram of residuals above shows a distribution skewed to the left. This is because lifeExp has an upper bound from biological constraints whereas the linear model implies unbounded life expectancy is possible.
In this analysis, we do a power transformation of both the life expectancy and the total expenditure variables raising them to powers 4.6 and 0.06 respectively. Then we run the regression between the two variables and analyze the results.
who2 <- ( who %>% mutate( LifeExp2 = LifeExp^04.6, TotExp2 = TotExp^0.06 ) )
glimpse(who2)
## Observations: 190
## Variables: 12
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "…
## $ LifeExp <dbl> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 7…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, …
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, …
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99…
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.00329729…
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.00350000…
## $ PersExp <dbl> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788,…
## $ GovtExp <dbl> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856,…
## $ TotExp <dbl> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944…
## $ LifeExp2 <dbl> 29305338, 327935478, 327935478, 636126841, 262304…
## $ TotExp2 <dbl> 1.327251, 1.625875, 1.672697, 2.061481, 1.560068,…
mod2 = lm( LifeExp2 ~ TotExp2, data=who2)
summary(mod2)
##
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp2 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
plot(who2$TotExp2, who2$LifeExp2)
abline(mod2)
hist(mod2$residuals)
Looking at the scatter plot and the fit of the regression line, we conclude the linear model is appropriate. This is confirmed by the near normality of the histogram of residuals above. The histogram shows a slight left skew in the distribution of residuals.
The F-statistic is 507.7 and its p-value is nearly zero. This suggests TotExp2 is a statistically significant predictor of LifeExp2 and the model is better than a constant estimate of the LifeExp2.
The R-squared is 72.98% which implies the model can explain 72.98% of the variance of LifeExp2.
The standard error is 90490000 represents 1 standard deviation of the model error around LifeExp2.
The p-values of the coefficients around nearly zero (<2e-16) and suggest both the intercept and the slope are statistically very significantly different from zero.
new.data= data.frame( TotExp2 = c( 1.5 , 2.5 ))
( predictions = predict(mod2, newdata= new.data) )
## 1 2
## 193562414 813622630
( new.pred = data.frame( TotExp2 = new.data$TotExp2,
LifeExp = predictions^(1/4.6),
TotExp = new.data$TotExp2^(1/0.06) )
)
## TotExp2 LifeExp TotExp
## 1 1.5 63.31153 860.705
## 2 2.5 86.50645 4288777.125
knitr::kable(new.pred, digit = 1)
| TotExp2 | LifeExp | TotExp |
|---|---|---|
| 1.5 | 63.3 | 860.7 |
| 2.5 | 86.5 | 4288777.1 |
The forecast life expectancy is 63.3 years when TotExp2 = 1.5 and 86.5 years when TotExp2 = 2.5. These are reasonable values because although 86.5 exceeds the range of current life expectanies. It is well within the range of plausible individual life spans.
The proposed multiple regression model is generated below along with its summary statistics, histogram of residuals and the four standard diagnostic plots.
mod3 = lm( LifeExp ~ PropMD + TotExp + PropMD * TotExp, data=who)
summary(mod3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
hist(mod3$residuals)
par(mfrow=c(2,2))
plot(mod3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
After a review of the plots and diagnostics, I conclude the model fit is poor and don’t recommend its use. First, we evaluate the results. The intercept and sign of the coefficients seem appropriate for PropMD and TotExp.
However, the sign of the interaction term seems wrong. As the proportion of doctors increases, the benefit of increasing expenditures causes life expectancy to decline. This does not make medical sense.
Separately, the R-squared is somewhat low. The model explains only 35.74% of the variation in life expectancy. Although the F-test suggests the model is statistically significant (F = 34.49 and p-value < 2.2e-16), we cannot conclude the model is a good fit for prediction purposes. Lastly, the p-value of all the coefficients are nearly zero so the coefficients are very statistically significant.
Looking at the diagnostic plots, we see that residuals are highly skewed. In the residual vs. Fitted plot and in the Residuals vs. Leverage plots, observation 45 is an influential outlier. This observation is corresponds to the nation of Cyprus with has a PropMD = 3.32% , TotExp is 40,749 per capita which has a life expectancy of over 80 year. The removal of Cyprus will improve the model R-squared substantially.
whoMinusCyprus = who[-45,]
modExCyprus = lm( LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whoMinusCyprus )
summary(modExCyprus)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whoMinusCyprus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6078 -4.5469 0.6976 6.1319 12.6637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.946e+01 8.516e-01 69.823 < 2e-16 ***
## PropMD 4.482e+03 4.927e+02 9.098 < 2e-16 ***
## TotExp 8.102e-05 8.099e-06 10.004 < 2e-16 ***
## PropMD:TotExp -1.608e-02 1.943e-03 -8.276 2.47e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.811 on 185 degrees of freedom
## Multiple R-squared: 0.4888, Adjusted R-squared: 0.4805
## F-statistic: 58.96 on 3 and 185 DF, p-value: < 2.2e-16
hist(modExCyprus$residuals)
In the above, we see that modExCyprus explained 48.8% of the variance whereas mod3 explains only 35.7% of the variance. This change in R-2 confirms the significance of removing the outlier Cyprus.
We now examine the prediction of the multiple regression model when PropMD = 0.03 and TotExp = 14.
new.data2 = data.frame( PropMD = c(0.03), TotExp = c(14))
( p2 = predict(mod3, new.data2) )
## 1
## 107.696
The predicted life expectancy is 107.7 years which is unrealistic because no country is every likely to reach this life expectancy within the 21th century even with gradual improvements in medical care.