Data 605 Assignment Week 12

Alexander Ng

11/17/2019

Overview

We are asked to analyze World Health Organization (WHO) data from 2008 containing life expectancy, child mortality data, proportion of medical staff and total per capita expenditure on healthcare. By analyzing these data using linear and multiple regression models, we seek to understand how to fit a model that explains country-level life expectancy in terms of the other variables.

The data is loaded from a csv file provided called HW12_who.csv. In the code block below, we see representative rows from the dataset. There are 190 observations of 10 variables.

library(tidyverse)
who = read_csv("HW12_who.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   LifeExp = col_double(),
##   InfantSurvival = col_double(),
##   Under5Survival = col_double(),
##   TBFree = col_double(),
##   PropMD = col_double(),
##   PropRN = col_double(),
##   PersExp = col_double(),
##   GovtExp = col_double(),
##   TotExp = col_double()
## )
glimpse(who)
## Observations: 190
## Variables: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "…
## $ LifeExp        <dbl> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 7…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, …
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, …
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99…
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.00329729…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.00350000…
## $ PersExp        <dbl> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788,…
## $ GovtExp        <dbl> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856,…
## $ TotExp         <dbl> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944…

1. Scatterplot and Simple Linear Regression

plot(who$TotExp, who$LifeExp)
mod1 = lm( LifeExp ~ TotExp, data=who)
abline(mod1, color="red")
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## "color" is not a graphical parameter

summary(mod1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
par(mfrow=c(2,2))
plot(mod1)

par(mfrow=c(1,1))
hist(mod1$residuals)

who[c(99, 11, 76, 152),c("Country", "LifeExp", "TotExp")]
## # A tibble: 4 x 3
##   Country      LifeExp TotExp
##   <chr>          <dbl>  <dbl>
## 1 Luxembourg        80 482750
## 2 Azerbaijan        64    842
## 3 Iceland           81 400776
## 4 Sierra Leone      40    172

F-statistic of 65.26 tells us whether the model’s independent variables jointly provide a better fit to the response variable than a model with no independent variables. I.e. where none of the independent variables are statistically significant predictors of the response. In this case, the p-value of the F-test is very small so the predictor variables are statistically significant.

R^2 0.2577 measures the percentage of variance of the response variable (LifeExp) explained by the model’s predictor variables (TotExp). In this case, it is 25.77%.

standard error – 9.371 defines the average distance that the observations deviate from the regression line. 9.371 is measured in years in this case. It measures how wrong the model is on average.

p-value 7.71e-14 for TotExp, 2e-16 for Intercept - the p-value of the coefficients tell us the probability of the coefficient actually being zero but showing the observed values is nearly zero. In this case, the coefficients of the regression are highly statistically significant.

The assumptions of simple linear regression are not met for several reasons:

  1. Non-linear trend – visually we see the scatterplot appear to define an inverted L-shaped curve – not a linear trend.

  2. Non-constant variance - most of the variance is bunched at the lower range of the TotExp variable from 0 to 1000 dollars per capita.

  3. Residuals don’t fit a normal distribution. Both on the Q-Q plot above, the realized residuals seem to form a thin tailed distribution. Likewise the histogram of residuals above shows a distribution skewed to the left. This is because lifeExp has an upper bound from biological constraints whereas the linear model implies unbounded life expectancy is possible.

2. Second Linear Regression with Powers

In this analysis, we do a power transformation of both the life expectancy and the total expenditure variables raising them to powers 4.6 and 0.06 respectively. Then we run the regression between the two variables and analyze the results.

who2 <- ( who %>% mutate( LifeExp2 = LifeExp^04.6, TotExp2 = TotExp^0.06 ) )

glimpse(who2)
## Observations: 190
## Variables: 12
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "…
## $ LifeExp        <dbl> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 7…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, …
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, …
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99…
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.00329729…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.00350000…
## $ PersExp        <dbl> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788,…
## $ GovtExp        <dbl> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856,…
## $ TotExp         <dbl> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944…
## $ LifeExp2       <dbl> 29305338, 327935478, 327935478, 636126841, 262304…
## $ TotExp2        <dbl> 1.327251, 1.625875, 1.672697, 2.061481, 1.560068,…
mod2 = lm( LifeExp2 ~ TotExp2, data=who2)
summary(mod2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
plot(who2$TotExp2, who2$LifeExp2)
abline(mod2)

hist(mod2$residuals)

Looking at the scatter plot and the fit of the regression line, we conclude the linear model is appropriate. This is confirmed by the near normality of the histogram of residuals above. The histogram shows a slight left skew in the distribution of residuals.

The F-statistic is 507.7 and its p-value is nearly zero. This suggests TotExp2 is a statistically significant predictor of LifeExp2 and the model is better than a constant estimate of the LifeExp2.

The R-squared is 72.98% which implies the model can explain 72.98% of the variance of LifeExp2.

The standard error is 90490000 represents 1 standard deviation of the model error around LifeExp2.

The p-values of the coefficients around nearly zero (<2e-16) and suggest both the intercept and the slope are statistically very significantly different from zero.

3. Predictions Of Second Model

new.data= data.frame( TotExp2 = c( 1.5 , 2.5 ))

( predictions = predict(mod2, newdata= new.data) )
##         1         2 
## 193562414 813622630
( new.pred = data.frame( TotExp2 = new.data$TotExp2,
                         LifeExp = predictions^(1/4.6), 
                         TotExp = new.data$TotExp2^(1/0.06) )
                          )
##   TotExp2  LifeExp      TotExp
## 1     1.5 63.31153     860.705
## 2     2.5 86.50645 4288777.125
knitr::kable(new.pred, digit = 1)
TotExp2 LifeExp TotExp
1.5 63.3 860.7
2.5 86.5 4288777.1

The forecast life expectancy is 63.3 years when TotExp2 = 1.5 and 86.5 years when TotExp2 = 2.5. These are reasonable values because although 86.5 exceeds the range of current life expectanies. It is well within the range of plausible individual life spans.

4. Multiple Regression Model

The proposed multiple regression model is generated below along with its summary statistics, histogram of residuals and the four standard diagnostic plots.

mod3 = lm( LifeExp ~ PropMD + TotExp + PropMD * TotExp, data=who)
summary(mod3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
hist(mod3$residuals)

par(mfrow=c(2,2))
plot(mod3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

After a review of the plots and diagnostics, I conclude the model fit is poor and don’t recommend its use. First, we evaluate the results. The intercept and sign of the coefficients seem appropriate for PropMD and TotExp.
However, the sign of the interaction term seems wrong. As the proportion of doctors increases, the benefit of increasing expenditures causes life expectancy to decline. This does not make medical sense.

Separately, the R-squared is somewhat low. The model explains only 35.74% of the variation in life expectancy. Although the F-test suggests the model is statistically significant (F = 34.49 and p-value < 2.2e-16), we cannot conclude the model is a good fit for prediction purposes. Lastly, the p-value of all the coefficients are nearly zero so the coefficients are very statistically significant.

Looking at the diagnostic plots, we see that residuals are highly skewed. In the residual vs. Fitted plot and in the Residuals vs. Leverage plots, observation 45 is an influential outlier. This observation is corresponds to the nation of Cyprus with has a PropMD = 3.32% , TotExp is 40,749 per capita which has a life expectancy of over 80 year. The removal of Cyprus will improve the model R-squared substantially.

whoMinusCyprus = who[-45,]

modExCyprus = lm( LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whoMinusCyprus )

summary(modExCyprus)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whoMinusCyprus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6078  -4.5469   0.6976   6.1319  12.6637 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.946e+01  8.516e-01  69.823  < 2e-16 ***
## PropMD         4.482e+03  4.927e+02   9.098  < 2e-16 ***
## TotExp         8.102e-05  8.099e-06  10.004  < 2e-16 ***
## PropMD:TotExp -1.608e-02  1.943e-03  -8.276 2.47e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.811 on 185 degrees of freedom
## Multiple R-squared:  0.4888, Adjusted R-squared:  0.4805 
## F-statistic: 58.96 on 3 and 185 DF,  p-value: < 2.2e-16
hist(modExCyprus$residuals)

In the above, we see that modExCyprus explained 48.8% of the variance whereas mod3 explains only 35.7% of the variance. This change in R-2 confirms the significance of removing the outlier Cyprus.

5. Predictions Of Multiple Regression Model

We now examine the prediction of the multiple regression model when PropMD = 0.03 and TotExp = 14.

new.data2 = data.frame( PropMD = c(0.03), TotExp = c(14))

( p2 = predict(mod3, new.data2) )
##       1 
## 107.696

The predicted life expectancy is 107.7 years which is unrealistic because no country is every likely to reach this life expectancy within the 21th century even with gradual improvements in medical care.