Life Expectancy Model

Initialization

library(tidyverse)

who <- read_csv('C:\\Users\\pgood\\OneDrive\\Documents\\R\\who.csv')

1

ggplot(who) + geom_point(aes(x = TotExp, y = LifeExp))

life <- lm(LifeExp ~ TotExp, data = who)
ggplot(who) + geom_point(aes(x = TotExp, y = LifeExp)) + geom_smooth(aes(x = TotExp, y = LifeExp), method = 'lm')

#F Statistic
summary(life)$fstatistic
##    value    numdf    dendf 
##  65.2642   1.0000 188.0000
#R Squared
summary(life)$r.squared
## [1] 0.2576922
#Standard Error
summary(life)$sigma
## [1] 9.371033
#P Value
f <- summary(life)$fstatistic
pf(f[1],f[2],f[3],lower.tail=F)
##        value 
## 7.713993e-14

The model is clearly misspecified. The scatter plot demonstrates a nonlinear relationship. The F statistic and p value indicate that there is a relationship between totExp and LifeExp, as does R squared. The standard error is misleading because the residuals will not be normally distributed, but as is says that there is about an 84% chance the actual value is +- 9.4 years from the prediction.

2

who_t <- who %>% 
  mutate(
    LifeExp = LifeExp**4.6,
    TotExp = TotExp**.06
  )

ggplot(who_t) + geom_point(aes(x = TotExp, y = LifeExp))

life <- lm(LifeExp ~ TotExp, data = who_t)

ggplot(who_t) + geom_point(aes(x = TotExp, y = LifeExp)) + geom_smooth(aes(x = TotExp, y = LifeExp), method = 'lm')

#F Statistic
summary(life)$fstatistic
##    value    numdf    dendf 
## 507.6967   1.0000 188.0000
#R Squared
summary(life)$r.squared
## [1] 0.7297673
#Standard Error
summary(life)$sigma
## [1] 90492393
#P Value
f <- summary(life)$fstatistic
pf(f[1],f[2],f[3],lower.tail=F)
##        value 
## 2.601428e-55

The scatterplot now shows a linear relationship. All of the model diagonstics are much better. R squared is 3 times higher, indicating 3 times more of the variance in LifeExp is explained. Taking the 4.6 root of the standard error:

summary(life)$sigma**(1/4.6)
## [1] 53.66588

indicates we’re much more sure of our prediction for life expectancy.

3

predict(life,data.frame(TotExp = 1.5))**(1/4.6)
##        1 
## 63.31153
predict(life,data.frame(TotExp = 2.5))**(1/4.6)
##        1 
## 86.50645

This model may not generalize to higher total expendature levels, but that can’t be determined because we don’t have those data points yet. Technology could allow for life expectancies in the 100s

4

life2 <- lm(LifeExp ~  PropMD + TotExp + PropMD * TotExp, data = who)

#F Statistic
summary(life2)$fstatistic
##     value     numdf     dendf 
##  34.48833   3.00000 186.00000
#R Squared
summary(life2)$adj.r.squared
## [1] 0.3470713
#Standard Error
summary(life2)$sigma
## [1] 8.765493
#P Value
f <- summary(life2)$fstatistic
pf(f[1],f[2],f[3],lower.tail=F)
##        value 
## 9.024193e-18
plot(life2)

This model has worse fit statistics than the transformed model, but slightly better than the single variable model. Based on the plot diagnostics, it apppears to be misspecified, as there are nonlinear trends and nonnormal residuals.

5

predict(life2,  data.frame(PropMD=.03, TotExp = 14))
##       1 
## 107.696

This prediction is not realistic. A country with that level of expendatures would not have the technology to keep people alive that long even if its a population with so many doctors. Extrapolating with a variable like life expectancy will require nonlinear predictors or a transformation of the response so that the response almost reaches an asymptote.