Homework assignment 12

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Data explaination:
* Country: name of the country
* LifeExp: average life expectancy for the country in years
* InfantSurvival: proportion of those surviving to one year or more
* Under5Survival: proportion of those surviving to five years or more
* TBFree: proportion of the population without TB.
* PropMD: proportion of the population who are MDs
* PropRN: proportion of the population who are RNs
* PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
* GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
* TotExp: sum of personal and government expenditures.

who = read_csv("./who.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   LifeExp = col_double(),
##   InfantSurvival = col_double(),
##   Under5Survival = col_double(),
##   TBFree = col_double(),
##   PropMD = col_double(),
##   PropRN = col_double(),
##   PersExp = col_double(),
##   GovtExp = col_double(),
##   TotExp = col_double()
## )
head(who)
## # A tibble: 6 x 10
##   Country LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN
##   <chr>     <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>
## 1 Afghan~      42          0.835          0.743  0.998 2.29e-4 5.72e-4
## 2 Albania      71          0.985          0.983  1.000 1.14e-3 4.61e-3
## 3 Algeria      71          0.967          0.962  0.999 1.06e-3 2.09e-3
## 4 Andorra      82          0.997          0.996  1.000 3.30e-3 3.50e-3
## 5 Angola       41          0.846          0.74   0.997 7.04e-5 1.15e-3
## 6 Antigu~      73          0.99           0.989  1.000 1.43e-4 2.77e-3
## # ... with 3 more variables: PersExp <dbl>, GovtExp <dbl>, TotExp <dbl>
options("scipen"=100, "digits"=4)

1.

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \({ R }^{ 2 }\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

who%>%
  ggplot(aes(LifeExp,TotExp))+
    geom_point(size = 4, alpha = .4)+
    geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

who_linear <- lm(LifeExp ~ TotExp, data = who)

summary(who_linear)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.76  -4.78   3.15   7.12  13.29 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) 64.75337453  0.75353661   85.93 < 0.0000000000000002 ***
## TotExp       0.00006297  0.00000779    8.08    0.000000000000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.37 on 188 degrees of freedom
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.254 
## F-statistic: 65.3 on 1 and 188 DF,  p-value: 0.0000000000000771

\[LifeExp=64.7533+0.00006297*TotExp\] At this time, this is a good signigicant predictor of evulation since the p-value is near 0 (less than .05). However looking at both the \({ R }^{ 2 }\) figures it fits only ~25% of the data.

qqnorm(who_linear$residuals)
qqline(who_linear$residuals)

The Q-Q plot shows the data is not uniformly scattered and has a definate curve. This matches the initial scatter plot.
I can not say clearly at this time the assumptions of linear regression are met.

2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \({ R }^{ 2 }\), standard error, and p-values. Which model is “better?”

who<-who%>%
  mutate(LifeExp2 = LifeExp^4.6)%>%
  mutate(TotExp2 = TotExp^.06)
  
who_linear2 <- lm(LifeExp2 ~ TotExp2, data = who)
summary(who_linear2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -736527910   46817945   -15.7 <0.0000000000000002 ***
## TotExp2      620060216   27518940    22.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90500000 on 188 degrees of freedom
## Multiple R-squared:  0.73,   Adjusted R-squared:  0.728 
## F-statistic:  508 on 1 and 188 DF,  p-value: <0.0000000000000002

\[Previous\] \[LifeExp=64.7533+0.00006297*TotExp\] \[Updated\] \[LifeExp=-736527910+620060216*TotExp\]

While there is no major change to the p-value (it remains signifiant and under 0.05), there is a large change in the \({ R }^{ 2 }\) figures. Previous model fit ~25% of the data, the updated model now fits ~73% of the data.

qqnorm(who_linear2$residuals)
qqline(who_linear2$residuals)

There is now no longer as much of a curve to the top of the data as it starts fo fall in line with the residual line. There is no change to the bottom of the data, which is to be expected.

3.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

TotExp <- 1.5
TotExp2 <- 2.5
raiseexp <- -736527910+620060216*TotExp
raiseexp2 <- -736527910+620060216*TotExp2

(Lifeexp1 <- raiseexp^(1/4.6))
## [1] 63.31
(Lifeexp2 <- raiseexp2^(1/4.6))
## [1] 86.51

4.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error,and p-values. How good is the model?

\[LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp\]

who_linear3 <- lm(data = who, LifeExp ~ PropMD + TotExp + PropMD*TotExp)
summary(who_linear3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.32  -4.13   2.10   6.54  13.07 
## 
## Coefficients:
##                    Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)     62.77270326    0.79560524   78.90 < 0.0000000000000002 ***
## PropMD        1497.49395252  278.81687965    5.37    0.000000232060277 ***
## TotExp           0.00007233    0.00000898    8.05    0.000000000000094 ***
## PropMD:TotExp   -0.00602569    0.00147236   -4.09    0.000063527329494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.77 on 186 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.347 
## F-statistic: 34.5 on 3 and 186 DF,  p-value: <0.0000000000000002

\[LifeExp = 62.7727+0.00007233 TotExp + 1497.4939 PropMD + -0.006025 * TotExp*PropMD\]

All T-Values do show statistically significant predictor values to the evaluation score.
The p-value is also near 0 (under 0.05) show statistically significant predictor values to the evaluation score.
The \({ R }^{ 2 }\) figures are ~38% and ~35% which is low varibility of the model.
The Residual Standard Error shows 8.77 w/ 186 degrees of freedom making it significant. We would have to take in to account the datapoints will be on average 8.77 off from the model. This is not off from the first model.

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
TotExp = 14
PropMD=.03

LifeExp = 62.7727+0.00007233*TotExp + 1497.4939*PropMD + -0.006025*TotExp*PropMD
LifeExp
## [1] 107.7

107 does not seem too far off from the model forecast, however, it does present a small bit of discussion. 3% of your population is Doctors and your total expendatures is only 14 billion does not seem reasonable. To have confidence in the model, I should be able to plug in the data from an actual country and get near enough to the same LifeExp number listed.

Pulling the numbers from UK

TotExp = 243184
PropMD= 0.0022085

LifeExp_mod = 62.7727+0.00007233*TotExp + 1497.4939*PropMD + -0.006025*TotExp*PropMD
LifeExp_mod
## [1] 80.43
LifeExpUK <- who%>%
  filter(Country == "United Kingdom")%>%
  mutate(LifeExpUK = LifeExp)%>%
  mutate(LifeExp_Model = .GlobalEnv$LifeExp_mod)%>%
  select(LifeExpUK, LifeExp_Model)
LifeExpUK
## # A tibble: 1 x 2
##   LifeExpUK LifeExp_Model
##       <dbl>         <dbl>
## 1        79          80.4

This is within the degrees of freedom already identified, I would accept this model forecast.