df_who <- read.csv("who.csv")

summary(df_who)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750
glimpse(df_who)
## Rows: 190
## Columns: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola…
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63,…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979,…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976,…
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, 0…
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.0…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.0…
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, 1…
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 18761…
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 1907…

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

plot(df_who$TotExp, df_who$LifeExp)

m1 <- lm(LifeExp ~ TotExp, data=df_who)
summary(m1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df_who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
plot(m1)

Interpret the F statistics, R^2, standard error,and p-values only.

F statistic is indicating that the model overall is significant (better than guessing the mean) at a 99% confidence level. Adjusted R-squared is rather low, but I would usually take this as an indicator that the model is worth attempting to improve (ignoring the obviously failed assumptions for the moment), rather than abandoning. The p-values for the intercept, TotExp, and the model overall are all significant at a 99% confidence level.

Discuss whether the assumptions of simple linear regression met.

Based on the scatterplot we can see the relationship is not linear. Based on the residual plots we can further conclude the relationship is non-linear and the residuals are non-normal. There may be other assumptions not being met, but we can reject this model before proceeding further.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

plot(df_who$TotExp^.06, df_who$LifeExp^4.6)

m2 <- lm(I(LifeExp^4.6) ~ I(TotExp^.06), data=df_who)
summary(m2)
## 
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = df_who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -736527910   46817945  -15.73   <2e-16 ***
## I(TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

The F statistic is indicating the model is extremely significant (better than guessing the mean). Adjusted R-squared is substantially better than the previous model and overall .728 is interpreted as large (Cohen 1988). This model is significantly better.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

(-736527910 + 620060216 * 1.5)^(1/4.6)
## [1] 63.31153
(-736527910 + 620060216 * 2.5)^(1/4.6)
## [1] 86.50645

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data=df_who)
summary(m3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df_who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

The F statistic is indicating the model is very significant (better than guessing the mean). Adjusted R-squared is improved over the first model, but a step back from the second model and overall 0.347 is interpreted as moderate (Cohen 1988). This model is better than the first and worse than the second.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

new <- data.frame(PropMD = c(0.3), TotExp = c(14))
predict(m3, new)
##        1 
## 511.9966

This does not seem realistic as human life expectancy is far below 512 and is not expected to approach that in the near future despite the number of MDs and government expenditures. This is another reminder that models should avoid extrapolation when possible and if it is necessary do not exceed significantly beyond the data values you do have.