Get the Data and Review Data

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 190 obs. of  10 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : num  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : num  20 169 108 2589 36 ...
##  $ GovtExp       : num  92 3128 5184 169725 1620 ...
##  $ TotExp        : num  112 3297 5292 172314 1656 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   LifeExp = col_double(),
##   ..   InfantSurvival = col_double(),
##   ..   Under5Survival = col_double(),
##   ..   TBFree = col_double(),
##   ..   PropMD = col_double(),
##   ..   PropRN = col_double(),
##   ..   PersExp = col_double(),
##   ..   GovtExp = col_double(),
##   ..   TotExp = col_double()
##   .. )
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN         
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750
  1. Provide a scatter plot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatter Plot - Life Expectancy vs Total Expenditures

Model - Mod1: lm(LifeExp~TotExp, whodat)

term estimate std.error statistic p.value
(Intercept) 64.753375 0.7535366 85.932619 0
TotExp 0.000063 0.0000078 8.078626 0

Model Evaluation

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2576922 0.2537437 9.371033 65.2642 0 2 -693.7415 1393.483 1403.224 16509.46 188

The scatter plot of Total Expenditures and Life Expectancy does not scream linear relationship so it difficult to say all the assumptions of a simple linear regression are met. The intercept of 64.7 means that with no expenditures one could expect to live 64 years. The coefficient for TotExp is very small, however, this likely reflects the scale of the expenditures. The p-values (0 values) for the intercept and coefficient indicate the variables are statistically significant. The R-Squared and Adj R-Squared around .25 indicates that Tot Exp explains approximately 25% of the variance - additional variable may improve the fit. The F-statistic of 65.26 and the p-value of zero mean that we can reject the hypothesis that the model is not better than the zero beta model.

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression mod1 using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which mod1 is “better?”

Scatter Plot

Model - Mod2: lm(LifeExpTrans2 ~ TotExpTrans2, whodat2)

term estimate std.error statistic p.value
(Intercept) -736527909 46817945 -15.73174 0
TotExpTrans2 620060216 27518940 22.53213 0

Model Evaluation

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.7297673 0.7283299 90492393 507.6967 0 2 -3749.541 7505.081 7514.822 1.539508e+18 188

The scatter plot for Mod2 shows a linear relationship and right away we see a big jump in the R-squared and Adj. R-Squared compared to the previous model - 25% vs 72%. Similar to mod1 both the intercept and coefficient are statistically significant. We have all seen a substantial increase in the F-statistic (507 vs 65). The interesting thing about this model is that while the transforms improved the linear relationship and model fit, they also make the model more difficult to interpret.

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
## [1] 63.31153
## [1] 86.50645
  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
term estimate std.error statistic p.value
(Intercept) 62.7727033 0.7956052 78.899309 0.00e+00
PropMD 1497.4939525 278.8168797 5.370887 2.00e-07
TotExp 0.0000723 0.0000090 8.053199 0.00e+00
PropMD:TotExp -0.0060257 0.0014724 -4.092543 6.35e-05

Model Evaluation

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.3574352 0.3470713 8.765493 34.48833 0 4 -680.0333 1370.067 1386.302 14291.1 186

The R-Squared and Adj R-Squared of .357 and .347 indicate that the model is only capturing 34% to 36% percent of the variance. This could be an indication that we are missing some important explanatory variables. The F Statistic of 34.48 and a p-value of 0 mean we can reject the null hypothesis that our model is no better than the zero beta model. Additionally, the p-values of the coefficients are all small and close to zero, thus indicating the coefficients are statistically significant. This is also consistent with our F-Statistic. The sigma or residual standard error of 8.765 indicates that this is the average prediction error in a lifespan of 80 years that would be within about 10% - not bad.

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
## [1] 107.6953

While this result is possible, it does not seem realistic. The model would seem to indicate we can that if we reduce spending, but increase doctors we can increase life expectancy. I suspect that there is a correlation between our two explanatory variables and this may undermining our results.