The attached who.csv dataset contains real-world data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
data<-read.csv("C:\\Users\\jkks9\\Documents\\DATA 605\\who.csv")
plot(data$LifeExp,data$TotExp)

lm<-lm(data$LifeExp ~ data$TotExp)
summary(lm)
## 
## Call:
## lm(formula = data$LifeExp ~ data$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## data$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

P-value indiates that it is statistically significant which is anything below a value of .05. And then from looking at the R squared values, they are very low. Which indicate that only 26% of the model can be explained. Thus based on that, it would be considered a very poor model.

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
lm2<-lm((data$LifeExp^4.6)~I(data$TotExp^.06))
summary(lm2)
## 
## Call:
## lm(formula = (data$LifeExp^4.6) ~ I(data$TotExp^0.06))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -736527910   46817945  -15.73   <2e-16 ***
## I(data$TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

P-value is lower which still indicates that it is statistically significant. R-squared is much higher, close to 50% higher which indicates much better performance of the model. Based just on R-squared comparisons alone, the second model is better.

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
LifeExp<-function(forecast)
{   y <- -736527910 + 620060216 * (forecast)
    y <- y^(1/4.6)
    print(y)
}
LifeExp(1.5)
## [1] 63.31153
LifeExp(2.5)
## [1] 86.50645
  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
lm3<-lm(data$LifeExp ~ data$PropMD + data$TotExp + data$PropMD*data$TotExp)
summary(lm3)
## 
## Call:
## lm(formula = data$LifeExp ~ data$PropMD + data$TotExp + data$PropMD * 
##     data$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.277e+01  7.956e-01  78.899  < 2e-16 ***
## data$PropMD              1.497e+03  2.788e+02   5.371 2.32e-07 ***
## data$TotExp              7.233e-05  8.982e-06   8.053 9.39e-14 ***
## data$PropMD:data$TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

P value is close to 2nd model. However, the r-squared while better then the 1st model is significantly lower than the 2nd model by almost 40% less. This model falls in between the best model so far which is the 2nd model and the worst model of the three, the 1st model.

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
LifeExp2<-((6.277*10^1) + (1.497*10^3)*.03 + (7.233*10^(-5))*14 - ((6.026*10^(-3))*0.03*14))
LifeExp2
## [1] 107.6785

The life expectancy does not appear to realistic because it seems a little too high. I base that off googling what is the life expectancy today and one site referenced the CDC National Center For Health Statistics which for the US life expectancy is around 80 which is almost 20 lower than the forecasted value.