library(dplyr)

0. Load/Convert World Health Organization (WHO) Life Expentancy dataset and Examine First Few Data.

df = read.csv("who.csv", header=TRUE)
#df$Country, df$LifeExp, df$PropMD, df$TotExp
head(df,20)
##                Country LifeExp InfantSurvival Under5Survival  TBFree
## 1          Afghanistan      42          0.835          0.743 0.99769
## 2              Albania      71          0.985          0.983 0.99974
## 3              Algeria      71          0.967          0.962 0.99944
## 4              Andorra      82          0.997          0.996 0.99983
## 5               Angola      41          0.846          0.740 0.99656
## 6  Antigua and Barbuda      73          0.990          0.989 0.99991
## 7            Argentina      75          0.986          0.983 0.99952
## 8              Armenia      69          0.979          0.976 0.99920
## 9            Australia      82          0.995          0.994 0.99993
## 10             Austria      80          0.996          0.996 0.99990
## 11          Azerbaijan      64          0.927          0.911 0.99913
## 12             Bahamas      74          0.987          0.986 0.99960
## 13             Bahrain      75          0.991          0.990 0.99955
## 14          Bangladesh      63          0.948          0.931 0.99609
## 15            Barbados      75          0.989          0.988 0.99989
## 16             Belarus      69          0.994          0.992 0.99929
## 17             Belgium      79          0.996          0.995 0.99989
## 18              Belize      69          0.986          0.984 0.99944
## 19               Benin      55          0.912          0.852 0.99865
## 20              Bhutan      64          0.937          0.930 0.99904
##         PropMD      PropRN PersExp GovtExp TotExp
## 1  0.000228841 0.000572294      20      92    112
## 2  0.001143127 0.004614439     169    3128   3297
## 3  0.001060478 0.002091362     108    5184   5292
## 4  0.003297297 0.003500000    2589  169725 172314
## 5  0.000070400 0.001146162      36    1620   1656
## 6  0.000142857 0.002773810     503   12543  13046
## 7  0.002780191 0.000741044     484   19170  19654
## 8  0.003698671 0.004918937      88    1856   1944
## 9  0.002331953 0.009149391    3181  187616 190797
## 10 0.003610904 0.006458749    3788  189354 193142
## 11 0.003660005 0.008477873      62     780    842
## 12 0.000954128 0.004045872    1224   55783  57007
## 13 0.002679296 0.005967524     710   45784  46494
## 14 0.000274894 0.000253034      12      75     87
## 15 0.001098976 0.003372014     725   24433  25158
## 16 0.004758674 0.012457093     204   11315  11519
## 17 0.004230489 0.014079195    3451  239105 242556
## 18 0.000890071 0.001074468     198    5376   5574
## 19 0.000035500 0.000660845      28     600    628
## 20 0.000080100 0.001124807      52     407    459


1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

attach(df)
cor(LifeExp, TotExp)     #check for correlation between the 2 variables
## [1] 0.5076339
plot(TotExp,  LifeExp, main='scatterplot', ylab='Life Expentancy', xlab = 'Total Expenditure', col=2)
abline(lm(LifeExp~TotExp), col=1)

exp.lm = lm(LifeExp~TotExp)
exp.lm
## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Coefficients:
## (Intercept)       TotExp  
##   6.475e+01    6.297e-05
summary(exp.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Linear Regression Model -
Life Expentancy = 64.75 + .000063 * Total Expenditure

The model above shows a negative y intercept (Total Expenditure on healtcare). Which means the model would give negative Total Expenditure if life expendtacy is less than ~65. The model at the onset, is not realistic. The model reflects the very small amount low Life Expentancy countries spend on healthcare when compared to the Total Expentiture by high Life Expentancy countries.

Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537 - The low R-squared value tells us that our model only explains around 25% of the response variable (Life expentancy in response to Total Expenditure) around the mean.

F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14 - the p-value of the model is really low which means we can confindetly reject the null hypothesis (that Total Expenditure DOES NOT contribute to a country’s Life Expentancy). We can say that the variable does contribute to the model, albeit ony a minor contributor.

Residual standard error: 9.371 on 188 degrees of freedom - 9.371 Residual standard error also tells us the SE is somewhat high (about 10 man years). This means that some the sample data points are significantly off the fitted line. This means that countries who contribute significantly less in healthcare expenditure than what the model would predict, have nonetheless sustain a life expecgtancy that is significantly higher than expected.


2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

TotExp2 = TotExp^0.06
LifeExp2 = LifeExp^4.6
cor(LifeExp2,TotExp2)     #check for correlation between the 2 variables
## [1] 0.8542642
plot(TotExp2, LifeExp2, main='scatterplot', ylab='Life Expentancy', xlab = 'Total Expenditure', col=2)
abline(lm(LifeExp2~TotExp2), col=1)

exp2.lm = lm(LifeExp2~TotExp2)
exp2.lm
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2)
## 
## Coefficients:
## (Intercept)      TotExp2  
##  -736527909    620060216
summary(exp2.lm)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Linear Regression Model -
Life Expectancy^4.6 = -736527909 + 620060216 * Total Expenditure^0.06

By looking at the regression line for this transformed model and comparing it against the previous model, I can say that the transformed model is the better model since the data points are more closely clustterred around the regression line of the model.

Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283 R-squared value of close to 73% is much better than the ~26% R-squared value for the first model. This means that the response variable (life expentancy^4.6) explains the model’s variability around the mean 75% of the time.

F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16 - the p-value of the model is really low which means we can confindetly reject the null hypothesis (that Total Expenditure^0.06 DOES NOT contribute to a country’s Life Expentancy^4.6). We can say that the variable does contribute to the model, in a greater way than the orignial model.

Residual standard error: 90,490,000 on 188 degrees of freedom Suprising high Residual SE even when we consider that life Expentancy was increased expontially by 4.6. This contradicts the R-squared and F-statistics finding but since the original scatterplot does shou that countries with low life expectancy have even much lower Total Expenditures. Since we increase these values exponentially, the SE should would also increase exponentially.


3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Linear Regression Model -
Life Expectancy^4.6 = 64.75 + 620060216 * Total Expenditure^0.06

LifeExp46 = -736527909 +  620060216 * (1.5)
LifeExp15 = exp(log(LifeExp46)/4.6)
LifeExp15
## [1] 63.31153
LifeExp46 = -736527909 +  620060216 * (2.5)
LifeExp25 = exp(log(LifeExp46)/4.6)
LifeExp25
## [1] 86.50645

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

#plot(TotExp,  LifeExp, main='scatterplot', ylab='Life Expentancy', xlab = 'Total Expenditure', col=2)
#abline(lm(LifeExp~TotExp), col=1)
expMUL.lm = lm(LifeExp~TotExp + PropMD + PropMD * TotExp)
expMUL.lm
## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + PropMD * TotExp)
## 
## Coefficients:
##   (Intercept)         TotExp         PropMD  TotExp:PropMD  
##     6.277e+01      7.233e-05      1.497e+03     -6.026e-03
summary(expMUL.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + PropMD * TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Life Expentancy MR = 62.8 + .000072 Total Expenditure + 1,497 PropMD + .006 * Total Expenditrure * PropMD

Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471 - with an adjusted R-squared value of only ~35%, this is not a good model. This means that the response variables in this model account for only ~35% of the variability of the model.

F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16 the F-statistic shows that the p-value is really low (close to zero), which means we can reject the null hypothesis and state with confidence that the response variables do contribute to the true value of the dependent variable.

Residual standard error: 8.765 on 186 degrees of freedom - The residual SE is significant at 8.765. Which means that datapoints on the average are off by 8.765 from what the model would have predicted. By this measure, I would have to say the model is not a good fit to its corresponding data points.


5.Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

LifeExpMR = 62.8 + .000072 * 14 + 1497 * 0.03 + .006 * 14 * 0.03
LifeExpMR
## [1] 107.7135

The forecast is not realistic. It shows that if we increase the proportion of doctors in the population and drastically reduce spending, we can dramatically increase life expectancy from ~80s (high life expectancy countries) to 107. Prportion of Doctors is not independent of Total Expenditure in healthcare. It takes a lot of money to train good doctors and good doctors also expect to be well compensated. Thus, it is not realitic to have a drastic increase in doctors in a population and at the same tiem have a drastic decrease in healthcare spending. 14 is too low a number for Total Expenditure even for countries that have a very expensive and inefficient health care systems. The US, for example, spends more for healthcare per capita than any other country at around $7,000 per capita. To drastically reduce this to $14 per capita and expect to have a surge in medical doctors (x1,000 to x10,000) would be absurd.