library(dplyr)
df = read.csv("who.csv", header=TRUE)
#df$Country, df$LifeExp, df$PropMD, df$TotExp
head(df,20)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## 7 Argentina 75 0.986 0.983 0.99952
## 8 Armenia 69 0.979 0.976 0.99920
## 9 Australia 82 0.995 0.994 0.99993
## 10 Austria 80 0.996 0.996 0.99990
## 11 Azerbaijan 64 0.927 0.911 0.99913
## 12 Bahamas 74 0.987 0.986 0.99960
## 13 Bahrain 75 0.991 0.990 0.99955
## 14 Bangladesh 63 0.948 0.931 0.99609
## 15 Barbados 75 0.989 0.988 0.99989
## 16 Belarus 69 0.994 0.992 0.99929
## 17 Belgium 79 0.996 0.995 0.99989
## 18 Belize 69 0.986 0.984 0.99944
## 19 Benin 55 0.912 0.852 0.99865
## 20 Bhutan 64 0.937 0.930 0.99904
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
## 7 0.002780191 0.000741044 484 19170 19654
## 8 0.003698671 0.004918937 88 1856 1944
## 9 0.002331953 0.009149391 3181 187616 190797
## 10 0.003610904 0.006458749 3788 189354 193142
## 11 0.003660005 0.008477873 62 780 842
## 12 0.000954128 0.004045872 1224 55783 57007
## 13 0.002679296 0.005967524 710 45784 46494
## 14 0.000274894 0.000253034 12 75 87
## 15 0.001098976 0.003372014 725 24433 25158
## 16 0.004758674 0.012457093 204 11315 11519
## 17 0.004230489 0.014079195 3451 239105 242556
## 18 0.000890071 0.001074468 198 5376 5574
## 19 0.000035500 0.000660845 28 600 628
## 20 0.000080100 0.001124807 52 407 459
attach(df)
cor(LifeExp, TotExp) #check for correlation between the 2 variables
## [1] 0.5076339
plot(TotExp, LifeExp, main='scatterplot', ylab='Life Expentancy', xlab = 'Total Expenditure', col=2)
abline(lm(LifeExp~TotExp), col=1)
exp.lm = lm(LifeExp~TotExp)
exp.lm
##
## Call:
## lm(formula = LifeExp ~ TotExp)
##
## Coefficients:
## (Intercept) TotExp
## 6.475e+01 6.297e-05
summary(exp.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Linear Regression Model -
Life Expentancy = 64.75 + .000063 * Total Expenditure
The model above shows a negative y intercept (Total Expenditure on healtcare). Which means the model would give negative Total Expenditure if life expendtacy is less than ~65. The model at the onset, is not realistic. The model reflects the very small amount low Life Expentancy countries spend on healthcare when compared to the Total Expentiture by high Life Expentancy countries.
Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537 - The low R-squared value tells us that our model only explains around 25% of the response variable (Life expentancy in response to Total Expenditure) around the mean.
F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14 - the p-value of the model is really low which means we can confindetly reject the null hypothesis (that Total Expenditure DOES NOT contribute to a country’s Life Expentancy). We can say that the variable does contribute to the model, albeit ony a minor contributor.
Residual standard error: 9.371 on 188 degrees of freedom - 9.371 Residual standard error also tells us the SE is somewhat high (about 10 man years). This means that some the sample data points are significantly off the fitted line. This means that countries who contribute significantly less in healthcare expenditure than what the model would predict, have nonetheless sustain a life expecgtancy that is significantly higher than expected.
TotExp2 = TotExp^0.06
LifeExp2 = LifeExp^4.6
cor(LifeExp2,TotExp2) #check for correlation between the 2 variables
## [1] 0.8542642
plot(TotExp2, LifeExp2, main='scatterplot', ylab='Life Expentancy', xlab = 'Total Expenditure', col=2)
abline(lm(LifeExp2~TotExp2), col=1)
exp2.lm = lm(LifeExp2~TotExp2)
exp2.lm
##
## Call:
## lm(formula = LifeExp2 ~ TotExp2)
##
## Coefficients:
## (Intercept) TotExp2
## -736527909 620060216
summary(exp2.lm)
##
## Call:
## lm(formula = LifeExp2 ~ TotExp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp2 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Linear Regression Model -
Life Expectancy^4.6 = -736527909 + 620060216 * Total Expenditure^0.06
By looking at the regression line for this transformed model and comparing it against the previous model, I can say that the transformed model is the better model since the data points are more closely clustterred around the regression line of the model.
Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283 R-squared value of close to 73% is much better than the ~26% R-squared value for the first model. This means that the response variable (life expentancy^4.6) explains the model’s variability around the mean 75% of the time.
F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16 - the p-value of the model is really low which means we can confindetly reject the null hypothesis (that Total Expenditure^0.06 DOES NOT contribute to a country’s Life Expentancy^4.6). We can say that the variable does contribute to the model, in a greater way than the orignial model.
Residual standard error: 90,490,000 on 188 degrees of freedom Suprising high Residual SE even when we consider that life Expentancy was increased expontially by 4.6. This contradicts the R-squared and F-statistics finding but since the original scatterplot does shou that countries with low life expectancy have even much lower Total Expenditures. Since we increase these values exponentially, the SE should would also increase exponentially.
Linear Regression Model -
Life Expectancy^4.6 = 64.75 + 620060216 * Total Expenditure^0.06
LifeExp46 = -736527909 + 620060216 * (1.5)
LifeExp15 = exp(log(LifeExp46)/4.6)
LifeExp15
## [1] 63.31153
LifeExp46 = -736527909 + 620060216 * (2.5)
LifeExp25 = exp(log(LifeExp46)/4.6)
LifeExp25
## [1] 86.50645
#plot(TotExp, LifeExp, main='scatterplot', ylab='Life Expentancy', xlab = 'Total Expenditure', col=2)
#abline(lm(LifeExp~TotExp), col=1)
expMUL.lm = lm(LifeExp~TotExp + PropMD + PropMD * TotExp)
expMUL.lm
##
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + PropMD * TotExp)
##
## Coefficients:
## (Intercept) TotExp PropMD TotExp:PropMD
## 6.277e+01 7.233e-05 1.497e+03 -6.026e-03
summary(expMUL.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + PropMD * TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Life Expentancy MR = 62.8 + .000072 Total Expenditure + 1,497 PropMD + .006 * Total Expenditrure * PropMD
Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471 - with an adjusted R-squared value of only ~35%, this is not a good model. This means that the response variables in this model account for only ~35% of the variability of the model.
F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16 the F-statistic shows that the p-value is really low (close to zero), which means we can reject the null hypothesis and state with confidence that the response variables do contribute to the true value of the dependent variable.
Residual standard error: 8.765 on 186 degrees of freedom - The residual SE is significant at 8.765. Which means that datapoints on the average are off by 8.765 from what the model would have predicted. By this measure, I would have to say the model is not a good fit to its corresponding data points.
LifeExpMR = 62.8 + .000072 * 14 + 1497 * 0.03 + .006 * 14 * 0.03
LifeExpMR
## [1] 107.7135
The forecast is not realistic. It shows that if we increase the proportion of doctors in the population and drastically reduce spending, we can dramatically increase life expectancy from ~80s (high life expectancy countries) to 107. Prportion of Doctors is not independent of Total Expenditure in healthcare. It takes a lot of money to train good doctors and good doctors also expect to be well compensated. Thus, it is not realitic to have a drastic increase in doctors in a population and at the same tiem have a drastic decrease in healthcare spending. 14 is too low a number for Total Expenditure even for countries that have a very expensive and inefficient health care systems. The US, for example, spends more for healthcare per capita than any other country at around $7,000 per capita. To drastically reduce this to $14 per capita and expect to have a surge in medical doctors (x1,000 to x10,000) would be absurd.