Libraries

library(rvest) 
library(dplyr)
library(knitr)
library(rcompanion)
library(MASS)
library(tidyverse)
library(caret)

Reading Data

who<-read.csv(file='who.csv',header=TRUE)

Summary of data

  print(kable(head(who)))
## 
## 
## Country                LifeExp   InfantSurvival   Under5Survival    TBFree      PropMD      PropRN   PersExp   GovtExp   TotExp
## --------------------  --------  ---------------  ---------------  --------  ----------  ----------  --------  --------  -------
## Afghanistan                 42            0.835            0.743   0.99769   0.0002288   0.0005723        20        92      112
## Albania                     71            0.985            0.983   0.99974   0.0011431   0.0046144       169      3128     3297
## Algeria                     71            0.967            0.962   0.99944   0.0010605   0.0020914       108      5184     5292
## Andorra                     82            0.997            0.996   0.99983   0.0032973   0.0035000      2589    169725   172314
## Angola                      41            0.846            0.740   0.99656   0.0000704   0.0011462        36      1620     1656
## Antigua and Barbuda         73            0.990            0.989   0.99991   0.0001429   0.0027738       503     12543    13046
  print(kable(summary(who)))
## 
## 
##                     Country       LifeExp      InfantSurvival   Under5Survival       TBFree           PropMD              PropRN             PersExp           GovtExp             TotExp     
## ---  ------------------------  --------------  ---------------  ---------------  ---------------  ------------------  ------------------  ----------------  -----------------  ---------------
##      Afghanistan        :  1   Min.   :40.00   Min.   :0.8350   Min.   :0.7310   Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00   Min.   :    10.0   Min.   :    13 
##      Albania            :  1   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253   1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584 
##      Algeria            :  1   Median :70.00   Median :0.9785   Median :0.9745   Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50   Median :  5385.0   Median :  5541 
##      Andorra            :  1   Mean   :67.38   Mean   :0.9624   Mean   :0.9459   Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00   Mean   : 40953.5   Mean   : 41696 
##      Angola             :  1   3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900   3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331 
##      Antigua and Barbuda:  1   Max.   :83.00   Max.   :0.9980   Max.   :0.9970   Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00   Max.   :476420.0   Max.   :482750 
##      (Other)            :184   NA              NA               NA               NA               NA                  NA                  NA                NA                 NA

1 Scatter Plot; Simple Regression

plot(who$LifeExp,who$TotExp)

s_mod<-lm(LifeExp~TotExp,data=who)

summary(s_mod)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

F Statistics of 65.26 with tiny p-value indicates that there is some form of relationship between Life Expectancy and Total Expenditures (a slope exists) R^2 of 0.2577 indicates that Total Expenditures can only explain 25% of Life Expectancy variation. Pretty small. Standard Error is about 10 times smaller than coefficient which is a good indicator (indicates a good model) Residual Median is much bigger than 0, 1Q is bigger than 3Q, and minimum is much far away from center than maximum (3 times more away). All this indicates that residuals are not normally distributed, which is one of the assumptions of simple linear regression. So, assumptions of residual normality are not met.

2 Transformation

who$LifeExp_t<-who$LifeExp^4.6

who$TotExp_t<-who$TotExp^0.06

plot(who$LifeExp_t,who$TotExp_t)

s_mod<-lm(LifeExp_t~TotExp_t,data=who)

summary(s_mod)
## 
## Call:
## lm(formula = LifeExp_t ~ TotExp_t, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_t     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

F Statistics of 507.7 with tiny p-value indicates that there is some form of relationship between transformed Life Expectancy and transformed Total Expenditures (a slope exists) R^2 of 0.7298 indicates that Total Expenditures can explain 73% of Life Expectancy variation. Pretty good. Standard Error is about 20 times smaller than coefficient which is a good indicator (indicates a good model) Residual Median is bigger than 0, 1Q is still bigger than 3Q, and minimum is far away from center than maximum (1.5 times more away). All this indicates that residuals are still not exactly normally distributed, which is one of the assumptions of simple linear regression. So, assumptions are not completely met. But based on significant improvements in R squired and residuals distribution, the second model is much better.

3 Forecast

(-736527910+2.5*620060216)^(1/4.6)
## [1] 86.50645
(-736527910+1.5*620060216)^(1/4.6)
## [1] 63.31153

4 Mutliple Model

m_mod<-lm(LifeExp~PropMD+TotExp+PropMD*TotExp,data=who)

summary(m_mod)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
plot(m_mod)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

F Statistics with tiny p-value indicates that there is some form of relationship between Life Expectancy and at least some of independent variables (a slope exists) R^2 of 0.3574 indicates that our model can explain 36% of Life Expectancy variation. Not that good. P values for independent variables indicate that all variables are good fit. Residual Median is bigger than 0, 1Q is bigger than 3Q, and minimum is far away from center than maximum (3 times more away). All this indicates that residuals are not exactly normally distributed, which is one of the assumptions of simple linear regression. So, assumptions are not met. Residual plots confirm that residuals are not normally distributed. Variance of residuals is not constant. Model is not good.

5 Forecast

PropMD<-0.03

TotExp<-14

62.77+PropMD*1497+TotExp*0.00007233-0.006026*PropMD*TotExp
## [1] 107.6785

It seems somewhat high:). The highest life expectancy in the world is 84. Considering that expenditures are low and even though proportion of doctors on higher end, but still reasonable, there is no reasons to believe that we will achieve 108 life expectancy. Not anytime soon.