Mulitple Regression from WHO data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.1 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
url = "https://raw.githubusercontent.com/josh1den/DATA-605/main/HW/HW12/who.csv"
df = read.csv(url)
head(df)
##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046
names(df)
##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"

ggplot(df, aes(x=TotExp, y=LifeExp)) + 
  geom_point() +
  geom_smooth(se=FALSE) +
  labs(title = "Total Expenditures vs. Life Expectancy", x="Total Expenditures", y="Life Expectancy")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df.lm1 = lm(LifeExp ~ TotExp, df)
summary(df.lm1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-statistic compares variable fits in multiple regression models. Since this is a single-intercept model, the f-statistic is not meaningful.

The R-squared of 0.25 indicates Total Expenditures accounts for only 25% of the variance in Life Expectancy. This is very low.

The Standard Error of 7.535e-01 for the intercept and 7.795e-06 for the slope are small relative to the coefficients, which suggests low variability among the parameters.

The P-value of 7.71e-14 is very small and significant. This means there is a high probability total expenditures is relevant to life expectancy.

The takeaway is total expenditures is relevant life expectancy, but doesn’t fully explain it. More is needed.

ggplot(df, aes(x=TotExp^0.06, y=LifeExp^4.6)) + 
  geom_point() +
  geom_smooth(se=FALSE) +
  labs(title = "Total Expenditures vs. Life Expectancy", x="Total Expenditures ^0.06", y="Life Expectancy ^4.6")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df.lm2 = lm(I(LifeExp^4.6) ~ I(TotExp^0.06), df)
summary(df.lm2)
## 
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -736527910   46817945  -15.73   <2e-16 ***
## I(TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The F-statistic is still comparing a single linear model, so it is not relevant.

The R-squared of 0.73 indicates Total Expenditures accounts for only 73% of the variance in Life Expectancy, which is much stronger than the previous model.

The Standard Error of 46817945 for the intercept and 27518940 for the slope are very large relative to the coefficients, which indicates the model is not estimating them well.

The P-value of 2.2e-16 is very small, in line with the previous model which indicates total expenditures is relevant to life expectancy.

The takeaway is that this model better explains the variance between the variables, but because the standard is larger, does not accurately predict them.

Given the equation \(𝑌 = mx + b\) where \(m\) is the slope, \(b\) is the intercept, and \(x, Y\) are points on the line:

x1 = 1.5 
x2 = 2.5  
b = df.lm2$coefficients[1] #intercept
m = df.lm2$coefficients[2] #slope

pred_1 = m*x1 + b
pred_2 = m*x2 + b

inv_1 = round(pred_1^(1/4.6),2) #inverted prediction
inv_2 = round(pred_2^(1/4.6),2)

print(paste("When TotExp^.06=1.5, Life Expectancy is",inv_1))
## [1] "When TotExp^.06=1.5, Life Expectancy is 63.31"
print(paste("When TotExp^.06=2.5, Life Expectancy is",inv_2))
## [1] "When TotExp^.06=2.5, Life Expectancy is 86.51"

df.lm3 = lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, df)
summary(df.lm3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F-statistic is very similar to the R-squared value, which suggests removing a variable will have a negligable affect to the model’s reliability.

The R-squared of 0.35 is low, indicating the combined variables account for 35% of the variability of life expectancy.

The Standard Error for the intercept, total expenditures, and proportion of doctors to total expenditures is very low, indicating low variability, while proportion of doctors has high variability.

The P-values are all low, indicating the variables are all significant.

Takeaway - this model does not model the data as well as the previous model, however neither model is reliable at prediction.

b0 = df.lm3$coefficients[1]
b1 = df.lm3$coefficients[2]
b2 = df.lm3$coefficients[3]
b3 = df.lm3$coefficients[4]

pred = b0 + .03*b1 + 14*b2 + .03*14*b3
print(pred)
## (Intercept) 
##     107.696

It does not seem realistic that increasing the percent of doctors can lead to a life expectancy of nearly 108 years!