Mulitple Regression from WHO data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
= "https://raw.githubusercontent.com/josh1den/DATA-605/main/HW/HW12/who.csv"
url = read.csv(url)
df head(df)
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp
## 1 0.000572294 20 92 112
## 2 0.004614439 169 3128 3297
## 3 0.002091362 108 5184 5292
## 4 0.003500000 2589 169725 172314
## 5 0.001146162 36 1620 1656
## 6 0.002773810 503 12543 13046
names(df)
## [1] "Country" "LifeExp" "InfantSurvival" "Under5Survival"
## [5] "TBFree" "PropMD" "PropRN" "PersExp"
## [9] "GovtExp" "TotExp"
ggplot(df, aes(x=TotExp, y=LifeExp)) +
geom_point() +
geom_smooth(se=FALSE) +
labs(title = "Total Expenditures vs. Life Expectancy", x="Total Expenditures", y="Life Expectancy")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
= lm(LifeExp ~ TotExp, df)
df.lm1 summary(df.lm1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F-statistic compares variable fits in multiple regression models. Since this is a single-intercept model, the f-statistic is not meaningful.
The R-squared of 0.25 indicates Total Expenditures accounts for only 25% of the variance in Life Expectancy. This is very low.
The Standard Error of 7.535e-01 for the intercept and 7.795e-06 for the slope are small relative to the coefficients, which suggests low variability among the parameters.
The P-value of 7.71e-14 is very small and significant. This means there is a high probability total expenditures is relevant to life expectancy.
The takeaway is total expenditures is relevant life expectancy, but doesn’t fully explain it. More is needed.
ggplot(df, aes(x=TotExp^0.06, y=LifeExp^4.6)) +
geom_point() +
geom_smooth(se=FALSE) +
labs(title = "Total Expenditures vs. Life Expectancy", x="Total Expenditures ^0.06", y="Life Expectancy ^4.6")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
= lm(I(LifeExp^4.6) ~ I(TotExp^0.06), df)
df.lm2 summary(df.lm2)
##
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## I(TotExp^0.06) 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The F-statistic is still comparing a single linear model, so it is not relevant.
The R-squared of 0.73 indicates Total Expenditures accounts for only 73% of the variance in Life Expectancy, which is much stronger than the previous model.
The Standard Error of 46817945 for the intercept and 27518940 for the slope are very large relative to the coefficients, which indicates the model is not estimating them well.
The P-value of 2.2e-16 is very small, in line with the previous model which indicates total expenditures is relevant to life expectancy.
The takeaway is that this model better explains the variance between the variables, but because the standard is larger, does not accurately predict them.
Given the equation \(𝑌 = mx + b\) where \(m\) is the slope, \(b\) is the intercept, and \(x, Y\) are points on the line:
= 1.5
x1 = 2.5
x2 = df.lm2$coefficients[1] #intercept
b = df.lm2$coefficients[2] #slope
m
= m*x1 + b
pred_1 = m*x2 + b
pred_2
= round(pred_1^(1/4.6),2) #inverted prediction
inv_1 = round(pred_2^(1/4.6),2)
inv_2
print(paste("When TotExp^.06=1.5, Life Expectancy is",inv_1))
## [1] "When TotExp^.06=1.5, Life Expectancy is 63.31"
print(paste("When TotExp^.06=2.5, Life Expectancy is",inv_2))
## [1] "When TotExp^.06=2.5, Life Expectancy is 86.51"
= lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, df)
df.lm3 summary(df.lm3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The F-statistic is very similar to the R-squared value, which suggests removing a variable will have a negligable affect to the model’s reliability.
The R-squared of 0.35 is low, indicating the combined variables account for 35% of the variability of life expectancy.
The Standard Error for the intercept, total expenditures, and proportion of doctors to total expenditures is very low, indicating low variability, while proportion of doctors has high variability.
The P-values are all low, indicating the variables are all significant.
Takeaway - this model does not model the data as well as the previous model, however neither model is reliable at prediction.
= df.lm3$coefficients[1]
b0 = df.lm3$coefficients[2]
b1 = df.lm3$coefficients[3]
b2 = df.lm3$coefficients[4]
b3
= b0 + .03*b1 + 14*b2 + .03*14*b3
pred print(pred)
## (Intercept)
## 107.696
It does not seem realistic that increasing the percent of doctors can lead to a life expectancy of nearly 108 years!