The attached who.csv dataset contains real-world data from 2008.
The variables included follow.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
who <- read_csv("who.csv")
head(who)
## # A tibble: 6 × 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
describe(who)
## vars n mean sd median trimmed mad min
## Country* 1 190 95.50 54.99 95.50 95.50 70.42 1.00
## LifeExp 2 190 67.38 10.85 70.00 68.47 10.38 40.00
## InfantSurvival 3 190 0.96 0.04 0.98 0.97 0.02 0.84
## Under5Survival 4 190 0.95 0.06 0.97 0.96 0.03 0.73
## TBFree 5 190 1.00 0.00 1.00 1.00 0.00 0.99
## PropMD 6 190 0.00 0.00 0.00 0.00 0.00 0.00
## PropRN 7 190 0.00 0.01 0.00 0.00 0.00 0.00
## PersExp 8 190 742.00 1354.00 199.50 386.70 256.49 3.00
## GovtExp 9 190 40953.49 86140.65 5385.00 17671.33 7692.47 10.00
## TotExp 10 190 41695.49 87449.85 5541.00 18060.03 7899.29 13.00
## max range skew kurtosis se
## Country* 190.00 189.00 0.00 -1.22 3.99
## LifeExp 83.00 43.00 -0.80 -0.25 0.79
## InfantSurvival 1.00 0.16 -1.34 1.11 0.00
## Under5Survival 1.00 0.27 -1.57 1.71 0.00
## TBFree 1.00 0.01 -1.66 2.70 0.00
## PropMD 0.04 0.04 7.52 64.25 0.00
## PropRN 0.07 0.07 7.25 74.81 0.00
## PersExp 6350.00 6347.00 2.48 5.64 98.23
## GovtExp 476420.00 476410.00 2.86 8.39 6249.30
## TotExp 482750.00 482737.00 2.85 8.32 6344.28
who %>% ggplot(aes(x = TotExp, y = LifeExp)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
who_lm1 <- lm(LifeExp~TotExp, data = who)
summary(who_lm1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
According to the \(R^2\) the simple linear model can account for 25.37% of the variability in life expectancy from the least squared line. For a good model we would like to see a standard error of at least 5 - 10x smaller than the corresponding coefficients. The simple linear model gives a standard error that is 8.07x smaller than the coefficient of total expenditure. Since this is a simple linear model and only has one parameter the F - statistic is about the same as the slope coefficient. The p - value of the whole model is the same as the p-value for the coefficient of the total expenditure parameter which is statistically significant at < 0.05.
who <- who %>% mutate(LifeExp_4.6 = LifeExp ** 4.6, TotExp_0.06 = TotExp**0.06)
who_lm2 <- lm(LifeExp_4.6 ~ TotExp_0.06, data = who)
summary(who_lm2)
##
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_0.06, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
According to the \(R^2\) of the transformed simple linear model, we can now account for 72.83% of the variability in life expectancy from the least squared line. Again, the p - value of the whole model is the same as the p-value for the coefficient of the total expenditure parameter (<2e-16) which is statistically significant at < 0.05. Transforming the variables also raised the F-statistic which demonstrates that the model has improved. The transformed model is a better model as it accounts for more of the variability in the dependent variable.
The equation that represents this model is:
\(LifeExp_4.6 = -736527910 + 620060216 * TotExp_0.06\)
-736527910 + 620060216 * 1.5
## [1] 193562414
Reversing the transformation we get a forcasted age of 63.31 years old when total expenditure is 1.5
-736527910 + 620060216 * 2.5
## [1] 813622630
Reversing the transformation we get age of 86.5 years old when total expenditure is 2.5
\(LifeExp = b0 + b1 * PropMD + b2 * TotExp + b3 * PropMD * TotExp\)
who_lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
summary(who_lm3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The linear model that includes total expenditure, proportion of MD’s, and the interaction term of proportion MD’s and Total expenditure is a pretty good model but only accounts for 34.71% of the variability in life expectancy. P values for all coefficients are significant and the p-value of the F statistic is also significant.
\(LifeExp = 62.77 + 1,497 * PropMD + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\)
PropMD <- .03
TotExp <- 14
who_lm3_prediction <- 62.77 + (1497 * PropMD) + (0.00007233 * TotExp) + (-0.006026 * PropMD * TotExp)
who_lm3_prediction
## [1] 107.6785
The foretasted life expectancy of 107 years doesn’t seem realistic.
According to our best fit line a high proportion of MD’s does seem to have more leverage on the life expectancy bringing it higher. When we look at the transformed model the relationship of total expenditure and life expectancy is moderate (correlation coefficient = 0.854) and the model seems to be the best fit. Based on this it seems that such a low total expenditure would not be related to such a high life expectancy.
who %>% ggplot(aes(x = TotExp_0.06, y = LifeExp_4.6)) +
geom_point()
who %>%
summarise(cor(TotExp_0.06, LifeExp_4.6, use = "complete.obs"))
## # A tibble: 1 × 1
## `cor(TotExp_0.06, LifeExp_4.6, use = "complete.obs")`
## <dbl>
## 1 0.854