Introduction

The attached who.csv dataset contains real-world data from 2008.

The variables included follow.

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggpubr)
library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

who <- read_csv("who.csv")

head(who)

## # A tibble: 6 × 10
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>

describe(who)

##                vars   n     mean       sd  median  trimmed     mad   min
## Country*          1 190    95.50    54.99   95.50    95.50   70.42  1.00
## LifeExp           2 190    67.38    10.85   70.00    68.47   10.38 40.00
## InfantSurvival    3 190     0.96     0.04    0.98     0.97    0.02  0.84
## Under5Survival    4 190     0.95     0.06    0.97     0.96    0.03  0.73
## TBFree            5 190     1.00     0.00    1.00     1.00    0.00  0.99
## PropMD            6 190     0.00     0.00    0.00     0.00    0.00  0.00
## PropRN            7 190     0.00     0.01    0.00     0.00    0.00  0.00
## PersExp           8 190   742.00  1354.00  199.50   386.70  256.49  3.00
## GovtExp           9 190 40953.49 86140.65 5385.00 17671.33 7692.47 10.00
## TotExp           10 190 41695.49 87449.85 5541.00 18060.03 7899.29 13.00
##                      max     range  skew kurtosis      se
## Country*          190.00    189.00  0.00    -1.22    3.99
## LifeExp            83.00     43.00 -0.80    -0.25    0.79
## InfantSurvival      1.00      0.16 -1.34     1.11    0.00
## Under5Survival      1.00      0.27 -1.57     1.71    0.00
## TBFree              1.00      0.01 -1.66     2.70    0.00
## PropMD              0.04      0.04  7.52    64.25    0.00
## PropRN              0.07      0.07  7.25    74.81    0.00
## PersExp          6350.00   6347.00  2.48     5.64   98.23
## GovtExp        476420.00 476410.00  2.86     8.39 6249.30
## TotExp         482750.00 482737.00  2.85     8.32 6344.28

Problem 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression.

Do not transform the variables.
Provide and interpret the F statistics, R^2, standard error,and p-values only.
Discuss whether the assumptions of simple linear regression met.

Scatter plot

who %>% ggplot(aes(x = TotExp, y = LifeExp)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Simple linear model

who_lm1 <- lm(LifeExp~TotExp, data = who)
summary(who_lm1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

According to the \(R^2\) the simple linear model can account for 25.37% of the variability in life expectancy from the least squared line. For a good model we would like to see a standard error of at least 5 - 10x smaller than the corresponding coefficients. The simple linear model gives a standard error that is 8.07x smaller than the coefficient of total expenditure. Since this is a simple linear model and only has one parameter the F - statistic is about the same as the slope coefficient. The p - value of the whole model is the same as the p-value for the coefficient of the total expenditure parameter which is statistically significant at < 0.05.

Problem 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6) and raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06).

Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables.
Provide and interpret the F statistics, R^2, standard error, and p-values.
Which model is “better?”

Transformed simple linear regression

who <- who %>% mutate(LifeExp_4.6 = LifeExp ** 4.6, TotExp_0.06 = TotExp**0.06)
who_lm2 <- lm(LifeExp_4.6 ~ TotExp_0.06, data = who)
summary(who_lm2)

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_0.06, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_0.06  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

According to the \(R^2\) of the transformed simple linear model, we can now account for 72.83% of the variability in life expectancy from the least squared line. Again, the p - value of the whole model is the same as the p-value for the coefficient of the total expenditure parameter (<2e-16) which is statistically significant at < 0.05. Transforming the variables also raised the F-statistic which demonstrates that the model has improved. The transformed model is a better model as it accounts for more of the variability in the dependent variable.

Problem 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5.

The equation that represents this model is:

\(LifeExp_4.6 = -736527910 + 620060216 * TotExp_0.06\)

-736527910 + 620060216 * 1.5

## [1] 193562414

Reversing the transformation we get a forcasted age of 63.31 years old when total expenditure is 1.5

Then forecast life expectancy when TotExp^.06=2.5.

-736527910 + 620060216 * 2.5

## [1] 813622630

Reversing the transformation we get age of 86.5 years old when total expenditure is 2.5

Problem 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values.

\(LifeExp = b0 + b1 * PropMD + b2 * TotExp + b3 * PropMD * TotExp\)

How good is the model?

who_lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
summary(who_lm3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The linear model that includes total expenditure, proportion of MD’s, and the interaction term of proportion MD’s and Total expenditure is a pretty good model but only accounts for 34.71% of the variability in life expectancy. P values for all coefficients are significant and the p-value of the F statistic is also significant.

Problem 5

Forecast LifeExp when PropMD=.03 and TotExp = 14.

\(LifeExp = 62.77 + 1,497 * PropMD + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\)

PropMD <- .03
TotExp <-  14

who_lm3_prediction <- 62.77 + (1497 * PropMD) + (0.00007233 * TotExp) + (-0.006026 * PropMD * TotExp)

who_lm3_prediction

## [1] 107.6785

Does this forecast seem realistic?

The foretasted life expectancy of 107 years doesn’t seem realistic.

Why or why not?

According to our best fit line a high proportion of MD’s does seem to have more leverage on the life expectancy bringing it higher. When we look at the transformed model the relationship of total expenditure and life expectancy is moderate (correlation coefficient = 0.854) and the model seems to be the best fit. Based on this it seems that such a low total expenditure would not be related to such a high life expectancy.

who %>% ggplot(aes(x = TotExp_0.06, y = LifeExp_4.6)) +
  geom_point()

who %>%
  summarise(cor(TotExp_0.06, LifeExp_4.6, use = "complete.obs"))

## # A tibble: 1 × 1
##   `cor(TotExp_0.06, LifeExp_4.6, use = "complete.obs")`
##                                                   <dbl>
## 1                                                 0.854

Week 12 Homework

Dirk Hartog

2024-04-11