Multiple Regression

library(tidyverse)
library(janitor)

Load data

# load data
df <- read_csv("/Users/justinwilliams/Documents/CUNY SPS/605/assignments/data/who.csv") %>% 
  clean_names()

## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Variables include:

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

Question 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatterplot function

sct_plot <- function(df, col1, col2, xlab, ylab, title="Scatterplot",
                     title_size = 18) {
  df %>% 
  ggplot(aes(!!sym(col1), !!sym(col2))) +
  geom_point() +
  labs(title = title,
       y = ylab,
       x = xlab) +
  theme(plot.title.position = "plot",
          plot.title = element_text(hjust = 0.5, size = title_size))
}

sct_plot(df, 'tot_exp', 'life_exp',"Total Expenditures",
         "Life Expectancy (yrs)", 
         "Life Expectancy as a function of Total Expenditures")

Run simple regressions on life_exp ~ tot_exp

# slr model 
lm1 <- lm(life_exp ~ tot_exp, data = df)

# summary
summary(lm1)

## 
## Call:
## lm(formula = life_exp ~ tot_exp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## tot_exp     6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The standard error is extremely small here, it shows the statistical \(standard error\) for the tot_exp coefficient. For a good model we would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient.

In this case:

# define std error
std_err <- 7.795e-06

# define coefficient estimate
est <- 6.297e-05

# divide by std error
est/std_err

## [1] 8.078255

Here we can see the standard error is about ~8 smaller than the coefficient estimate, which generally indicates a good model.

The F-statistic is 65.26 which compares the current model to to a model that has only a single additional parameter. However, because this model already has a single parameter, it gives the same information as the line for slope estimate, it is the \(t\) value squared. Therefore, since the F-statistic is large and the associated p-value is small, we reject the null hypothesis and conclude that the coefficient is significantly different then zero, which indicates there IS a linear relationship between tot_exp and life_exp

For the predictor tot_exp, the \(t\)-value is 8.079 which is the ratio of the coefficient estimate to the standard error. This then gives us the p-value labeled as \(Pr(>|t|)\) which shows the probability of observing a test statistic as extreme or more extreme as the one observed (assuming there is no linear relationship between the predictor and response variables). Because this p-value is so low, tot_exp is a highly significant predictor of life_exp.

The \(R^2\) is pretty low with a score of 0.2577, tot_exp only accounts for ~25% of the variation in life_exp.

Assumptions of linear regressions can be checked by doing a residual analysis.

par(mfrow=c(2,2))
plot(lm1)

Here in the residuals vs. fitted chart we can see the residuals are NOT fitted around zero. This is an additional sign (besides small \(R^2\)) that we have produced a poor model.

When looking at the Q-Q plot we see some obvious nonlinearities, leading us to be cautious about the residuals being normally distributed.

Let’s check out a histogram.

hist(lm1$residuals)

Here we can clearly see our residuals are NOT normal.

Overall we can conclude that, we have a poor model.

Question 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”

# transform
df$life_exp_pwr <- df$life_exp^4.6
df$tot_exp_pwr <- df$tot_exp^0.06

Re-plot

sct_plot(df, 'tot_exp_pwr', 'life_exp_pwr',"Total Expenditures ^ 0.06",
         "Life Expectancy ^ 4.6", 
         "Life Expectancy as a function of Total Expenditures")

Re-run model

# transformed model
lm2 <- lm(life_exp_pwr ~ tot_exp_pwr, data = df)

#summary
summary(lm2)

## 
## Call:
## lm(formula = life_exp_pwr ~ tot_exp_pwr, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## tot_exp_pwr  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

After transforming the variables, we can have a much better model. The \(R^2\) here is 0.7298, which accounts for almost 73% of the variation in life_exp_pwr. We can see a much larger F-statistic and a much smaller p-value, this is also an indicator of a better model.

Let’s do the same coefficient/standard error check.

# assign variables
std_err2 <- 27518940
est2 <- 620060216

# compute
est2/std_err2

## [1] 22.53213

Here we can see the standard error is almost 23 times smaller then the coefficient estimate. Typically a ratio of 5 - 10 times smaller indicates a decent model. This indicates a better then decent model.

Let’s look at some diagnostic plots.

par(mfrow=c(2,2))
plot(lm2)

Much more normal looking, we can see the residuals are more gathered around zero and the Q-Q plot has a much straighter line.

hist(lm2$residuals)

In short, the transformation produced a model that predicts life_exp much better then the non-transformed version. However, due to the power transformation, interpretability.

Question 3

Using the results above, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

This can be gleaned from the following formula:

\[\widehat{LifeExp4.6}=\beta_0+\beta_1 \times TotExp0.06\]

# write function
forecast_resp <- function(predictor,beta_0, beta_1) {
  (beta_0 + (beta_1 * predictor))^(1/4.6)
}

For 1.5.

# tot_exp^0.06 = 1.5
forecast_resp(1.5, -736527910, 620060216)

## [1] 63.31153

Let’s try this for 2.5

forecast_resp(2.5, -736527910, 620060216)

## [1] 86.50645

Question 4

Build the following multiple regression model and interpret the F Statistics, \(R^2\), standard error, and p-values. How good is the model?

\[\widehat{LifeExp}=\beta_0+\beta_1 \times PropMd+\beta_2 \times TotExp+\beta_3 \times PropMD \times TotExp\]

# multiple regression
lm4 <- lm(life_exp ~ prop_md + tot_exp + prop_md*tot_exp, data = df)

# summary
summary(lm4)

## 
## Call:
## lm(formula = life_exp ~ prop_md + tot_exp + prop_md * tot_exp, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.277e+01  7.956e-01  78.899  < 2e-16 ***
## prop_md          1.497e+03  2.788e+02   5.371 2.32e-07 ***
## tot_exp          7.233e-05  8.982e-06   8.053 9.39e-14 ***
## prop_md:tot_exp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F-statistic is much lower then the other models indicating that it is not as good of a fit. However, the \(R^2\) value is a touch higher than the first model with a single predictor, so it explains more of the variance, roughly ~36%. The p-values are all significant and very small, therefore we can say that each of these coefficients are highly predictive of life_exp. However, I would not conclude that this is a good model, due to its overall low \(R^2\).

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# define function
for_mult_resp <- function(pred_1, pred_2, beta_0, beta_1, beta_2, beta_3) {
  (beta_0 + (beta_1 * pred_1) + (beta_2 * pred_2) - (beta_3 * pred_1 * pred_2))^(1/4.6)
}

# compute forecast
for_mult_resp(.03, 14, 6.277e+01, 1.497e+03,7.233e-05,-6.026e-03)

## [1] 2.765487

This does not seem realistic, because how could life_exp be 2.765487 years?