Homework 12 - Linear Regression

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# data
df <- read_csv('data/who.csv')

# scatterplot
df %>% ggplot(aes(x=TotExp, y=LifeExp)) +
  geom_point(alpha=0.2) +
  geom_smooth(method='lm', se=FALSE, alpha=0.5, color='blue')

df %>% ggplot(aes(x=TotExp)) +
  geom_histogram()

# lm
lm1 <- lm(LifeExp ~ TotExp, data=df)
summary(lm1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

# fitted vs residual
get_regression_points(lm1) %>%
  ggplot(aes(x=LifeExp_hat, y=residual)) +
    geom_point(alpha=0.3) +
    geom_hline(yintercept=0, color='blue')

# qq
qqnorm(resid(lm1))
qqline(resid(lm1))

Model Statistics

The Adjusted $R^2$ is 0.25, suggesting this single-predictor model explains only 25% of the variance in Life Expectancy. The F-Statistic of 65.26 provides the coefficient we’d observe in an intercept-only model, and we can see it is very lose to the coefficient of our intercept here (64.75). The p-value for the F-Statistic is very small (7.714e-14) indicating we’d almost never observe a value this extreme in the absence of a linear relationship. The Residual SE is 9.37, the amount of total variance in the residual values.

Assumptions of Linear Regression

Linearity - The relationship between x and the mean of y should be linear. Our scatterplot shows this is a curvilinear relationship instead - we can still use LR to model it, but will probably have to add some Quadratic terms to our predictors.
Homoscedasticity - The variance of residuals should be the same for any value of x. Our fitted vs residual plot shows the residuals are not randomly distributed around the y-axis, violating this assumption for LR.
Normality of Residuals - The residuals should be normally distributed. Our Q-Q plot shows this assumption is violated for LR.
Lack of Multicollinearity - Doesn’t apply here, since we have only a single predictor.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

We can scatterplot the transformed predictor and response, and plainly see we’ve managed to “bend” the original values into a scale that lends itself well to linear regression:

df2 <- df %>%
  mutate(LifeExp = LifeExp ^ 4.6, TotExp = TotExp ^ 0.06)

df2 %>% ggplot(aes(x=TotExp, y=LifeExp)) +
  geom_point(alpha=0.2) +
  geom_smooth(method='lm', se=FALSE, alpha=0.5, color='blue')

lm2 <- lm(LifeExp ~ TotExp, data=df2)
summary(lm2)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

# fitted vs residual
get_regression_points(lm2) %>%
  ggplot(aes(x=LifeExp_hat, y=residual)) +
    geom_point(alpha=0.3) +
    geom_hline(yintercept=0, color='blue')

# qq
qqnorm(resid(lm2))
qqline(resid(lm2))

Model Statistics

Overall, this second model performs better than the first. The Adjusted $R^2$ is 0.73, suggesting this single-predictor model explains about 73% of the variance in Life Expectancy. The F-Statistic of 507.7 provides the coefficient we’d observe in an intercept-only model, and we can see it is quite far from coefficient of our intercept here. The p-value for the F-Statistic is very small (2.2e-16) indicating we’d almost never observe a value this extreme in the absence of a linear relationship. The Residual SE is 90490000, the amount of total variance in the residual values - although if we wanted to interpret this (and both model coefficients) in real terms, we would need to transform them back to the original scale.

While the coefficients and statistics are on an exponential scale, we can see the assumptions of linear largely hold true, compared to the first model. The fitted vs. residuals plot demonstrates greater homoscedasticity, and the Q-Q plot shows the residuals are more normally distributed.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

newval1 <- data.frame(TotExp = c(1.5))
lm2_predict1 <- predict(lm2, newdata = newval1) ^ (1/4.6)

newval2 <- data.frame(TotExp = c(2.5))
lm2_predict2 <- predict(lm2, newdata = newval2) ^ (1/4.6)

print(lm2_predict1)

##        1 
## 63.31153

print(lm2_predict2)

##        1 
## 86.50645

Remembering that the raw prediction returned is on the transformed scale of our predictor (to the power of 4.6), we simply transform back to the original scale to get a “real world” estimate of life expectancy.

When TotExp^.06 = 1.5 ($861), the LifeExp is approx 63.3.

When TotExp^.06 = 2.5 ($4.3M), the LifeExp is approx 86.5.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df)
summary(lm3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

# fitted vs residual
get_regression_points(lm3) %>%
  ggplot(aes(x=LifeExp_hat, y=residual)) +
    geom_point(alpha=0.3) +
    geom_hline(yintercept=0, color='blue')

# qq
qqnorm(resid(lm3))
qqline(resid(lm3))

In this model we return to non-transformed values for the response and predictors, and we can observe similar problems regarding our assumptions for linear regression.

The Adjusted $R^2$ is 0.35, suggesting this model only explains about 35% of the variance in Life Expectancy. The F-Statistic of 34.5 provides the coefficient we’d observe in an intercept-only model, and we can see it is quite far from coefficient of our intercept here. The p-value for the F-Statistic is very small (2.2e-16) indicating we’d almost never observe a value this extreme in the absence of a linear relationship. The Residual SE is 8.77, the amount of total variance in the residual values.

The fitted vs residual plot shows extreme heteroscedasticity, and the Q-Q plot shows the residuals are less-normally distributed than our transformed model in the second example.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

newval3 <- data.frame(PropMD=c(.03), TotExp = c(1.5))
lm3_predict1 <- predict(lm3, newdata = newval3) 
print(lm3_predict1)

##        1 
## 107.6974

At first glance, the estimate of an average life expectancy of nearly 108 years seems extremely optimistic! If we compare to the second model in which we used transformed variables, we saw that a per-capita investment of $4.2M (holding all other variables equal) would be required just to raise life expectancy to 85.

Given the low Adjusted $R^2$ and multiple violations of our assumptions for linear regression – and applying a dose of common sense – we can safely reject this prediction and look for ways to improve the model instead.

Homework 12 - Linear Regression

DATA 605: Foundations in Computational Mathematics

Jeff Parks