knitr::opts_chunk$set(echo = TRUE)
library(tinytex)
Source files: [https://github.com/djlofland/DATA605_S2020/tree/master/]
WHO 2008 Dataset REgression
The attached who.csv dataset contains real-world data from 2008. The variables included follow:
- Country: name of the country
- LifeExp: average life expectancy for the country in years
- InfantSurvival: proportion of those surviving to one year or more
- Under5Survival: proportion of those surviving to five years or more
- TBFree: proportion of the population without TB.
- PropMD: proportion of the population who are MDs
- PropRN: proportion of the population who are RNs
- PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
- GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
- TotExp: sum of personal and government expenditures.
who <- read.csv('who.csv')
head(who)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
Problem 1
- Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Model
# ----------- Load WHO dataset -----------
cor(who$LifeExp, who$TotExp)
## [1] 0.5076339
model <- lm(LifeExp ~ TotExp, who)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
plot(LifeExp ~ TotExp, who)
abline(model)

plot(model$residuals ~ who$TotExp)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

hist(model$residuals)

qqnorm(model$residuals)
qqline(model$residuals)

Discussion:
- Assumptions - Our assumptions for a linear regression are NOT met. Problems we see include: 1) plotting LifeExp vs TotExp, it’s clear the raw data points do NOT follow a linear relationship. If anything it looks more like an exponential relationship. 2) The residuals are NOT normally distributed and the residual when plotted against TotExp show a clear pattern (they should be randomly distributed). Our QQ plot confirms that the data doesn’t conform to a linear model and shows strong skew (note the concave nature of the QQ plot)
- F Statistic - 65.26 - This high value would normally indicate significance, but as we saw in the assumptions, a linear regression is NOT appropriate for this data and as a result, the F Statistic is not informative.
- R2 - 0.2577 - The linear model explains 25.77% of the variability in LifeExp using TotExp. This is a low value.
- Standard Error - 7.795e-06 - actually “Residual Standard Error” - this is calculated as sqrt(SSE/(n-(1+k))) where SSE is the sum of residulas squared, k is the number of variable in the model (not counting the intercept), and n is the number of datapoints in the dataset. The Stand Error for each variable in the model is like a Standard Deviation of the error.
- p-value - 7.714e-14 - While the p-value would suggest this model is significant (p-value < 0.05), we saw when looking at assumptions, a linear regression is NOT appropriate for these variables and as a result, the p-value isn’t informative.
Problem 2
- Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Model
who$LifeExpXform <- who$LifeExp^4.6
who$TotExpXform <- who$TotExp^0.06
cor(who$LifeExpXform, who$TotExpXform)
## [1] 0.8542642
model <- lm(LifeExpXform ~ TotExp, who)
summary(model)
##
## Call:
## lm(formula = LifeExpXform ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -309107631 -103496133 18566535 100019031 273607812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.542e+08 1.064e+07 23.89 <2e-16 ***
## TotExp 1.290e+03 1.101e+02 11.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132300000 on 188 degrees of freedom
## Multiple R-squared: 0.4222, Adjusted R-squared: 0.4191
## F-statistic: 137.4 on 1 and 188 DF, p-value: < 2.2e-16
plot(LifeExpXform ~ TotExpXform, who)
abline(model)

plot(model$residuals ~ who$TotExpXform)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

hist(model$residuals)

qqnorm(model$residuals)
qqline(model$residuals)

Discussion
While a linear regression is far more appropraite for the transformed data, it’s not perfect. However, our assumptions are looking better and a linear regression is proably ok in this situation.
- Assumptions - Our assumptions for a linear regression are mostly/almost met. 1) Plotting LifeExpXform vs TotExpXform, the datapoints follow a borad linear relationship. 2) The residuals are closer to normally distributed and the residual when plotted against TotExpXform are more randomly distributed. We do see some outliers and a little bit of a pattern to residuals, but this is far better than the model in Problem 1. Our QQ plot confirms that the data conform to a linear model thru most of the datapoints. We do see some issues at the extremes (outliers) that don’t conform to linear auuption, but it’s far better than the model in Problem 1.
- F Statistic - 137.4 - Since our assumptions are basically met, the high F indicates significance.
- R2 - 0.4222 - The linear model explains 42.22% of the variability in LifeExp using TotExp.
- Standard Error - 1.101e+02 - actually “Residual Standard Error” - this is calculated as sqrt(SSE/(n-(1+k))) where SSE is the sum of residulas squared, k is the number of variable in the model (not counting the intercept), and n is the number of datapoints in the dataset. The Stand Error for each variable in the model is like a Standard Deviation of the error.
- p-value - 2.2e-16 - The p-value suggests this model is significant (p-value < 0.05).
Problem 3
- Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
pred_lifeExp <- 2.542e+08 + 1.290e+03 * 1.5
(pred_lifeExp <- pred_lifeExp^(1/4.6))
## [1] 67.17579
Problem 4
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMD + b2 x TotExp +b3 x PropMD x TotExp
Model
model <- lm(LifeExp ~ PropMD * TotExp, who)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
plot(LifeExp ~ PropMD * TotExp, who)

abline(model)
## Warning in abline(model): only using the first two of 4 regression
## coefficients

plot(model$residuals ~ PropMD * TotExp, who)

abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

hist(model$residuals)

qqnorm(model$residuals)
qqline(model$residuals)

Discussion
- Assumptions - Our assumptions for a linear regression are *mostly/almostNOT met. AS a result, a linear model is not appriate to model the relationship between these variables. 1) Plotting LifeExp vs TotExp and LifeExp vs PropMD, the datapoints do not follow linear relationship. 2) The residuals are NOT normally distributed and the residual when plotted against TotExp and PropMD show a clear patterns (they should be randomly distributed). Our QQ plot confirms that the data doesn’t conform to a linear model and shows strong skew (note the concave nature of the QQ plot)
- F Statistic - 34.49 - Since our assumptions are NOT met, the F Statistic in not particualrly informative.
- R2 - 0.3574 - The linear model explains 35.74% of the variability in LifeExp using both TotExp and PropMD.
- Standard Error - 2.788e+02, 8.982e-06 and 1.472e-03 - actually “Residual Standard Error” - this is calculated as sqrt(SSE/(n-(1+k))) where SSE is the sum of residulas squared, k is the number of variable in the model (not counting the intercept), and n is the number of datapoints in the dataset. The Standard Error for each variable in the model is like a Standard Deviation of the error.
- p-value - 2.2e-16 - Normally, the p-value would suggest this model is significant (p-value < 0.05); however, since the assumptions for a linear model are not met, the p-value isnt’ particularly informative.
Problem 5
- Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
(pred_lifeExp <- 6.277e+01 + 1.497e+03 * 0.03 + 7.233e-05 * 14 + -6.026e-03 * 0.03 * 14)
## [1] 107.6785
Discussion
- 107.7 years does NOT seem realistic - very few people live to be that old. It’s highly unlikely we’d see such an extreme value from our model. If anything, 107 years old would be an extreme outlier event.
- Wehn building the model, it was clear that our assumption were not met for a simple linear regression (even include the interaction term). As such, we should not rely on any predictions made by the model.