The who.csv dataset contains real-world data from 2008. The variables included follow.
Variable Name | Description |
---|---|
Country | name of the country |
LifeExp | average life expectancy for the country in years |
InfantSurvival | proportion of those surviving to one year or more |
Under5Survival | proportion of those surviving to five years or more |
TBFree | proportion of the population without TB |
PropMD | proportion of the population who are MDs |
PropRN | proportion of the population who are RNs |
PersExp | mean personal expenditures on healthcare in US dollars at average exchange rate |
GovtExp | mean government expenditures per capita on healthcare, US dollars at average exchange rate |
TotExp | sum of personal and government expenditures |
0. Load the Data
who <- read.csv('https://raw.githubusercontent.com/xiaoxiaogao-DD/store/master/who.csv')
head(who)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
summary(who)
## Country LifeExp InfantSurvival
## Afghanistan : 1 Min. :40.00 Min. :0.8350
## Albania : 1 1st Qu.:61.25 1st Qu.:0.9433
## Algeria : 1 Median :70.00 Median :0.9785
## Andorra : 1 Mean :67.38 Mean :0.9624
## Angola : 1 3rd Qu.:75.00 3rd Qu.:0.9910
## Antigua and Barbuda: 1 Max. :83.00 Max. :0.9980
## (Other) :184
## Under5Survival TBFree PropMD PropRN
## Min. :0.7310 Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9253 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9745 Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9459 Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9900 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :0.9970 Max. :1.0000 Max. :0.0351290 Max. :0.0708387
##
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
##
hist(who$LifeExp,main = 'Histogram: Life Expectancy in each country',xlab = 'Life Expectancy')
1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
cor(who$LifeExp,who$TotExp)
## [1] 0.5076339
m1 <- lm(LifeExp ~ TotExp, data = who)
summary(m1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
plot(who$TotExp,who$LifeExp,main = 'Scatterplot & Regression Model: Life Expectancy ~ Expenditures',xlab = 'Predictor: personal and government expenditures',ylab = 'avg life expectancy (yr)')
abline(m1)
p-values: p-values of the intercept and TotExp are less than 0.05 showing that the null hypothesis (corresponding coefficient should be 0) should be rejected and therefpre the linear model is statistically significant. (also verified by number of stars next to the p-values)
F-statistics and standard error: Both of these two parameters are measures of goodness of fit. The F-statistics shows that the relationship between predictor and response variable is only 65.26% which is relatively low. The residual standard error is the average amount that the response value will deviate from the true regression line. Since this model is evaluating life expetancy in number of years, based on the value added by TotExp, 9.371 is relatively high.
R\(^2\): R\(^2\) is the proportion of variation in the dependent (response) variable that has been explained by the model. Adjusted R\(^2\) penalizes total value for the number of terms in the model. While both R\(^2\) and adjusted R\(^2\) are relatively low in this model, only approximately 25%, other parameters should be evaluted before we discard the model with a low R\(^2\) value.
Assumptions of simple linear regression: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1) linearity: the model is not linear based on the histogram above.
(2) nearly normal residuals: based on the QQ-plot below, there are many descrepencies between the base line and the line created by the residuals.
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
*(3) constant variability: most of the life expectancy data concentrate in the low range of the expenditure.
plot(fitted(m1),resid(m1))
2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better”?
who$LifeExp4.6 <- who$LifeExp^4.6
who$TotExp0.06 <- who$TotExp^0.06
cor(who$LifeExp4.6,who$TotExp0.06)
## [1] 0.8542642
m2 <- lm(LifeExp4.6 ~ TotExp0.06, data = who)
summary(m2)
##
## Call:
## lm(formula = LifeExp4.6 ~ TotExp0.06, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
plot(who$TotExp0.06,who$LifeExp4.6,main = 'Regression Model: Life Expectancy^4.6 ~ Expenditures^0.06',xlab = 'Predictor: personal and government expenditures raised to 0.06 power',ylab = 'avg life expectancy (yr) raised to 4.6 power')
abline(m2)
p-values: p-values of the intercept and TotExp0.06 are less than 0.05 showing that the null hypothesis (corresponding coefficient should be 0) should be rejected and therefpre the linear model is statistically significant. (also verified by number of stars next to the p-values)
F-statistics and standard error: Both of these two parameters are measures of goodness of fit. The F-statistics shows the relationship between predictor and response variable which is much higher this time. The residual standard error is the average amount that the response value will deviate from the true regression line. Even though 90,490,000 seems to be a large number, it’s smaller than the previous one when comparing to the estimate created by TotExp0.06.
R\(^2\): R\(^2\) is the proportion of variation in the dependent (response) variable that has been explained by the model. Adjusted R\(^2\) penalizes total value for the number of terms in the model. Both of them are increasing to roughly 73% which is pretty high especially comaring to the previous one.
Assumptions of simple linear regression: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1) linearity: the model is linear with a 0.85 correlation coefficient based on the histogram above.
(2) nearly normal residuals: based on the QQ-plot below, the plot is mostly aligned with the base line with some descrepencies towards both ends especially for the lower end.
qqnorm(m2$residuals)
qqline(m2$residuals) # adds diagonal line to the normal prob plot
*(3) constant variability: all points are randomly distributed with some empty area.
plot(fitted(m2),resid(m2))
Overall, this model is much better than the first one.
3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
The equation created by the result from lastquestion is: LifeExp4.6 = -736527910 + 620060216\(\times\)TotExp0.06
#when TotExp^.06 =1.5
TE0.06 <- 1.5
LE4.6 <- -736527910 + 620060216*TE0.06
LE4.6^(1/4.6)
## [1] 63.31153
#when TotExp^.06 =2.5
TE0.06 <- 2.5
LE4.6 <- -736527910 + 620060216*TE0.06
LE4.6^(1/4.6)
## [1] 86.50645
4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
\(\space\space\space\space\space\space\space\space\space\space\space\space\space\)LifeExp = b\(_0\) + b\(_1\) \(\times\) PropMD + b\(_2\) \(\times\) TotExp + b\(_3\) \(\times\) PropMD \(\times\) TotExp
m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
summary(m3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
p-values: p-values of all coefficients and the whole model are less than 0.05 showing that the null hypothesis (corresponding coefficient should be 0) should be rejected and therefpre the linear model is statistically significant. (also verified by number of stars next to the p-values)
F-statistics and standard error: Both of these two parameters are measures of goodness of fit. The F-statistics shows that the relationship between predictor and response variable is even lower than the first model. The residual standard error is the average amount that the response value will deviate from the true regression line. As compared to the estimates’ values, 8.765 is still considered high.
R\(^2\): R\(^2\) is the proportion of variation in the dependent (response) variable that has been explained by the model. Adjusted R\(^2\) penalizes total value for the number of terms in the model. While both R\(^2\) and adjusted R\(^2\) are relatively low in this model, approximately 35.7%, but higher than model 1, other parameters should be evaluted before we discard the model with a low R\(^2\) value.
Assumptions of simple linear regression: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1) linearity: the model is not linear based on the distribution of LifeExp.
(2) nearly normal residuals: based on the QQ-plot below, there are many descrepencies between the base line and the line created by the residuals.
qqnorm(m3$residuals)
qqline(m3$residuals) # adds diagonal line to the normal prob plot
*(3) constant variability: most of the life expectancy data concentrate in the low range of the expenditure.
plot(fitted(m3),resid(m3))
5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
PropMD <- 0.03
TotExp <- 14
m3$coefficients[1] + m3$coefficients[2]*PropMD + m3$coefficients[3]*TotExp + m3$coefficients[4]*PropMD*TotExp
## (Intercept)
## 107.696
Life expectancy is a statisticall measure of the average time an organism is expected to live. In this scenario, life expectancy is the average of each country. 107.70 years is too high and unrealistic.
Moreover, a PropMD of 0.03 is among the highest within the dataset while a TotExp of 14 is among the lowest. This pair itself doesn’t seem to be realistic.