data <- read_csv("who.csv")## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)summary(data)## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
## GovtExp TotExp
## Min. : 10.0 Min. : 13
## 1st Qu.: 559.5 1st Qu.: 584
## Median : 5385.0 Median : 5541
## Mean : 40953.5 Mean : 41696
## 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :476420.0 Max. :482750
Question 1
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
data %>%
ggplot(aes(x=LifeExp, y=TotExp)) +
geom_point(position="jitter")model1 <- lm(LifeExp~TotExp, data=data)summary(model1)##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F-statistic is 65.26 with 1 degree of freedom on the coefficients and 188 degrees of freedom on the residuals. The p-value associated with TotExp is very close to zero, and therefore significant. The standard error is higher than we’d like, but is also less than 1 standard deviation of the response variable. The adjusted R2 tells us that this model accounts for only 25.37% of the total variance of the response variable, which means we can definitly improve this model.
Based on the scatter plot, we can see that the assumptions for simple linear regression are not met.
Question 2
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
data$LifeExp_t <- data$LifeExp^4.6
data$TotExp_t <- data$TotExp^0.6data %>%
ggplot(aes(x=LifeExp_t, y=TotExp_t)) +
geom_point(position="jitter")model2 <- lm(LifeExp_t ~ TotExp_t, data=data)summary(model2)##
## Call:
## lm(formula = LifeExp_t ~ TotExp_t, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -257351739 -82599957 14030425 93896945 237720335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211907647 10234512 20.70 <2e-16 ***
## TotExp_t 238461 15021 15.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared: 0.5728, Adjusted R-squared: 0.5705
## F-statistic: 252 on 1 and 188 DF, p-value: < 2.2e-16
The second model is significanly “better” than model 1. We can see that for the same degrees of freedom, the second model’s F-statistic is multiples greater than model 1. TotExp’s p-value is even more close to zero. The R2 value tells us that this model accounts for 57% of the total variance of the response variable.
The only concerning thing is that the residuals look to be significantly larger, which is likely due to the transformations we made.
Question 3
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
predict(model2,data.frame(TotExp_t=1.5))## 1
## 212265338
predict(model2,data.frame(TotExp_t=2.5))## 1
## 212503799
Question 4
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0 + b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
multi_model <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data=data)summary(multi_model)##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The F-statistic is large and tells us that the independent variables have predictrive power. Each of the independent variables have small associatred p-values, suggesting they are all indivudally important and significant in our model. TotExp has the smallest p-value. R2 has decreased since model2, as this model accounts for only 35% of the total variacne of the response variable. The standard error is much smaller again as we are no longer using transformed values.
Question 5
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
predict(multi_model,data.frame(PropMD=.03, TotExp=14))## 1
## 107.696
This does not seem likely, as most people do not live to be over 100 years old.