data <- read_csv("who.csv")
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
summary(data)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750

Question 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

data %>%
  ggplot(aes(x=LifeExp, y=TotExp)) +
  geom_point(position="jitter")

model1 <- lm(LifeExp~TotExp, data=data)
summary(model1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-statistic is 65.26 with 1 degree of freedom on the coefficients and 188 degrees of freedom on the residuals. The p-value associated with TotExp is very close to zero, and therefore significant. The standard error is higher than we’d like, but is also less than 1 standard deviation of the response variable. The adjusted R2 tells us that this model accounts for only 25.37% of the total variance of the response variable, which means we can definitly improve this model.

Based on the scatter plot, we can see that the assumptions for simple linear regression are not met.

Question 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data$LifeExp_t <- data$LifeExp^4.6
data$TotExp_t <- data$TotExp^0.6
data %>%
  ggplot(aes(x=LifeExp_t, y=TotExp_t)) +
  geom_point(position="jitter")

model2 <- lm(LifeExp_t ~ TotExp_t, data=data)
summary(model2)
## 
## Call:
## lm(formula = LifeExp_t ~ TotExp_t, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -257351739  -82599957   14030425   93896945  237720335 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 211907647   10234512   20.70   <2e-16 ***
## TotExp_t       238461      15021   15.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5705 
## F-statistic:   252 on 1 and 188 DF,  p-value: < 2.2e-16

The second model is significanly “better” than model 1. We can see that for the same degrees of freedom, the second model’s F-statistic is multiples greater than model 1. TotExp’s p-value is even more close to zero. The R2 value tells us that this model accounts for 57% of the total variance of the response variable.

The only concerning thing is that the residuals look to be significantly larger, which is likely due to the transformations we made.

Question 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

predict(model2,data.frame(TotExp_t=1.5))
##         1 
## 212265338
predict(model2,data.frame(TotExp_t=2.5))
##         1 
## 212503799

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0 + b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

multi_model <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data=data)
summary(multi_model)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F-statistic is large and tells us that the independent variables have predictrive power. Each of the independent variables have small associatred p-values, suggesting they are all indivudally important and significant in our model. TotExp has the smallest p-value. R2 has decreased since model2, as this model accounts for only 35% of the total variacne of the response variable. The standard error is much smaller again as we are no longer using transformed values.

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

predict(multi_model,data.frame(PropMD=.03, TotExp=14))
##       1 
## 107.696

This does not seem likely, as most people do not live to be over 100 years old.