HW 12 - Linear Regression

Life Expectancy Dataset

who_df <- read_csv("http://my-cunymsds.com/data605/data/who.csv")

EDA of Dataset

General stats of data set

summary(who_df)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750

1- Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatter Plot

ggplot(who_df, aes(x=TotExp, y=LifeExp)) + 
  geom_point(alpha=0.3) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Simple Linear Regression

model_01 <- lm(LifeExp ~ TotExp, data=who_df)
summary(model_01)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Summary of findings.

With an R2 less than 0.26 the model is very far from a good representation of the relationship between the variables. While the predictor is statistically significant and F-stastitic is also statistically significant which measure the significance of the overall model.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Scatter Plot

ggplot(who_df, aes(x=TotExp^0.06, y=LifeExp^4.6)) + 
  geom_point(alpha=0.3) +
  geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

Linear Regression

model_02 <- lm(I(LifeExp^4.6) ~ I(TotExp^0.06), data=who_df)
summary(model_02)
## 
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = who_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -736527910   46817945  -15.73   <2e-16 ***
## I(TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Model Plots

plot(model_02)

Reverse de transformation for better viewing

model_02_preds <- as_tibble(model_02$fitted.values^(1/4.6))
model_02_preds <- rename(model_02_preds, predicted= value)
model_02_preds$actual <- who_df$LifeExp
ggplot(model_02_preds, aes(x=actual, y=predicted)) + 
  geom_point(alpha=0.3) +
  geom_smooth()

Summary of findings

With an R2 of 0.72 this is a huge improvement over the previous model without the transformations. That was clear when we plotted the scatterplot, that the regression algorithm was going to have an easier time finding a linear relatonship between the variables.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

b0 <- model_02$coefficients[1]
b1 <- model_02$coefficients[2]
Pred1 <- b0 + b1*1.5
Pred2 <- b0 + b1*2.5

# To makes sense of the prediction LifeExp^4.6, we will invert by raising the model prediction pred^(1/4.6)

print(paste0("First prediction when TotExp^.06=1.5 -> ", as.character(round(Pred1^(1/4.6),2))))
## [1] "First prediction when TotExp^.06=1.5 -> 63.31"
print(paste0("Second prediction when TotExp^.06=2.5 -> ", as.character(round(Pred2^(1/4.6),2))))
## [1] "Second prediction when TotExp^.06=2.5 -> 86.51"

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

model_03 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data=who_df)
summary(model_03)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
plot(model_03)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

model_03_preds <- as_tibble(model_03$fitted.values)
model_03_preds <- rename(model_03_preds, predicted= value)
model_03_preds$actual <- who_df$LifeExp
ggplot(model_03_preds,aes(x=actual, y=predicted)) + 
  geom_point(alpha=0.3) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Summary of findings

Although a little better (Measured by the R2) than the simple one, it still has a very low R2 of 0.36. Definitely the model requires transformation to reach better levels. While all statistics are significant, still is not as good as the second model where we transformed the variables.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

b0 <- model_03$coefficients[1]
b1 <- model_03$coefficients[2]
b2 <- model_03$coefficients[3]
b12 <- model_03$coefficients[4]

Pred1 <- b0 + b1*0.03 + b2*14 + b12*(0.03*14)

print(paste0("Prediction when PropMd=0.03 TotExp=14 -> ", as.character(round(Pred1,2))))
## [1] "Prediction when PropMd=0.03 TotExp=14 -> 107.7"

Summary of findings

The predicted life expectancy was close to 108 years. This is not realistic since is much higher than the highest life expectancy observed in reality. This is due becuase we are using a high number for PropMD. Linear regression is very sensible to outliers, thus the best fitting line for most values, may be very off for the extreme ones.