wh <- read_csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/who.csv")
glimpse(wh)
## Rows: 190
## Columns: 11
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola…
## $ LifeExp        <dbl> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63,…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979,…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976,…
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, 0…
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.0…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.0…
## $ PersExp        <dbl> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, 1…
## $ GovtExp        <dbl> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 18761…
## $ TotExp         <dbl> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 1907…
## $ ...11          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Problem 1

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
ggplot(data=wh,mapping=aes(TotExp,LifeExp)) + geom_point()

mod1 <- lm(LifeExp~TotExp, data=wh)
summary(mod1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = wh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Evaluating Linear Regression Assumptions

plot(mod1)

F statistics, \(R^2\), standard error, and p-values:

Main assumptions for Linear Regression and they are:

The relationship between X and Y must be linear. As can be seen from the scatter plot above, LifeExp vs TotExp does not have a linear relationship and this condition is not satisfied. + Homoscedacity: There should be constant variance in the residuals. From the Residual vs Fitted Plot shown above, it does not appear that there is a constant variance and thus the homoscedacity criterion is not satisfied.

Since the Linearity, Homoscedacity, and Normality conditions are not satisfied, we can conclude that the assumptions for Linear Regression are not met.

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
wh$LifeExpSQ <- (wh$LifeExp)**4.6 
wh$TotExpSQ <- (wh$TotExp)**0.06
head(wh)
## # A tibble: 6 × 13
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 5 more variables: GovtExp <dbl>, TotExp <dbl>, ...11 <lgl>,
## #   LifeExpSQ <dbl>, TotExpSQ <dbl>

Problem 2

ggplot(data=wh,mapping=aes(TotExpSQ,LifeExpSQ)) + geom_point()

mod2 <- lm(LifeExpSQ~TotExpSQ, data=wh)
summary(mod2)
## 
## Call:
## lm(formula = LifeExpSQ ~ TotExpSQ, data = wh)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExpSQ     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Evaluating Linear Regression Assumptions

plot(mod2)

F statistics, \(R^2\), standard error, and p-values:

Model 2 is better than model 1 above considering that R-squared (\(R^2\)) value in model 2 is high at about 73% and the fact the the non-normalized model (model 1) failed all the linear regression assumptions.

Problem 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Predict TotExp^.06 =1.5 and TotExp^.06=2.5

#TotExp^.06 =1.5
TotExpSQ = 1.5
LifeExpSQ = -736527910 + TotExpSQ*620060216
LifeExp = (LifeExpSQ)**(1/4.6)
LifeExp
## [1] 63.31153
#TotExp^.06 =2.5
TotExpSQ = 2.5
LifeExpSQ = -736527910 + TotExpSQ*620060216
LifeExp = (LifeExpSQ)**(1/4.6)
LifeExp
## [1] 86.50645

Problem 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

mod3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = wh)
summary(mod3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = wh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
plot(mod3)

F statistics, \(R^2\), standard error, and p-values:

Problem 5

Forecast LifeExp when PropMD=0.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD=0.03 
TotExp = 14

LifeExp = 6.277e+01 + (1.497e+03 * PropMD) + (7.233e-05 * TotExp) - (6.026e-03 * PropMD * TotExp)
max_life <- max(wh$LifeExp)
rng <- range(wh$LifeExp)

cat("The prediction is ",LifeExp)
## The prediction is  107.6785
cat("The max maximum life expectancy is ", max_life, " and the range of the life expectanct is  ", rng, " so the prediction is not realistic as it falls outside of the range of the data")
## The max maximum life expectancy is  83  and the range of the life expectanct is   40 83  so the prediction is not realistic as it falls outside of the range of the data