wh <- read_csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/who.csv")
glimpse(wh)
## Rows: 190
## Columns: 11
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola…
## $ LifeExp <dbl> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63,…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979,…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976,…
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, 0…
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.0…
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.0…
## $ PersExp <dbl> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, 1…
## $ GovtExp <dbl> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 18761…
## $ TotExp <dbl> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 1907…
## $ ...11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Problem 1
ggplot(data=wh,mapping=aes(TotExp,LifeExp)) + geom_point()
mod1 <- lm(LifeExp~TotExp, data=wh)
summary(mod1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = wh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Evaluating Linear Regression Assumptions
plot(mod1)
F statistics, \(R^2\), standard error, and p-values:
Main assumptions for Linear Regression and they are:
The relationship between X and Y must be linear. As can be seen from the scatter plot above, LifeExp vs TotExp does not have a linear relationship and this condition is not satisfied. + Homoscedacity: There should be constant variance in the residuals. From the Residual vs Fitted Plot shown above, it does not appear that there is a constant variance and thus the homoscedacity criterion is not satisfied.
Normality: The data should be normally distributed. From the QQ plot shown above, the data does not follow a normal distribution.
Independence: The observations should be independent of each other. This may be difficult to determine from looking at the data and we may have to rely on the assumptions provided by the data collector.
Since the Linearity, Homoscedacity, and Normality conditions are not satisfied, we can conclude that the assumptions for Linear Regression are not met.
wh$LifeExpSQ <- (wh$LifeExp)**4.6
wh$TotExpSQ <- (wh$TotExp)**0.06
head(wh)
## # A tibble: 6 × 13
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 5 more variables: GovtExp <dbl>, TotExp <dbl>, ...11 <lgl>,
## # LifeExpSQ <dbl>, TotExpSQ <dbl>
Problem 2
ggplot(data=wh,mapping=aes(TotExpSQ,LifeExpSQ)) + geom_point()
mod2 <- lm(LifeExpSQ~TotExpSQ, data=wh)
summary(mod2)
##
## Call:
## lm(formula = LifeExpSQ ~ TotExpSQ, data = wh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpSQ 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Evaluating Linear Regression Assumptions
plot(mod2)
F statistics, \(R^2\), standard error, and p-values:
Model 2 is better than model 1 above considering that R-squared (\(R^2\)) value in model 2 is high at about 73% and the fact the the non-normalized model (model 1) failed all the linear regression assumptions.
Problem 3
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Predict TotExp^.06 =1.5 and TotExp^.06=2.5
#TotExp^.06 =1.5
TotExpSQ = 1.5
LifeExpSQ = -736527910 + TotExpSQ*620060216
LifeExp = (LifeExpSQ)**(1/4.6)
LifeExp
## [1] 63.31153
#TotExp^.06 =2.5
TotExpSQ = 2.5
LifeExpSQ = -736527910 + TotExpSQ*620060216
LifeExp = (LifeExpSQ)**(1/4.6)
LifeExp
## [1] 86.50645
Problem 4
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
mod3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = wh)
summary(mod3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = wh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
plot(mod3)
F statistics, \(R^2\), standard error, and p-values:
Problem 5
Forecast LifeExp when PropMD=0.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
PropMD=0.03
TotExp = 14
LifeExp = 6.277e+01 + (1.497e+03 * PropMD) + (7.233e-05 * TotExp) - (6.026e-03 * PropMD * TotExp)
max_life <- max(wh$LifeExp)
rng <- range(wh$LifeExp)
cat("The prediction is ",LifeExp)
## The prediction is 107.6785
cat("The max maximum life expectancy is ", max_life, " and the range of the life expectanct is ", rng, " so the prediction is not realistic as it falls outside of the range of the data")
## The max maximum life expectancy is 83 and the range of the life expectanct is 40 83 so the prediction is not realistic as it falls outside of the range of the data