Loading Data

who <- read.csv('https://raw.githubusercontent.com/cocodono/Data605HW12/main/who.csv')

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1.

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Initial Plot

plot(who$TotExp, who$LifeExp, xlab = 'Sum of Personal and Government Expenditures', ylab = 'Life Expectancy')

Model 1:

who_model_1 <- lm(formula = LifeExp ~ TotExp, who)

summary(who_model_1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-statistic value of 65.26 is pretty high, which serves as supporting the evidence of a significant overall model fit.

The p-value for this model is 7.714e-14, which is less than 0.05 indicating that our independent variable is related to our dependent variable.

The Estimate here is 6.297e-05 and the standard error here is 7.795e-06, which gives us a t-value of 8.079. I know that we have a threshold of 1.96 for a confidence level of 95%. If the absolute t-value is greater than 1.96, the coefficient is often considered statistically significant and given that our t-value is greater than 1.96, it seems that we can deem this statistically significant.

R^2 values range from 0 to 1 and the R^2 values here are: Multiple R-squared:
0.2577 and Adjusted R-squared: 0.2537. These values are on the smaller end, which suggests that the model is not the best fit.

Plotting Residuals: Model 1

plot(fitted(who_model_1),resid(who_model_1))
abline(h=0)

Looking at the prior plot, it is not entirely evident whether or not the residuals are evenly distributed above and below 0 (it could be the case that the residuals are evely distributed, but, again, it is not immediately apparent).

QQ Plot: Model 1

qqnorm(resid(who_model_1))
qqline(resid(who_model_1))

Further, from looking at the QQ plot, it seems that the residuals do not really fall on the normal line.

Some of the summary statistics (p-value, F-stat and t-value indicate that this model may be a good fit), but after looking a at the residual and QQ plots, it does not seem that this may be the best model.

2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Field Creation and Intial Plot

who <- who %>%
  mutate(LifeExp4.6 = LifeExp^4.6,
         TotExp0.06 = TotExp^0.06)

plot(who$TotExp0.06, who$LifeExp4.6, xlab = 'Sum of Personal and Government Expenditures', ylab = 'Life Expectancy')

Model 2

who_model_2 <- lm(formula = LifeExp4.6 ~ TotExp0.06, who)

summary(who_model_2)
## 
## Call:
## lm(formula = LifeExp4.6 ~ TotExp0.06, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp0.06   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

This new model has a p-value of < 2.2e-16, which just like the last model, is less than 0.05, indicating that our independent variable is related to our dependent variable.

The F-stat here is 507.7, which is signifigantly higher than the F-stat of the prior model. I made a note that the prior model’s F-stat indicated that the model was statistically significant and with a higher F-stat here, that lends itself to this model being a better fit.

The Estimate here is 620060216 and the standard error is 27518940, which gives us a t-value of 22.53. I know that we have a threshold of 1.96 for a confidence level of 95%. If the absolute t-value is greater than 1.96, the coefficient is often considered statistically significant and given that our t-value is greater than 1.96, it seems that we can deem this statistically significant. While both models have a t-value greater than 1.96, the further from 1.96 the t-value the more statistically significant the t-value, meaning the second model has a more statistically significant t-value.

The R^2 values are: Multiple R-squared: 0.7298 and Adjusted R-squared: 0.7283. An R-squared of 0.7298 suggests that the model explains a significant portion of the variability in the response variable and they are signifcantly higher than that of the prior model.

Plotting Residuals: Model 2

plot(fitted(who_model_2),resid(who_model_2))
abline(h=0)

This residual plot paints a better picture than that of Model 1. It seems that the residual values seem to be more evenly distributed above and below 0 than are the residuals from Model 1 (this would indicate that the residuals are more normally distributed).

QQ Plot: Model 2

qqnorm(resid(who_model_2))
qqline(resid(who_model_2))

The QQ Plot here is not perfect. While it seems the points more consistently fall on the normal line, there is some straying from the normal line on the outer ends of the theoretical quantiles.

Generally, I would say that the second model is a far better fit than is model 1. Model 2 is not perfect and it is a far better fit than is model 1.

3.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

# Using slope intercept form (y=mx+b)

b = summary(who_model_2)$coefficients[1,1]
x = summary(who_model_2)$coefficients[2,1]

#forecast life expectancy when TotExp0.6 = 1.5 and TotExp0.6 = 2.5.

m1 = 1.5
LifeExp1 = (m1*x + b)^(1/4.6)
paste('The forecasted life expectancy when TotExp^0.06 = 1.5:', LifeExp1)
## [1] "The forecasted life expectancy when TotExp^0.06 = 1.5: 63.3115334469743"
m2 = 2.5
LifeExp2 = (m2*x + b)^(1/4.6)
paste('The forecasted life expectancy when TotExp^0.06 = 2.5:', LifeExp2)
## [1] "The forecasted life expectancy when TotExp^0.06 = 2.5: 86.5064484844719"
new <- data.frame(TotExp0.06 = c(1.5, 2.5))
predict(who_model_2, new, interval="predict")^(1/4.6)
##        fit      lwr      upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414

4.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

\(LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp\)

Multiple Regression Model

who_mult_reg <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, who)
summary(who_mult_reg)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The p-value for the full model is significantly less than 0.05 which is statistically significant and mind you all of of the individual predictor variables also have individual p-values of less than 0.05.

The t-value for the full model is significantly greater than 1.96, which again points to statistical signifigance and provides an indication that this may be a good model.

However, with a low R^2 value, this model may not be the best fit.

Plotting Residuals: Multiple Regression

plot(fitted(who_mult_reg),resid(who_mult_reg))
abline(h=0)

It is not immediately apparent if the residuals are evenly distributed above or below zero, at least with this plot.

QQ Plot: Multiple Regression

qqnorm(resid(who_mult_reg))
qqline(resid(who_mult_reg))

The points do not really ahere to the normal line and therefore I would not say that the residuals for this multiple regression are normally distributed.

I would not say that this model is the best fitting prediction for Life Expectancy.

5.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# intial conditions
PropMD = 0.03
TotExp = 14

# model values
b = summary(who_mult_reg)$coefficients[1,1]
PropMDx = summary(who_mult_reg)$coefficients[2,1] * PropMD
TotExpx = summary(who_mult_reg)$coefficients[3,1] * TotExp
PropMD_TotExp = summary(who_mult_reg)$coefficients[4,1] * PropMD * TotExp

# Forecasting
LifeExp = b + PropMDx + TotExpx - PropMD_TotExp
paste('The forecasted life expectancy when PropMD=0.03 and TotExp = 14:', LifeExp)  
## [1] "The forecasted life expectancy when PropMD=0.03 and TotExp = 14: 107.701065284669"
# Predictions
new <- data.frame(PropMD, TotExp)
predict(who_mult_reg, new, interval="predict")
##       fit      lwr      upr
## 1 107.696 84.24791 131.1441

I do not believe having a forecasted average life expectancy of about 108 years old is highly realistic (at least for current human health). The value obviously makes sense when you destill reality down to the variables in this model, but based on this prediction, this model may not be the best (in addition to the fact that it was not deemed to be the best fit in question #4).