1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

library(tidyverse)
data <- read.csv("/Users/mohamedhassan/Downloads/who.csv")
summary(data)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750
plot(data[,"TotExp"],data[,"LifeExp"], main="Relationship Between Government Expenditures and Life Expectancy",
xlab="Sum of Personal and Government Expenditures.", ylab="Average Life Expectancy")

model1 <- lm(LifeExp~TotExp, data=data)
summary(model1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-Statistic is the t-value squared. The value is 65.26, which taken with the small p-value of 7.71^e-14 would indicate that the null hypothesis can be rejected and the model is statistically significant. The small p-value indicates that the variable TotalExp is statisically significant to the relationship with LifeExp. The Adjusted R-squared value indicates that 25.37% of the variation of LifeExp can be explained by the TotalExp. The standard error is 7.795^e-14, which can be interpreted through the t-value as being 8.079 times smaller than the correlation coefficient of TotalExp, 6.297^e-05. Typically, the standard error of a good model should be five to ten times smaller than the corresponding coefficient.

plot(fitted(model1), residuals(model1), xlab="fitted", ylab="residuals")
abline(h=0)

plot(model1)

ggplot(model1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

The scatterplot of the model shows the residual points shifted mostly to the left, with an unequal distribution above and below the zero threshold. This indicates that the variance of the residuals is not constant, and that there is heteroscedasticity. Additionally, the right and left tail of the Q-Q plot deviate from the reference line, which also indicates that the model does not do a good job of capturing the linearity between LifeExp and TotExp.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data_new <- data %>% 
  mutate(LifeExp2 = LifeExp^4.6) %>% 
  mutate(TotExp2 = TotExp^.06)
plot(data_new[,"TotExp2"],data_new[,"LifeExp2"], main="Relationship Between Government Expenditures and Life Expectancy",
xlab="Sum of Personal and Government Expenditures.", ylab="Average Life Expectancy")

model2 <- lm(LifeExp2~TotExp2, data=data_new)
summary(model2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = data_new)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Transforming each variable had an impact on the performance of the model. The F-statistic increased substantially, from 65.26 to 507.7. Combined with the small p-value of 2^e-16, this indicates that the null hypothesis can be rejected and the model is statistically significant. The small p-value indicates that the newly transformed variable TotExp2 is statistically significant to the transformed dependent variable, LifeExp2.The Adjusted R-squared increased as well, from 25.37% to 72.83%, which indicates that this model does a better job of capturing the variation in LifeExp2 that is explained by TotExp2. The standard error can be explained through the t-value, which shows that the standard error, 27518940, is 22.53 smaller than the corresponding coefficient, 620060216. As stated earlier, a good model has a t-value (ratio of coefficient/standard error) that shows the standard error being five to ten times smaller than the coefficient. Taken altogether, this model does a better job than the initial model.

plot(fitted(model2), residuals(model2), xlab="fitted", ylab="residuals")
abline(h=0)

plot(model2)

ggplot(model2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

The plots of the new model supports the argument that this model is better. The Residual vs. Fitted Values plot shows randomly scattered residual points with no discernible pattern, with a more even distribution above and below the horizontal axis. The Q-Q plot shows the points doing a better job of falling on the reference line, even though the left tail deviates from the line.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

x <- 1.5
forecast1 <- round((-736527909 + 620060216 * x)^(1/4.6), 1)
cat("The Forecast of Life Expectancy when TotExp^.06 = 1.5 is", forecast1) 
## The Forecast of Life Expectancy when TotExp^.06 = 1.5 is 63.3
y <- 2.5
forecast2 <- round((-736527909 + 620060216 * y)^(1/4.6), 1)
cat("The Forecast of Life Expectancy when TotExp^.06 = 2.5 is", forecast2)
## The Forecast of Life Expectancy when TotExp^.06 = 2.5 is 86.5

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

model3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data=data)
summary(model3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

When analyzing the summary of the model, we can see that it does not do a good job of capturing the relationship between the independent variables and dependent variable, LifeExp. While the overall p-value of 2.2^e-16 may indicate that the independent variables are statistically significant to LifeExp, the small F-Statistic value of 34.49 suggests that it would be difficult to reject the null hypothesis and determine that the independent variables are statistically significant. The Adjusted R-squared value is 34.71%, which is the percentage of variation in LifeExp that can be explained by the independent variables. When examining each independent variable, each have a p-value less than .05, indicates that each variable is statistically significant to LifeExp. The t-values of PropMD and TotExp are between 5 and 10, which indicates that the standard error is smaller than their respective coefficients and the variables are statistically significant. However, the interactive term PropMD*TotExp has a t-value of -4.093, which indicates that the variable does not have a statistically significant impact on LifeExp.

plot(model3)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

ggplot(model3, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

The plots follow the pattern of the initial model. The residual points are skewed to the left and not randomly scattered, with an unequal distribution above and below the horizontal axis. This indicates that the variance of the residuals is not constant, and that there is heteroscedasticity. Additionally, the right and left tail of the Q-Q plot deviate from the reference line, which also indicates that the model does not do a good job of capturing the linearity between the independent variables and LifeExp.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

x <- 0.03
z <- 14
forecast3 <- 62.77 + (1497 * x) + (0.00007233 * z) - (0.006026 * x * z)
cat("The Forecast of Life Expectancy when TotExp = 1.5 is", forecast3) 
## The Forecast of Life Expectancy when TotExp = 1.5 is 107.6785

This doesn’t seem realistic. The mean of LifeExp in the dataset is 67.38, with a Min of 40 and a Max of 83. This model suggests that the life expectancy would far surpass the mean and max of life expectancy. Additionally, the outcome of this model forecasts that increasing the sum of personal and government expenditures, TotExp, would extend the life expectancy of a person until they were 107, which doesn’t seem plausible.