Import WHO csv

file <- "C:\\Users\\jashb\\OneDrive\\Documents\\Masters Data Science\\Spring 2024\\Fundamentals of Computational Mathematics DATA 605\\Week 12\\who.csv"
who <- read.csv(file)

1.

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

model <- lm(LifeExp~TotExp, data = who)
summary(model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

As seen in the summary of the model, the pvalue for the fstat and the TotExp variable are extremely small, meaning that we can reject the null hypothesis, suggesting that there is strong evidence that there is an effect. The Multiple R^2 says that around 25.7% of the variance in LifeExp can be attributed to changing sum of personal and government expenditures. Lastly, the standard error of residuals explains that the model predicts the life expectancy with a std error of 9.37 years. Looking at the graph another regression might be better.

plot(x = who$TotExp , y = who$LifeExp)
abline(model, col = "red3")

2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

LifeExp_v2 <- who$LifeExp^(4.6) # Raise LifeExp to the power of 4.6
TotExp_v2 <- who$TotExp^(0.06) # Raise TotExp to the power of 0.06
model_v2 <- lm(LifeExp_v2~TotExp_v2)
plot(TotExp_v2, LifeExp_v2)
abline(model_v2)

summary(model_v2)
## 
## Call:
## lm(formula = LifeExp_v2 ~ TotExp_v2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_v2    620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The Fstat remains significant as well as the pvalues for the intercept and TotExp. There was also a major shift in the predictability of this model compared with the non-transformed one. The R^2 jumped from around 25% of explained variability to 72.9%. Overall the transformed model performs much better than the non transformed.

3.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

a.

(-736527910 + 620060216 *(1.5))^(1/4.6)
## [1] 63.31153

b.

(-736527910 + 620060216 *(2.5))^(1/4.6)
## [1] 86.50645

4.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

model_4 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who)
summary(model_4)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Looking at this multiple regression model, all of the variables added seem to be very statistically significant. The Fstat has remained statistically significant as well. The model overall explains around 36% of the variability in life expectancy, a similar value in Adjusted R-Squared also shows that the predictor variables are contributing to the overall explained variance (in other words, they are adding to the models explanatory power). This is generally a good model.

5.

  #Intercept        PropMD              TotExp                  PropMD * TotExp 
((6.277*10^1)+(1.497*10^3)*0.03 + (7.233*10^(-5))*14 - ((6.026*10^(-3))*0.03*14))
## [1] 107.6785

This is not realistic since it is rare that someone lives past 100 years of age, nevermind 107.67