Solving questions using who.csv dataset contains real-world data from 2008.
# Load the data
who_data <- read.csv("who.csv")

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# 1. Scatterplot and Simple Linear Regression
plot(who_data$TotExp, who_data$LifeExp, xlab = "Total Expenditures", ylab = "Life Expectancy")

lm_fit <- lm(LifeExp ~ TotExp, data = who_data)
summary(lm_fit)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Interpretation:

The simple linear regression model between LifeExp and TotExp is statistically significant as indicated by the F-statistic of 65.26 and a very low p-value of 7.714e-14. This means that there is a significant relationship between LifeExp and TotExp. The R-squared value of 0.2577 indicates that 25.77% of the variation in LifeExp can be explained by TotExp. The intercept term (b0) has an estimated value of 6.475e+01, which represents the average value of LifeExp when TotExp is equal to zero. The coefficient of TotExp (b1) is 6.297e-05, which means that for every one-unit increase in TotExp, the LifeExp increases by 6.297e-05. The standard error of the estimate (residual standard error) is 9.371, which means that the average distance between the observed and predicted values of LifeExp is 9.371 years. The assumptions of simple linear regression, including linearity, normality, independence, and equal variance, should be checked before making any conclusions about the model.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

# 2. Transform Variables and Re-run Regression
who_data$LifeExp_trans <- who_data$LifeExp^4.6
who_data$TotExp_trans <- who_data$TotExp^0.06
plot(who_data$TotExp_trans, who_data$LifeExp_trans, xlab = "Total Expenditures (transformed)", ylab = "Life Expectancy (transformed)")

lm_fit_trans <- lm(LifeExp_trans ~ TotExp_trans, data = who_data)
summary(lm_fit_trans)
## 
## Call:
## lm(formula = LifeExp_trans ~ TotExp_trans, data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExp_trans  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Interpretation:

The output is from a simple linear regression model where LifeExp^4.6 is regressed on TotExp^0.06 after raising LifeExp to the power of 4.6 and TotExp to the power of 0.06. The intercept is -736,527,910 and the slope is 620,060,216, indicating that when TotExp^0.06 is zero, the expected value of LifeExp^4.6 is -736,527,910 and on average, for each one-unit increase in TotExp^0.06, the expected value of LifeExp^4.6 increases by 620,060,216. The standard error for the slope coefficient is 27,518,940, indicating that the estimate of the slope is quite precise. The p-value associated with the slope coefficient is less than 0.001, which suggests that there is strong evidence that the slope is significantly different from zero. The R-squared is 0.7298, which means that 72.98% of the variation in LifeExp^4.6 can be explained by the linear relationship with TotExp^0.06. The F-statistic is 507.7 with a p-value less than 0.001, indicating that the regression model is significant. Overall, this model is a better fit than the simple linear regression model with LifeExp and TotExp as the original variables.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

# 3. Forecast Life Expectancy
new_data <- data.frame(TotExp_trans = c(1.5, 2.5))
predicted_lifeexp <- predict(lm_fit_trans, newdata = new_data, inverse = TRUE)
predicted_lifeexp
##         1         2 
## 193562414 813622630

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

# 4. Multiple Linear Regression
lm_mult <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who_data)
summary(lm_mult)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpretation:

The coefficients table provides the estimated coefficients for the model’s intercept and each explanatory variable. The coefficient estimates indicate that the intercept is 62.77, PropMD has a positive effect on LifeExp (beta = 1497, p < 0.001), and TotExp has a positive effect on LifeExp (beta = 0.00007233, p < 0.001). The interaction between PropMD and TotExp is also significant (beta = -0.006026, p = 0.0000635), indicating that the effect of TotExp on LifeExp depends on the value of PropMD.

The Residuals table shows the minimum, 1st quartile, median, 3rd quartile, and maximum of the model residuals.

The model’s R-squared value of 0.3574 indicates that the model explains 35.74% of the variation in LifeExp. The adjusted R-squared value of 0.3471 suggests that adding PropMD, TotExp, and their interaction to the model did not result in overfitting. The F-statistic of 34.49 with 3 and 186 degrees of freedom and a p-value of < 2.2e-16 indicates that the model is statistically significant.

Overall, the model suggests that both PropMD and TotExp are significant predictors of LifeExp, and their interaction should be considered when interpreting the effect of TotExp on LifeExp.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# 5. Forecast Life Expectancy
new_data2 <- data.frame(PropMD = 0.03, TotExp = 14)
predicted_lifeexp2 <- predict(lm_mult, newdata = new_data2)
predicted_lifeexp2
##       1 
## 107.696

When PropMD is 0.03 and TotExp is 14, the predicted life expectancy is 107.696