library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Including Plots

You can also embed plots, for example:

df <- read.csv("https://raw.githubusercontent.com/MAB592/DATA-605-SP2024/main/who.csv")
head(df)

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

ggplot(data = df , aes(x= LifeExp,y=TotExp)) + 
  geom_point()

lm <- lm(LifeExp ~ TotExp, data = df)
summary(lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

par(mfrow=c(2,2))
plot(lm)

#Answer

The F-statistic tells us if our model is useful overall. In this case, the F-statistic is 65.26 with a really tiny p-value of 7.714e-14. That small p-value means our model is way better than a simpler one with just an intercept. However, the R-squared value of 0.2577 shows that only about 25.77% of the differences in the data can be explained by our model, so it’s not super strong.Also it looks like many of the assumptions have been violated.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

df2 <- df %>%
  mutate(LifeExp = LifeExp ^ 4.6,
         TotExp = TotExp ^ 0.06)

ggplot(data = df2 , aes(x= LifeExp,y=TotExp)) + 
  geom_point()

lm2 <- lm(LifeExp ~ TotExp, data = df2)
summary_lm2 <- summary(lm2)

print(summary_lm2)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

#Answer

Model 2 definitely performs much better if we look at the r squared values and the scatter plot shows more of a linear correlation.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

intercept <- abs(summary_lm2$coefficients[1, 1])
Beta1 <- summary_lm2$coefficients[2, 1]

predict_LifeExp <- function(TotExp){
  return (intercept + Beta1 * TotExp)^(1/4.6)
}

# TotExp^.06 =1.5
predict_LifeExp(1.5)

## [1] 1666618233

# TotExp^.06=2.5
predict_LifeExp(2.5)

## [1] 2286678449

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df2)
summary_multi <- summary(lm3)

print(summary_multi)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -296470018  -47729263   12183210   60285515  212311883 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.244e+08  5.083e+07 -14.253   <2e-16 ***
## PropMD         4.727e+10  2.258e+10   2.094   0.0376 *  
## TotExp         6.048e+08  3.023e+07  20.005   <2e-16 ***
## PropMD:TotExp -2.121e+10  1.131e+10  -1.876   0.0622 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88520000 on 186 degrees of freedom
## Multiple R-squared:  0.7441, Adjusted R-squared:   0.74 
## F-statistic: 180.3 on 3 and 186 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm3)

# Answer The F-statistic for our multiple regression model is 180.3, indicating that our model is highly significant. All of the factors, except for PropMD x TotExp0.06, have strong p-values, suggesting they are significant predictors. Additionally, the R-squared value of 0.7441 means that our model explains about 74.41% of the variability in the data, which is quite good.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

summary_multi$coefficients

##                   Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)     -724418697    50825046 -14.253183 1.227840e-31
## PropMD         47273338389 22577050673   2.093867 3.762723e-02
## TotExp           604795792    30232703  20.004688 3.059992e-48
## PropMD:TotExp -21214671638 11308191249  -1.876045 6.221567e-02

multiintercept <- abs(summary_multi$coefficients[1, 1])
multiBeta1 <- summary_multi$coefficients[2, 1]
multiBeta2 <- summary_multi$coefficients[3, 1]
multiBeta3 <- summary_multi$coefficients[4, 1]

predict_multi <- function(PropMD, TotExp_0.06){
  LifeExp_4.6 <- multiintercept + multiBeta1 * PropMD + multiBeta2 * TotExp_0.06 + multiBeta3 * PropMD * TotExp_0.06
  return(LifeExp_4.6^(1/4.6))
}

PropMD=.03 and TotExp = 14

predict_multi(.03, 14^0.06)

## [1] 106.3698

Answer

The prediction suggests that if 3% of the population are doctors and total spending is $\$14$, people would live about 82.57 years. But when we look at the data where 3% of the population are doctors, we see that life expectancy is around 82 years. However, the total spending in those cases is much higher than $\$14$. So, based on this, the prediction doesn’t seem realistic.

Data 605 HW 12

2024-01-29

Including Plots

Answer