library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
You can also embed plots, for example:
df <- read.csv("https://raw.githubusercontent.com/MAB592/DATA-605-SP2024/main/who.csv")
head(df)
ggplot(data = df , aes(x= LifeExp,y=TotExp)) +
geom_point()
lm <- lm(LifeExp ~ TotExp, data = df)
summary(lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
par(mfrow=c(2,2))
plot(lm)
#Answer
The F-statistic tells us if our model is useful overall. In this case, the F-statistic is 65.26 with a really tiny p-value of 7.714e-14. That small p-value means our model is way better than a simpler one with just an intercept. However, the R-squared value of 0.2577 shows that only about 25.77% of the differences in the data can be explained by our model, so it’s not super strong.Also it looks like many of the assumptions have been violated.
df2 <- df %>%
mutate(LifeExp = LifeExp ^ 4.6,
TotExp = TotExp ^ 0.06)
ggplot(data = df2 , aes(x= LifeExp,y=TotExp)) +
geom_point()
lm2 <- lm(LifeExp ~ TotExp, data = df2)
summary_lm2 <- summary(lm2)
print(summary_lm2)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
#Answer
Model 2 definitely performs much better if we look at the r squared values and the scatter plot shows more of a linear correlation.
intercept <- abs(summary_lm2$coefficients[1, 1])
Beta1 <- summary_lm2$coefficients[2, 1]
predict_LifeExp <- function(TotExp){
return (intercept + Beta1 * TotExp)^(1/4.6)
}
# TotExp^.06 =1.5
predict_LifeExp(1.5)
## [1] 1666618233
# TotExp^.06=2.5
predict_LifeExp(2.5)
## [1] 2286678449
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df2)
summary_multi <- summary(lm3)
print(summary_multi)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -296470018 -47729263 12183210 60285515 212311883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.244e+08 5.083e+07 -14.253 <2e-16 ***
## PropMD 4.727e+10 2.258e+10 2.094 0.0376 *
## TotExp 6.048e+08 3.023e+07 20.005 <2e-16 ***
## PropMD:TotExp -2.121e+10 1.131e+10 -1.876 0.0622 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88520000 on 186 degrees of freedom
## Multiple R-squared: 0.7441, Adjusted R-squared: 0.74
## F-statistic: 180.3 on 3 and 186 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm3)
# Answer The F-statistic for our multiple regression model is 180.3,
indicating that our model is highly significant. All of the factors,
except for PropMD x TotExp0.06, have strong p-values, suggesting they
are significant predictors. Additionally, the R-squared value of 0.7441
means that our model explains about 74.41% of the variability in the
data, which is quite good.
summary_multi$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -724418697 50825046 -14.253183 1.227840e-31
## PropMD 47273338389 22577050673 2.093867 3.762723e-02
## TotExp 604795792 30232703 20.004688 3.059992e-48
## PropMD:TotExp -21214671638 11308191249 -1.876045 6.221567e-02
multiintercept <- abs(summary_multi$coefficients[1, 1])
multiBeta1 <- summary_multi$coefficients[2, 1]
multiBeta2 <- summary_multi$coefficients[3, 1]
multiBeta3 <- summary_multi$coefficients[4, 1]
predict_multi <- function(PropMD, TotExp_0.06){
LifeExp_4.6 <- multiintercept + multiBeta1 * PropMD + multiBeta2 * TotExp_0.06 + multiBeta3 * PropMD * TotExp_0.06
return(LifeExp_4.6^(1/4.6))
}
PropMD=.03 and TotExp = 14
predict_multi(.03, 14^0.06)
## [1] 106.3698
The prediction suggests that if 3% of the population are doctors and total spending is \(\$14\), people would live about 82.57 years. But when we look at the data where 3% of the population are doctors, we see that life expectancy is around 82 years. However, the total spending in those cases is much higher than \(\$14\). So, based on this, the prediction doesn’t seem realistic.