library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.2.3
##
## Attaching package: 'olsrr'
##
## The following object is masked from 'package:datasets':
##
## rivers
who_data <- read.csv("https://raw.githubusercontent.com/brsingh7/Data/main/who.csv")
plot(who_data$TotExp, who_data$LifeExp, xlab = "TotExp", ylab = "LifeExp", main = "LifeExp vs TotExp")
lm_mod <- lm(LifeExp ~ TotExp, data = who_data)
summary(lm_mod)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The model has an f-statistic of 65.26 and a pvalue less than 0.05 indicating that the model fits the data well. However, the multiple r squared of 0.2577 is fairly low, meaning the variation in lifeexp is only be explained by 25.77% of the explanatory variable. Further, the standard error of 9.371 is larger than we want it to be, showing a large deviation in the predicted and actual values.
hist(lm_mod$residuals)
qqnorm(lm_mod$residuals)
qqline(lm_mod$residuals)
In looking at the residuals, we can see there is a skew to the right,
not normally distributed, which can also raise questions about the
model’s fit. We can see the residual values are not normally
distributed.
who_data$LifeExp_transformed <- who_data$LifeExp^4.6
who_data$TotExp_transformed <- who_data$TotExp^0.06
plot(who_data$TotExp_transformed, who_data$LifeExp_transformed, xlab = "TotExp_Transformed", ylab = "LifeExp_Transformed", main = "LifeExp vs TotExp")
lm_mod2 <- lm(LifeExp_transformed ~ TotExp_transformed, data = who_data)
summary(lm_mod2)
##
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_transformed 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
After transforming the variables, there appears to be a much more linear relationship than the plot not transformed. Since the variables are transformed and scaled, the standard error appears out of wack, but we can’t rely on it or use for comparison based on this. The Multiple-R squared is much higher than the first model, and the pvalue is well below 0.05 significance. Overall, the model appears to be a better fit than the first.
hist(lm_mod2$residuals)
qqnorm(lm_mod2$residuals)
qqline(lm_mod2$residuals)
There is still some skew in the distribution of the residuals but a much tighter fit to in the qqplot than the first model.
#results will be scaled, so have to invert for correct prediction
p_1.5 = (predict(lm_mod2, data.frame(TotExp_transformed=1.5)))^(1/4.6)
p_2.5 = (predict(lm_mod2, data.frame(TotExp_transformed=2.5)))^(1/4.6)
The predicted life expectancy when TotExp = 1.5 is 63.3 years old and 86.5 years old when TotExp = 2.5.
lm_mod3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who_data)
summary(lm_mod3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The explanatory variables all have a pvalue less than 0.05, therefore are statistically significant. However, the multiple r squared of 0.3574 is much lower than the 2nd model. The FStatistic shows that the model is statistically significant at the pvalue below 0.05. However, the residual error is near the first model’s value.
hist(lm_mod3$residuals)
qqnorm(lm_mod3$residuals)
qqline(lm_mod3$residuals)
The residual plots for this model appear to be close to the first, with
a right skewness and high variance in the distribution.
p_prodmd_totexp = predict(lm_mod3, data.frame(PropMD=0.03,TotExp=14))
p_prodmd_totexp
## 1
## 107.696
The model predicts lifeexp at 107.69 years old, which does not seem to be an accurate prediction. Its prediction is significantly higher than the other models, and a high proportion of MDs with a low personal and govt expenditure does not appear to be the case with the actual data (e.g. Japan has the highest lifeexpectancy with a lower proportion of MDs and higher totalExp, as is the case with Andorra, Australia and so on)