file <- "C:\\Users\\jashb\\OneDrive\\Documents\\Masters Data Science\\Spring 2024\\Fundamentals of Computational Mathematics DATA 605\\Week 12\\who.csv"
who <- read.csv(file)
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
model <- lm(LifeExp~TotExp, data = who)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
As seen in the summary of the model, the pvalue for the fstat and the TotExp variable are extremely small, meaning that we can reject the null hypothesis, suggesting that there is strong evidence that there is an effect. The Multiple R^2 says that around 25.7% of the variance in LifeExp can be attributed to changing sum of personal and government expenditures. Lastly, the standard error of residuals explains that the model predicts the life expectancy with a std error of 9.37 years. Looking at the graph another regression might be better.
plot(x = who$TotExp , y = who$LifeExp)
abline(model, col = "red3")
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
LifeExp_v2 <- who$LifeExp^(4.6) # Raise LifeExp to the power of 4.6
TotExp_v2 <- who$TotExp^(0.06) # Raise TotExp to the power of 0.06
model_v2 <- lm(LifeExp_v2~TotExp_v2)
plot(TotExp_v2, LifeExp_v2)
abline(model_v2)
summary(model_v2)
##
## Call:
## lm(formula = LifeExp_v2 ~ TotExp_v2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_v2 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The Fstat remains significant as well as the pvalues for the intercept and TotExp. There was also a major shift in the predictability of this model compared with the non-transformed one. The R^2 jumped from around 25% of explained variability to 72.9%. Overall the transformed model performs much better than the non transformed.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
(-736527910 + 620060216 *(1.5))^(1/4.6)
## [1] 63.31153
(-736527910 + 620060216 *(2.5))^(1/4.6)
## [1] 86.50645
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
model_4 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who)
summary(model_4)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Looking at this multiple regression model, all of the variables added seem to be very statistically significant. The Fstat has remained statistically significant as well. The model overall explains around 36% of the variability in life expectancy, a similar value in Adjusted R-Squared also shows that the predictor variables are contributing to the overall explained variance (in other words, they are adding to the models explanatory power). This is generally a good model.
#Intercept PropMD TotExp PropMD * TotExp
((6.277*10^1)+(1.497*10^3)*0.03 + (7.233*10^(-5))*14 - ((6.026*10^(-3))*0.03*14))
## [1] 107.6785
This is not realistic since it is rare that someone lives past 100 years of age, nevermind 107.67