library(MASS); library(car); library(olsrr)
## Loading required package: carData
## Warning: package 'olsrr' was built under R version 4.0.3
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
We would like to investigate the relationships between cholesterol, weight and/or blood pressure. The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.
The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.
lm.heart <- lm(Cholesterol~Weight, data=heart)
summary(lm.heart)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4))
cor(heart$Weight, heart$Cholesterol, method ="spearman")
## [1] 0.1078544
cook.d = cooks.distance(lm.heart)
inf.id=which(cooks.distance(lm.heart)>0.015)
heart[inf.id, ]
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
lm.heart2=lm(Cholesterol~Weight, data=heart[-inf.id, ])
with(heart, plot(Weight, Cholesterol))
abline(lm.heart, col="red")
abline(lm.heart2,col="blue")
legend("topright",col=c("red","blue"),legend=c("w/ all observations", "w/out all observations"), cex=0.8, title.adj=0.15, lty=1)
lm.heart2 <- lm(Cholesterol~Weight, data=heart)
summary(lm.heart2)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
Influential points: Looking more specifically at the normal qq plot and the Standardized residuals there are many points that stray from the line and that there are many points above 1.5. Specifically, these points are at the top 23 and 210. We can assume that normality is not reasonable.
Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity
Model Significance: Based on the F test the p value is 9.778e-05 which is very low. With this info we can support the Ha: Model is useful.
Individual Term Significance: Based off the T-test results the p-value for Weight is 9.78e-05. We can support the alternative hypothesis that there is a significant linear relationship between Weight and Cholesterol.
Estimated Regression line: On average, Cholesterol is predicted to have an increase .108, with one unit increase on weight.
R-squared: The prediction power is very low. 0.48 % of the variation of Cholesterol can be explained by the model (Weight). Very low prediction power.
Conclusion: Based on this information I would not suggest that weight alone can predict the outcome of Cholesterol.
The medical director wants to know if blood pressures and weight can better predict cholesterol outcome.
Consider modeling cholesterol as a function of diastolic, systolic, and weight.
lm.heart3 <- lm(Cholesterol~Diastolic+Systolic+Weight, data=heart)
summary(lm.heart3)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## Weight 0.02146 0.02903 0.739 0.4597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.heart3, which=c(1:4))
inf.id2=which(cooks.distance(lm.heart3) > 0.015)
heart[inf.id2, ]
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
lm.heart4 =lm(Cholesterol~Systolic + Diastolic + Weight, data=heart[-inf.id2, ])
summary(lm.heart4)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic + Weight, data = heart[-inf.id2,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Weight 0.03671 0.02860 1.284 0.1994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Diagnostic observations: Looking at the plots specifically the standardized residual plot the influential points include 23, 210, 2035 are above the 2.0 mark. The Normal qq plot shows that the observations follow normal distribution, but you can see the influential points noted above.
vif(lm.heart4)
## Systolic Diastolic Weight
## 2.454207 2.558914 1.120631
Model Significance: Based on the F-test,the p-value is very low at 2.2e-16. With this we can support the alternative hypothesis and say there is linear relationship between some of x’s (Diastolic, Systolic, and Weight) and y (Cholesterol).
Individual term significance: Based on the t-test we determine the linear relationship based on the p-value. High -value (above the .05 significance level) support the null hypothesis that there is no linear relationship between x and y. Low p-value (below the .05 significant level) supports the alternative hypothesis where there is a linear relationship between x and y. - There is linear relationship between Diastolic and Cholesterol (2.2e-06) - There is linear relationship between Systolic and Cholesterol (0.0195) - There is no linear relationship between Weight and Cholesterol (0.1994)
R-squared: Looking at the R-squared we can interpret that only 3% of variation of Cholesterol can be explained by the model Diastolic, Systolic and Weight The prediction power is very low
Statistically Significant Predictor(s): Based off the T-test we would not include Weight as a statistically significant predictor since it has a high p-value.
Multicollinearity: Based of the VIF test of all the predictors (Diastolic, Systolic and Weight) they are all very small (under 10) meaning they all have very low correlation to Cholesterol
Conclusion: I would suggest to the medical director that this is not a good model for the prediction of Cholesterol level. Both predictors, Diastolic and Systolic, have a low correlation to Cholesterol.
Now consider stepwise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.
model.stepwise<-ols_step_both_p(lm.heart4, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
plot(model.stepwise)
lm.final=lm(Cholesterol~ Systolic + Diastolic, data=heart[-inf.id2, ])
summary(lm.final)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-inf.id2,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.final, which=c(1:4))
Diagnostic: Looking at the Normal qq plot it appears that the data follows normal distribution. The standardized residual plot the influential points include 23, 210, 2035 are above the 2.0 mark.
Model Significance: Based on the F-test, p-value is very low 2.2e-16. Support Alternative hypothesis: There is linear relationship between some of x’s (Systolic and Diastolic) and y (Cholesterol)
Individual Term Significance: Based on the t-test we determine the linear relationship based on the p-value. Support Alternative hypothesis. - There is linear relationship between Diastolic and Cholesterol (2.2e-06) - There is linear relationship between Systolic and Cholesterol (0.0195)
Estimated Regression line: Y-hat = 159.3317 + 0.3022(Systolic) + 0.2770(Diastolic) + E
R-squared: Looking at the R-squared we can interpret that only 3% of variation of Cholesterol can be explained by the model Diastolic and Systolic. The prediction power is very low
Final Model: Using the automatic Selection, specifically using stepwise selection we were able to determine based on the “Added/Removed” that only Systolic and Diastolic should be included in the final model. Cholesterol ~ Systolic + Diastolic
Variations: The variations for both Exercise 1 (0.48 %) and 2 (3%) had very small prediction power.
Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.
model.best.subset<-ols_step_best_subset(lm.heart4)
model.best.subset
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Systolic Diastolic
## 3 Systolic Diastolic Weight
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
By looking at the Adjusted R-squared we want to find the biggest one: Model 3: Systolic Diastolic Weight
For AIC the smaller the better: Model 2: Systolic Diastolic
Both Models 2&3 contain Systolic and Diastolic.
The final model approach in Exercise 3 and the best subset approach both used Systolic and Diastolic (Model 2)