We would like to investigate the relationships between Cholesterol, Weight and/or Blood Pressure. The data set contains Weight, Diastolic blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv. The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.
Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.
Cholesterol ~ Weight)lm.heart = lm(Cholesterol ~ Weight, heart)
Cholesterol ~ Weight)plot(heart$Weight, heart$Cholesterol, xlab ="Weight", ylab ="Cholesterol")
abline(lm.heart, col ="red")
cor(heart$Weight, heart$Cholesterol, method ="pearson")
## [1] 0.0695377
Conclusion: As noted above, the Spearman Correlation measure 0.1078544. This identifies that there’s a significantly low correlation between Weight and Cholesterol
par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4))
Conclusion:
Normality Check: From the Normal Q-Q Plot, we can see that majority of the points fall along the straight grey line. Nevertheless, right towards the end of the line, some of the points are distanced themselves from it. This shows that normality assumption is not reasonable. Looking at the Standardized Residual plot, we can also see that normality assumption is not reasonable because majority of the points fall above 1.5 (Y-axis)
Equal Variance Check: From the sqrt(Standardized Residuals), we can see that there’s a pattern in the plot. Hence, it’s safe to conclude that this supports heteroscedasticity.
Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.
summary(lm.heart)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
Conclusion:
Model Signifiance: For this, we can check the p-value from its F-statistic. Above, we can see that the p-value for this model is 9.778e-5, which falls below the significance level of 0.05. Hence, it’s safe to say that we can reject the null hypothesis and conclude that the linear regression model is in fact useful (at lest one beta is not equal to 0)
Individual Term Significance: To test the individual term significance, we will rely on the p-value from the T-test of Weight. Above, we can see that it yields a value of 9.78e-05, which also falls below the significance level of 0.05. From this, we can also reject the null hypothesis and conclude that there is a linear relationship between Weight and Cholesterol.
Estimated Regression Line ( \(\hat{y}\) ): From the table above, we can conclude our estimated regression line to be \(\hat{y}\) = 205.86763 + 0.10867x. In other words, Cholesterol is expected to increase by 0.10867 when there’s a one unit increase in Weight.
R-squared: In spite of its linear relationship, only 0.4835% of Cholesterol can be explained by Weight. This can be seen from our table above (\(R^2\)= 0.004835) Hence, the medical doctor should not use this model as this has a low predictive power.
The medical director wants to know if blood pressures and weight can better predict Cholesterol outcome. Consider modeling Cholesterol as a function of Diastolic, Systolic, and Weight. Fit a linear regression model for Cholesterol as a function of Diastolic, Systolic, and Weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.
pairs(heart)
lm.heart2 <- lm(Cholesterol~., data = heart)
par(mfrow=c(2,2))
plot(lm.heart2, which=c(1:4))
Conclusion:
Normality Check: From the Normal Q-Q Plot, we can see that majority of the points fall along the straight grey line. Nevertheless, right towards the end of the line, some of the points are distanced themselves from it. This shows that normality assumption is not reasonable. Looking at the Standardized Residual plot, we can also see that normality assumption is not reasonable because majority of the points fall above 1.5 (Y-axis)
Equal Variance Check: From the sqrt(Standardized Residuals), we can see that there’s a pattern in the plot. Hence, it’s safe to conclude that this supports heteroscedasticity.
Cook’s Distance: From the Cook’s Distance Plot, we can see that there are 2 unduly influential points. We’ll confirm if these two points have a Cook’s distance greater than 0.015 below.
ipoint <- which(cooks.distance(lm.heart2) > 0.015)
heart[ipoint, ]
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
Conclusion: As seen above, we can see that these 2 points - 23 and 210 - have Cook’s distance greater than 0.015. Hence, we’ll refit the model below.
lm.heart2 <- lm(Cholesterol~., data = heart[-ipoint, ])
Comment on significance of the parameters and how much variation in Cholesterol is described by the model. Comment on the relationship between Cholesterol and statistically significant predictor(s). Check multicollinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.
summary(lm.heart2)
##
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-ipoint, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Conclusion:
Model Signifiance: For this, we can check the p-value from its F-statistic. Above, we can see that the p-value for this model is < 2.2e-16, which falls far below the significance level of 0.05. Hence, it’s safe to say that we can reject the null hypothesis and conclude that the linear regression model is in fact useful (at lest one beta is not equal to 0)
Individual Term Significance: To test the individual term significance, we will rely on the p-value from the T-test of Weight, Diastolic, and Systolic.
Weight: For this variable, we can see that its p-value is 0.1994, which is larger than the significance level of 0.05. Hence, it’s safe to say that we do not reject our null hypothesis and conclude that there is no linear relationship between Weight and Cholesterol.Diastolic: For our second variable, we can see that its p-value (0.0195) falls below the significance level of 0.05. With that being said, we reject our null hypothesis and conclude that there is a linear relationship between Diastolic and Cholesterol.Systolic: Our last variable has the least amount of p-value (2.2e-06), which also suggests that we should reject our null hypothesis and conclude that there is a linear relationship between Systolic and Cholesterol.Estimated Regression Line ( \(\hat{y}\) ): From the table above, we can conclude our estimated regression line to be \(\hat{y}\) = 156.32618 + 0.24922(Diastolic) + 0.30073(Systolic). In other words, with one unit increase in Diastolic, Cholesterol will increase by 0.24922, while Systolic stays constant. And with one unit increase in Systolic, Cholesterol will increase by 0.30073, while `Diastolic stays constant.
R-squared: In spite of its linear relationship, only 3.767% of Cholesterol can be explained by both Diastolic and Systolic. This can be seen from our table above (\(R^2\)= 0.03767) Hence, the medical doctor should not use this model as this has a low predictive power.
vif(lm.heart2)
## Weight Diastolic Systolic
## 1.120631 2.558914 2.454207
Conclusion: From the VIF table above, we can see that all three predictors - Weight, Diastolic, and Systolic - do not exceed the VIF cutoff point (10) This means that these variables are nto correlated with each other.
Now consider stepwise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has Cook’s distance larger than 0.015, prior to performing the model selection. Perform stepwise model selection with 0.05 criteria and address any issues in diagnostics plots.
model.stepwise = ols_step_both_p(lm.heart2, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
plot(model.stepwise)
Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models of from Exercise 1 and 2.
lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heart)
summary(lm.step)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.52 -29.58 -4.57 23.79 328.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.63995 5.91244 27.001 < 2e-16 ***
## Systolic 0.30193 0.06442 4.687 2.89e-06 ***
## Diastolic 0.27609 0.10612 2.602 0.00932 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared: 0.03589, Adjusted R-squared: 0.03527
## F-statistic: 58.27 on 2 and 3131 DF, p-value: < 2.2e-16
Conclusion:
Model Signifiance: For this, we can check the p-value from its F-statistic. Above, we can see that the p-value for this model is < 2.2e-16, which falls far below the significance level of 0.05. Hence, it’s safe to say that we can reject the null hypothesis and conclude that the linear regression model is in fact useful (at lest one beta is not equal to 0)
Individual Term Significance: To test the individual term significance, we will rely on the p-value from the T-test of Systolic and Diastolic
Systolic: This variable has the least amount of p-value (2.89e-06), which suggests that we should reject our null hypothesis and conclude that there is a linear relationship between Systolic and Cholesterol.Diastolic: For our second variable, we can see that its p-value (0.00932) falls below the significance level of 0.05. With that being said, we reject our null hypothesis and conclude that there is a linear relationship between Diastolic and Cholesterol.Estimated Regression Line ( \(\hat{y}\) ): From the table above, we can conclude our estimated regression line to be \(\hat{y}\) = 156.63995 + 0.30193(Systolic) + 0.27609(Diastolic). In other words, with one unit increase in Systolic, Cholesterol will increase by 0.30193, while Diastolic stays constant. And with one unit increase in Diastolic, Cholesterol will increase by 0.27609, while `Systolic stays constant.
R-squared: In spite of its linear relationship, only 3.589% of Cholesterol can be explained by both Systolic and Diastolic. This can be seen from our table above (\(R^2\) = 0.03589) Hence, the medical doctor should not use this model as this has a low predictive power.
Variation explained by the models of from Exercise 1 and 2:
Exercise 1: Only 0.4835% of Cholesterol can be explained by the Weight - the model. Therefore, this is not a good model for the prediction of Cholesterol level as it has low predictive power.
Exercise 2: Only 3.767% of Cholesterol can be explained by Diastolic and Systolic - the model. Therefore, this is also not a good model for the prediction of Cholesterol level as it has low predictive power.
Exercise 3: Only 3.589% of Cholesterol can be explained by Systolic and Diastolic - the model. Therefore, this is not a good model for the prediction of Cholesterol level as it has low predictive power.
Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has Cook’s distance larger than 0.015, prior to performing the model selection. Find the best model based on adjusted-R square criteria and specify which predictors are selected.
model.best.subset = ols_step_best_subset(lm.heart2)
model.best.subset
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Conclusion:
Cholesterol is Model 3 since it has the highest adjusted R-Square out of all three models (0.0367)Weight, Diastolic, and SystolicFind the best model based on AIC criteria and specify which predictors are selected.
Conclusion:
Cholesterol is Model 2 since it has the smallest AIC value out of all three models (32344.7321)Diastolic, and SystolicCompare final models selected in a) and b). Also compare final models from Best Subset approach with the final model from Stepwise Selection.
Final Conclusion:
Weight, Diastolic, and SystolicDiastolic, and SystolicDiastolic, and SystolicFrom our final model selection, we can see that both models - AIC and Adjusted R-Square based - have p-values under the significance level of 0.05, and hence, these models can be concluded as useful. Not only that, but they contain Diastolic and Systolic predictors in their models whose p-values fall below the significance level of 0.05. With that being said, we can also conclude that there is a significant linear relationship between these 2 predictors - Diastolic and Systolic and Cholesterol.
Furthermore, the Best Subset Approach and the Stepwise Selection returned Diastolic and Systolic predictors. Hence, the final selected model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E