We would like to investigate the relationships between cholesterol, Weight and/or blood pressure. The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.
Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.
Fitted Linear Regression Model
lm.heart = lm(Cholesterol~Weight, data = heart)
summary(lm.heart)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
Scatter Plot and Correlation Between Weight and Cholesterol
{plot(heart$Weight, heart$Cholesterol, xlab ="Weight", ylab ="Cholesterol")
abline(lm.heart, col ="red")}
cor(heart$Weight, heart$Cholesterol, method ="spearman")
## [1] 0.1078544
Because the Spearman Correlation is robust to outliers in the data, it was selected. The Spearman Correlation measure of 0.1078544 indicates significantly low correlation between Weight and Cholesterol.
Model Diagnostics
par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4))
Based on the above Cook’s Distance plot, the following unduly influential points are noted: Observation 23 and 210.
Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.
R Output (F-Test, T-test, Estimated Regression Line, R-Squared)
summary(lm.heart)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
Model Significance: The model p-value of 9.778e-05 is below the significance level; therefore, we reject the null and conclude that the linear regression model is useful to explain the behavior of Cholesterol. It can be noted that there is some linear relationship between x and y.
Individual Term Significance: The T-test on the individual term (Weight) yields a p-value of 9.78e-05. Therefore, we reject the null and conclude that there is a significant linear relationship between Weight and the behavior of Cholesterol.
Estimated Regression Line: Y = 205.86763 + .10867X + E - On average, Cholesterol is predicted to have an increase of .10867 when Weight increases by one unit.
R-Squared: Although there is a significant relationship between Weight and the behavior of Cholesterol, only 0.4835% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power). We would not recommend this model for the prediction of Cholesterol level to the medical director.
Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
The medical director wants to know if blood pressures and Weight can better predict cholesterol outcome. Consider modeling cholesterol as a function of Diastolic, Systolic, and Weight.
Exercise 2.A Fit a linear regression model for cholesterol as a function of Diastolic, Systolic, and Weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.
Fit Linear Regression Model
lm.heart2 = lm(Cholesterol~., data = heart)
Model Diagnostics
par(mfrow=c(2,2))
plot(lm.heart2, which=1:4)
Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
Cook’s Distance
Based on the above Cook’s Distance plot, the following unduly influential points are noted: Observation 23 and 210.
Confirm Observations With Cook’s Distance Greater Than 0.015
influential.id = which(cooks.distance(lm.heart2) > 0.015)
heart[influential.id, ]
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
Exercise 2.B Comment on significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multi-collinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.
R Output (F-Test, T-test, Estimated Regression Line, R-Squared)
summary(lm.heart2)
##
## Call:
## lm(formula = Cholesterol ~ ., data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Weight 0.02146 0.02903 0.739 0.4597
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
Model Significance: The model p-value of 2.2e-16 is below the significance level; therefore, we reject the null and conclude that the multiple linear regression model is useful to explain the behavior of Cholesterol (at least one B is not equal to 0).
Individual Term Significance: The T-test on the terms Diastolic, Systolic, and Weight yields the following results:
Diastolic: P-value of 0.0195 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Diastolic and the behavior of Cholesterol.
Systolic: P-value of 2.2e-06 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Systolic and the behavior of Cholesterol.
Weight: P-value of 0.1994 is above the significance level of 0.05; therefore, we DO NOT reject the null and conclude that there is not significant linear relationship between Weight and the behavior of Cholesterol.
Estimated Regression Line:
Y = 156.32618 + 0.24922(Diastolic) + 0.30073(Systolic) + E - A one unit increase in Diastolic blood pressure is associated with a 0.24922 unit increase in Cholesterol holding Systolic blood pressure constant. Each additional unit of Systolic blood pressure is associated with a 0.30073 unit increase of Cholesterol holding Diastolic blood pressure constant.
R-Squared: Only 3.8% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power). We would not recommend this model for the prediction of Cholesterol level to the medical director.
Variance Inflation Factors (VIF) to Check Multicollineararity
vif(lm.heart2)
## Weight Diastolic Systolic
## 1.120375 2.558682 2.454214
pairs(heart)
VIF: Predictors
Diastolic, Systolic, and Weight do not exceed the VIF cutoff of 10, meaning they have low correlation.
Relationships: There appears to be a strong linear relationship between Diastolic and Systolic.
Now consider step-wise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.
Perform step-wise model selection with .05 criteria and address any issues in diagnostics plots.
model.stepwise = ols_step_both_p(lm.heart2, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.034 0.033 7.3140 32470.5215 42.9823
## 2 Diastolic addition 0.036 0.035 2.5470 32465.7540 42.9427
## ----------------------------------------------------------------------------------------
plot(model.stepwise)
Results: Based on the results of the step-wise selection process,
Systolic and Diastolic will be included in the final model.
Model Without Weight
lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heart[-influential.id, ])
Diagnostics
par(mfrow=c(2,2))
plot(lm.step, which=c(1:4))
Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercise 1 and 2.
summary(lm.step)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-influential.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
Final Model: Using the automatic selection, specifically using step-wise selection we were able to determine (based on the “Added/Removed”) that only Systolic and Diastolic should be included in the final model (Cholesterol ~ Systolic + Diastolic).
Model Significance: The model p-value of 2.2e-16 is below the significance level; therefore, we reject the null and conclude that the multiple linear regression model is useful to explain the behavior of Cholesterol (at least one B is not equal to 0).
Individual Term Significance: The T-test on the terms Diastolic and Systolicyields the following results:
Diastolic: P-value of 0.00803 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Diastolic and the behavior of Cholesterol.
Systolic: P-value of 1.95e-06 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Systolic and the behavior of Cholesterol.
Estimated Regression Line:
Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E - A one unit increase in Diastolic blood pressure is associated with a 0.2770 unit increase in Choleterol holding Systolic blood pressure constant. Each additional unit of Systolic blood pressure is associated with a 0.3022 unit increase of Cholesterol holding Diastolic blood pressure constant.
Variation Explained by the Model Comparison
Exercise 1: Only 0.4835% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).
Exercise 2: Only 3.8% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).
Exercise 3: Only 3.7% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).This is similar to the results of the Exercise 1 and 2 models.
Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.
model.best.subset = ols_step_best_subset(lm.heart2)
model.best.subset
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0338 0.0335 0.0325 7.3140 32470.5215 23576.6105 32488.6717 5789984.7365 1848.6534 0.5901 0.9674
## 2 0.0359 0.0353 0.034 2.5467 32465.7540 23571.8539 32489.9543 5779341.3588 1845.8433 0.5892 0.9660
## 3 0.0361 0.0351 0.0334 4.0000 32467.2067 23573.3102 32497.4570 5780178.8159 1846.6991 0.5894 0.9664
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Find the best model based on adjusted-R square criteria and specify which predictors are selected.
Best Model: Based on the adjusted R-square criteria, the best model is Model 3 as it has the highest adjusted R-square of 0.0367.
Selected Predictors: Model 3 includes Weight, Diastolic, and Systolic predictors.
Find the best model based on AIC criteria and specify which predictors are selected.
Best Model: Based on the AIC criteria, the best model is Model 2 as it has the lowest AIC of 32344.7321.
Selected Predictors: Model 2 includes Diastolic and Systolic predictors.
Compare final models selected in a) and b). Also compare final models from best subset approach with the final model from step-wise selection.
Final Model Comparisons: a) Best Model (Adjusted R-Squared): Model with Predictors Weight, Diastolic, and Systolic
Comparing models selected in a and b, we see that both models have p-values below the significance level of 0.05; therefore, they are both useful. Also, both models have Systolic and Diastolic predictors that have a p-value below the significance level; therefore, there is a significant linear relationship between these Predictors and the behavior of Cholesterol.