Exercise 1
We would like to investigate the relationships between cholesterol, Weight and/or blood pressure. The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.
The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.
Exercise 1.A
Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.
Scatter Plot and Correlation Between Weight and Cholesterol
plot(heart$Weight, heart$Cholesterol, xlab ="Weight", ylab ="Cholesterol")
abline(lm.heart, col ="red")## [1] 0.1078544
- Correlation: Because the Spearman Correlation is robust to outliers in the data, it was selected. The Spearman Correlation measure of 0.1078544 indicates significantly low correlation between
WeightandCholesterol.
Exercise 1.B
Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.
R Output (F-Test, T-test, Estimated Regression Line, R-Squared)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
Model Significance: The model p-value of 9.778e-05 is below the significance level; therefore, we reject the null and conclude that the linear regression model is useful to explain the behavior of
Cholesterol(there is some linear relationship between x and y).Individual Term Significance: The T-test on the individual term (
Weight) yields a p-value of 9.78e-05; therefore, we reject the null and conclude that there is a significant linear relationship betweenWeightand the behavior ofCholesterol.Estimated Regression Line: Y = 205.86763 + .10867X + E - On average,
Cholesterolis predicted to have an increase of .10867 whenWeightincreases by one unit.R-Squared: Although there is a significant relationship between
Weightand the behavior ofCholesterol, only 0.4835% of the variation inCholesterolcan be explained by the model. Therefore, this is not a good model for the prediction ofCholesterollevel (low predictive power). We would not recommend this model for the prediction of Cholesterol level to the medical director.Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
Exercise 2
The medical director wants to know if blood pressures and Weight can better predict cholesterol outcome. Consider modeling cholesterol as a function of Diastolic, Systolic, and Weight.
Exercise 2.A
Fit a linear regression model for cholesterol as a function of Diastolic, Systolic, and Weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.
Model Diagnostics
Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
Cook’s Distance
Based on the above Cook’s Distance plot, the following unduly influential points are noted: Observation 23 and 210.
Confirm Observations With Cook’s Distance Greater Than 0.015
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
Exercise 2.B
Comment on significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multi-collinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.
R Output (F-Test, T-test, Estimated Regression Line, R-Squared)
##
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-influential.id, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Model Significance: The model p-value of 2.2e-16 is below the significance level; therefore, we reject the null and conclude that the multiple linear regression model is useful to explain the behavior of
Cholesterol(at least one B is not equal to 0).Individual Term Significance: The T-test on the terms
Diastolic,Systolic, andWeightyields the following results.Diastolic: P-value of 0.0195 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship betweenDiastolicand the behavior ofCholesterol.Systolic: P-value of 2.2e-06 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship betweenSystolicand the behavior ofCholesterol.Weight: P-value of 0.1994 is above the significance level of 0.05; therefore, we DO NOT reject the null and conclude that there is not significant linear relationship betweenWeightand the behavior ofCholesterol.
Estimated Regression Line:
- Y = 156.32618 + 0.24922(
Diastolic) + 0.30073(Systolic) + E - A one unit increase inDiastolicblood pressure is associated with a 0.24922 unit increase inCholeterolholdingSystolicblood pressure constant. Each additional unit ofSystolicblood pressure is associated with a 0.30073 unit increase ofCholesterolholdingDiastolicblood pressure constant.
- Y = 156.32618 + 0.24922(
R-Squared: Only 3.8% of the variation in
Cholesterolcan be explained by the model. Therefore, this is not a good model for the prediction ofCholesterollevel (low predictive power). We would not recommend this model for the prediction of Cholesterol level to the medical director.
Variance Inflation Factors (VIF) to Check Multicollineararity
## Weight Diastolic Systolic
## 1.120631 2.558914 2.454207
VIF: Predictors
Diastolic,Systolic, andWeightdo not exceed the VIF cutoff of 10, meaning they have low correlation.Relationships: There appears to be a strong linear relationship between
DiastolicandSystolic.
Exercise 3
Now consider step-wise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.
Exercise 3.A
Perform step-wise model selection with .05 criteria and address any issues in diagnostics plots.
model.stepwise = ols_step_both_p(lm.heart2, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
- Results: Based on the results of the step-wise selection process,
SystolicandDiastolicwill be included in the final model.
Model Without Weight
Diagnostics
Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
Exercise 3.B
Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercise 1 and 2.
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-influential.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
Final Model: Using the automatic selection, specifically using step-wise selection we were able to determine (based on the “Added/Removed”) that only Systolic and Diastolic should be included in the final model (Cholesterol ~ Systolic + Diastolic).
Model Significance: The model p-value of 2.2e-16 is below the significance level; therefore, we reject the null and conclude that the multiple linear regression model is useful to explain the behavior of
Cholesterol(at least one B is not equal to 0).Individual Term Significance: The T-test on the terms
DiastolicandSystolicyields the following results.Diastolic: P-value of 0.00803 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship betweenDiastolicand the behavior ofCholesterol.Systolic: P-value of 1.95e-06 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship betweenSystolicand the behavior ofCholesterol.
Estimated Regression Line:
- Y = 159.3317 + 0.2770(
Diastolic) + 0.3022(Systolic) + E - A one unit increase inDiastolicblood pressure is associated with a 0.2770 unit increase inCholeterolholdingSystolicblood pressure constant. Each additional unit ofSystolicblood pressure is associated with a 0.3022 unit increase ofCholesterolholdingDiastolicblood pressure constant.
- Y = 159.3317 + 0.2770(
Variation Explained by the Model Comparison
Exercise 1: Only 0.4835% of the variation in
Cholesterolcan be explained by the model. Therefore, this is not a good model for the prediction ofCholesterollevel (low predictive power).Exercise 2: Only 3.8% of the variation in
Cholesterolcan be explained by the model. Therefore, this is not a good model for the prediction ofCholesterollevel (low predictive power).Exercise 3: Only 3.7% of the variation in
Cholesterolcan be explained by the model. Therefore, this is not a good model for the prediction ofCholesterollevel (low predictive power).This is similar to the results of the Exercise 1 and 2 models.
Exercise 4
Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Exercise 4.A
Find the best model based on adjusted-R square criteria and specify which predictors are selected.
Best Model: Based on the adjusted R-square criteria, the best model is Model 3 as it has the highest adjusted R-square of 0.0367.
Selected Predictors: Model 3 includes
Weight,Diastolic, andSystolicpredictors.
Exercise 4.B
Find the best model based on AIC criteria and specify which predictors are selected.
Best Model: Based on the AIC criteria, the best model is Model 2 as it has the lowest AIC of 32344.7321.
Selected Predictors: Model 2 includes
DiastolicandSystolicpredictors.
Exercise 4.C
Compare final models selected in a) and b). Also compare final models from best subset approach with the final model from step-wise selection.
Final Model Comparisons
a) Best Model (Adjusted R-Squared): Model with Predictors
Weight,Diastolic, andSystolicb) Best Model (AIC Criteria): Model with
DiastolicandSystolic
Comparing models selected in a and b, we see that both models have p-values below the significance level of 0.05; therefore, they are both useful. Also, both models have Systolic and Diastolic predictors that have a p-value below the significance level; therefore, there is a significant linear relationship between these Predictors and the behavior of Cholesterol.
- c) Step-wise Selection Best Model: Model with
DiastolicandSystolic
Both the Step-wise Selection and Best Subset Approach yielded a model with Diastolic and Systolic Predictors. Therefore, our final selected model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E