Exercise 1A

We would like to investigate the relationships between Cholesterol, Weight and/or Blood Pressure. The data set contains Weight, Diastolic blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv. The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.

Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.

Linear Regression Model (Cholesterol ~ Weight)

lm.heart = lm(Cholesterol ~ Weight, heart)

Scatter Plot and Correlation (Cholesterol ~ Weight)

plot(heart$Weight, heart$Cholesterol, xlab ="Weight", ylab ="Cholesterol")
abline(lm.heart, col ="red")

cor(heart$Weight, heart$Cholesterol, method ="pearson")
## [1] 0.0695377

Conclusion: As noted above, the Spearman Correlation measure 0.1078544. This identifies that there’s a significantly low correlation between Weight and Cholesterol

Model Diagnostics

par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4))

Conclusion:

  • Normality Check: From the Normal Q-Q Plot, we can see that majority of the points fall along the straight grey line. Nevertheless, right towards the end of the line, some of the points are distanced themselves from it. This shows that normality assumption is not reasonable. Looking at the Standardized Residual plot, we can also see that normality assumption is not reasonable because majority of the points fall above 1.5 (Y-axis)

  • Equal Variance Check: From the sqrt(Standardized Residuals), we can see that there’s a pattern in the plot. Hence, it’s safe to conclude that this supports heteroscedasticity.


Exercise 1B

Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

summary(lm.heart)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

Conclusion:


Exercise 2A

The medical director wants to know if blood pressures and weight can better predict Cholesterol outcome. Consider modeling Cholesterol as a function of Diastolic, Systolic, and Weight. Fit a linear regression model for Cholesterol as a function of Diastolic, Systolic, and Weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.

Scatter Plot (Diastolic, Systolic, and Weight)

pairs(heart)

lm.heart2 <- lm(Cholesterol~., data = heart)

Model Diagnostics

par(mfrow=c(2,2))
plot(lm.heart2, which=c(1:4))

Conclusion:

  • Normality Check: From the Normal Q-Q Plot, we can see that majority of the points fall along the straight grey line. Nevertheless, right towards the end of the line, some of the points are distanced themselves from it. This shows that normality assumption is not reasonable. Looking at the Standardized Residual plot, we can also see that normality assumption is not reasonable because majority of the points fall above 1.5 (Y-axis)

  • Equal Variance Check: From the sqrt(Standardized Residuals), we can see that there’s a pattern in the plot. Hence, it’s safe to conclude that this supports heteroscedasticity.

  • Cook’s Distance: From the Cook’s Distance Plot, we can see that there are 2 unduly influential points. We’ll confirm if these two points have a Cook’s distance greater than 0.015 below.

ipoint <- which(cooks.distance(lm.heart2) > 0.015) 
heart[ipoint, ]
##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

Conclusion: As seen above, we can see that these 2 points - 23 and 210 - have Cook’s distance greater than 0.015. Hence, we’ll refit the model below.

lm.heart2 <- lm(Cholesterol~., data = heart[-ipoint, ])

Exercise 2B

Comment on significance of the parameters and how much variation in Cholesterol is described by the model. Comment on the relationship between Cholesterol and statistically significant predictor(s). Check multicollinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

summary(lm.heart2)
## 
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-ipoint, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Conclusion:

Checking Multicollineararity using VIF (Variance Inflation Factors)

vif(lm.heart2)
##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207

Conclusion: From the VIF table above, we can see that all three predictors - Weight, Diastolic, and Systolic - do not exceed the VIF cutoff point (10) This means that these variables are nto correlated with each other.


Exercise 3A

Now consider stepwise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has Cook’s distance larger than 0.015, prior to performing the model selection. Perform stepwise model selection with 0.05 criteria and address any issues in diagnostics plots.

model.stepwise = ols_step_both_p(lm.heart2, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise
## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------
plot(model.stepwise)


Exercise 3B

Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models of from Exercise 1 and 2.

lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heart)
summary(lm.step)
## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.52  -29.58   -4.57   23.79  328.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16

Conclusion:

Variation explained by the models of from Exercise 1 and 2:


Exercise 4A

Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has Cook’s distance larger than 0.015, prior to performing the model selection. Find the best model based on adjusted-R square criteria and specify which predictors are selected.

model.best.subset = ols_step_best_subset(lm.heart2)
model.best.subset
##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Conclusion:


Exercise 4B

Find the best model based on AIC criteria and specify which predictors are selected.

Conclusion:


Exercise 4C

Compare final models selected in a) and b). Also compare final models from Best Subset approach with the final model from Stepwise Selection.

Final Conclusion:

From our final model selection, we can see that both models - AIC and Adjusted R-Square based - have p-values under the significance level of 0.05, and hence, these models can be concluded as useful. Not only that, but they contain Diastolic and Systolic predictors in their models whose p-values fall below the significance level of 0.05. With that being said, we can also conclude that there is a significant linear relationship between these 2 predictors - Diastolic and Systolic and Cholesterol.

Furthermore, the Best Subset Approach and the Stepwise Selection returned Diastolic and Systolic predictors. Hence, the final selected model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E