Homework 3

Rudy Martinez, Brenda Parnin, Jose Fernandez

11/7/2020

Exercise 1

We would like to investigate the relationships between cholesterol, Weight and/or blood pressure. The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.

Exercise 1.A

Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.

Fitted Linear Regression Model

lm.heart = lm(Cholesterol~Weight, data = heart)

Scatter Plot and Correlation Between Weight and Cholesterol

plot(heart$Weight, heart$Cholesterol, xlab ="Weight", ylab ="Cholesterol")
abline(lm.heart, col ="red")

cor(heart$Weight, heart$Cholesterol, method ="spearman")
## [1] 0.1078544
  • Correlation: Because the Spearman Correlation is robust to outliers in the data, it was selected. The Spearman Correlation measure of 0.1078544 indicates significantly low correlation between Weight and Cholesterol.

Model Diagnostics

par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4)) 

  • Cook’s Distance: Based on the above Cook’s Distance plot, the following unduly influential points are noted: Observation 23 and 210.

Exercise 1.B

Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

R Output (F-Test, T-test, Estimated Regression Line, R-Squared)

summary(lm.heart)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05
  • Model Significance: The model p-value of 9.778e-05 is below the significance level; therefore, we reject the null and conclude that the linear regression model is useful to explain the behavior of Cholesterol (there is some linear relationship between x and y).

  • Individual Term Significance: The T-test on the individual term (Weight) yields a p-value of 9.78e-05; therefore, we reject the null and conclude that there is a significant linear relationship between Weight and the behavior of Cholesterol.

  • Estimated Regression Line: Y = 205.86763 + .10867X + E - On average, Cholesterol is predicted to have an increase of .10867 when Weight increases by one unit.

  • R-Squared: Although there is a significant relationship between Weight and the behavior of Cholesterol, only 0.4835% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power). We would not recommend this model for the prediction of Cholesterol level to the medical director.

  • Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.

  • Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.


Exercise 2

The medical director wants to know if blood pressures and Weight can better predict cholesterol outcome. Consider modeling cholesterol as a function of Diastolic, Systolic, and Weight.

Exercise 2.A

Fit a linear regression model for cholesterol as a function of Diastolic, Systolic, and Weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.

Fit Linear Regression Model

lm.heart2 = lm(Cholesterol~., data = heart)

Model Diagnostics

par(mfrow=c(2,2))
plot(lm.heart2, which=1:4)

  • Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.

  • Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.

Cook’s Distance

Based on the above Cook’s Distance plot, the following unduly influential points are noted: Observation 23 and 210.

Confirm Observations With Cook’s Distance Greater Than 0.015

influential.id = which(cooks.distance(lm.heart2) > 0.015)
heart[influential.id, ]
##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

Refitted Model

lm.heart2 = lm(Cholesterol~., data = heart[-influential.id, ])

Exercise 2.B

Comment on significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multi-collinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

R Output (F-Test, T-test, Estimated Regression Line, R-Squared)

summary(lm.heart2)
## 
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-influential.id, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16
  • Model Significance: The model p-value of 2.2e-16 is below the significance level; therefore, we reject the null and conclude that the multiple linear regression model is useful to explain the behavior of Cholesterol (at least one B is not equal to 0).

  • Individual Term Significance: The T-test on the terms Diastolic, Systolic, and Weight yields the following results.

    • Diastolic: P-value of 0.0195 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Diastolic and the behavior of Cholesterol.

    • Systolic: P-value of 2.2e-06 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Systolic and the behavior of Cholesterol.

    • Weight: P-value of 0.1994 is above the significance level of 0.05; therefore, we DO NOT reject the null and conclude that there is not significant linear relationship between Weight and the behavior of Cholesterol.

  • Estimated Regression Line:

    • Y = 156.32618 + 0.24922(Diastolic) + 0.30073(Systolic) + E - A one unit increase in Diastolic blood pressure is associated with a 0.24922 unit increase in Choleterol holding Systolic blood pressure constant. Each additional unit of Systolic blood pressure is associated with a 0.30073 unit increase of Cholesterol holding Diastolic blood pressure constant.
  • R-Squared: Only 3.8% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power). We would not recommend this model for the prediction of Cholesterol level to the medical director.

Variance Inflation Factors (VIF) to Check Multicollineararity

vif(lm.heart2)
##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207
pairs(heart)

  • VIF: Predictors Diastolic, Systolic, and Weight do not exceed the VIF cutoff of 10, meaning they have low correlation.

  • Relationships: There appears to be a strong linear relationship between Diastolic and Systolic.


Exercise 3

Now consider step-wise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.

Exercise 3.A

Perform step-wise model selection with .05 criteria and address any issues in diagnostics plots.

model.stepwise = ols_step_both_p(lm.heart2, pent = 0.05, prem = 0.05, details = FALSE)

model.stepwise
## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------
plot(model.stepwise)

  • Results: Based on the results of the step-wise selection process, Systolic and Diastolic will be included in the final model.

Model Without Weight

lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heart[-influential.id, ])

Diagnostics

par(mfrow=c(2,2))
plot(lm.step, which=c(1:4))

  • Normality Check: Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.

  • Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.

Exercise 3.B

Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercise 1 and 2.

summary(lm.step)
## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-influential.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16
  • Final Model: Using the automatic selection, specifically using step-wise selection we were able to determine (based on the “Added/Removed”) that only Systolic and Diastolic should be included in the final model (Cholesterol ~ Systolic + Diastolic).

  • Model Significance: The model p-value of 2.2e-16 is below the significance level; therefore, we reject the null and conclude that the multiple linear regression model is useful to explain the behavior of Cholesterol (at least one B is not equal to 0).

  • Individual Term Significance: The T-test on the terms Diastolic and Systolicyields the following results.

    • Diastolic: P-value of 0.00803 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Diastolic and the behavior of Cholesterol.

    • Systolic: P-value of 1.95e-06 is below the significance level of 0.05; therefore, we reject the null and conclude that there is a significant linear relationship between Systolic and the behavior of Cholesterol.

  • Estimated Regression Line:

    • Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E - A one unit increase in Diastolic blood pressure is associated with a 0.2770 unit increase in Choleterol holding Systolic blood pressure constant. Each additional unit of Systolic blood pressure is associated with a 0.3022 unit increase of Cholesterol holding Diastolic blood pressure constant.

Variation Explained by the Model Comparison

  • Exercise 1: Only 0.4835% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).

  • Exercise 2: Only 3.8% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).

  • Exercise 3: Only 3.7% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).This is similar to the results of the Exercise 1 and 2 models.


Exercise 4

Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.

model.best.subset = ols_step_best_subset(lm.heart2)

model.best.subset
##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Exercise 4.A

Find the best model based on adjusted-R square criteria and specify which predictors are selected.

  • Best Model: Based on the adjusted R-square criteria, the best model is Model 3 as it has the highest adjusted R-square of 0.0367.

  • Selected Predictors: Model 3 includes Weight, Diastolic, and Systolic predictors.

Exercise 4.B

Find the best model based on AIC criteria and specify which predictors are selected.

  • Best Model: Based on the AIC criteria, the best model is Model 2 as it has the lowest AIC of 32344.7321.

  • Selected Predictors: Model 2 includes Diastolic and Systolic predictors.

Exercise 4.C

Compare final models selected in a) and b). Also compare final models from best subset approach with the final model from step-wise selection.

Final Model Comparisons

  • a) Best Model (Adjusted R-Squared): Model with Predictors Weight, Diastolic, and Systolic

  • b) Best Model (AIC Criteria): Model with Diastolic and Systolic

Comparing models selected in a and b, we see that both models have p-values below the significance level of 0.05; therefore, they are both useful. Also, both models have Systolic and Diastolic predictors that have a p-value below the significance level; therefore, there is a significant linear relationship between these Predictors and the behavior of Cholesterol.

  • c) Step-wise Selection Best Model: Model with Diastolic and Systolic

Both the Step-wise Selection and Best Subset Approach yielded a model with Diastolic and Systolic Predictors. Therefore, our final selected model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E