heart <- read_csv("./data/heart.csv", show_col_types = FALSE)

Exercise 1

The medical director at your company wants to know if Weight alone can predict Cholesterol outcomes. Consider modeling Cholesterol as a function of Weight

a)

Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.

lm.heart = lm(Cholesterol ~ Weight, heart)
summary(lm.heart)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05
plot(heart$Weight, heart$Cholesterol, xlab = "Weight", ylab = "Cholesterol")+abline(lm.heart, col="maroon")

## integer(0)
par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4)) 

Oberservations: Weight has a small p-value of 9.78e-05

The scatter plot representing the correlation between Weight and Cholesterol gives a visual to how little of correlation actually exists between the two variables.

The graphs gives us a visualization of correlation between Weight and Cholosterol and the model as a whole. Specifically Residuals and the QQplot gives clear indication of outliers, which proves that normality assumption is not met.

Per Cook’s distances cutoff being 0.015, the influential points are 23 and 210. 3094 is a noticeable outlier but does not fall into a influential point, according to Cook’s distance.

b)

Comment on the significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol levels.

summary(lm.heart)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05
cor(heart$Weight, heart$Cholesterol)
## [1] 0.0695377

Oberservations:

Estimated Regression Line: The model formula itself is: \(\hat{y}\)=205.86763 + 0.10867X. Basically this means that for every one unit increase of Weight, Cholosterol increases by 0.10867.

Model Significance: The F-test 15.22 is larger than the p-value 9.778e-05 which proves the model usefulness. Therefore we can conclude that the model is useful (meaning at least one beta is not equal to 0).

Individual Term Significance: To assess the individual term significance, we look at the Weight’s p-value of the t-test and see that the value is 9.78e-05, which is below the 0.05 significance level. Therefore, we can reject the null hypothesis and accept the alternative hypothesis that there is a linear relationship between Weight and Cholesterol.

R-Square: 0.4835% variation of Cholesterol (Y) can be explained by Weight (X).

Additionally, there is approximately 6.95% correlation between Weight and Cholesterol in the model. This is very low correlation overall. Therefore, with a low level predictive power, this is not a good model.

Overall, with the Model Significance and Individual Term Significance proving useful, we can recommend this model as a predictor for Cholosterol and Weight.

Exercise 2

The medical director wants to know if blood pressures and weight can better predict cholesterol outcome. Consider modeling cholesterol as a function of diastolic, systolic, and weight.

a)

Fit a linear regression model for cholesterol as a function of diastolic, systolic, and weight. Generate the diagnostics plots and comment on any issues that need to be noted. For Cook’s distances, do not leave any points that have Cook’s distance greater than 0.015.

lm.heart2 <- lm(Cholesterol~., data=heart)
lm.heart2
## 
## Call:
## lm(formula = Cholesterol ~ ., data = heart)
## 
## Coefficients:
## (Intercept)       Weight    Diastolic     Systolic  
##   157.88394      0.02146      0.25983      0.30106
par(mfrow=c(2,2))
plot(lm.heart2,which=c(1:4))

par(mfrow=c(2,2))
inf.id=which(cooks.distance(lm.heart2)>0.015)
heart[inf.id, ]
## # A tibble: 2 x 4
##   Weight Diastolic Systolic Cholesterol
##    <dbl>     <dbl>    <dbl>       <dbl>
## 1     90        82      130         550
## 2    100        82      130         500
lm.heart2=lm(Cholesterol~., data=heart[-inf.id, ])

plot(lm.heart2,which=c(1:4))

Observations: Looking at the Q-Q Plot we notice a majority of the points follow the theoretical line for normality, however as the Quantiles get larger the normality assumption becomes less valid. Upon removing all points that have Cook’s distance greater than 0.015 we notice that there are are only a couple of outliers.

b)

Comment on the significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multicollinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol levels.

summary(lm.heart2)
## 
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-inf.id, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16
vif(lm.heart2)
##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207

Observation: While checking the F-statistic we observe a p-value less than our significance level of 0.05 allowing us to reject the null and state that the linear model is significant. However when we check the p-values for each term in the model we see that Weight has a p-value greater than our significane level so we conclude that there is no linear relationship between weight and cholesterol. For our other two terms Diastolic and Systolic we can state that there is a linear relationship between Cholesterol ~ Diastolic and Cholesterol ~ Systolic.

With our \(R^{2}\) value showing that the only 3.767% of Cholesterol can be explained by the model and only two of our terms having significant linear relationships I would advise the Medical director that this is not a good model to use for cholesterol level predictions.

Exercise 3

Now consider stepwise model selection for the Cholesterol model. Before performing the model selection, we remove influential points detected in Exercise 2, which has a cook’s distance larger than 0.015.

a)

Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.

model.stepwise<-ols_step_backward_p(lm.heart2, pent =0.05, prem = 0.05, details = F)
model.stepwise
## 
## 
##                             Elimination Summary                             
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ---------------------------------------------------------------------------
##    1    Weight        0.0372      0.0365    3.6475    32344.7321    42.2606    
## ---------------------------------------------------------------------------
plot(model.stepwise)
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

b)

Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercise 1 and 2.

lm.step=lm(Cholesterol ~ Systolic + Diastolic, data=heart)
summary(lm.step)
## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.52  -29.58   -4.57   23.79  328.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16

Observation:

Estimated Regression Line: The formula for this model is \(\hat{y}\)=159.64 + 0.302Systolic + 0.276Diastolic which means for every unit increase in Systolic, Cholesterol will increase by 0.302. Similiarly, for every unit increase of Diastolic, Cholesterol will increase 0.276.

Model Significance: The F-test 58.27 is larger than its p-value 2.2e-16 which proves the model usefulness because it is below a 0.05 significance level. Therefore we can conclude that the model is useful (meaning at least one beta is not equal to 0).

Individual Term Significance: To assess the individual term significance, we look at the p-value of the t-tests and see that the values for Systolic is 2.89e-06 and Diastolic is 0.00932. We can conclude with both variables p-values being below the 0.05 significance level, that we reject the null hypothesis and accept the alternative that there is linear relationships between the two variables.

R-Square: 3.589% of the Cholesterol variation can be explained by Systolic and Diastolic. This is a low predictive power, therefore not a good model.

Compare the variations explained by the models from

Exercise 1 and 2.

Exercise 1: 0.4835% variation of Cholesterol (Y) can be explained by Weight (X). This is a low level predictive power, therefore it is not a good model.

Exercise 2: 3.767% variation of Cholesterol can be explained by the model. Therefore, with a low level predictive power, this is not a good model.

Exercise 4

Now consider the best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, before performing the model selection.

a) Find the best model based on adjusted-R square criteria and specify which predictors are selected.

best.model = ols_step_best_subset(lm.heart2)
best.model
##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Observation According to the Adjusted R-Squared value Model 3 is the best model with an adjusted R\(^2\) of 0.0367. The predictors for Model 3 are Systolic, Diastolic, and Weight.

b) Find the best model based on AIC criteria and specify which predictors are selected. Observation Based on the lowest AIC values we can see that Model 2 is the best with predictors being Diastolic and Systolic.

c) Compare final models selected in a) and b). Also, compare final models from the best subset approach with the final model from the stepwise selection.

lm.4a=lm(Cholesterol ~ Systolic + Diastolic + Weight, data=heart)
summary(lm.4a)
## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic + Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Weight        0.02146    0.02903   0.739   0.4597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16
lm.4b=lm(Cholesterol ~ Systolic + Diastolic, data=heart)
summary(lm.4b)
## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.52  -29.58   -4.57   23.79  328.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16

Observation Final Models,

  1. Systolic, Diastolic, and Weight

  2. Diastolic and Systolic

Stepwise selection) Systolic and Diastolic

Both the models selected in a) and b) have p-values lower than our significance level, proving that those two predictors do have linear relationship. Similarly our model from the stepwise approach has a small p-value proving significant linear relationship between the two predictors. When comparing the subset and stepwise we have both Diastolic and Systolic as our predictors so we conclude that our final model equation is \(\hat{y}\)=159.64 + 0.302Systolic + 0.276Diastolic