Due: Saturday, Nov 4 by 11:59pm.

You need to download heart.csv.

Below is a brief summary of variables in hear.csv.

  • Weight: subject’s weight.
  • Systolic: top number in a blood pressure reading, indicating the blood pressure level when the heart contracts.
  • Diastolic: bottom number in a blood pressure reading, indicating the blood pressure level when the heart is at rest or between beats.
  • Cholesterol: measured cholesterol.

  • The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure, and Cholesterol for alive subjects in the heart.csv.

    library(ggplot2)
    ## Warning: package 'ggplot2' was built under R version 4.3.2
    library(stats)
    library(car)
    ## Loading required package: carData
    library(tidyverse)
    ## Warning: package 'dplyr' was built under R version 4.3.2
    ## Warning: package 'stringr' was built under R version 4.3.2
    ## Warning: package 'lubridate' was built under R version 4.3.2
    ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ## ✔ dplyr     1.1.4     ✔ readr     2.1.4
    ## ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
    ## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    ## ✖ dplyr::recode() masks car::recode()
    ## ✖ purrr::some()   masks car::some()
    ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    library(olsrr)
    ## Warning: package 'olsrr' was built under R version 4.3.2
    ## 
    ## Attaching package: 'olsrr'
    ## 
    ## The following object is masked from 'package:datasets':
    ## 
    ##     rivers
    .bordered{
      border-style: solid;
      border-color: teal;
      padding: 5px;
      background-color: #DCDCDC;
    }
    heart <- read.csv("heart.csv")

    Exercise 1

    The medical director at your company wants to know if Weight alone can predict Cholesterol outcomes. Consider modeling Cholesterol as a function of Weight.

    (A) Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut-off to be 0.015.

    • Since a few observations with of Cook’s distance greater than 0.015 are detected, we refit the model without them. Specifically they are observations 23 and 210.


    lm_heart = lm(Cholesterol ~ Weight, data = heart) 
    
    summary(lm_heart)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Weight, data = heart)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -111.95  -29.59   -4.64   23.49  334.35 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
    ## Weight        0.10867    0.02786   3.901 9.78e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 43.62 on 3132 degrees of freedom
    ## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
    ## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05
    cor(heart$Weight, heart$Cholesterol, method="pearson")
    ## [1] 0.0695377
    cor(heart$Weight, heart$Cholesterol, method="spearman")
    ## [1] 0.1078544
    par(mfrow=c(2,2)) 
    plot(lm_heart, which=c(1:4))

    (B) Comment on the significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for predicting Cholesterol levels.

    • The final model is significant with very small p-value from F-test. In other words, the coefficient of weight is non-zero. The slight non-normal behavior is observed in QQ-plot but there are no obvious remaining diagnostic. No more observations with Cook’s D larger than 0.015. The model tells us that an increase of weight is associated with an increase of cholesterol. On average, the cholesterol is predicted to have an increase of 0.122 units when weight increases by 1 unit. Only around 0.63% of the variation of cholesterol is explained by weight, implying that it is not be a good-fit model for cholesterol.


    CD_lm_heart <- cooks.distance(lm_heart) 
    CD_lm_heart <- round(CD_lm_heart, 5) 
    CD_lm_heart <- sort(CD_lm_heart>0.015) 
    CD_lm_heart <- rev(sort(CD_lm_heart)) 
    head(CD_lm_heart, 5)
    ##   210    23  3134  3133  3132 
    ##  TRUE  TRUE FALSE FALSE FALSE
    CD_cleaned <- which(cooks.distance(lm_heart)>0.015) 
    CD_cleaned <- lm(Cholesterol ~ Weight, data=heart[-CD_cleaned,]) 
    summary(CD_cleaned)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Weight, data = heart[-CD_cleaned, 
    ##     ])
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -112.369  -29.395   -4.482   23.672  209.348 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 203.57605    4.18543  48.639  < 2e-16 ***
    ## Weight        0.12264    0.02745   4.469 8.16e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.92 on 3130 degrees of freedom
    ## Multiple R-squared:  0.006339,   Adjusted R-squared:  0.006022 
    ## F-statistic: 19.97 on 1 and 3130 DF,  p-value: 8.155e-06
    par(mfrow=c(2,2)) 
    plot(CD_cleaned, which=c(1:4))

    Exercise 2

    The medical director wants to know if blood pressure and weight can better predict cholesterol outcomes. Consider modeling cholesterol as a function of diastolic, systolic, and weight.

    (A) Fit a linear regression model for cholesterol as a function of diastolic, systolic, and weight. Generate the diagnostics plots and comment on any issues that need to be noted. For Cook’s distances, do not leave any points that have Cook’s distance greater than 0.015.

    • We again observe two observations (same data points in Exercise 1) with cook’s D greater than 0.015. The model is refitted without them. In terms of diagnostics plots, QQ-plot does not show a perfect line but it does not look like serious abnormality. For remedy of skewness, we can try transformation of Y, although it is not addressed here.


    heart_2 = lm(Cholesterol~ Diastolic+Systolic+Weight,, data = heart) 
    summary(heart_2)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = heart)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -110.27  -29.58   -4.56   23.66  329.74 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
    ## Diastolic     0.25983    0.10838   2.397   0.0166 *  
    ## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
    ## Weight        0.02146    0.02903   0.739   0.4597    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.95 on 3130 degrees of freedom
    ## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
    ## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16
    vif(heart_2)
    ## Diastolic  Systolic    Weight 
    ##  2.558682  2.454214  1.120375
    par(mfrow=c(2,2)) 
    plot(heart_2, which=c(1:4))

    CD_2 <- which(cooks.distance(heart_2) > 0.015) 
    heart[CD_2, ]
    ##     Weight Diastolic Systolic Cholesterol
    ## 23      90        82      130         550
    ## 210    100        82      130         500
    lm_heart_2=lm(Cholesterol~., data=heart[-CD_2, ])
    
    par(mfrow=c(2,2)) 
    plot(lm_heart_2, which=c(1:4))

    (B) Comment on the significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multicollinearity issues among predictors. Explain to the medical director whether this is a good model for predicting Cholesterol levels.

    • The weight is not significant but diastolic and systolic are both significant with p-values less than 0.05 in t-test. If the bottom number in the blood pressure (Diastolic) and weight are fixed, the increase of 1 unit in top number of blood pressure (Systolic ) would predict an increase of 0.30 units of cholesterol on average. Similarly, if the top number (Systolic) and weight are unchanged, 1 unit increase of bottom number (Diastolic) predicts an increase of 0.25 units in cholesterol on average. There is no multicollinearity issue among the predictors since the VIFs are all less than 10 and the scatterplot matrix does not indicate very strongly correlated predictors.

    • The R-square is 3.77%, meaning that only 3.77% variance of cholesterol is explained, which is pretty low. So it is not a good model for predicting cholesterol levels.


    summary(lm_heart_2)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ ., data = heart[-CD_2, ])
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -110.617  -29.371   -4.476   23.755  216.041 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
    ## Weight        0.03671    0.02860   1.284   0.1994    
    ## Diastolic     0.24922    0.10665   2.337   0.0195 *  
    ## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.26 on 3128 degrees of freedom
    ## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
    ## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16
    vif(lm_heart_2)
    ##    Weight Diastolic  Systolic 
    ##  1.120631  2.558914  2.454207
    pairs(heart[,c(1:3)], pch = 19)

    Exercise 3

    Now consider stepwise model selection for the Cholesterol model. Before performing the model selection, we remove influential points detected in Exercise 2, which have a cook’s distance larger than 0.015.

    (A) Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.

    • The stepwise selection is performed and the final model includes Systolic and Diastolic. Since all the observations are of Cook’s distance less than 0.015, there is not any highly influential point. Similarly, a bit skewed histogram and non-linear QQ plot are detected, but they don’t seem to be serious.


    stepwise_heart = ols_step_both_p(lm_heart_2, pent = 0.05, prem = 0.05, details = FALSE)
    
    stepwise_heart
    ## 
    ##                                Stepwise Selection Summary                                
    ## ----------------------------------------------------------------------------------------
    ##                       Added/                   Adj.                                         
    ## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
    ## ----------------------------------------------------------------------------------------
    ##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
    ##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
    ## ----------------------------------------------------------------------------------------
    plot(stepwise_heart)

    lm_stepwise_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart) 
    lm_stepwise_heart_cd <- which(cooks.distance(heart_2) > 0.015) 
    lm_step_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ]) 
    par(mfrow=c(2,2)) 
    plot(lm_step_heart, which=c(1:4))

    (B) Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercises 1 and 2.

    • An increase of 1 unit in Systopic predicts an increase of 0.3 units of cholesterol on average, when Diastolic is fixed. An increase of 1 unit in Diastolic predicts an increase of 0.28 units of cholesterol on average, when Systolic is fixed. The R-square is 3.50%, meaning that only 3.50% of the variation of the cholesterol variable is explained by this model, which is pretty low. It is not a good-fit model for cholesterol. This model’s percentage of variation explained is much larger than that of Exercise 1, but smaller than the model in Exercise 2 with less number of predictors.


    lm_step_heart <- lm(Cholesterol~Systolic+Diastolic, data=heart) 
    
    summary(lm_step_heart)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -109.52  -29.58   -4.57   23.79  328.47 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
    ## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
    ## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.94 on 3131 degrees of freedom
    ## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
    ## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16

    Exercise 4

    Now consider the best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has a cook’s distance larger than 0.015, before performing the model selection.

    (A) Find the best model based on adjusted-R square criteria and specify which predictors are selected.

    • Based on adjusted-R square criteria, model 3 is the final model with the largest measure of 0.0367. It is the model with all three variables, Weight, Diastolic, and Systolic.


    best_heart = ols_step_best_subset(lm_heart_2) 
    best_heart
    ##         Best Subsets Regression         
    ## ----------------------------------------
    ## Model Index    Predictors
    ## ----------------------------------------
    ##      1         Systolic                  
    ##      2         Diastolic Systolic        
    ##      3         Weight Diastolic Systolic 
    ## ----------------------------------------
    ## 
    ##                                                           Subsets Regression Summary                                                          
    ## ----------------------------------------------------------------------------------------------------------------------------------------------
    ##                        Adj.        Pred                                                                                                        
    ## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
    ## ----------------------------------------------------------------------------------------------------------------------------------------------
    ##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
    ##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
    ##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
    ## ----------------------------------------------------------------------------------------------------------------------------------------------
    ## AIC: Akaike Information Criteria 
    ##  SBIC: Sawa's Bayesian Information Criteria 
    ##  SBC: Schwarz Bayesian Criteria 
    ##  MSEP: Estimated error of prediction, assuming multivariate normality 
    ##  FPE: Final Prediction Error 
    ##  HSP: Hocking's Sp 
    ##  APC: Amemiya Prediction Criteria

    (B) Find the best model based on AIC criteria and specify which predictors are selected.

    • Based on AIC criteria, model 2 is the final model with the smallest AIC of 32344.73. It is the model with two variables, Diastolic and Systolic.


    (C) Compare the final models selected in a) and b). Also, compare the final models from the best subset approach with the final model from the stepwise selection.

    • Final models from adjusted-R square and AIC criterion are different. The model from adjusted-R square criteria is same for the final model based on stepwise selection.


    heart_4A=lm(Cholesterol ~ Systolic + Diastolic + Weight, data=heart) 
    summary(heart_4A)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Systolic + Diastolic + Weight, data = heart)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -110.27  -29.58   -4.56   23.66  329.74 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
    ## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
    ## Diastolic     0.25983    0.10838   2.397   0.0166 *  
    ## Weight        0.02146    0.02903   0.739   0.4597    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.95 on 3130 degrees of freedom
    ## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
    ## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16
    heart_4B=lm(Cholesterol ~ Systolic + Diastolic, data=heart)
    summary(heart_4B)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -109.52  -29.58   -4.57   23.79  328.47 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
    ## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
    ## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.94 on 3131 degrees of freedom
    ## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
    ## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16
    heart_4sys=lm(Cholesterol ~ Systolic, data=heart) 
    summary(heart_4sys)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Systolic, data = heart)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -107.80  -29.84   -4.36   23.60  328.40 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 165.57934    5.45892   30.33   <2e-16 ***
    ## Systolic      0.43092    0.04117   10.47   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.98 on 3132 degrees of freedom
    ## Multiple R-squared:  0.0338, Adjusted R-squared:  0.0335 
    ## F-statistic: 109.6 on 1 and 3132 DF,  p-value: < 2.2e-16
    lm_step_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ]) 
    summary(lm_step_heart)
    ## 
    ## Call:
    ## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, 
    ##     ])
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -109.332  -29.399   -4.433   23.922  217.241 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
    ## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
    ## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 42.26 on 3129 degrees of freedom
    ## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
    ## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16