Due: Saturday, Nov 4 by 11:59pm.

You need to download heart.csv.

Below is a brief summary of variables in hear.csv.

Weight: subject’s weight.

Systolic: top number in a blood pressure reading, indicating the blood pressure level when the heart contracts.

Diastolic: bottom number in a blood pressure reading, indicating the blood pressure level when the heart is at rest or between beats.

Cholesterol: measured cholesterol.

The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure, and Cholesterol for alive subjects in the heart.csv.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

library(stats)
library(car)

## Loading required package: carData

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(olsrr)

## Warning: package 'olsrr' was built under R version 4.3.2

## 
## Attaching package: 'olsrr'
## 
## The following object is masked from 'package:datasets':
## 
##     rivers

.bordered{
  border-style: solid;
  border-color: teal;
  padding: 5px;
  background-color: #DCDCDC;
}

heart <- read.csv("heart.csv")

Exercise 1

The medical director at your company wants to know if Weight alone can predict Cholesterol outcomes. Consider modeling Cholesterol as a function of Weight.

(A) Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut-off to be 0.015.

Since a few observations with of Cook’s distance greater than 0.015 are detected, we refit the model without them. Specifically they are observations 23 and 210.

lm_heart = lm(Cholesterol ~ Weight, data = heart) 

summary(lm_heart)

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

cor(heart$Weight, heart$Cholesterol, method="pearson")

## [1] 0.0695377

cor(heart$Weight, heart$Cholesterol, method="spearman")

## [1] 0.1078544

par(mfrow=c(2,2)) 
plot(lm_heart, which=c(1:4))

(B) Comment on the significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for predicting Cholesterol levels.

The final model is significant with very small p-value from F-test. In other words, the coefficient of weight is non-zero. The slight non-normal behavior is observed in QQ-plot but there are no obvious remaining diagnostic. No more observations with Cook’s D larger than 0.015. The model tells us that an increase of weight is associated with an increase of cholesterol. On average, the cholesterol is predicted to have an increase of 0.122 units when weight increases by 1 unit. Only around 0.63% of the variation of cholesterol is explained by weight, implying that it is not be a good-fit model for cholesterol.

CD_lm_heart <- cooks.distance(lm_heart) 
CD_lm_heart <- round(CD_lm_heart, 5) 
CD_lm_heart <- sort(CD_lm_heart>0.015) 
CD_lm_heart <- rev(sort(CD_lm_heart)) 
head(CD_lm_heart, 5)

##   210    23  3134  3133  3132 
##  TRUE  TRUE FALSE FALSE FALSE

CD_cleaned <- which(cooks.distance(lm_heart)>0.015) 
CD_cleaned <- lm(Cholesterol ~ Weight, data=heart[-CD_cleaned,]) 
summary(CD_cleaned)

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart[-CD_cleaned, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.369  -29.395   -4.482   23.672  209.348 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 203.57605    4.18543  48.639  < 2e-16 ***
## Weight        0.12264    0.02745   4.469 8.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared:  0.006339,   Adjusted R-squared:  0.006022 
## F-statistic: 19.97 on 1 and 3130 DF,  p-value: 8.155e-06

par(mfrow=c(2,2)) 
plot(CD_cleaned, which=c(1:4))

Exercise 2

The medical director wants to know if blood pressure and weight can better predict cholesterol outcomes. Consider modeling cholesterol as a function of diastolic, systolic, and weight.

(A) Fit a linear regression model for cholesterol as a function of diastolic, systolic, and weight. Generate the diagnostics plots and comment on any issues that need to be noted. For Cook’s distances, do not leave any points that have Cook’s distance greater than 0.015.

We again observe two observations (same data points in Exercise 1) with cook’s D greater than 0.015. The model is refitted without them. In terms of diagnostics plots, QQ-plot does not show a perfect line but it does not look like serious abnormality. For remedy of skewness, we can try transformation of Y, although it is not addressed here.

heart_2 = lm(Cholesterol~ Diastolic+Systolic+Weight,, data = heart) summary(heart_2)

## ## Call: ## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = heart) ## ## Residuals: ## Min 1Q Median 3Q Max ## -110.27 -29.58 -4.56 23.66 329.74 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 157.88394 6.37201 24.778 < 2e-16 *** ## Diastolic 0.25983 0.10838 2.397 0.0166 * ## Systolic 0.30106 0.06443 4.672 3.1e-06 *** ## Weight 0.02146 0.02903 0.739 0.4597 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.95 on 3130 degrees of freedom ## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513 ## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16

vif(heart_2)

## Diastolic Systolic Weight ## 2.558682 2.454214 1.120375

par(mfrow=c(2,2)) plot(heart_2, which=c(1:4))

CD_2 <- which(cooks.distance(heart_2) > 0.015) heart[CD_2, ]

## Weight Diastolic Systolic Cholesterol ## 23 90 82 130 550 ## 210 100 82 130 500

lm_heart_2=lm(Cholesterol~., data=heart[-CD_2, ]) par(mfrow=c(2,2)) plot(lm_heart_2, which=c(1:4))

(B) Comment on the significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multicollinearity issues among predictors. Explain to the medical director whether this is a good model for predicting Cholesterol levels.

The weight is not significant but diastolic and systolic are both significant with p-values less than 0.05 in t-test. If the bottom number in the blood pressure (Diastolic) and weight are fixed, the increase of 1 unit in top number of blood pressure (Systolic ) would predict an increase of 0.30 units of cholesterol on average. Similarly, if the top number (Systolic) and weight are unchanged, 1 unit increase of bottom number (Diastolic) predicts an increase of 0.25 units in cholesterol on average. There is no multicollinearity issue among the predictors since the VIFs are all less than 10 and the scatterplot matrix does not indicate very strongly correlated predictors.

The R-square is 3.77%, meaning that only 3.77% variance of cholesterol is explained, which is pretty low. So it is not a good model for predicting cholesterol levels.

summary(lm_heart_2)

## ## Call: ## lm(formula = Cholesterol ~ ., data = heart[-CD_2, ]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -110.617 -29.371 -4.476 23.755 216.041 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 156.32618 6.27153 24.926 < 2e-16 *** ## Weight 0.03671 0.02860 1.284 0.1994 ## Diastolic 0.24922 0.10665 2.337 0.0195 * ## Systolic 0.30073 0.06340 4.743 2.2e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.26 on 3128 degrees of freedom ## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675 ## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16

vif(lm_heart_2)

## Weight Diastolic Systolic ## 1.120631 2.558914 2.454207

pairs(heart[,c(1:3)], pch = 19)

Exercise 3

Now consider stepwise model selection for the Cholesterol model. Before performing the model selection, we remove influential points detected in Exercise 2, which have a cook’s distance larger than 0.015.

(A) Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.

The stepwise selection is performed and the final model includes Systolic and Diastolic. Since all the observations are of Cook’s distance less than 0.015, there is not any highly influential point. Similarly, a bit skewed histogram and non-linear QQ plot are detected, but they don’t seem to be serious.

stepwise_heart = ols_step_both_p(lm_heart_2, pent = 0.05, prem = 0.05, details = FALSE) stepwise_heart

## ## Stepwise Selection Summary ## ---------------------------------------------------------------------------------------- ## Added/ Adj. ## Step Variable Removed R-Square R-Square C(p) AIC RMSE ## ---------------------------------------------------------------------------------------- ## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013 ## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606 ## ----------------------------------------------------------------------------------------

plot(stepwise_heart)

lm_stepwise_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart) lm_stepwise_heart_cd <- which(cooks.distance(heart_2) > 0.015) lm_step_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ]) par(mfrow=c(2,2)) plot(lm_step_heart, which=c(1:4))

(B) Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercises 1 and 2.

An increase of 1 unit in Systopic predicts an increase of 0.3 units of cholesterol on average, when Diastolic is fixed. An increase of 1 unit in Diastolic predicts an increase of 0.28 units of cholesterol on average, when Systolic is fixed. The R-square is 3.50%, meaning that only 3.50% of the variation of the cholesterol variable is explained by this model, which is pretty low. It is not a good-fit model for cholesterol. This model’s percentage of variation explained is much larger than that of Exercise 1, but smaller than the model in Exercise 2 with less number of predictors.

lm_step_heart <- lm(Cholesterol~Systolic+Diastolic, data=heart) summary(lm_step_heart)

## ## Call: ## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart) ## ## Residuals: ## Min 1Q Median 3Q Max ## -109.52 -29.58 -4.57 23.79 328.47 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 159.63995 5.91244 27.001 < 2e-16 *** ## Systolic 0.30193 0.06442 4.687 2.89e-06 *** ## Diastolic 0.27609 0.10612 2.602 0.00932 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.94 on 3131 degrees of freedom ## Multiple R-squared: 0.03589, Adjusted R-squared: 0.03527 ## F-statistic: 58.27 on 2 and 3131 DF, p-value: < 2.2e-16

Exercise 4

Now consider the best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has a cook’s distance larger than 0.015, before performing the model selection.

(A) Find the best model based on adjusted-R square criteria and specify which predictors are selected.

Based on adjusted-R square criteria, model 3 is the final model with the largest measure of 0.0367. It is the model with all three variables, Weight, Diastolic, and Systolic.

best_heart = ols_step_best_subset(lm_heart_2) best_heart

## Best Subsets Regression ## ---------------------------------------- ## Model Index Predictors ## ---------------------------------------- ## 1 Systolic ## 2 Diastolic Systolic ## 3 Weight Diastolic Systolic ## ---------------------------------------- ## ## Subsets Regression Summary ## ---------------------------------------------------------------------------------------------------------------------------------------------- ## Adj. Pred ## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC ## ---------------------------------------------------------------------------------------------------------------------------------------------- ## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662 ## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647 ## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648 ## ---------------------------------------------------------------------------------------------------------------------------------------------- ## AIC: Akaike Information Criteria ## SBIC: Sawa's Bayesian Information Criteria ## SBC: Schwarz Bayesian Criteria ## MSEP: Estimated error of prediction, assuming multivariate normality ## FPE: Final Prediction Error ## HSP: Hocking's Sp ## APC: Amemiya Prediction Criteria

(B) Find the best model based on AIC criteria and specify which predictors are selected.

Based on AIC criteria, model 2 is the final model with the smallest AIC of 32344.73. It is the model with two variables, Diastolic and Systolic.

(C) Compare the final models selected in a) and b). Also, compare the final models from the best subset approach with the final model from the stepwise selection.

Final models from adjusted-R square and AIC criterion are different. The model from adjusted-R square criteria is same for the final model based on stepwise selection.

heart_4A=lm(Cholesterol ~ Systolic + Diastolic + Weight, data=heart) summary(heart_4A)

## ## Call: ## lm(formula = Cholesterol ~ Systolic + Diastolic + Weight, data = heart) ## ## Residuals: ## Min 1Q Median 3Q Max ## -110.27 -29.58 -4.56 23.66 329.74 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 157.88394 6.37201 24.778 < 2e-16 *** ## Systolic 0.30106 0.06443 4.672 3.1e-06 *** ## Diastolic 0.25983 0.10838 2.397 0.0166 * ## Weight 0.02146 0.02903 0.739 0.4597 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.95 on 3130 degrees of freedom ## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513 ## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16

heart_4B=lm(Cholesterol ~ Systolic + Diastolic, data=heart) summary(heart_4B)

## ## Call: ## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart) ## ## Residuals: ## Min 1Q Median 3Q Max ## -109.52 -29.58 -4.57 23.79 328.47 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 159.63995 5.91244 27.001 < 2e-16 *** ## Systolic 0.30193 0.06442 4.687 2.89e-06 *** ## Diastolic 0.27609 0.10612 2.602 0.00932 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.94 on 3131 degrees of freedom ## Multiple R-squared: 0.03589, Adjusted R-squared: 0.03527 ## F-statistic: 58.27 on 2 and 3131 DF, p-value: < 2.2e-16

heart_4sys=lm(Cholesterol ~ Systolic, data=heart) summary(heart_4sys)

## ## Call: ## lm(formula = Cholesterol ~ Systolic, data = heart) ## ## Residuals: ## Min 1Q Median 3Q Max ## -107.80 -29.84 -4.36 23.60 328.40 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 165.57934 5.45892 30.33 <2e-16 *** ## Systolic 0.43092 0.04117 10.47 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.98 on 3132 degrees of freedom ## Multiple R-squared: 0.0338, Adjusted R-squared: 0.0335 ## F-statistic: 109.6 on 1 and 3132 DF, p-value: < 2.2e-16

lm_step_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ]) summary(lm_step_heart)

## ## Call: ## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ## ]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -109.332 -29.399 -4.433 23.922 217.241 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 159.3317 5.8186 27.383 < 2e-16 *** ## Systolic 0.3022 0.0634 4.767 1.95e-06 *** ## Diastolic 0.2770 0.1044 2.652 0.00803 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 42.26 on 3129 degrees of freedom ## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655 ## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16

Homework 3

Leonel Salazar

2023-11-04

Exercise 1

The medical director at your company wants to know if Weight alone can predict Cholesterol outcomes. Consider modeling Cholesterol as a function of Weight.

Exercise 2

The medical director wants to know if blood pressure and weight can better predict cholesterol outcomes. Consider modeling cholesterol as a function of diastolic, systolic, and weight.

Exercise 3

Now consider stepwise model selection for the Cholesterol model. Before performing the model selection, we remove influential points detected in Exercise 2, which have a cook’s distance larger than 0.015.

Exercise 4

Now consider the best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has a cook’s distance larger than 0.015, before performing the model selection.