Due: Saturday, Nov 4 by 11:59pm.
You need to download heart.csv.
Below is a brief summary of variables in hear.csv.
The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure, and Cholesterol for alive subjects in the heart.csv.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(stats)
library(car)
## Loading required package: carData
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.3.2
##
## Attaching package: 'olsrr'
##
## The following object is masked from 'package:datasets':
##
## rivers
.bordered{
border-style: solid;
border-color: teal;
padding: 5px;
background-color: #DCDCDC;
}
heart <- read.csv("heart.csv")
(A) Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut-off to be 0.015.
lm_heart = lm(Cholesterol ~ Weight, data = heart)
summary(lm_heart)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
cor(heart$Weight, heart$Cholesterol, method="pearson")
## [1] 0.0695377
cor(heart$Weight, heart$Cholesterol, method="spearman")
## [1] 0.1078544
par(mfrow=c(2,2))
plot(lm_heart, which=c(1:4))
(B) Comment on the significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for predicting Cholesterol levels.
CD_lm_heart <- cooks.distance(lm_heart)
CD_lm_heart <- round(CD_lm_heart, 5)
CD_lm_heart <- sort(CD_lm_heart>0.015)
CD_lm_heart <- rev(sort(CD_lm_heart))
head(CD_lm_heart, 5)
## 210 23 3134 3133 3132
## TRUE TRUE FALSE FALSE FALSE
CD_cleaned <- which(cooks.distance(lm_heart)>0.015)
CD_cleaned <- lm(Cholesterol ~ Weight, data=heart[-CD_cleaned,])
summary(CD_cleaned)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart[-CD_cleaned,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.369 -29.395 -4.482 23.672 209.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203.57605 4.18543 48.639 < 2e-16 ***
## Weight 0.12264 0.02745 4.469 8.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared: 0.006339, Adjusted R-squared: 0.006022
## F-statistic: 19.97 on 1 and 3130 DF, p-value: 8.155e-06
par(mfrow=c(2,2))
plot(CD_cleaned, which=c(1:4))
(A) Fit a linear regression model for cholesterol as a function of diastolic, systolic, and weight. Generate the diagnostics plots and comment on any issues that need to be noted. For Cook’s distances, do not leave any points that have Cook’s distance greater than 0.015.
heart_2 = lm(Cholesterol~ Diastolic+Systolic+Weight,, data = heart)
summary(heart_2)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## Weight 0.02146 0.02903 0.739 0.4597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
vif(heart_2)
## Diastolic Systolic Weight
## 2.558682 2.454214 1.120375
par(mfrow=c(2,2))
plot(heart_2, which=c(1:4))
CD_2 <- which(cooks.distance(heart_2) > 0.015)
heart[CD_2, ]
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
lm_heart_2=lm(Cholesterol~., data=heart[-CD_2, ])
par(mfrow=c(2,2))
plot(lm_heart_2, which=c(1:4))
(B) Comment on the significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multicollinearity issues among predictors. Explain to the medical director whether this is a good model for predicting Cholesterol levels.
The weight is not significant but diastolic and systolic are both significant with p-values less than 0.05 in t-test. If the bottom number in the blood pressure (Diastolic) and weight are fixed, the increase of 1 unit in top number of blood pressure (Systolic ) would predict an increase of 0.30 units of cholesterol on average. Similarly, if the top number (Systolic) and weight are unchanged, 1 unit increase of bottom number (Diastolic) predicts an increase of 0.25 units in cholesterol on average. There is no multicollinearity issue among the predictors since the VIFs are all less than 10 and the scatterplot matrix does not indicate very strongly correlated predictors.
summary(lm_heart_2)
##
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-CD_2, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
vif(lm_heart_2)
## Weight Diastolic Systolic
## 1.120631 2.558914 2.454207
pairs(heart[,c(1:3)], pch = 19)
(A) Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.
stepwise_heart = ols_step_both_p(lm_heart_2, pent = 0.05, prem = 0.05, details = FALSE)
stepwise_heart
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
plot(stepwise_heart)
lm_stepwise_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart)
lm_stepwise_heart_cd <- which(cooks.distance(heart_2) > 0.015)
lm_step_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ])
par(mfrow=c(2,2))
plot(lm_step_heart, which=c(1:4))
(B) Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models from Exercises 1 and 2.
lm_step_heart <- lm(Cholesterol~Systolic+Diastolic, data=heart)
summary(lm_step_heart)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.52 -29.58 -4.57 23.79 328.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.63995 5.91244 27.001 < 2e-16 ***
## Systolic 0.30193 0.06442 4.687 2.89e-06 ***
## Diastolic 0.27609 0.10612 2.602 0.00932 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared: 0.03589, Adjusted R-squared: 0.03527
## F-statistic: 58.27 on 2 and 3131 DF, p-value: < 2.2e-16
(A) Find the best model based on adjusted-R square criteria and specify which predictors are selected.
best_heart = ols_step_best_subset(lm_heart_2)
best_heart
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
(B) Find the best model based on AIC criteria and specify which predictors are selected.
(C) Compare the final models selected in a) and b). Also, compare the final models from the best subset approach with the final model from the stepwise selection.
heart_4A=lm(Cholesterol ~ Systolic + Diastolic + Weight, data=heart)
summary(heart_4A)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic + Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Weight 0.02146 0.02903 0.739 0.4597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
heart_4B=lm(Cholesterol ~ Systolic + Diastolic, data=heart)
summary(heart_4B)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.52 -29.58 -4.57 23.79 328.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.63995 5.91244 27.001 < 2e-16 ***
## Systolic 0.30193 0.06442 4.687 2.89e-06 ***
## Diastolic 0.27609 0.10612 2.602 0.00932 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared: 0.03589, Adjusted R-squared: 0.03527
## F-statistic: 58.27 on 2 and 3131 DF, p-value: < 2.2e-16
heart_4sys=lm(Cholesterol ~ Systolic, data=heart)
summary(heart_4sys)
##
## Call:
## lm(formula = Cholesterol ~ Systolic, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.80 -29.84 -4.36 23.60 328.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165.57934 5.45892 30.33 <2e-16 ***
## Systolic 0.43092 0.04117 10.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.98 on 3132 degrees of freedom
## Multiple R-squared: 0.0338, Adjusted R-squared: 0.0335
## F-statistic: 109.6 on 1 and 3132 DF, p-value: < 2.2e-16
lm_step_heart = lm(Cholesterol~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd, ])
summary(lm_step_heart)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-lm_stepwise_heart_cd,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16