lm.heartBP = lm(Cholesterol ~ Weight, heartBP)
plot(heartBP$Weight, heartBP$Cholesterol,
xlab = "Weight",
ylab = "Cholesterol")
abline(lm.heartBP, col = "red")
cor(heartBP$Weight, heartBP$Cholesterol, method = "spearman")
## [1] 0.1078544
Spearman Correlation shows significantly low correlation between Weight & Cholesterol
par(mfrow = c(2,2))
plot(lm.heartBP, which = c(1:4))
Normal QQPlot: Normality assumption not reasonable Standardized Residual: also shows normality assumption not reasonable Equal variance test: the sqrt of the Standardized Residual shows a pattern & supports heteroscedasticity
summary(lm.heartBP)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heartBP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
Model Significance: p-value is lower than significant value - can reject the null & conclude the linear regression model is useful (at least one beta is not equal to 0) Individual Term Significance: p-value from Weight t-test is also lower than significant value - can again reject null & conclude there is a linear relationship between Weight & Cholesterol Estimated Regression Line: Cholesterol expected to increase ~0.109 when there’s one unit increase in Weight R-Squared: only ~0.5% of Cholesterol can be explained by Weight - doctor should not use this model - it has low predictive power
pairs(heartBP)
lm.heartBP_2 <- lm(Cholesterol~., data = heartBP)
par(mfrow = c(2,2))
plot(lm.heartBP_2, which = c(1:4))
Normality QQPlot: Normality assumption is not reasonable Standardized Residuals: also shows normality assumption is not reasonable Equal Variance: sqrt of standardized residuals shows a pattern - safe to assume heteroscedasticity Cook’s Distance: there are 2 unduly influential points - need to confirm Cook’s distance is greater than 0.015
ipoint <- which(cooks.distance(lm.heartBP_2) > 0.015)
heartBP[ipoint, ]
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
The 2 points in question have a Cook’s distance greater than 0.015 - we need to refit the model
lm.heartBP_2 <- lm(Cholesterol~., data = heartBP[-ipoint, ])
summary(lm.heartBP_2)
##
## Call:
## lm(formula = Cholesterol ~ ., data = heartBP[-ipoint, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Model Significance: Check p-value of f-statistic – p-value is less than significant value - reject the null & conclude the linear regression model is useful (at least one beta not equal to 0) Individual Term Significance: p-value from t-test of Weight, Diastolic, & Systolic Weight: p-value is greater than 0.05 – do not reject null – no linear relationship between Weight and Cholesterol Diastolic: p-value is less than 0.05 – reject the null – linear relationship between Diastolic and Cholesterol Systolic: p-value is also less than 0.05 (smallest p-value) – reject the null – linear relationship between Systolic and Cholesterol Estimated Regression Line: with one unit increase in Diastolic, Cholesterol increases ~0.249 (Systolic stays constant) ; one unit increase in Systolic, Cholesterol increases ~0.301 (Diastolic stays constant) R-Squared: only ~3.767% of Cholesterol can be explained by both Diastolic & Systolic – Doctor should not use this model - low predictive power
VIF(lm.heartBP_2)
## Weight Diastolic Systolic
## 1.120631 2.558914 2.454207
None of the three predictors exceed the VIF cutoff point (10) – these variables are not correlated to each other
model.stepwise = ols_step_both_p(lm.heartBP_2, pent = 0.05, details = FALSE)
model.stepwise
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
plot(model.stepwise)
lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heartBP)
summary(lm.step)
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heartBP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.52 -29.58 -4.57 23.79 328.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.63995 5.91244 27.001 < 2e-16 ***
## Systolic 0.30193 0.06442 4.687 2.89e-06 ***
## Diastolic 0.27609 0.10612 2.602 0.00932 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared: 0.03589, Adjusted R-squared: 0.03527
## F-statistic: 58.27 on 2 and 3131 DF, p-value: < 2.2e-16
Model Significance: model p-value is less than 0.05 – reject null – conclude linear regression model is useful (at least one beta not equal to 0) Individual Term Significance: Systolic: p-value is less than 0.05 – reject null – linear relationship between Systolic & Cholesterol Diastolic: p-value is less than 0.05 – reject null – linear relationship between Diastolic & Cholesterol Estimated Regression Line: one unit increase in Systolic causes ~0.302 increase in Cholesterol (Diastolic stays constant) ; one unit increase in Diastolic causes ~0.276 increase in Cholesterol (Systolic stays constant) R-Squared: only ~3.589% of Cholesterol can be explained by both Systolic and Diastolic – doctor should not use this model – low predictive power Variation explained by previous models: Ex. 1: 0.4835% of cholesterol explained by weight – not a good model Ex. 2: 3.767% of cholesterol explained by systolic & diastolic – not a good model Ex. 3: 3.589% of cholesterol explained by systolic & diastolic – not a good model
model.best_subset <- ols_step_best_subset(lm.heartBP_2)
model.best_subset
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Best Model: Model 3 - highest adjusted R-squared Selected Predictors: Weight, Diastolic & Systolic
Best Model: Model 2 - smallest AIC value Selected Predictors: Diastolic & Systolic
Best Model 1 | Adjusted R-Square: Weight, Diastolic, & Systolic Best Model 2 | AIC: Diastolic & Systolic Best Model 3 | Step-Wise Selection: Diastolic & Systolic Best Models 1 & 2 have p-values less than 0.05 – these are useful – Diastolic & Systolic predictors have p-values less than 0.05 - suggests linear relationship with Cholesterol The Best Subset Approach and Best Model 3 have Diastolic & Systolic predictors Final model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E