## 'data.frame': 3134 obs. of 4 variables:
## $ Weight : int 132 158 156 131 136 194 179 151 174 155 ...
## $ Diastolic : int 90 80 76 92 80 68 76 68 90 90 ...
## $ Systolic : int 170 128 110 176 112 132 128 108 142 130 ...
## $ Cholesterol: int 250 242 281 196 196 211 225 221 188 292 ...
Looking at Cook’s distance we can see that the 23rd and 210th observation is dramatically higher than every other observation in the set. With a cut off of .015 we can then remove these and refit our regression model.
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart[-cutOff, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.369 -29.395 -4.482 23.672 209.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203.57605 4.18543 48.639 < 2e-16 ***
## Weight 0.12264 0.02745 4.469 8.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared: 0.006339, Adjusted R-squared: 0.006022
## F-statistic: 19.97 on 1 and 3130 DF, p-value: 8.155e-06
this is the last part sig - check f-stat. model is useful to predict cholesterol After running the regression model we see that Weight has a very small p-value which tells us that it’s relationship with Cholesterol is significant. So it can be assumed there is a linear relationship between Weight and Cholesterol. The variation of y explained by the model is very little, less than 1% in fact. This tells us that the prediction power of this model is very bad. Examing the diagnostics plots we see that looking at the qqplot see that it follows pretty well to the standard line until the top right side of the graph. Standardized residuals certainly has a pattern to it, and we can see several points over 2. The residual plot also is patterned which leads me to believe that these support the case for heteroscadisity. With cook’s distance we see that we still have several large variables, but they’re much lower than our cutoff. While this could still help predict Cholesterol level, it wouldn’t do a very good job so I wouldn’t be able to recommend it to the medical director.
Looking at the diagnostics plot for this function it shares a lot of similarities with the one displayed in exercise 1. The qq-plot follows the standard line until the right tail suddenly diverges off. Residual, and standard residual follow the same basic patter, and Cook’s distance still has the same two variables that are much larger than any of the others. We will take the same necessary procedures to remove these large variables.
##
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Weight 0.02146 0.02903 0.739 0.4597
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
##
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Looking at the model we can first see that Weight is not significant due to its large p-value, while both Diastolic and Systolic are significant to the model because of their very small p-values. The variation is slightly larger in this model, coming out to almost 4%, but this is still fairly small and lacks any real prediction power. To check multicolinearity with the VIF function and we see all of our x-variables have a <10. This leads us to believe that neither of the variable are correlated with one another. This could already be assumed with such a low r value, but we needed to ensure our data reflected our assumptions. Due to this assumption we won’t remove any of the variables to refit to model. With such a low r value and with the evidence to explain that our x-variables are not highly correlated I would recommend to not use this model for the prediction of Cholesterol level.
## [1] 0.03766896
## Weight Diastolic Systolic
## 1.120631 2.558914 2.454207
After running the stepwise model selection and generating a plot we are only left with two points, Systolic and Diastolic. Examining the plots containing the two points we can see a positive linear relationship in r-squared, adjusted r-squared, and SBC. While C(p), AIC, and SBIC have a negative linear relationship.
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
In the final model of Cholesterol, when examining both Systolic and Diastolic we see that their r-squared values are very small confirming to us that the variation is very low. Comparing it to ex2 we see that the variations are almost identical, while comparing with ex1 we can see that the variation is much smaller.
For adj r-squared we want the model with the largest adj r-squared value which is model 3 ~ Weight, Diastolic, and Systolic.
For AIC we want to choose the model with the smallest AIC value so the best set would be model 2~ Diastolic and Systolic.
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Comparing all three of these together it’s actually kind of shocking how identical each model is. They all share the same p-value and r-squared and adj r-squared. In fact the model for AIC and the stepwise model selection are the exact same. I think this is something to expect when you have so few variables in a data set.
##
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic, data = heart1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16