## 'data.frame':    3134 obs. of  4 variables:
##  $ Weight     : int  132 158 156 131 136 194 179 151 174 155 ...
##  $ Diastolic  : int  90 80 76 92 80 68 76 68 90 90 ...
##  $ Systolic   : int  170 128 110 176 112 132 128 108 142 130 ...
##  $ Cholesterol: int  250 242 281 196 196 211 225 221 188 292 ...

Exercise 1:

(a)

Looking at Cook’s distance we can see that the 23rd and 210th observation is dramatically higher than every other observation in the set. With a cut off of .015 we can then remove these and refit our regression model.

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart[-cutOff, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.369  -29.395   -4.482   23.672  209.348 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 203.57605    4.18543  48.639  < 2e-16 ***
## Weight        0.12264    0.02745   4.469 8.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared:  0.006339,   Adjusted R-squared:  0.006022 
## F-statistic: 19.97 on 1 and 3130 DF,  p-value: 8.155e-06

Exercise 1:

(b)

this is the last part sig - check f-stat. model is useful to predict cholesterol After running the regression model we see that Weight has a very small p-value which tells us that it’s relationship with Cholesterol is significant. So it can be assumed there is a linear relationship between Weight and Cholesterol. The variation of y explained by the model is very little, less than 1% in fact. This tells us that the prediction power of this model is very bad. Examing the diagnostics plots we see that looking at the qqplot see that it follows pretty well to the standard line until the top right side of the graph. Standardized residuals certainly has a pattern to it, and we can see several points over 2. The residual plot also is patterned which leads me to believe that these support the case for heteroscadisity. With cook’s distance we see that we still have several large variables, but they’re much lower than our cutoff. While this could still help predict Cholesterol level, it wouldn’t do a very good job so I wouldn’t be able to recommend it to the medical director.

Exercise 2:

(a)

Looking at the diagnostics plot for this function it shares a lot of similarities with the one displayed in exercise 1. The qq-plot follows the standard line until the right tail suddenly diverges off. Residual, and standard residual follow the same basic patter, and Cook’s distance still has the same two variables that are much larger than any of the others. We will take the same necessary procedures to remove these large variables.

## 
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500
## 
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Exercise 2:

(b)

Looking at the model we can first see that Weight is not significant due to its large p-value, while both Diastolic and Systolic are significant to the model because of their very small p-values. The variation is slightly larger in this model, coming out to almost 4%, but this is still fairly small and lacks any real prediction power. To check multicolinearity with the VIF function and we see all of our x-variables have a <10. This leads us to believe that neither of the variable are correlated with one another. This could already be assumed with such a low r value, but we needed to ensure our data reflected our assumptions. Due to this assumption we won’t remove any of the variables to refit to model. With such a low r value and with the evidence to explain that our x-variables are not highly correlated I would recommend to not use this model for the prediction of Cholesterol level.

## [1] 0.03766896
##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207

Exercise 3:

(a)

After running the stepwise model selection and generating a plot we are only left with two points, Systolic and Diastolic. Examining the plots containing the two points we can see a positive linear relationship in r-squared, adjusted r-squared, and SBC. While C(p), AIC, and SBIC have a negative linear relationship.

## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------

Exercise 3:

(b)

## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

In the final model of Cholesterol, when examining both Systolic and Diastolic we see that their r-squared values are very small confirming to us that the variation is very low. Comparing it to ex2 we see that the variations are almost identical, while comparing with ex1 we can see that the variation is much smaller.

Exercise 4:

(a)

For adj r-squared we want the model with the largest adj r-squared value which is model 3 ~ Weight, Diastolic, and Systolic.

(b)

For AIC we want to choose the model with the smallest AIC value so the best set would be model 2~ Diastolic and Systolic.

##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Exercise 4:

(c)

Comparing all three of these together it’s actually kind of shocking how identical each model is. They all share the same p-value and r-squared and adj r-squared. In fact the model for AIC and the stepwise model selection are the exact same. I think this is something to expect when you have so few variables in a data set.

Final Model from Adjusted R-Squared.

## 
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Final Model from AIC

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic, data = heart1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

Final Model from Stepwise Selection

## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16