Exercise 1

## 'data.frame':    3134 obs. of  4 variables:
##  $ Weight     : int  132 158 156 131 136 194 179 151 174 155 ...
##  $ Diastolic  : int  90 80 76 92 80 68 76 68 90 90 ...
##  $ Systolic   : int  170 128 110 176 112 132 128 108 142 130 ...
##  $ Cholesterol: int  250 242 281 196 196 211 225 221 188 292 ...

Fitting the linear model for Cholesterol as the function of weight

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart[-inf.id, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.369  -29.395   -4.482   23.672  209.348 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 203.57605    4.18543  48.639  < 2e-16 ***
## Weight        0.12264    0.02745   4.469 8.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared:  0.006339,   Adjusted R-squared:  0.006022 
## F-statistic: 19.97 on 1 and 3130 DF,  p-value: 8.155e-06

After the influential points are removed, the following graph shows regression line with and without influential points

Comment on Model Significance:

**H_0**: All b's are zero (Model is not useful)
**H_a**: Atleast one b is non zero(Model is useful)

The p-value from F-test is 8.155e-06 which is very small. Hence we have enough evidence to reject null hypothesis, thus our model is significant.

Comment on individual parameter’s significance:

As our model consists of only one predictor variable weight, model significance and individual parameter significance are the same.

comment on variation explained by the model:

The variation in Cholesterol that can be explained by Weight(i.e R-Squared) is 0.63% which is very low.

Problems with the diagnostic plot

From qqplot, we can observe a slight curve in the data from the normal line in the higher end. And a number of data points fall above 1.5 in sqrt(standard residuals) which shows that normality is not reasonable. There is slight trend in the residual vs fitted plot converging towards the right, telling there might unequal variance i.e., heteroscedesity.

Relationship between Cholesterol and Weight The equation of the regression line is Cholesterol = 203.576 + 0.122(Weight).This means on average, Cholesterol is predicted to increase/decrease by 0.122 units with 1 unit increase/decrease in weight. The R-squared value is very low, which indicates that model has low predictive power of cholesterol levels hence it is not a good fit model hence wouldn’t recommend it.

Exercise 2

Cholesterol ~ Weight + Diastolic + Systolic

## 
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16

From the diagnostics plot, we can see that few data points have cook’s distance greater than 0.015, and data is not normally distributed as it follows a trend on qqplot towards the right side and few data points have square root of standardized residuals is greater than 1.5.Hence normality assumption might not be reasonable. The cook’s distance of 2 point(210, 23) is graeter than the cutoff 0.015, hence they need to be removed.

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

Refitting the model after removing the influential points

## 
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart[-inf.id2, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Check for multicollinearity issue

##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207

From the above result we can see that all the variables have vif less than 10, hence we need not remove any variable due multi collinearity issue.

## 
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart[-inf.id2, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Significance of parameters

The Model is significant and useful (F-test) as the p-value (2.2e-16) is ver small. From T-test to test the significance of individual parameters weight (p-vale = 0.1994) is not significant. Both systolic and Diastolic pressure with p-value less than 0.05, are significant parameters.

Relationship between cholesterol and statistically significant parameters The relationship can be explained by Cholesterol = 156.32618 + 0.24922 x Diastolic + 0.3022 x Systolic. Cholesterol increases by 0.24922 with an increase of a unit of Diastolic when the Systolic is fixed. Similarly Cholesterol increases by 0.3022 with an increase of a unit of Systolic given Diastolic is fixed.

R-squared value is 3.6%,which is quite low, hence this model is also not a good model for the prediction of cholesterol levels.

Exercise 3

## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------

Based on the result of the step-wise selection process, Systolic and Diastolic will be included in the final model.

Model without Weight

Comment on the Normality check and Equal Variance

Normality Check : Based on the Normal Q-Q Plot we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This shows that an assumption of normality is not reasonable. We can also second that by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.

Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.

## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-inf.id2, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

Comment on the Final Model

Final Model: Using the automatic selection, specifically the step-wise selection we were able to conclude (based on the “Added/Removed”) that only Systolic and Diastolic should be included in the final model (Cholesterol ~ Systolic + Diastolic).

Model Significance: The model p-value of 2.2e-16 is below the significance level therefore, we can reject the null hypothesis and conclude that the multiple linear regression model is useful to explain the behavior of Cholesterol

Individual Term Significance: The T-test on the terms Diastolic and Systolic signifies the following results:

Diastolic: P-value of 0.00803 is below the significance level of 0.05 hence, we can reject the null hypothesis and conclude that there is a significant linear relationship between Diastolic and the behavior of Cholesterol.

Systolic: P-value of 1.95e-06 is below the significance level of 0.05 hence, we can reject the null hypothesis and conclude that there is a significant linear relationship between Systolic and the behavior of Cholesterol.

Estimated Regression Line ( y^ )

From the table above, we can conclude our estimated regression line to be Cholesterol = 159.3317 + 0.3022(Systolic) + 0.2770(Diastolic). In other words, with one unit increase in Systolic, Cholesterol will increase by 0.3022 units, while Diastolic stays constant. And with one unit increase in Diastolic, Cholesterol will increase by 0.2770 units, while Systolic stays constant.

R-squared: In spite of its linear relationship, only 3.716% of Cholesterol can be explained by both Systolic and Diastolic. This can be seen from our table above (R2 = 0.03716) Hence, the medical director should not use this model as this has a low predictive power.

Variation Explained by the Model Comparison

Exercise 1: Only 0.6339% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).

Exercise 2: Only 3.767% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).

Exercise 3: Only 3.716% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).This value is comparable to Exercise 2 but grater than r-sqaured of Exercise 1

Exercise 4

##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Best Model based on adjusted-R square criteria and Selected Predictors

Best Model: Based on the adjusted R-square criteria above, the best model is Model 3 as it has the highest adjusted R-square of 0.0367.

Selected Predictors: Model 3 includes Weight, Diastolic, and Systolic predictors.

Best Model based on AIC criteria and Selected Predictors

Best Model: Based on the AIC criteria, the best model is Model 2 as it has the lowest AIC of 32344.7321.

Selected Predictors: Model 2 includes Diastolic and Systolic predictors.

Comparing the final models and comparing final models from best subset approach with the final model from step-wise selection

The final models from a) and b) are both different as they have different number of predictor variables.To choose the best model, we usually prefer to choose the simple model with fewer predictors, hence subset model selection based on AIC criteria having predictors Diastolic and Systolic pressures would a better one.

The final models from the best subset selection approach and the stepwise selection are the same and contain the predictors Diastolic and Systolic. The model is significant (F-Statistic p-value = 2.2e-16) and the individual predictors are significant as well.