Exercise 1
## 'data.frame': 3134 obs. of 4 variables:
## $ Weight : int 132 158 156 131 136 194 179 151 174 155 ...
## $ Diastolic : int 90 80 76 92 80 68 76 68 90 90 ...
## $ Systolic : int 170 128 110 176 112 132 128 108 142 130 ...
## $ Cholesterol: int 250 242 281 196 196 211 225 221 188 292 ...
Fitting the linear model for Cholesterol as the function of weight
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart[-inf.id, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.369 -29.395 -4.482 23.672 209.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203.57605 4.18543 48.639 < 2e-16 ***
## Weight 0.12264 0.02745 4.469 8.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared: 0.006339, Adjusted R-squared: 0.006022
## F-statistic: 19.97 on 1 and 3130 DF, p-value: 8.155e-06
After the influential points are removed, the following graph shows regression line with and without influential points
Comment on Model Significance:
**H_0**: All b's are zero (Model is not useful)
**H_a**: Atleast one b is non zero(Model is useful)
The p-value from F-test is 8.155e-06 which is very small. Hence we have enough evidence to reject null hypothesis, thus our model is significant.
Comment on individual parameter’s significance:
As our model consists of only one predictor variable weight, model significance and individual parameter significance are the same.
comment on variation explained by the model:
The variation in Cholesterol that can be explained by Weight(i.e R-Squared) is 0.63% which is very low.
Problems with the diagnostic plot
From qqplot, we can observe a slight curve in the data from the normal line in the higher end. And a number of data points fall above 1.5 in sqrt(standard residuals) which shows that normality is not reasonable. There is slight trend in the residual vs fitted plot converging towards the right, telling there might unequal variance i.e., heteroscedesity.
Relationship between Cholesterol and Weight The equation of the regression line is Cholesterol = 203.576 + 0.122(Weight).This means on average, Cholesterol is predicted to increase/decrease by 0.122 units with 1 unit increase/decrease in weight. The R-squared value is very low, which indicates that model has low predictive power of cholesterol levels hence it is not a good fit model hence wouldn’t recommend it.
Exercise 2
Cholesterol ~ Weight + Diastolic + Systolic
##
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Weight 0.02146 0.02903 0.739 0.4597
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
From the diagnostics plot, we can see that few data points have cook’s distance greater than 0.015, and data is not normally distributed as it follows a trend on qqplot towards the right side and few data points have square root of standardized residuals is greater than 1.5.Hence normality assumption might not be reasonable. The cook’s distance of 2 point(210, 23) is graeter than the cutoff 0.015, hence they need to be removed.
## Weight Diastolic Systolic Cholesterol
## 23 90 82 130 550
## 210 100 82 130 500
Refitting the model after removing the influential points
##
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart[-inf.id2,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Check for multicollinearity issue
## Weight Diastolic Systolic
## 1.120631 2.558914 2.454207
From the above result we can see that all the variables have vif less than 10, hence we need not remove any variable due multi collinearity issue.
##
## Call:
## lm(formula = Cholesterol ~ Weight + Diastolic + Systolic, data = heart[-inf.id2,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Weight 0.03671 0.02860 1.284 0.1994
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
Significance of parameters
The Model is significant and useful (F-test) as the p-value (2.2e-16) is ver small. From T-test to test the significance of individual parameters weight (p-vale = 0.1994) is not significant. Both systolic and Diastolic pressure with p-value less than 0.05, are significant parameters.
Relationship between cholesterol and statistically significant parameters The relationship can be explained by Cholesterol = 156.32618 + 0.24922 x Diastolic + 0.3022 x Systolic. Cholesterol increases by 0.24922 with an increase of a unit of Diastolic when the Systolic is fixed. Similarly Cholesterol increases by 0.3022 with an increase of a unit of Systolic given Diastolic is fixed.
R-squared value is 3.6%,which is quite low, hence this model is also not a good model for the prediction of cholesterol levels.
Exercise 3
##
## Stepwise Selection Summary
## ----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Systolic addition 0.035 0.035 8.6850 32349.7666 42.3013
## 2 Diastolic addition 0.037 0.037 3.6480 32344.7321 42.2606
## ----------------------------------------------------------------------------------------
Based on the result of the step-wise selection process, Systolic and Diastolic will be included in the final model.
Model without Weight
Comment on the Normality check and Equal Variance
Normality Check : Based on the Normal Q-Q Plot we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This shows that an assumption of normality is not reasonable. We can also second that by looking at the sqrt(Standardized Residuals) Plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
Equal Variance Check: Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity.
##
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-inf.id2,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
Comment on the Final Model
Final Model: Using the automatic selection, specifically the step-wise selection we were able to conclude (based on the “Added/Removed”) that only Systolic and Diastolic should be included in the final model (Cholesterol ~ Systolic + Diastolic).
Model Significance: The model p-value of 2.2e-16 is below the significance level therefore, we can reject the null hypothesis and conclude that the multiple linear regression model is useful to explain the behavior of Cholesterol
Individual Term Significance: The T-test on the terms Diastolic and Systolic signifies the following results:
Diastolic: P-value of 0.00803 is below the significance level of 0.05 hence, we can reject the null hypothesis and conclude that there is a significant linear relationship between Diastolic and the behavior of Cholesterol.
Systolic: P-value of 1.95e-06 is below the significance level of 0.05 hence, we can reject the null hypothesis and conclude that there is a significant linear relationship between Systolic and the behavior of Cholesterol.
Estimated Regression Line ( y^ )
From the table above, we can conclude our estimated regression line to be Cholesterol = 159.3317 + 0.3022(Systolic) + 0.2770(Diastolic). In other words, with one unit increase in Systolic, Cholesterol will increase by 0.3022 units, while Diastolic stays constant. And with one unit increase in Diastolic, Cholesterol will increase by 0.2770 units, while Systolic stays constant.
R-squared: In spite of its linear relationship, only 3.716% of Cholesterol can be explained by both Systolic and Diastolic. This can be seen from our table above (R2 = 0.03716) Hence, the medical director should not use this model as this has a low predictive power.
Variation Explained by the Model Comparison
Exercise 1: Only 0.6339% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).
Exercise 2: Only 3.767% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).
Exercise 3: Only 3.716% of the variation in Cholesterol can be explained by the model. Therefore, this is not a good model for the prediction of Cholesterol level (low predictive power).This value is comparable to Exercise 2 but grater than r-sqaured of Exercise 1
Exercise 4
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Weight Diastolic Systolic
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Best Model based on adjusted-R square criteria and Selected Predictors
Best Model: Based on the adjusted R-square criteria above, the best model is Model 3 as it has the highest adjusted R-square of 0.0367.
Selected Predictors: Model 3 includes Weight, Diastolic, and Systolic predictors.
Best Model based on AIC criteria and Selected Predictors
Best Model: Based on the AIC criteria, the best model is Model 2 as it has the lowest AIC of 32344.7321.
Selected Predictors: Model 2 includes Diastolic and Systolic predictors.
Comparing the final models and comparing final models from best subset approach with the final model from step-wise selection
The final models from a) and b) are both different as they have different number of predictor variables.To choose the best model, we usually prefer to choose the simple model with fewer predictors, hence subset model selection based on AIC criteria having predictors Diastolic and Systolic pressures would a better one.
The final models from the best subset selection approach and the stepwise selection are the same and contain the predictors Diastolic and Systolic. The model is significant (F-Statistic p-value = 2.2e-16) and the individual predictors are significant as well.