Exercise 1:

Ex 1 Part A

Linear Regression Model: Cholesterol ~ Weight

lm.heartBP = lm(Cholesterol ~ Weight, heartBP)

Scatter Plot

plot(heartBP$Weight, heartBP$Cholesterol, 
     xlab = "Weight",
     ylab = "Cholesterol")
abline(lm.heartBP, col = "red")

Correlation

cor(heartBP$Weight, heartBP$Cholesterol, method = "spearman")
## [1] 0.1078544

Conclusion

Spearman Correlation shows significantly low correlation between Weight & Cholesterol

Model Diagnostics

par(mfrow = c(2,2))
plot(lm.heartBP, which = c(1:4))

Conclusion

Normal QQPlot: Normality assumption not reasonable Standardized Residual: also shows normality assumption not reasonable Equal variance test: the sqrt of the Standardized Residual shows a pattern & supports heteroscedasticity

Ex 1 Part B

summary(lm.heartBP)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heartBP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

Conclusion

Model Significance: p-value is lower than significant value - can reject the null & conclude the linear regression model is useful (at least one beta is not equal to 0) Individual Term Significance: p-value from Weight t-test is also lower than significant value - can again reject null & conclude there is a linear relationship between Weight & Cholesterol Estimated Regression Line: Cholesterol expected to increase ~0.109 when there’s one unit increase in Weight R-Squared: only ~0.5% of Cholesterol can be explained by Weight - doctor should not use this model - it has low predictive power

Exercise 2:

Ex 2 Part A

Scatter Plot

pairs(heartBP)

lm.heartBP_2 <- lm(Cholesterol~., data = heartBP)

Model Diagnostics

par(mfrow = c(2,2))
plot(lm.heartBP_2, which = c(1:4))

Conclusion

Normality QQPlot: Normality assumption is not reasonable Standardized Residuals: also shows normality assumption is not reasonable Equal Variance: sqrt of standardized residuals shows a pattern - safe to assume heteroscedasticity Cook’s Distance: there are 2 unduly influential points - need to confirm Cook’s distance is greater than 0.015

ipoint <- which(cooks.distance(lm.heartBP_2) > 0.015)
heartBP[ipoint, ]
##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

Conclusion

The 2 points in question have a Cook’s distance greater than 0.015 - we need to refit the model

lm.heartBP_2 <- lm(Cholesterol~., data = heartBP[-ipoint, ])

Ex 2 Part B

summary(lm.heartBP_2)
## 
## Call:
## lm(formula = Cholesterol ~ ., data = heartBP[-ipoint, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Conclusion

Model Significance: Check p-value of f-statistic – p-value is less than significant value - reject the null & conclude the linear regression model is useful (at least one beta not equal to 0) Individual Term Significance: p-value from t-test of Weight, Diastolic, & Systolic Weight: p-value is greater than 0.05 – do not reject null – no linear relationship between Weight and Cholesterol Diastolic: p-value is less than 0.05 – reject the null – linear relationship between Diastolic and Cholesterol Systolic: p-value is also less than 0.05 (smallest p-value) – reject the null – linear relationship between Systolic and Cholesterol Estimated Regression Line: with one unit increase in Diastolic, Cholesterol increases ~0.249 (Systolic stays constant) ; one unit increase in Systolic, Cholesterol increases ~0.301 (Diastolic stays constant) R-Squared: only ~3.767% of Cholesterol can be explained by both Diastolic & Systolic – Doctor should not use this model - low predictive power

Multicollineararity with VIF

VIF(lm.heartBP_2)
##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207

Conclusion

None of the three predictors exceed the VIF cutoff point (10) – these variables are not correlated to each other

Ex 3 Part A

model.stepwise = ols_step_both_p(lm.heartBP_2, pent = 0.05, details = FALSE)
model.stepwise
## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------
plot(model.stepwise)

Ex 3 Part B

lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heartBP)
summary(lm.step)
## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heartBP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.52  -29.58   -4.57   23.79  328.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16

Conclusion

Model Significance: model p-value is less than 0.05 – reject null – conclude linear regression model is useful (at least one beta not equal to 0) Individual Term Significance: Systolic: p-value is less than 0.05 – reject null – linear relationship between Systolic & Cholesterol Diastolic: p-value is less than 0.05 – reject null – linear relationship between Diastolic & Cholesterol Estimated Regression Line: one unit increase in Systolic causes ~0.302 increase in Cholesterol (Diastolic stays constant) ; one unit increase in Diastolic causes ~0.276 increase in Cholesterol (Systolic stays constant) R-Squared: only ~3.589% of Cholesterol can be explained by both Systolic and Diastolic – doctor should not use this model – low predictive power Variation explained by previous models: Ex. 1: 0.4835% of cholesterol explained by weight – not a good model Ex. 2: 3.767% of cholesterol explained by systolic & diastolic – not a good model Ex. 3: 3.589% of cholesterol explained by systolic & diastolic – not a good model

Ex 4 Part A

model.best_subset <- ols_step_best_subset(lm.heartBP_2)
model.best_subset
##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Conclusion

Best Model: Model 3 - highest adjusted R-squared Selected Predictors: Weight, Diastolic & Systolic

Ex 4 Part B

Conclusion

Best Model: Model 2 - smallest AIC value Selected Predictors: Diastolic & Systolic

Ex 4 Part C

Final Conclusion

Best Model 1 | Adjusted R-Square: Weight, Diastolic, & Systolic Best Model 2 | AIC: Diastolic & Systolic Best Model 3 | Step-Wise Selection: Diastolic & Systolic Best Models 1 & 2 have p-values less than 0.05 – these are useful – Diastolic & Systolic predictors have p-values less than 0.05 - suggests linear relationship with Cholesterol The Best Subset Approach and Best Model 3 have Diastolic & Systolic predictors Final model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E