Homework 3

library(MASS); library(car); library(olsrr)

## Loading required package: carData

## Warning: package 'olsrr' was built under R version 4.0.3

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

Exercise 1:

We would like to investigate the relationships between cholesterol, weight and/or blood pressure. The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.

a) Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.

lm.heart <- lm(Cholesterol~Weight, data=heart)  
summary(lm.heart)

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4))

cor(heart$Weight, heart$Cholesterol, method ="spearman")

## [1] 0.1078544

cook.d = cooks.distance(lm.heart)

inf.id=which(cooks.distance(lm.heart)>0.015)
heart[inf.id, ]

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

lm.heart2=lm(Cholesterol~Weight, data=heart[-inf.id, ])

with(heart, plot(Weight, Cholesterol))
abline(lm.heart, col="red")
abline(lm.heart2,col="blue")
legend("topright",col=c("red","blue"),legend=c("w/ all observations", "w/out all observations"), cex=0.8, title.adj=0.15, lty=1)

lm.heart2 <- lm(Cholesterol~Weight, data=heart)  
summary(lm.heart2)

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

Influential points: Looking more specifically at the normal qq plot and the Standardized residuals there are many points that stray from the line and that there are many points above 1.5. Specifically, these points are at the top 23 and 210. We can assume that normality is not reasonable.

Looking at the Standardized Residuals Plot, we see that there is a pattern in the residual plot. This supports heteroscedasticity

b) Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

Model Significance: Based on the F test the p value is 9.778e-05 which is very low. With this info we can support the Ha: Model is useful.

Individual Term Significance: Based off the T-test results the p-value for Weight is 9.78e-05. We can support the alternative hypothesis that there is a significant linear relationship between Weight and Cholesterol.

Estimated Regression line: On average, Cholesterol is predicted to have an increase .108, with one unit increase on weight.

R-squared: The prediction power is very low. 0.48 % of the variation of Cholesterol can be explained by the model (Weight). Very low prediction power.

Conclusion: Based on this information I would not suggest that weight alone can predict the outcome of Cholesterol.

Exercise 2:

The medical director wants to know if blood pressures and weight can better predict cholesterol outcome.

Consider modeling cholesterol as a function of diastolic, systolic, and weight.

a) Fit a linear regression model for cholesterol as a function of diastolic, systolic, and weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.

lm.heart3 <- lm(Cholesterol~Diastolic+Systolic+Weight, data=heart)  
summary(lm.heart3)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.heart3, which=c(1:4))

inf.id2=which(cooks.distance(lm.heart3) > 0.015)
heart[inf.id2, ]

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

lm.heart4 =lm(Cholesterol~Systolic + Diastolic + Weight, data=heart[-inf.id2, ])
summary(lm.heart4)

## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic + Weight, data = heart[-inf.id2, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Diagnostic observations: Looking at the plots specifically the standardized residual plot the influential points include 23, 210, 2035 are above the 2.0 mark. The Normal qq plot shows that the observations follow normal distribution, but you can see the influential points noted above.

b) Comment on significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multicollinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

vif(lm.heart4)

##  Systolic Diastolic    Weight 
##  2.454207  2.558914  1.120631

Model Significance: Based on the F-test,the p-value is very low at 2.2e-16. With this we can support the alternative hypothesis and say there is linear relationship between some of x’s (Diastolic, Systolic, and Weight) and y (Cholesterol).

Individual term significance: Based on the t-test we determine the linear relationship based on the p-value. High -value (above the .05 significance level) support the null hypothesis that there is no linear relationship between x and y. Low p-value (below the .05 significant level) supports the alternative hypothesis where there is a linear relationship between x and y. - There is linear relationship between Diastolic and Cholesterol (2.2e-06) - There is linear relationship between Systolic and Cholesterol (0.0195) - There is no linear relationship between Weight and Cholesterol (0.1994)

R-squared: Looking at the R-squared we can interpret that only 3% of variation of Cholesterol can be explained by the model Diastolic, Systolic and Weight The prediction power is very low

Statistically Significant Predictor(s): Based off the T-test we would not include Weight as a statistically significant predictor since it has a high p-value.

Multicollinearity: Based of the VIF test of all the predictors (Diastolic, Systolic and Weight) they are all very small (under 10) meaning they all have very low correlation to Cholesterol

Conclusion: I would suggest to the medical director that this is not a good model for the prediction of Cholesterol level. Both predictors, Diastolic and Systolic, have a low correlation to Cholesterol.

Exercise 3:

Now consider stepwise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.

a) Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.

model.stepwise<-ols_step_both_p(lm.heart4, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise

## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------

plot(model.stepwise)

lm.final=lm(Cholesterol~ Systolic + Diastolic, data=heart[-inf.id2, ])
summary(lm.final)

## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart[-inf.id2, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.final, which=c(1:4))

Diagnostic: Looking at the Normal qq plot it appears that the data follows normal distribution. The standardized residual plot the influential points include 23, 210, 2035 are above the 2.0 mark.

b) Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models of from Exercise 1 and 2.

Model Significance: Based on the F-test, p-value is very low 2.2e-16. Support Alternative hypothesis: There is linear relationship between some of x’s (Systolic and Diastolic) and y (Cholesterol)

Individual Term Significance: Based on the t-test we determine the linear relationship based on the p-value. Support Alternative hypothesis. - There is linear relationship between Diastolic and Cholesterol (2.2e-06) - There is linear relationship between Systolic and Cholesterol (0.0195)

Estimated Regression line: Y-hat = 159.3317 + 0.3022(Systolic) + 0.2770(Diastolic) + E

R-squared: Looking at the R-squared we can interpret that only 3% of variation of Cholesterol can be explained by the model Diastolic and Systolic. The prediction power is very low

Final Model: Using the automatic Selection, specifically using stepwise selection we were able to determine based on the “Added/Removed” that only Systolic and Diastolic should be included in the final model. Cholesterol ~ Systolic + Diastolic

Variations: The variations for both Exercise 1 (0.48 %) and 2 (3%) had very small prediction power.

Exercise 4:

Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has cook’s distance larger than 0.015, prior to performing the model selection.

a) Find the best model based on adjusted-R square criteria and specify which predictors are selected.

model.best.subset<-ols_step_best_subset(lm.heart4)
model.best.subset

##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Systolic Diastolic        
##      3         Systolic Diastolic Weight 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

By looking at the Adjusted R-squared we want to find the biggest one: Model 3: Systolic Diastolic Weight

b) Find the best model based on AIC criteria and specify which predictors are selected.

For AIC the smaller the better: Model 2: Systolic Diastolic

c) Compare final models selected in a) and b). Also compare final models from best subset approach with the final model from stepwise selection.

Both Models 2&3 contain Systolic and Diastolic.

The final model approach in Exercise 3 and the best subset approach both used Systolic and Diastolic (Model 2)