Data Algorithms I - Homework 3

Exercise 1A

We would like to investigate the relationships between Cholesterol, Weight and/or Blood Pressure. The data set contains Weight, Diastolic blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv. The medical director at your company wants to know if Weight alone can predict Cholesterol outcome. Consider modeling Cholesterol as a function of Weight.

Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.

Linear Regression Model (`Cholesterol` ~ `Weight`)

lm.heart = lm(Cholesterol ~ Weight, heart)

Scatter Plot and Correlation (`Cholesterol` ~ `Weight`)

plot(heart$Weight, heart$Cholesterol, xlab ="Weight", ylab ="Cholesterol")
abline(lm.heart, col ="red")

cor(heart$Weight, heart$Cholesterol, method ="pearson")

## [1] 0.0695377

Conclusion: As noted above, the Spearman Correlation measure 0.1078544. This identifies that there’s a significantly low correlation between Weight and Cholesterol

Model Diagnostics

par(mfrow=c(2,2))
plot(lm.heart, which=c(1:4))

Conclusion:

Normality Check: From the Normal Q-Q Plot, we can see that majority of the points fall along the straight grey line. Nevertheless, right towards the end of the line, some of the points are distanced themselves from it. This shows that normality assumption is not reasonable. Looking at the Standardized Residual plot, we can also see that normality assumption is not reasonable because majority of the points fall above 1.5 (Y-axis)
Equal Variance Check: From the sqrt(Standardized Residuals), we can see that there’s a pattern in the plot. Hence, it’s safe to conclude that this supports heteroscedasticity.

Exercise 1B

Comment on significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

summary(lm.heart)

## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05

Conclusion:

Model Signifiance: For this, we can check the p-value from its F-statistic. Above, we can see that the p-value for this model is 9.778e-5, which falls below the significance level of 0.05. Hence, it’s safe to say that we can reject the null hypothesis and conclude that the linear regression model is in fact useful (at lest one beta is not equal to 0)
Individual Term Significance: To test the individual term significance, we will rely on the p-value from the T-test of Weight. Above, we can see that it yields a value of 9.78e-05, which also falls below the significance level of 0.05. From this, we can also reject the null hypothesis and conclude that there is a linear relationship between Weight and Cholesterol.
Estimated Regression Line ( \(\hat{y}\) ): From the table above, we can conclude our estimated regression line to be \(\hat{y}\) = 205.86763 + 0.10867x. In other words, Cholesterol is expected to increase by 0.10867 when there’s a one unit increase in Weight.
R-squared: In spite of its linear relationship, only 0.4835% of Cholesterol can be explained by Weight. This can be seen from our table above (\(R^2\)= 0.004835) Hence, the medical doctor should not use this model as this has a low predictive power.

Exercise 2A

The medical director wants to know if blood pressures and weight can better predict Cholesterol outcome. Consider modeling Cholesterol as a function of Diastolic, Systolic, and Weight. Fit a linear regression model for Cholesterol as a function of Diastolic, Systolic, and Weight. Generate the diagnostics plots and comment on any issues that need to be noted. Then make any necessary adjustments for undue influence. For Cook’s distances, do not leave any points in the final model that have Cook’s distance greater than 0.015.

Scatter Plot (Diastolic, Systolic, and Weight)

pairs(heart)

lm.heart2 <- lm(Cholesterol~., data = heart)

Model Diagnostics

par(mfrow=c(2,2))
plot(lm.heart2, which=c(1:4))

Conclusion:

Normality Check: From the Normal Q-Q Plot, we can see that majority of the points fall along the straight grey line. Nevertheless, right towards the end of the line, some of the points are distanced themselves from it. This shows that normality assumption is not reasonable. Looking at the Standardized Residual plot, we can also see that normality assumption is not reasonable because majority of the points fall above 1.5 (Y-axis)
Equal Variance Check: From the sqrt(Standardized Residuals), we can see that there’s a pattern in the plot. Hence, it’s safe to conclude that this supports heteroscedasticity.
Cook’s Distance: From the Cook’s Distance Plot, we can see that there are 2 unduly influential points. We’ll confirm if these two points have a Cook’s distance greater than 0.015 below.

ipoint <- which(cooks.distance(lm.heart2) > 0.015) 
heart[ipoint, ]

##     Weight Diastolic Systolic Cholesterol
## 23      90        82      130         550
## 210    100        82      130         500

Conclusion: As seen above, we can see that these 2 points - 23 and 210 - have Cook’s distance greater than 0.015. Hence, we’ll refit the model below.

lm.heart2 <- lm(Cholesterol~., data = heart[-ipoint, ])

Exercise 2B

Comment on significance of the parameters and how much variation in Cholesterol is described by the model. Comment on the relationship between Cholesterol and statistically significant predictor(s). Check multicollinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol level.

summary(lm.heart2)

## 
## Call:
## lm(formula = Cholesterol ~ ., data = heart[-ipoint, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Conclusion:

Model Signifiance: For this, we can check the p-value from its F-statistic. Above, we can see that the p-value for this model is < 2.2e-16, which falls far below the significance level of 0.05. Hence, it’s safe to say that we can reject the null hypothesis and conclude that the linear regression model is in fact useful (at lest one beta is not equal to 0)
Individual Term Significance: To test the individual term significance, we will rely on the p-value from the T-test of Weight, Diastolic, and Systolic.
- Weight: For this variable, we can see that its p-value is 0.1994, which is larger than the significance level of 0.05. Hence, it’s safe to say that we do not reject our null hypothesis and conclude that there is no linear relationship between Weight and Cholesterol.
- Diastolic: For our second variable, we can see that its p-value (0.0195) falls below the significance level of 0.05. With that being said, we reject our null hypothesis and conclude that there is a linear relationship between Diastolic and Cholesterol.
- Systolic: Our last variable has the least amount of p-value (2.2e-06), which also suggests that we should reject our null hypothesis and conclude that there is a linear relationship between Systolic and Cholesterol.
Estimated Regression Line ( \(\hat{y}\) ): From the table above, we can conclude our estimated regression line to be \(\hat{y}\) = 156.32618 + 0.24922(Diastolic) + 0.30073(Systolic). In other words, with one unit increase in Diastolic, Cholesterol will increase by 0.24922, while Systolic stays constant. And with one unit increase in Systolic, Cholesterol will increase by 0.30073, while `Diastolic stays constant.
R-squared: In spite of its linear relationship, only 3.767% of Cholesterol can be explained by both Diastolic and Systolic. This can be seen from our table above (\(R^2\)= 0.03767) Hence, the medical doctor should not use this model as this has a low predictive power.

Checking Multicollineararity using VIF (Variance Inflation Factors)

vif(lm.heart2)

##    Weight Diastolic  Systolic 
##  1.120631  2.558914  2.454207

Conclusion: From the VIF table above, we can see that all three predictors - Weight, Diastolic, and Systolic - do not exceed the VIF cutoff point (10) This means that these variables are nto correlated with each other.

Exercise 3A

Now consider stepwise model selection for the Cholesterol model. We remove influential points detected in Exercise 2, which has Cook’s distance larger than 0.015, prior to performing the model selection. Perform stepwise model selection with 0.05 criteria and address any issues in diagnostics plots.

model.stepwise = ols_step_both_p(lm.heart2, pent = 0.05, prem = 0.05, details = FALSE)
model.stepwise

## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------

plot(model.stepwise)

Exercise 3B

Interpret the final model and comment on the variation in Cholesterol explained. Compare the variations explained by the models of from Exercise 1 and 2.

lm.step = lm(Cholesterol~ Systolic + Diastolic, data = heart)
summary(lm.step)

## 
## Call:
## lm(formula = Cholesterol ~ Systolic + Diastolic, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.52  -29.58   -4.57   23.79  328.47 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.63995    5.91244  27.001  < 2e-16 ***
## Systolic      0.30193    0.06442   4.687 2.89e-06 ***
## Diastolic     0.27609    0.10612   2.602  0.00932 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.94 on 3131 degrees of freedom
## Multiple R-squared:  0.03589,    Adjusted R-squared:  0.03527 
## F-statistic: 58.27 on 2 and 3131 DF,  p-value: < 2.2e-16

Conclusion:

Model Signifiance: For this, we can check the p-value from its F-statistic. Above, we can see that the p-value for this model is < 2.2e-16, which falls far below the significance level of 0.05. Hence, it’s safe to say that we can reject the null hypothesis and conclude that the linear regression model is in fact useful (at lest one beta is not equal to 0)
Individual Term Significance: To test the individual term significance, we will rely on the p-value from the T-test of Systolic and Diastolic
- Systolic: This variable has the least amount of p-value (2.89e-06), which suggests that we should reject our null hypothesis and conclude that there is a linear relationship between Systolic and Cholesterol.
- Diastolic: For our second variable, we can see that its p-value (0.00932) falls below the significance level of 0.05. With that being said, we reject our null hypothesis and conclude that there is a linear relationship between Diastolic and Cholesterol.
Estimated Regression Line ( \(\hat{y}\) ): From the table above, we can conclude our estimated regression line to be \(\hat{y}\) = 156.63995 + 0.30193(Systolic) + 0.27609(Diastolic). In other words, with one unit increase in Systolic, Cholesterol will increase by 0.30193, while Diastolic stays constant. And with one unit increase in Diastolic, Cholesterol will increase by 0.27609, while `Systolic stays constant.
R-squared: In spite of its linear relationship, only 3.589% of Cholesterol can be explained by both Systolic and Diastolic. This can be seen from our table above (\(R^2\) = 0.03589) Hence, the medical doctor should not use this model as this has a low predictive power.

Variation explained by the models of from Exercise 1 and 2:

Exercise 1: Only 0.4835% of Cholesterol can be explained by the Weight - the model. Therefore, this is not a good model for the prediction of Cholesterol level as it has low predictive power.
Exercise 2: Only 3.767% of Cholesterol can be explained by Diastolic and Systolic - the model. Therefore, this is also not a good model for the prediction of Cholesterol level as it has low predictive power.
Exercise 3: Only 3.589% of Cholesterol can be explained by Systolic and Diastolic - the model. Therefore, this is not a good model for the prediction of Cholesterol level as it has low predictive power.

Exercise 4A

Now consider best subset selection for the Cholesterol model. Again, we remove influential points detected in Exercise 2, which has Cook’s distance larger than 0.015, prior to performing the model selection. Find the best model based on adjusted-R square criteria and specify which predictors are selected.

model.best.subset = ols_step_best_subset(lm.heart2)
model.best.subset

##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Weight Diastolic Systolic 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Conclusion:

Best Model: According to the adjusted R-square criteria above, the best model for Cholesterol is Model 3 since it has the highest adjusted R-Square out of all three models (0.0367)
Selected Predictors: Weight, Diastolic, and Systolic

Exercise 4B

Find the best model based on AIC criteria and specify which predictors are selected.

Conclusion:

Best Model: According to the AIC creteria above, the best model for Cholesterol is Model 2 since it has the smallest AIC value out of all three models (32344.7321)
Selected Predictors: Diastolic, and Systolic

Exercise 4C

Compare final models selected in a) and b). Also compare final models from Best Subset approach with the final model from Stepwise Selection.

Final Conclusion:

Best Model (Adjusted R-Square): Weight, Diastolic, and Systolic
Best Model (AIC): Diastolic, and Systolic
Best Model (Step-Wise Selection): Diastolic, and Systolic

From our final model selection, we can see that both models - AIC and Adjusted R-Square based - have p-values under the significance level of 0.05, and hence, these models can be concluded as useful. Not only that, but they contain Diastolic and Systolic predictors in their models whose p-values fall below the significance level of 0.05. With that being said, we can also conclude that there is a significant linear relationship between these 2 predictors - Diastolic and Systolic and Cholesterol.

Furthermore, the Best Subset Approach and the Stepwise Selection returned Diastolic and Systolic predictors. Hence, the final selected model is Y = 159.3317 + 0.2770(Diastolic) + 0.3022(Systolic) + E

Data Algorithms I - Homework 3

Adrianne Kristianto

11/3/2020

Exercise 1A

Linear Regression Model (`Cholesterol` ~ `Weight`)

Scatter Plot and Correlation (`Cholesterol` ~ `Weight`)

Model Diagnostics

Exercise 1B

Exercise 2A

Scatter Plot (Diastolic, Systolic, and Weight)

Model Diagnostics

Exercise 2B

Checking Multicollineararity using VIF (Variance Inflation Factors)

Exercise 3A

Exercise 3B

Exercise 4A

Exercise 4B

Exercise 4C

Data Algorithms I - Homework 3

Adrianne Kristianto

11/3/2020

Exercise 1A

Linear Regression Model (Cholesterol ~ Weight)

Scatter Plot and Correlation (Cholesterol ~ Weight)

Model Diagnostics

Exercise 1B

Exercise 2A

Scatter Plot (Diastolic, Systolic, and Weight)

Model Diagnostics

Exercise 2B

Checking Multicollineararity using VIF (Variance Inflation Factors)

Exercise 3A

Exercise 3B

Exercise 4A

Exercise 4B

Exercise 4C

Linear Regression Model (`Cholesterol` ~ `Weight`)

Scatter Plot and Correlation (`Cholesterol` ~ `Weight`)