Stepwise selection

We will use heart.csv dataset. Below is brief summary of variables in heart.csv.

The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

# Read the CSV file
d1 <- read.csv("heart.csv") 

# Check data format
#str(d1)

consider stepwise model selection for the Cholesterol model. Before performing the model selection, which has a cook’s distance larger than 0.015.


Answer -:

Step 1) Run linear regressions with multicollinearity issue
e2.lr <- lm( Cholesterol ~ Diastolic+Systolic+Weight  , data=d1)
summary(e2.lr)
## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16
Result:
  • P-Value of F-Test is significant. Thus, the model is useful.

In the other word, there exist linear regression for cholesterol as a function of diastolic, systolic, and weight.

Step 2) Perform Correlation to find Multicollinearity
  • Two methods:

        - 1) Pairwise Scatter Plot  
    
        - 2) Quantitative Method (VIF's)  
pairs(d1,pch=19, col="darkblue")

Result:
  • It seems, maybe there exist a small correlation between Diastolic and Systolic, to be sure we will perform a Quantitative test (VIF’s)
vif(e2.lr)
## Diastolic  Systolic    Weight 
##  2.558682  2.454214  1.120375
Result:
  • There is no VIF>10, thus there exist no correlation between predictors.
Step 3) Perform Diagnostic Plot
par(mfrow=c(2,2))
plot(e2.lr,which=1:4,col="darkblue")

Result:
  • It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.

  • In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.

  • In the cook distance, there exist some pints greater than 0.015

Step 4) Influential points
cook.d2 <- cooks.distance(e2.lr)
plot(cook.d2,col="darkblue",pch=19,cex=1)

  • Delete observations larger than criteria (0.015)
e2.inf.id <- which(cooks.distance(e2.lr)>0.015)
e2.lr2 <- lm(Cholesterol ~ Diastolic+Systolic+Weight  , data=d1[-e2.inf.id,])
summary(e2.lr2)
## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16
Linear regression model:
    ŷ = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight    
R Square (R2) and Adj. R Square:

Calculate R2 and Adj.R2

summary(e2.lr2)$r.squared
## [1] 0.03766896
summary(e2.lr2)$adj.r.squared
## [1] 0.03674601
    R Square is very small, thus, "Goodness of fit" or "Predictive power" is very low.  
    
    Adj. R Square is very small too.  
Step 5) Plot scatter, with and without influential points
with(d1,plot(Cholesterol ~ Diastolic+Systolic+Weight, col="darkblue"))

abline(e2.lr,col="red")
## Warning in abline(e2.lr, col = "red"): only using the first two of 4 regression
## coefficients
abline(e2.lr2,col="green")
## Warning in abline(e2.lr2, col = "green"): only using the first two of 4
## regression coefficients
legend("bottomright",col=c("red","green"),legend = c("W/Inf. points","W/out Inf. points"),cex=0.8,title.adj = 0.15,lty=1)

Result:
  • Since dataset is very big (3134 observations) and we only remove 2 outliers, thus, the linear regression is closed together. The red line is under the green line.
Step 6) Diagnostic plot for without influential points
par(mfrow=c(2,2))
plot(e2.lr2, which=c(1:4),col="darkblue") 

Conclusion:

Regression lines with/without influential points are almost the same.


Perform stepwise model selection

  • Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.

Answer -:

We have removed influential points detected in Exercise 2, which has a cook’s distance larger than 0.015.
(We will use “data=d1[-e2.inf.id,]” dataset, which have deleted the influential points)

e3.model.stepwise <- ols_step_both_p(e2.lr2,pent = 0.05, prem = 0.05, details = F)
e3.model.stepwise
## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------
plot(e3.model.stepwise)

e3.lr.stepwise <- lm(Cholesterol ~ Diastolic+Systolic, data=d1[-e2.inf.id,])
summary(e3.lr.stepwise)
## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

Conclusion:

  • The model is significant and each individual variable is also significant.