linear regression, Stepwise Model selection

Stepwise selection

We will use heart.csv dataset. Below is brief summary of variables in heart.csv.

Weight: subject’s weight
Systolic: top number in a blood pressure reading, indicating the blood pressure level when the heart contracts
Diastolic: bottom number in a blood pressure reading, indicating the blood pressure level when the heart is at rest or between beats
Cholesterol: measured cholesterol

The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

# Read the CSV file
d1 <- read.csv("heart.csv") 

# Check data format
#str(d1)

consider stepwise model selection for the Cholesterol model. Before performing the model selection, which has a cook’s distance larger than 0.015.

Find the best model based on adjusted-R square criteria and specify which predictors are selected.

Answer -:

Step 1) Run linear regressions with multicollinearity issue

e2.lr <- lm( Cholesterol ~ Diastolic+Systolic+Weight  , data=d1)
summary(e2.lr)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16

Result:

P-Value of F-Test is significant. Thus, the model is useful.

In the other word, there exist linear regression for cholesterol as a function of diastolic, systolic, and weight.

Step 2) Perform Correlation to find Multicollinearity

Two methods:

    - 1) Pairwise Scatter Plot  

    - 2) Quantitative Method (VIF's)

pairs(d1,pch=19, col="darkblue")

Result:

It seems, maybe there exist a small correlation between Diastolic and Systolic, to be sure we will perform a Quantitative test (VIF’s)

vif(e2.lr)

## Diastolic  Systolic    Weight 
##  2.558682  2.454214  1.120375

Result:

There is no VIF>10, thus there exist no correlation between predictors.

Step 3) Perform Diagnostic Plot

par(mfrow=c(2,2))
plot(e2.lr,which=1:4,col="darkblue")

Result:

It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.
In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.
In the cook distance, there exist some pints greater than 0.015

Step 4) Influential points

cook.d2 <- cooks.distance(e2.lr)
plot(cook.d2,col="darkblue",pch=19,cex=1)

Delete observations larger than criteria (0.015)

e2.inf.id <- which(cooks.distance(e2.lr)>0.015)
e2.lr2 <- lm(Cholesterol ~ Diastolic+Systolic+Weight  , data=d1[-e2.inf.id,])
summary(e2.lr2)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Linear regression model:

    ŷ = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight

R Square (R2) and Adj. R Square:

Calculate R2 and Adj.R2

summary(e2.lr2)$r.squared

## [1] 0.03766896

summary(e2.lr2)$adj.r.squared

## [1] 0.03674601

    R Square is very small, thus, "Goodness of fit" or "Predictive power" is very low.  
    
    Adj. R Square is very small too.

Step 5) Plot scatter, with and without influential points

with(d1,plot(Cholesterol ~ Diastolic+Systolic+Weight, col="darkblue"))

abline(e2.lr,col="red")

## Warning in abline(e2.lr, col = "red"): only using the first two of 4 regression
## coefficients

abline(e2.lr2,col="green")

## Warning in abline(e2.lr2, col = "green"): only using the first two of 4
## regression coefficients

legend("bottomright",col=c("red","green"),legend = c("W/Inf. points","W/out Inf. points"),cex=0.8,title.adj = 0.15,lty=1)

Result:

Since dataset is very big (3134 observations) and we only remove 2 outliers, thus, the linear regression is closed together. The red line is under the green line.

Step 6) Diagnostic plot for without influential points

par(mfrow=c(2,2))
plot(e2.lr2, which=c(1:4),col="darkblue")

Conclusion:

Regression lines with/without influential points are almost the same.

Perform stepwise model selection

Perform stepwise model selection with .05 criteria and address any issues in diagnostics plots.

Answer -:

We have removed influential points detected in Exercise 2, which has a cook’s distance larger than 0.015.
(We will use “data=d1[-e2.inf.id,]” dataset, which have deleted the influential points)

e3.model.stepwise <- ols_step_both_p(e2.lr2,pent = 0.05, prem = 0.05, details = F)
e3.model.stepwise

## 
##                                Stepwise Selection Summary                                
## ----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                         
## Step    Variable     Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ----------------------------------------------------------------------------------------
##    1    Systolic     addition       0.035       0.035    8.6850    32349.7666    42.3013    
##    2    Diastolic    addition       0.037       0.037    3.6480    32344.7321    42.2606    
## ----------------------------------------------------------------------------------------

plot(e3.model.stepwise)

e3.lr.stepwise <- lm(Cholesterol ~ Diastolic+Systolic, data=d1[-e2.inf.id,])
summary(e3.lr.stepwise)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

Conclusion:

The model is significant and each individual variable is also significant.

linear regression, Stepwise Model selection

Arash Hatamirad, ahatamirad@gmail.com

9/1/2021

Stepwise selection

Answer -:

Step 1) Run linear regressions with multicollinearity issue

Result:

Step 2) Perform Correlation to find Multicollinearity

Result:

Result:

Step 3) Perform Diagnostic Plot

Result:

Step 4) Influential points

Linear regression model:

R Square (R2) and Adj. R Square:

Step 5) Plot scatter, with and without influential points

Result:

Step 6) Diagnostic plot for without influential points

Conclusion:

Perform stepwise model selection

Answer -:

Conclusion: