linear regression, Best Subset Model Selection

Best Subset Model Selection

We will use heart.csv dataset. Below is brief summary of variables in heart.csv.

Weight: subject’s weight
Systolic: top number in a blood pressure reading, indicating the blood pressure level when the heart contracts
Diastolic: bottom number in a blood pressure reading, indicating the blood pressure level when the heart is at rest or between beats
Cholesterol: measured cholesterol

The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

# Read the CSV file
d1 <- read.csv("heart.csv") 

# Check data format
#str(d1)

consider the best subset selection for the Cholesterol model. Again, which has cook’s distance larger than 0.015, before performing the model selection.

Find the best model based on adjusted-R square criteria and specify which predictors are selected.

Answer -:

Step 1) Run linear regressions with multicollinearity issue

e2.lr <- lm( Cholesterol ~ Diastolic+Systolic+Weight  , data=d1)
summary(e2.lr)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16

Result:

P-Value of F-Test is significant. Thus, the model is useful.

In the other word, there exist linear regression for cholesterol as a function of diastolic, systolic, and weight.

Step 2) Perform Correlation to find Multicollinearity

Two methods:

    - 1) Pairwise Scatter Plot  

    - 2) Quantitative Method (VIF's)

pairs(d1,pch=19, col="darkblue")

Result:

It seems, maybe there exist a small correlation between Diastolic and Systolic, to be sure we will perform a Quantitative test (VIF’s)

vif(e2.lr)

## Diastolic  Systolic    Weight 
##  2.558682  2.454214  1.120375

Result:

There is no VIF>10, thus there exist no correlation between predictors.

Step 3) Perform Diagnostic Plot

par(mfrow=c(2,2))
plot(e2.lr,which=1:4,col="darkblue")

Result:

It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.
In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.
In the cook distance, there exist some pints greater than 0.015

Step 4) Influential points

cook.d2 <- cooks.distance(e2.lr)
plot(cook.d2,col="darkblue",pch=19,cex=1)

Delete observations larger than criteria (0.015)

e2.inf.id <- which(cooks.distance(e2.lr)>0.015)
e2.lr2 <- lm(Cholesterol ~ Diastolic+Systolic+Weight  , data=d1[-e2.inf.id,])
summary(e2.lr2)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

Linear regression model:

    ŷ = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight

R Square (R2) and Adj. R Square:

Calculate R2 and Adj.R2

summary(e2.lr2)$r.squared

## [1] 0.03766896

summary(e2.lr2)$adj.r.squared

## [1] 0.03674601

    R Square is very small, thus, "Goodness of fit" or "Predictive power" is very low.  
    
    Adj. R Square is very small too.

Step 5) Plot scatter, with and without influential points

with(d1,plot(Cholesterol ~ Diastolic+Systolic+Weight, col="darkblue"))

abline(e2.lr,col="red")

## Warning in abline(e2.lr, col = "red"): only using the first two of 4 regression
## coefficients

abline(e2.lr2,col="green")

## Warning in abline(e2.lr2, col = "green"): only using the first two of 4
## regression coefficients

legend("bottomright",col=c("red","green"),legend = c("W/Inf. points","W/out Inf. points"),cex=0.8,title.adj = 0.15,lty=1)

Result:

Since dataset is very big (3134 observations) and we only remove 2 outliers, thus, the linear regression is closed together. The red line is under the green line.

Step 6) Diagnostic plot for without influential points

par(mfrow=c(2,2))
plot(e2.lr2, which=c(1:4),col="darkblue")

Conclusion:

Regression lines with/without influential points are almost the same.

e4.model.subset <- ols_step_best_subset(e2.lr2)
e4.model.subset

##         Best Subsets Regression         
## ----------------------------------------
## Model Index    Predictors
## ----------------------------------------
##      1         Systolic                  
##      2         Diastolic Systolic        
##      3         Diastolic Systolic Weight 
## ----------------------------------------
## 
##                                                           Subsets Regression Summary                                                          
## ----------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC           SBIC          SBC            MSEP           FPE        HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0350      0.0347      0.0337    8.6847    32349.7666    23461.5297    32367.9149    5604396.2122    1790.5412    0.5719    0.9662 
##   2        0.0372      0.0365      0.0352    3.6475    32344.7321    23456.5056    32368.9298    5593610.3978    1787.6653    0.5710    0.9647 
##   3        0.0377      0.0367      0.0351    4.0000    32345.0829    23456.8621    32375.3300    5592453.6261    1787.8655    0.5710    0.9648 
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

For Adj. R Square, bigger is better, thus Adj. R-Square=0.0367 is related to Model #3, thus the best selected model is:
Model #3: Diastolic Systolic Weight

e4.lr.subset.a <- lm(Cholesterol ~ Diastolic+Systolic+Weight,  data=d1[-e2.inf.id,])
summary(e4.lr.subset.a)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16

The final Model - based on Adj. R-Square in subset method - is:
ŷ = 157.88394 + 0.25983 * Diastolic + 0.30106 * Systolic + 0.02146 * Weight

summary(e4.lr.subset.a)$r.squared

## [1] 0.03766896

summary(e4.lr.subset.a)$adj.r.squared

## [1] 0.03674601

This is the value which is explained by model.

Best Subset MOdel Selection by AIC

Find the best model based on AIC criteria and specify which predictors are selected.

Answer -:

For the AIC/BIC/SBC, smaller is better, thus smaller value is: 32344.7321, therefore the best selected model is:
Model #2: Diastolic Systolic

e4.lr.subset.b <- lm(Cholesterol ~ Diastolic+Systolic, data=d1[-e2.inf.id,])
summary(e4.lr.subset.b)

## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.332  -29.399   -4.433   23.922  217.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.3317     5.8186  27.383  < 2e-16 ***
## Diastolic     0.2770     0.1044   2.652  0.00803 ** 
## Systolic      0.3022     0.0634   4.767 1.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared:  0.03716,    Adjusted R-squared:  0.03655 
## F-statistic: 60.38 on 2 and 3129 DF,  p-value: < 2.2e-16

The final Model - based on AIC in subset method - is:
ŷ = 159.3317 + 0.2770 * Diastolic + 0.3022 * Systolic

summary(e4.lr.subset.b)$r.squared

## [1] 0.0371621

summary(e4.lr.subset.b)$adj.r.squared

## [1] 0.03654667

This is the value which is explained by model.

Compare final models selected in a) and b). Also, compare final models from the best subset approach with the final model from the stepwise selection.

Answer -:

Subset Selection with Adj. R-Square result

1. Final model selected by Subset Selection with Adj. R-Squared criteria:
  ŷ = 156.32618 + 0.2492 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight
  And prediction power is (R Square): 0.03767

Subset Selection with AIC result

1. Final model selected by Subset Selection with AIC criteria:
  ŷ = 159.3317 + 0.2770 * Diastolic + 0.3022 * Systolic
  And prediction power is (R Square): 0.03716

stepwise Selection result

1. Final model selected by stepwise is:
  ŷ = 159.3317 + 0.2770 * Diastolic + 0.3022 * Systolic
  And prediction power is (R Square): 0.03716

Conclusion:

The result from stepwise and subset selection with AIC criteria are the same with two predictor. also the prediction power is same.
but for subset section with Adj. R-Squared, there are three predictors and the prediction power is bigger.

  - Results (B) and (C), have less goodness of fit, but have model simplicity (only two predictors)    
  - Result of (A) has more goodness of fit, but more complexity (three predictors) and includes an insignificant variable (Weight)

Since, Choosing the final model is dependent to our business, best fitting model is not always maximum fitting model, therefore if prediction is important for us, thus we must choose model B (or C, which is same).

linear regression, Best Subset Model Selection

Arash Hatamirad, ahatamirad@gmail.com

9/1/2021

Best Subset Model Selection

Answer -:

Step 1) Run linear regressions with multicollinearity issue

Result:

Step 2) Perform Correlation to find Multicollinearity

Result:

Result:

Step 3) Perform Diagnostic Plot

Result:

Step 4) Influential points

Linear regression model:

R Square (R2) and Adj. R Square:

Step 5) Plot scatter, with and without influential points

Result:

Step 6) Diagnostic plot for without influential points

Conclusion:

Best Subset MOdel Selection by AIC

Answer -:

Answer -:

Conclusion: