We will use heart.csv dataset. Below is brief summary of variables in heart.csv.
The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.
# Read the CSV file
d1 <- read.csv("heart.csv")
# Check data format
#str(d1)
consider the best subset selection for the Cholesterol model. Again, which has cook’s distance larger than 0.015, before performing the model selection.
e2.lr <- lm( Cholesterol ~ Diastolic+Systolic+Weight , data=d1)
summary(e2.lr)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## Weight 0.02146 0.02903 0.739 0.4597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
In the other word, there exist linear regression for cholesterol as a function of diastolic, systolic, and weight.
Two methods:
- 1) Pairwise Scatter Plot
- 2) Quantitative Method (VIF's) pairs(d1,pch=19, col="darkblue")
vif(e2.lr)
## Diastolic Systolic Weight
## 2.558682 2.454214 1.120375
par(mfrow=c(2,2))
plot(e2.lr,which=1:4,col="darkblue")
It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.
In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.
In the cook distance, there exist some pints greater than 0.015
cook.d2 <- cooks.distance(e2.lr)
plot(cook.d2,col="darkblue",pch=19,cex=1)
e2.inf.id <- which(cooks.distance(e2.lr)>0.015)
e2.lr2 <- lm(Cholesterol ~ Diastolic+Systolic+Weight , data=d1[-e2.inf.id,])
summary(e2.lr2)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## Weight 0.03671 0.02860 1.284 0.1994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
ŷ = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight
Calculate R2 and Adj.R2
summary(e2.lr2)$r.squared
## [1] 0.03766896
summary(e2.lr2)$adj.r.squared
## [1] 0.03674601
R Square is very small, thus, "Goodness of fit" or "Predictive power" is very low.
Adj. R Square is very small too.
with(d1,plot(Cholesterol ~ Diastolic+Systolic+Weight, col="darkblue"))
abline(e2.lr,col="red")
## Warning in abline(e2.lr, col = "red"): only using the first two of 4 regression
## coefficients
abline(e2.lr2,col="green")
## Warning in abline(e2.lr2, col = "green"): only using the first two of 4
## regression coefficients
legend("bottomright",col=c("red","green"),legend = c("W/Inf. points","W/out Inf. points"),cex=0.8,title.adj = 0.15,lty=1)
par(mfrow=c(2,2))
plot(e2.lr2, which=c(1:4),col="darkblue")
Regression lines with/without influential points are almost the same.
e4.model.subset <- ols_step_best_subset(e2.lr2)
e4.model.subset
## Best Subsets Regression
## ----------------------------------------
## Model Index Predictors
## ----------------------------------------
## 1 Systolic
## 2 Diastolic Systolic
## 3 Diastolic Systolic Weight
## ----------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0350 0.0347 0.0337 8.6847 32349.7666 23461.5297 32367.9149 5604396.2122 1790.5412 0.5719 0.9662
## 2 0.0372 0.0365 0.0352 3.6475 32344.7321 23456.5056 32368.9298 5593610.3978 1787.6653 0.5710 0.9647
## 3 0.0377 0.0367 0.0351 4.0000 32345.0829 23456.8621 32375.3300 5592453.6261 1787.8655 0.5710 0.9648
## ----------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
For Adj. R Square, bigger is better, thus Adj. R-Square=0.0367 is related to Model #3, thus the best selected model is:
Model #3: Diastolic Systolic Weight
e4.lr.subset.a <- lm(Cholesterol ~ Diastolic+Systolic+Weight, data=d1[-e2.inf.id,])
summary(e4.lr.subset.a)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## Weight 0.03671 0.02860 1.284 0.1994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
The final Model - based on Adj. R-Square in subset method - is:
ŷ = 157.88394 + 0.25983 * Diastolic + 0.30106 * Systolic + 0.02146 * Weight
summary(e4.lr.subset.a)$r.squared
## [1] 0.03766896
summary(e4.lr.subset.a)$adj.r.squared
## [1] 0.03674601
This is the value which is explained by model.
For the AIC/BIC/SBC, smaller is better, thus smaller value is: 32344.7321, therefore the best selected model is:
Model #2: Diastolic Systolic
e4.lr.subset.b <- lm(Cholesterol ~ Diastolic+Systolic, data=d1[-e2.inf.id,])
summary(e4.lr.subset.b)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic, data = d1[-e2.inf.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.332 -29.399 -4.433 23.922 217.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.3317 5.8186 27.383 < 2e-16 ***
## Diastolic 0.2770 0.1044 2.652 0.00803 **
## Systolic 0.3022 0.0634 4.767 1.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3129 degrees of freedom
## Multiple R-squared: 0.03716, Adjusted R-squared: 0.03655
## F-statistic: 60.38 on 2 and 3129 DF, p-value: < 2.2e-16
The final Model - based on AIC in subset method - is:
ŷ = 159.3317 + 0.2770 * Diastolic + 0.3022 * Systolic
summary(e4.lr.subset.b)$r.squared
## [1] 0.0371621
summary(e4.lr.subset.b)$adj.r.squared
## [1] 0.03654667
This is the value which is explained by model.
Subset Selection with Adj. R-Square result
Subset Selection with AIC result
stepwise Selection result
The result from stepwise and subset selection with AIC criteria are the same with two predictor. also the prediction power is same.
but for subset section with Adj. R-Squared, there are three predictors and the prediction power is bigger.
- Results (B) and (C), have less goodness of fit, but have model simplicity (only two predictors)
- Result of (A) has more goodness of fit, but more complexity (three predictors) and includes an insignificant variable (Weight) Since, Choosing the final model is dependent to our business, best fitting model is not always maximum fitting model, therefore if prediction is important for us, thus we must choose model B (or C, which is same).