We will use heart.csv dataset. Below is brief summary of variables in heart.csv.
The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.
# Read the CSV file
d1 <- read.csv("heart.csv")
# Check data format
#str(d1)
The medical director wants to know if blood pressures and weight can better predict cholesterol outcome. Consider modeling cholesterol as a function of diastolic, systolic, and weight.
e2.lr <- lm( Cholesterol ~ Diastolic+Systolic+Weight , data=d1)
summary(e2.lr)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.27 -29.58 -4.56 23.66 329.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.88394 6.37201 24.778 < 2e-16 ***
## Diastolic 0.25983 0.10838 2.397 0.0166 *
## Systolic 0.30106 0.06443 4.672 3.1e-06 ***
## Weight 0.02146 0.02903 0.739 0.4597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: 0.03513
## F-statistic: 39.03 on 3 and 3130 DF, p-value: < 2.2e-16
In the other word, there exist linear regression for cholesterol as a function of diastolic, systolic, and weight.
Two methods:
- 1) Pairwise Scatter Plot
- 2) Quantitative Method (VIF's) pairs(d1,pch=19, col="darkblue")
vif(e2.lr)
## Diastolic Systolic Weight
## 2.558682 2.454214 1.120375
par(mfrow=c(2,2))
plot(e2.lr,which=1:4,col="darkblue")
It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.
In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.
In the cook distance, there exist some pints greater than 0.015
cook.d2 <- cooks.distance(e2.lr)
plot(cook.d2,col="darkblue",pch=19,cex=1)
e2.inf.id <- which(cooks.distance(e2.lr)>0.015)
e2.lr2 <- lm(Cholesterol ~ Diastolic+Systolic+Weight , data=d1[-e2.inf.id,])
summary(e2.lr2)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## Weight 0.03671 0.02860 1.284 0.1994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
ลท = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight
Calculate R2 and Adj.R2
summary(e2.lr2)$r.squared
## [1] 0.03766896
summary(e2.lr2)$adj.r.squared
## [1] 0.03674601
R Square is very small, thus, "Goodness of fit" or "Predictive power" is very low.
Adj. R Square is very small too.
with(d1,plot(Cholesterol ~ Diastolic+Systolic+Weight, col="darkblue"))
abline(e2.lr,col="red")
## Warning in abline(e2.lr, col = "red"): only using the first two of 4 regression
## coefficients
abline(e2.lr2,col="green")
## Warning in abline(e2.lr2, col = "green"): only using the first two of 4
## regression coefficients
legend("bottomright",col=c("red","green"),legend = c("W/Inf. points","W/out Inf. points"),cex=0.8,title.adj = 0.15,lty=1)
par(mfrow=c(2,2))
plot(e2.lr2, which=c(1:4),col="darkblue")
Regression lines with/without influential points are almost the same.
summary(e2.lr2)
##
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.617 -29.371 -4.476 23.755 216.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.32618 6.27153 24.926 < 2e-16 ***
## Diastolic 0.24922 0.10665 2.337 0.0195 *
## Systolic 0.30073 0.06340 4.743 2.2e-06 ***
## Weight 0.03671 0.02860 1.284 0.1994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03675
## F-statistic: 40.81 on 3 and 3128 DF, p-value: < 2.2e-16
summary(e2.lr2)$r.squared
## [1] 0.03766896
summary(e2.lr2)$adj.r.squared
## [1] 0.03674601
- Diastolic and Systolic have significant value (Pvalue < 0.05)
- Weight has not significant value (Pvalue > 0.05)
- Based on this results, linear regression model:
ลท = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight
in <step 2-part (a)> we checked the multidisciplinary with these two methods:
- 1) Pairwise Scatter Plot
- 2) Quantitative Method (VIF's)
There is no VIF>10, thus there exist no correlation between predictors.
- Since P-Value is significant, there exist a linear regression for cholesterol as function of diastolic and systolic and weight
although Pvalue of Weight is not significant, thus we must use a model selection to detect best model (Will do it in Exercise 3)
- R square and Adj. R Square are small, thus "Power of Prediction" or "Goodness of fit" is not good enough,
it means model is useful but is not good enough for prediction.