We will use heart.csv dataset. Below is brief summary of variables in heart.csv.
The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.
# Read the CSV file
d1 <- read.csv("heart.csv")
# Check data format
#str(d1)
The medical director at your company wants to know if Weight alone can predict Cholesterol outcomes. Consider modeling Cholesterol as a function of Weight.
plot(data=d1, Cholesterol ~ Weight,col="darkblue")
cor(d1$Weight,d1$Cholesterol,method = "pearson")
## [1] 0.0695377
cor(d1$Cholesterol,d1$Weight,method = "spearman")
## [1] 0.1078544
“Person” correlation is sensitive and “spearman” correlation is robust to outliers.
e1.lr <- lm( Cholesterol ~ Weight , data=d1)
summary(e1.lr)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.95 -29.59 -4.64 23.49 334.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.86763 4.24729 48.470 < 2e-16 ***
## Weight 0.10867 0.02786 3.901 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared: 0.004835, Adjusted R-squared: 0.004518
## F-statistic: 15.22 on 1 and 3132 DF, p-value: 9.778e-05
For F-Statistic, is significant and P-Value < 0.05 (very small), thus we don’t have any evidence to accept Null, therefor the model is useful.
In the other word, there exist linear regression for cholestrol as a function of weight.
ŷ = 205.86763 + 0.10867 * Weight
R Square is very small, thus, “Goodness of fit” or “Predictive power” is very low.
with(d1, plot(Cholesterol ~ Weight,col="darkblue"))
abline(e1.lr,col="red")
par(mfrow=c(2,2))
plot(e1.lr, which=c(1:4),col="darkblue") # default diagnostics plots
Homosedastacity assumption:
There is no pattern in Residuals plot, thus data are homosedastacity.
Cook’s distance:
There are some observation with larger criteria (0.015)
cook.d <- round(cooks.distance(e1.lr),2)
plot(cook.d,col="darkblue", pch=19, cex=1)
Delete observations larger than criteria (0.015)
inf.id <- which(cooks.distance(e1.lr) > 0.015)
#d1[inf.id, ]
e1.lr2 <- lm(Cholesterol ~ Weight, data=d1[-inf.id, ])
summary(e1.lr2)
##
## Call:
## lm(formula = Cholesterol ~ Weight, data = d1[-inf.id, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.369 -29.395 -4.482 23.672 209.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203.57605 4.18543 48.639 < 2e-16 ***
## Weight 0.12264 0.02745 4.469 8.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared: 0.006339, Adjusted R-squared: 0.006022
## F-statistic: 19.97 on 1 and 3130 DF, p-value: 8.155e-06
ŷ = 203.57605 + 0.12264 * Weight
R Square is very small, thus, “Goodness of fit” or “Predictive power” is very low.
with(d1, plot(Cholesterol ~ Weight,col="darkblue"))
abline(e1.lr,col="red")
abline(e1.lr2,col="green")
legend("bottomright",col=c("red","green"),legend=c("w/ Inf. Points", "w/out Inf. Points"), cex=0.8, title.adj=0.15, lty=1)
par(mfrow=c(2,2))
plot(e1.lr2, which=c(1:4),col="darkblue")
We can see, there is not any Cook’s distance more than criteria value (0.015)
Regression lines with/without influential points are almost the same.
Based on step 2, correlation coefficient (rho) is positive and non-zero thus there is a direct relation.
Based on step 3 and step 7 (With/Without), P-Value is significant, thus there exist linear regression for cholesterol as a function of weight.
Based on step 3 and step 7 (With/Without), R2 is very small, thus “Power of Prediction” or “Goodness of fit” is not good enough, it means model is useful but is not good enough for prediction.
In the other word, this is not a good model to predict cholesterol by Weight, although the model is useful.