linear regression

We will use heart.csv dataset. Below is brief summary of variables in heart.csv.

The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

# Read the CSV file
d1 <- read.csv("heart.csv") 

# Check data format
#str(d1)

The medical director at your company wants to know if Weight alone can predict Cholesterol outcomes. Consider modeling Cholesterol as a function of Weight.

Question (a):

  • Fit a linear regression model for Cholesterol as a function of Weight. If any points are unduly influential, note those points, then remove them and refit the model. Consider Cook’s distance cut off to be 0.015.

Answer -:

Step 1) Scatter Plot
plot(data=d1, Cholesterol ~ Weight,col="darkblue")

Step 2) Find Correlation coefficient
cor(d1$Weight,d1$Cholesterol,method = "pearson")
## [1] 0.0695377
cor(d1$Cholesterol,d1$Weight,method = "spearman")
## [1] 0.1078544

“Person” correlation is sensitive and “spearman” correlation is robust to outliers.

Step 3) Fit Linear Regression
e1.lr <- lm( Cholesterol ~ Weight  , data=d1)
summary(e1.lr)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.95  -29.59   -4.64   23.49  334.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.86763    4.24729  48.470  < 2e-16 ***
## Weight        0.10867    0.02786   3.901 9.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.62 on 3132 degrees of freedom
## Multiple R-squared:  0.004835,   Adjusted R-squared:  0.004518 
## F-statistic: 15.22 on 1 and 3132 DF,  p-value: 9.778e-05
Result:

For F-Statistic, is significant and P-Value < 0.05 (very small), thus we don’t have any evidence to accept Null, therefor the model is useful.
In the other word, there exist linear regression for cholestrol as a function of weight.

Linear regression model:
    ŷ = 205.86763 + 0.10867 * Weight  
R Square (R2):

R Square is very small, thus, “Goodness of fit” or “Predictive power” is very low.

Step 4) Now draw a scatter plot again, include the regression line
with(d1, plot(Cholesterol ~ Weight,col="darkblue"))
abline(e1.lr,col="red")

Step 5) Diagnostic plots
par(mfrow=c(2,2))
plot(e1.lr, which=c(1:4),col="darkblue")    # default diagnostics plots

Result:
  • Normality assumption:
  1. Based on Normal-QQ Plot, data seems follows a normal distribution.
  2. Based on Standardized residuals plot, most data (over 95%) are between (0 & 1.5), thus data follows a nomal distribution.
  • Homosedastacity assumption:

  • There is no pattern in Residuals plot, thus data are homosedastacity.

  • Cook’s distance:

  • There are some observation with larger criteria (0.015)

Step 6) Influential Points
cook.d <- round(cooks.distance(e1.lr),2)
plot(cook.d,col="darkblue", pch=19, cex=1)

Step 7) Delete Influential Points

Delete observations larger than criteria (0.015)

inf.id <- which(cooks.distance(e1.lr) > 0.015)
#d1[inf.id, ]
e1.lr2 <- lm(Cholesterol ~ Weight, data=d1[-inf.id, ])
summary(e1.lr2)
## 
## Call:
## lm(formula = Cholesterol ~ Weight, data = d1[-inf.id, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.369  -29.395   -4.482   23.672  209.348 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 203.57605    4.18543  48.639  < 2e-16 ***
## Weight        0.12264    0.02745   4.469 8.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.92 on 3130 degrees of freedom
## Multiple R-squared:  0.006339,   Adjusted R-squared:  0.006022 
## F-statistic: 19.97 on 1 and 3130 DF,  p-value: 8.155e-06
Linear regression model:
    ŷ = 203.57605 + 0.12264 * Weight  
R Square (R2):

R Square is very small, thus, “Goodness of fit” or “Predictive power” is very low.

Step 8) Plot scatter, with and without influential points
with(d1, plot(Cholesterol ~ Weight,col="darkblue"))
abline(e1.lr,col="red")
abline(e1.lr2,col="green")
legend("bottomright",col=c("red","green"),legend=c("w/ Inf. Points", "w/out Inf. Points"), cex=0.8, title.adj=0.15, lty=1)

Result:
  • Since dataset is very big (3134 observations) and we only remove 2 outliers, thus,
    the linear regression is closed together. The red line is under the green line.
Step 9) Diagnostic plot for without influential points
par(mfrow=c(2,2))
plot(e1.lr2, which=c(1:4),col="darkblue")   

We can see, there is not any Cook’s distance more than criteria value (0.015)

Conclusion:

Regression lines with/without influential points are almost the same.


Question (b):

  • Comment on the significance of the parameters, variation explained by the model, and any remaining issues noted in the diagnostics plots. What does this model tell us about the relationship between Cholesterol and Weight? Interpret the relationship specifically. Explain to the medical director whether this is a good model for the prediction of Cholesterol levels.

Answer -:

  1. Based on step 2, correlation coefficient (rho) is positive and non-zero thus there is a direct relation.

  2. Based on step 3 and step 7 (With/Without), P-Value is significant, thus there exist linear regression for cholesterol as a function of weight.

  3. Based on step 3 and step 7 (With/Without), R2 is very small, thus “Power of Prediction” or “Goodness of fit” is not good enough, it means model is useful but is not good enough for prediction.

In the other word, this is not a good model to predict cholesterol by Weight, although the model is useful.