linear regression

We will use heart.csv dataset. Below is brief summary of variables in heart.csv.

The data set contains Weight, Diastolic Blood pressure, Systolic blood pressure and Cholesterol for alive subjects in the heart.csv.

# Read the CSV file
d1 <- read.csv("heart.csv") 

# Check data format
#str(d1)

The medical director wants to know if blood pressures and weight can better predict cholesterol outcome. Consider modeling cholesterol as a function of diastolic, systolic, and weight.


Question (a):

  • Fit a linear regression model for cholesterol as a function of diastolic, systolic, and weight. Generate the diagnostics plots and comment on any issues that need to be noted. For Cookโ€™s distances, do not leave any points that have Cookโ€™s distance greater than 0.015.

Answer -:

Step 1) Run linear regressions with multicollinearity issue
e2.lr <- lm( Cholesterol ~ Diastolic+Systolic+Weight  , data=d1)
summary(e2.lr)
## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.27  -29.58   -4.56   23.66  329.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 157.88394    6.37201  24.778  < 2e-16 ***
## Diastolic     0.25983    0.10838   2.397   0.0166 *  
## Systolic      0.30106    0.06443   4.672  3.1e-06 ***
## Weight        0.02146    0.02903   0.739   0.4597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.95 on 3130 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  0.03513 
## F-statistic: 39.03 on 3 and 3130 DF,  p-value: < 2.2e-16
Result:
  • P-Value of F-Test is significant. Thus, the model is useful.

In the other word, there exist linear regression for cholesterol as a function of diastolic, systolic, and weight.

Step 2) Perform Correlation to find Multicollinearity
  • Two methods:

        - 1) Pairwise Scatter Plot  
    
        - 2) Quantitative Method (VIF's)  
pairs(d1,pch=19, col="darkblue")

Result:
  • It seems, maybe there exist a small correlation between Diastolic and Systolic, to be sure we will perform a Quantitative test (VIFโ€™s)
vif(e2.lr)
## Diastolic  Systolic    Weight 
##  2.558682  2.454214  1.120375
Result:
  • There is no VIF>10, thus there exist no correlation between predictors.
Step 3) Perform Diagnostic Plot
par(mfrow=c(2,2))
plot(e2.lr,which=1:4,col="darkblue")

Result:
  • It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.

  • In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.

  • In the cook distance, there exist some pints greater than 0.015

Step 4) Influential points
cook.d2 <- cooks.distance(e2.lr)
plot(cook.d2,col="darkblue",pch=19,cex=1)

  • Delete observations larger than criteria (0.015)
e2.inf.id <- which(cooks.distance(e2.lr)>0.015)
e2.lr2 <- lm(Cholesterol ~ Diastolic+Systolic+Weight  , data=d1[-e2.inf.id,])
summary(e2.lr2)
## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16
Linear regression model:
    ลท = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight    
R Square (R2) and Adj. R Square:

Calculate R2 and Adj.R2

summary(e2.lr2)$r.squared
## [1] 0.03766896
summary(e2.lr2)$adj.r.squared
## [1] 0.03674601
    R Square is very small, thus, "Goodness of fit" or "Predictive power" is very low.  
    
    Adj. R Square is very small too.  
Step 5) Plot scatter, with and without influential points
with(d1,plot(Cholesterol ~ Diastolic+Systolic+Weight, col="darkblue"))

abline(e2.lr,col="red")
## Warning in abline(e2.lr, col = "red"): only using the first two of 4 regression
## coefficients
abline(e2.lr2,col="green")
## Warning in abline(e2.lr2, col = "green"): only using the first two of 4
## regression coefficients
legend("bottomright",col=c("red","green"),legend = c("W/Inf. points","W/out Inf. points"),cex=0.8,title.adj = 0.15,lty=1)

Result:
  • Since dataset is very big (3134 observations) and we only remove 2 outliers, thus, the linear regression is closed together. The red line is under the green line.
Step 6) Diagnostic plot for without influential points
par(mfrow=c(2,2))
plot(e2.lr2, which=c(1:4),col="darkblue") 

Conclusion:

Regression lines with/without influential points are almost the same.


Question (b):

  • Comment on the significance of the parameters and how much variation in cholesterol is described by the model. Comment on the relationship between cholesterol and statistically significant predictor(s). Check multicollinearity issue among predictors. Explain to the medical director whether this is a good model for the prediction of Cholesterol levels.

Answer -:

summary(e2.lr2)
## 
## Call:
## lm(formula = Cholesterol ~ Diastolic + Systolic + Weight, data = d1[-e2.inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.617  -29.371   -4.476   23.755  216.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.32618    6.27153  24.926  < 2e-16 ***
## Diastolic     0.24922    0.10665   2.337   0.0195 *  
## Systolic      0.30073    0.06340   4.743  2.2e-06 ***
## Weight        0.03671    0.02860   1.284   0.1994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.26 on 3128 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03675 
## F-statistic: 40.81 on 3 and 3128 DF,  p-value: < 2.2e-16
summary(e2.lr2)$r.squared
## [1] 0.03766896
summary(e2.lr2)$adj.r.squared
## [1] 0.03674601
Result:
    - Diastolic and Systolic have significant value (Pvalue < 0.05)  
    - Weight has not significant value (Pvalue > 0.05) 

    - Based on this results, linear regression model:   

    ลท = 156.32618 + 0.24922 * Diastolic + 0.30073 * Systolic + 0.03671 * Weight   
Check the multicollinearity:
    in <step 2-part (a)>  we checked the multidisciplinary with these two methods:  
    - 1) Pairwise Scatter Plot  
    - 2) Quantitative Method (VIF's)  
    There is no VIF>10, thus there exist no correlation between predictors.  

Conclusion:

    - Since P-Value is significant, there exist a linear regression for cholesterol as function of diastolic and systolic and weight
     although Pvalue of Weight is not significant, thus we must use a model selection to detect best model (Will do it in Exercise 3)  

    - R square and Adj. R Square are small, thus "Power of Prediction" or "Goodness of fit" is not good enough, 
    it means model is useful but is not good enough for prediction.