Abstract

We wanted to build a logistic regression model to accurately predict heart disease in patients based on data from the Cleveland Clinic Foundation. To do this, we have performed logistic regressions where we first considered multiple potential predictor variables and distilled them into the “best” set of predictors through a model-selection process. In the revised logistic regression, we have only considered the following predictor variables: cp (chest pain), sex, trestbps (resting blood pressure), slope (the slope of the peak exercise ST segment), ca (the number of major blood vessels (0-3) colored by fluoroscopy), thal (thallium stress test result).

Our results indicated that both our original and revised logistic regression models failed the Hosmer-Lemeshow goodness of fit test, suggesting that the models did not sufficiently fit the data. After comparing the deviance residual outputs for both models, we found that the second model’s deviance residuals median was closer to 0 and the minimum and maximum values were more symmetric than those of the first model. Next, we observed that the misclassification error rate for our second model (14.3%) was less than the error rate for our first model (16.1%), which would have indicated accuracy if not for the lack of good fit. Ultimately, while our revised model has reasonable predictive power (>85% prediction accuracy), we must take our models’ reliability and predictions with a grain of salt.

We then put forth a variety of suggestions that may have increased the reliability of our logistic models. First, we believe that having a larger dataset on heart disease, as well as including more variables that were directly linked to heart disease (ie, patient smoking history, obesity and diabetes) would have been more beneficial to building accurate logistic models. Regarding rooms for improvement, we have noticed that partitioning our dataset resulted in the sample size of the testing data being much smaller than the sample size of the training data; this might explain the higher error rates in the testing data compared to the training data. In addition, adjusting the heart disease prediction threshold (50%) may have also resulted in more accurate predictions. Implementing improvements may increase the predictive power of future models on heart disease.

## Warning: package 'ResourceSelection' was built under R version 3.6.3

## ResourceSelection 0.3-5   2019-07-22

## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
## 3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal hd
## 1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0  0
## 2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0  2
## 3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0  1
## 4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0  0
## 5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0  0
## 6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0  0

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 5 levels "?","0.0","1.0",..: 2 5 4 2 2 2 4 2 3 2 ...
##  $ thal    : Factor w/ 4 levels "?","3.0","6.0",..: 3 2 4 2 2 2 2 2 4 4 ...
##  $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 4 levels "2","3","4","5": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "2","3","4": 2 1 3 1 1 1 1 1 3 3 ...
##  $ hd      : Factor w/ 2 levels "Healthy","Unhealthy": 1 2 2 1 1 1 2 1 2 2 ...

## [1] 6

##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope   ca thal
## 88   53   F  3      128  216   0       2     115     0     0.0     1    2 <NA>
## 167  52   M  3      138  223   0       0     169     0     0.0     1 <NA>    2
## 193  43   M  4      132  247   1       2     143     1     0.1     2 <NA>    4
## 267  52   M  4      128  204   1       0     156     1     1.0     2    2 <NA>
## 288  58   M  2      125  220   0       0     144     0     0.4     2 <NA>    4
## 303  38   M  3      138  175   0       0     173     0     0.0     1 <NA>    2
##            hd
## 88    Healthy
## 167   Healthy
## 193 Unhealthy
## 267 Unhealthy
## 288   Healthy
## 303   Healthy

## [1] 303

## [1] 297

Introduction

The objective of this project is to build a logistic regression model to accurately predict heart disease in patients given a wide set of variables (chest pain, sex, cholesterol levels, resting ECG results, etc). To build our model, we used data from the Heart Disease dataset taken from the UC Irvine Machine Learning Repository. Our null and alternative hypotheses for the logistic regression are:

Null Hypothesis: There is no relationship between the predictor variables and the response.

Alternate Hypothesis: There exists a relationship between the predictor variables and the response.

In our research, we built two logistic regression models and compared each model’s reliability and predictive power. The first model used all predictor variables, and the second model was refined using the first model’s results to only use predictor variables that were statistically significant at a 5% level. Then, we used Hosmer-Lemeshow’s goodness of fit test to test our null hypothesis. We expected the testing data results to closely mirror that of the training data, and further expected the second model to have more accurate predictions compared to the first model. Thus, we hypothesized that we would fail to reject the null hypothesis. Ultimately, we found that the refined logistic model yielded more accurate results compared to the all-inclusive model.

We consolidated the questions of interests we hoped to answer into the following:

What was the reliability of using predictor variables such as cp (chest pain), sex, trestbps (resting blood pressure, slope (the slope of the peak exercise ST segment), ca (the number of major blood vessels (0-3) colored by fluoroscopy), thal (thallium stress test result) from our heart disease dataset to predict status of heart disease using the logistic regression model?
How accurate are our models at predicting heart disease using the training data versus the testing data? We evaluate predictive power using misclassification error rates on confusion matrices.
Do our logistic regression models sufficiently fit our data? We evaluate this using the Hosmer-Lemeshow goodness of fit test.

Methods

Our method focuses on producing logistic regressions that best predict heart disease using various predictor variables. We started by fixing our dataset into a suitable format for our logistic regressions such as re-labeling columns and factors (i.e. labeling the numeric response variable hd (heart disease from 0 as healthy and 1 as unhealthy) and partitioning our data into two sets for training (80%) and testing (20%). To test our model assumptions, we have utilized the xtabs() function command to build tables to test that both healthy and unhealthy patients from the hd variable are represented by different variables. For example, xtabs(~ hd + sex, data=data) command was used to determine whether both healthy and unhealthy patients are represented by both female and male samples. We then repeated this process for all categorical variables we were using to predict heart disease. Moreover, the deviance residual values from the summary function of the regressions indicate valid assumptions for both models. For example, the minimum and maximum values for the deviance residual is almost symmetric for the first model with values of -3.0490 and 2.9086, respectively, along with a median value of 0.1213 that is close to 0. Similarly, the minimum and maximum values for the deviance residual of the second regression are -2.9626 and 2.9125 with a median value of -0.1170, which is also close to 0.

While we have considered using a simple logistic regression model with one predictor variable, the statistically insignificant results from anticipated variables such as chol (cholesterol level) and age have geared us to consider multiple potential predictor variables. Thus, we first considered multiple potential predictor variables by predicting hd (heart disease) using all variables using the mylogit <- glm(hd ~ ., data = data, family = “binomial”) command. Then, after the detailed model-selection process to select the best set of predictors that are statistically significant (with a p-value less than 0.05 at a 5% significance level), we have performed a revised logistic regression that only contains the cp (chest pain), sex, trestbps (resting blood pressure), slope (the slope of the peak exercise ST segment), ca (the number of major blood vessels (0-3) colored by fluoroscopy), thal (thallium stress test result) as predictor variables. The command follows: mylogit_revised <- glm(hd ~ cp + sex + trestbps + slope + ca + thal, data = data, family = “binomial”). In addition, for our predicting variable heart disease, we chose 0.5 level as a threshold and thus only consider hd values greater and equal to 0.5 as having heart disease.

For our main hypothesis test, we have implemented the Hosmer-Lemeshow Goodness of Fit Test using the hoslem.test command. We have considered the resulting p-value from this test to assess whether we fail to reject the test’s null hypothesis that our logistic regression shows good fit our data. Thus, failing to reject the null hypothesis with a p-value greater than 0.05 at a 5% significance level indicates a strong fit of our model to the data.

Lastly, the graph from the ggplot illustrates the predicted probabilities of a patient having heart disease with his or her actual disease status. The cluster of green dots on the top right corner of the graph illustrates that most patients with heart disease have a high predicted probability of having heart disease. In contrast, the dots in orange demonstrate that most patients without heart disease have a low predicted probability of having heart disease.

First Logistic Regression Model

##            sex
## hd            F   M
##   Healthy    71  89
##   Unhealthy  25 112

##            cp
## hd            1   2   3   4
##   Healthy    16  40  65  39
##   Unhealthy   7   9  18 103

##            fbs
## hd            0   1
##   Healthy   137  23
##   Unhealthy 117  20

##            restecg
## hd           0  1  2
##   Healthy   92  1 67
##   Unhealthy 55  3 79

##            exang
## hd            0   1
##   Healthy   137  23
##   Unhealthy  63  74

##            slope
## hd            1   2   3
##   Healthy   103  48   9
##   Unhealthy  36  89  12

##            ca
## hd            2   3   4   5
##   Healthy   129  21   7   3
##   Unhealthy  45  44  31  17

##            thal
## hd            2   3   4
##   Healthy   127   6  27
##   Unhealthy  37  12  88

## 
## Call:
## glm(formula = hd ~ ., family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0490  -0.4847  -0.1213   0.3039   2.9086  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.253978   2.960399  -2.113 0.034640 *  
## age         -0.023508   0.025122  -0.936 0.349402    
## sexM         1.670152   0.552486   3.023 0.002503 ** 
## cp2          1.448396   0.809136   1.790 0.073446 .  
## cp3          0.393353   0.700338   0.562 0.574347    
## cp4          2.373287   0.709094   3.347 0.000817 ***
## trestbps     0.027720   0.011748   2.359 0.018300 *  
## chol         0.004445   0.004091   1.087 0.277253    
## fbs1        -0.574079   0.592539  -0.969 0.332622    
## restecg1     1.000887   2.638393   0.379 0.704424    
## restecg2     0.486408   0.396327   1.227 0.219713    
## thalach     -0.019695   0.011717  -1.681 0.092781 .  
## exang1       0.653306   0.447445   1.460 0.144267    
## oldpeak      0.390679   0.239173   1.633 0.102373    
## slope2       1.302289   0.486197   2.679 0.007395 ** 
## slope3       0.606760   0.939324   0.646 0.518309    
## ca3          2.237444   0.514770   4.346 1.38e-05 ***
## ca4          3.271852   0.785123   4.167 3.08e-05 ***
## ca5          2.188715   0.928644   2.357 0.018428 *  
## thal3       -0.168439   0.810310  -0.208 0.835331    
## thal4        1.433319   0.440567   3.253 0.001141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 183.10  on 276  degrees of freedom
## AIC: 225.1
## 
## Number of Fisher Scoring iterations: 6

## Warning in Ops.factor(1, y): '-' not meaningful for factors

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  training$hd, predict.glm(mylogit, training, type = "response")
## X-squared = 241, df = 3, p-value < 2.2e-16

##          Actual
## Predicted Healthy Unhealthy
##         0     119        18
##         1      12        92

## [1] 0.1244813

##          Actual
## Predicted Healthy Unhealthy
##         0      27         7
##         1       2        20

## [1] 0.1607143

## 
##      Two-sample t test power calculation 
## 
##               n = 65
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.9999716
##     alternative = one.sided
## 
## NOTE: n is number in *each* group

Revised Logistic Regression Model

## 
## Call:
## glm(formula = hd ~ cp + sex + trestbps + slope + ca + thal, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9626  -0.4746  -0.1170   0.3944   2.9125  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.907350   1.873754  -4.754 2.00e-06 ***
## cp2          1.109543   0.780235   1.422 0.155008    
## cp3          0.267203   0.698587   0.382 0.702097    
## cp4          2.647832   0.681891   3.883 0.000103 ***
## sexM         1.515713   0.494127   3.067 0.002159 ** 
## trestbps     0.026831   0.010469   2.563 0.010382 *  
## slope2       1.914643   0.430960   4.443 8.88e-06 ***
## slope3       1.492420   0.714306   2.089 0.036678 *  
## ca3          2.207236   0.474859   4.648 3.35e-06 ***
## ca4          3.066687   0.686580   4.467 7.95e-06 ***
## ca5          2.288734   0.865330   2.645 0.008171 ** 
## thal3        0.002464   0.730261   0.003 0.997307    
## thal4        1.563587   0.414943   3.768 0.000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 197.85  on 284  degrees of freedom
## AIC: 223.85
## 
## Number of Fisher Scoring iterations: 6

## Warning in Ops.factor(1, y): '-' not meaningful for factors

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  training$hd, predict.glm(mylogit_revised, training, type = "response")
## X-squared = 241, df = 3, p-value < 2.2e-16

##          Actual
## Predicted Healthy Unhealthy
##         0     116        15
##         1      15        95

## [1] 0.1244813

##          Actual
## Predicted Healthy Unhealthy
##         0      27         6
##         1       2        21

## [1] 0.1428571

Results

As stated, our data was partitioned so that 80% was used for training and 20% was used for testing. We built two logistic regression models, the first using all predictor variables, and the second using the following statistically significant variables: chest pain, sex, resting blood pressure, slope of the peak exercise ST segment, the number of major blood vessels colored by fluoroscopy, and thallium stress test result.

When we ran a logistic regression on the data using all predictor variables, we found that the minimum value for the deviance residual was -3.0490, the median value was -0.1213, and the maximum value was 2.9086, which indicates symmetry. Performing a Hosmer-Lemeshow goodness of fit test, we observed a large chi-squared statistic of 241 and a p-value smaller than 2.2e-16, which indicated evidence of poor fit. Calculating the first model’s rate of misclassification, we received relatively low values, with 12.4% error using the training data and 16.1% error using the testing data.

Running our second logistic regression using only the statistically significant variables, we found similar deviance residual values that suggested symmetry: a minimum value of -2.9626, a median value of -0.1170, and a maximum value of 2.9125. The Hosmer-Lemeshow goodness of fit test returned the same results, with the chi-squared statistic equal to 241 and the p-value smaller than 2.2e-16, which once again indicated the model was a poor fit. The second model’s rate of misclassification was identical to the first model’s when it came to the training data (12.4%), and nearly 2% lower than the first model’s for the testing data (14.3%).

Conclusion

The goal of this project was to build a logistic regression model to accurately predict heart disease given a set of variables that included predictors like chest pain, sex, cholesterol levels, and resting ECG results. Our heart disease dataset had both categorical/binary and numerical variables, including a variable we aimed to use as our response (heart disease). Hence, we thought it was a dataset of sufficient quality to use for our project.

To test our logistic model assumptions, we constructed contingency tables for all of our categorical variables. Since no combination of variable conditions with heart disease had 0 data points, we were justified to proceed with our logistic regression. We did not conduct tests for linear regression model assumptions, since logistic regressions are non-linear. We partitioned our data randomly into 80% training data and 20% testing data.

We then built our first logistic regression model using all 13 predictor variables and the training data. Then, we analyzed the summary of our model. The deviance residuals median was approximately centered at 0 and min and max values were relatively symmetric about 0. This indicated that our model was not biased in one direction. Next, we evaluated the significance levels of the coefficients. We observed that six of the 13 predictor variables used had p-values less than 0.05. These variables were: sex, cp (chest pain), trestbps (resting blood pressure), slope (peak exercise ST segment), ca (number of major blood vessels colored by fluoroscopy), and thal (thalium heart scan results).

Our second logistic regression used the six predictor variables that were significant. We then analyzed the summary of our revised model. The new deviance residuals median was closer to 0 than the previous model and the min and max values were also more symmetric compared to the previous model.

Next, we compared the predictions made with the training and testing data across the two models. We first saw that the misclassification error rates for predicting heart disease using training data were the same (12.4%) for both models. However, using testing data, the misclassification error rate for our revised model (14.3%) was lower than the error rate for our first model (16.1%). Our revised model thus appeared to perform better.

After performing the Hosmer-Lemeshow goodness of fit test on both models, we got equivalent outputs (X-squared = 241, df = 3, p-value < 2.2e-16). The p-values were highly significant, indicating poor fit. These test outputs were unexpected, especially since the p-value < 2.2e-16 was so highly significant. Because both the original and revised models failed the goodness of fit test, we must take our models’ reliability and predictions with a grain of salt. Failure of the Hosmer-Lemeshow test was further surprising considering that our revised model did have good predictive power given by our misclassification error rates (>85% prediction accuracy).

It should be noted, however, that while our revised model did have a smaller misclassification error rate than our original model, it was only by around 2%. Thus, our second model may have only been marginally better than the first. In this case, we should not fully accept the predictive power of either model.

When we graph the predicted probabilities of the revised logistic regression model, we see that “unhealthy” patients (with heart disease) are shown to have a higher predicted probability of having heart disease; likewise, “healthy” patients (without heart disease) have a lower predicted probability of having heart disease. This would indicate that our model is somewhat accurate, even though it has poor fit.

To understand why we arrived at a less than ideal logistic model, we evaluated potential shortcomings not only in our heart disease data but also in our methodology. Regarding our dataset, we observed that some data points seemed too unlikely. For instance, being asymptomatic for chest pain was highly significant as a predictor for heart disease. This was confusing, because it would perhaps be more likely that high levels of chest pain, not asymptomatic chest pain, were more correlated with heart disease. Examples like these made us question at times if our data was indeed accurate. Some improvements that may make our heart disease data more accurate would be to first increase the size of the dataset. This data contained only 297 observations from one hospital (Cleveland Clinic Foundation). However, if heart disease data was accumulated from many different hospitals, there may be more data points with which to build more accurate models. Next, perhaps the data could include different variables that are more directly correlated with heart disease, such as a history of smoking, diabetes and obesity (CDC).

Regarding our own regression approach, we partitioned the heart disease data into 80% training data and 20% testing data. Because the sample size of the testing data was considerably smaller than that of the training data, this may explain why misclassification error rates in the testing data were higher than those in the training data across both models. Moreover, it is possible that we incorrectly chose the best predictor variables for our logistic models. We would argue that this is not likely because we first built a model with all variables, then refined the model given the predictors with the highest significance. Yet, both models had the same Hosmer-Lemeshow goodness of fit output (which showed poor goodness of fit). We could also have tried modifying our threshold (0.5) for heart disease predictions; perhaps results may have been different if we used a higher threshold for predicting heart disease such as 0.6 or 0.7.

In conclusion, we recognize that our logistic regressions were both flawed models for predicting heart disease. Our second model had satisfactory predictive power (>85%); however, both models failed the Hosmer-Lemeshow goodness of fit test. Thus, we must take any predictions made with a grain of salt. Finally, we have put forth both immediate and long-term improvements that can be made to strengthen the accuracy of our models. Some of these suggestions included collecting more data on patient smoking history, obesity and diabetes, while also utilizing different predictor variables for future models.

Logistic Regression Heart Disease Model