1 Case Study

The heart data set in this case study contains 1025 observations on 13 variables. The data set is available on keggle data repository and is loaded in via my github.

1.1 Data and Variable Descriptions

There are 13 variables in the data set.

  1. age: Age (years)

  2. sex: (Male=1, Female=0)

  3. chest pain type : 4 Values increasing in pain

  4. resting blood pressure: Diastolic blood pressure (mm Hg)

  5. serum cholestoral: cholestoral in mg/dl

  6. fasting blood sugar: > 120 mg/d

  7. resting electrocardiographic results: (values 0,1,2)

  8. maximum heart rate achieved: bpm

  9. exercise induced angina:

  10. oldpeak: ST depression induced by exercise relative to rest

  11. the slope of the peak exercise ST segment

  12. number of major vessels (0-3) colored by flourosopy

  13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

I load the data from github in the following code. For convenience, I delete all records with missing values and keep only the records with complete records in this case study. The final analytic data set has 1025 observations

## Warning in data(heart): data set 'heart' not found

1.2 Research Question

The objective of this case study is to build a logistic regression model to predict heart disease using various risk factors associated with the individual patient.

2 Exploratory Analysis

We first make the following pairwise scatter plots to inspect the potential issues with predictor variables.

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

From the correlation matrix plot, we can see several patterns in the predictor variables.

  • All predictor variables are unimodal. But oldpeak and age are significantly skewed. we discretize oldpeak and age in the following.

A moderate correlation is observed in several pairs of variables: age v.s. tresbps, v.s. thalach, and chol v.s. oldpeak. We will not drop any of these variables. We will standardize these variables. Part of the correlation may be removed after they are standardized.

3 Standizing Numerical Predictor Variables

Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.

3.1 Data Split - Training and Testing Data

We randomly split the data into two subsets. 70% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 30% of the hold-up sample will be used for assessing the performance of the final model.

4 Candidat Models

In the previous module, we introduced full and reduced models to set up the scope for searching for the final model. In this case study, we use the full, reduced, and final models obtained based on the step-wise variable selection as the three candidate models.

For convenience, we use 0.5 as the common cut-off for all three models to define the predicted. In a real application, you may want to find the optimal cut-off for each candidate model in the cross-validation process.

  • 5-fold Cross-Validation

Since our training data is relatively small, I will use 5-fold cross-validation to ensure the validation data set has enough heart cases.

Average of prediction errors of candidate models
PE1 PE2 PE3
0.2658537 0.2658537 0.2646341

The average predictive errors show that candidate models 1 and 2 have the same predictive error. Since model 2 is simpler than model 1, we choose model 2 as the final predictive model.

The actual accuracy of the final model is given by

The actual accuracy of the final model
x
0.6682927

Therefore, the final model has an accuracy rate given in the above table.

  • Selecting the final model with ROC

We first estimate the TPR (true positive rate, sensitivity) and FPR (false positive rate, 1 - specificity) at each cut-off probability for each of the three candidate models using the following R function.

The ROC curves of the three candidate models are given below.

We can see from the ROC curve that candidate model 02 has a better global fit than the full and reduced models. Based on the area under the ROC curve, we still choose model #2 as the final working model.

5 Summary and Conclusion

The case study focused on predicting heart. For illustrative purposes, we used three models as candidates and use both cross-validation and ROC curve to select the final working model. Both cross-validation and ROC curve yielded the same result.