The heart data set in this case study contains 1025 observations on 13 variables. The data set is available on keggle data repository and is loaded in via my github.
There are 13 variables in the data set.
age: Age (years)
sex: (Male=1, Female=0)
chest pain type : 4 Values increasing in pain
resting blood pressure: Diastolic blood pressure (mm Hg)
serum cholestoral: cholestoral in mg/dl
fasting blood sugar: > 120 mg/d
resting electrocardiographic results: (values 0,1,2)
maximum heart rate achieved: bpm
exercise induced angina:
oldpeak: ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
I load the data from github in the following code. For convenience, I delete all records with missing values and keep only the records with complete records in this case study. The final analytic data set has 1025 observations
## Warning in data(heart): data set 'heart' not found
The objective of this case study is to build a logistic regression model to predict heart disease using various risk factors associated with the individual patient.
We first make the following pairwise scatter plots to inspect the potential issues with predictor variables.
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
From the correlation matrix plot, we can see several patterns in the predictor variables.
A moderate correlation is observed in several pairs of variables: age v.s. tresbps, v.s. thalach, and chol v.s. oldpeak. We will not drop any of these variables. We will standardize these variables. Part of the correlation may be removed after they are standardized.
Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.
We randomly split the data into two subsets. 70% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 30% of the hold-up sample will be used for assessing the performance of the final model.
In the previous module, we introduced full and reduced models to set up the scope for searching for the final model. In this case study, we use the full, reduced, and final models obtained based on the step-wise variable selection as the three candidate models.
For convenience, we use 0.5 as the common cut-off for all three models to define the predicted. In a real application, you may want to find the optimal cut-off for each candidate model in the cross-validation process.
Since our training data is relatively small, I will use 5-fold cross-validation to ensure the validation data set has enough heart cases.
| PE1 | PE2 | PE3 |
|---|---|---|
| 0.2658537 | 0.2658537 | 0.2646341 |
The average predictive errors show that candidate models 1 and 2 have the same predictive error. Since model 2 is simpler than model 1, we choose model 2 as the final predictive model.
The actual accuracy of the final model is given by
| x |
|---|
| 0.6682927 |
Therefore, the final model has an accuracy rate given in the above table.
We first estimate the TPR (true positive rate, sensitivity) and FPR (false positive rate, 1 - specificity) at each cut-off probability for each of the three candidate models using the following R function.
The ROC curves of the three candidate models are given below.
We can see from the ROC curve that candidate model 02 has a better global fit than the full and reduced models. Based on the area under the ROC curve, we still choose model #2 as the final working model.
The case study focused on predicting heart. For illustrative purposes, we used three models as candidates and use both cross-validation and ROC curve to select the final working model. Both cross-validation and ROC curve yielded the same result.