Number of observations: Healthy and Heart Disease cases
Heart Disease is uniformly spread out across Age
No major difference in Rest ECG for Healthy and Heart Disease patients
More Heart Disease patients seem to have between 200 and 250 mg/dl
Heart Disease patients have higher maximum heart rate than healthy patients
More Heart Disease patients have ST depression of 0.1
Almost all of the patients who have Heart Disease have 0 major vessels as observed by Fluroscopy
More females have Heart Disease
More Heart Disease patients have chest pain type 1 or 2
No difference in fasting blood sugar
Patients with Rest ECG 1 have more Heart Diseases
Patients with no exercise induced angina have more Heart Disease
Peak excercise ST Slope 2 have more Heart Disease
Fixed defect thalasemia has more Heart Disease
- We can see that only a few of the paramenters significantly has an effect on Heart Disease
- Gender, Chest Pain Type, Excercise Induced Angina, ST Depression & No. of vessels observed by fluroscopy are the only variables that has a significant effect on Heart Disease
- The rest of the paramenters can be excluded and dropped from our analysis
##
## Call:
## glm(formula = target ~ ., family = binomial, data = h)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7777 -0.3544 0.1525 0.5302 2.6007
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1005784 3.3647651 0.327 0.74360
## age -0.0005734 0.0235952 -0.024 0.98061
## sexMale -1.5149396 0.5212317 -2.906 0.00366
## cpChest Pain Type 1 0.9832302 0.5640531 1.743 0.08131
## cpChest Pain Type 2 1.9452318 0.4771939 4.076 4.57e-05
## cpChest Pain Type 3 2.0159122 0.6506319 3.098 0.00195
## trestbps -0.0170729 0.0107004 -1.596 0.11059
## chol -0.0043317 0.0038894 -1.114 0.26539
## fbsFasting Blood Sugar > 120 0.1764007 0.5661856 0.312 0.75538
## restecgRest ECG 1 0.5702065 0.3745081 1.523 0.12787
## restecgRest ECG 2 -0.2767289 2.2672126 -0.122 0.90285
## thalach 0.0171314 0.0107357 1.596 0.11055
## exangExercise Induced Angina -0.7630837 0.4260285 -1.791 0.07327
## oldpeak -0.4892926 0.2258040 -2.167 0.03024
## slopePeak Excercise ST Slope 1 -0.7196641 0.8630729 -0.834 0.40437
## slopePeak Excercise ST Slope 2 0.2015612 0.9382445 0.215 0.82990
## ca -0.8331781 0.2043120 -4.078 4.54e-05
## thalNormal Thalassemia 1.8146869 2.3786093 0.763 0.44551
## thalFixed Defect Thalassemia 1.8533188 2.2904818 0.809 0.41844
## thalReversible Defect Thalassemia 0.4732491 2.3013525 0.206 0.83707
##
## (Intercept)
## age
## sexMale **
## cpChest Pain Type 1 .
## cpChest Pain Type 2 ***
## cpChest Pain Type 3 **
## trestbps
## chol
## fbsFasting Blood Sugar > 120
## restecgRest ECG 1
## restecgRest ECG 2
## thalach
## exangExercise Induced Angina .
## oldpeak *
## slopePeak Excercise ST Slope 1
## slopePeak Excercise ST Slope 2
## ca ***
## thalNormal Thalassemia
## thalFixed Defect Thalassemia
## thalReversible Defect Thalassemia
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.64 on 302 degrees of freedom
## Residual deviance: 201.69 on 283 degrees of freedom
## AIC: 241.69
##
## Number of Fisher Scoring iterations: 6
Taking only the significant variables and summarising
## sex cp exang
## Female: 96 Chest Pain Type 0:143 No Exercise Induced Angina:204
## Male :207 Chest Pain Type 1: 50 Exercise Induced Angina : 99
## Chest Pain Type 2: 87
## Chest Pain Type 3: 23
##
##
## oldpeak ca target
## Min. :0.00 Min. :0.0000 Healthy :138
## 1st Qu.:0.00 1st Qu.:0.0000 Heart Disease:165
## Median :0.80 Median :0.0000
## Mean :1.04 Mean :0.7294
## 3rd Qu.:1.60 3rd Qu.:1.0000
## Max. :6.20 Max. :4.0000
- We will run the new data on logistic regression again
- All variables are significant now (notice that the intercept is also significant now)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3277 -0.5202 0.2011 0.5714 2.5038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9614 0.4348 4.511 6.44e-06 ***
## sexMale -1.4117 0.3894 -3.625 0.000289 ***
## cpChest Pain Type 1 1.3498 0.4868 2.773 0.005560 **
## cpChest Pain Type 2 2.0905 0.4192 4.987 6.12e-07 ***
## cpChest Pain Type 3 2.0161 0.6086 3.313 0.000924 ***
## exangExercise Induced Angina -1.2217 0.3721 -3.283 0.001028 **
## oldpeak -0.8060 0.1810 -4.454 8.42e-06 ***
## ca -0.7635 0.1662 -4.595 4.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.64 on 302 degrees of freedom
## Residual deviance: 238.32 on 295 degrees of freedom
## AIC: 254.32
##
## Number of Fisher Scoring iterations: 5
- Plotting the coefficients
- If the patient has chest pain type 2 or 3, the probability of them having Heart Disease rises
- More number of blood vessels that are visible by fluroscopy, the lower the chances of Heart Disease
- Also, higher the ST Depression, lower the chances of Heart Disease
- If a patient has excercise induced angina, then the probability of them having Heart Disease reduces
- If male, then also less chances of Heart Disease
As ST depression rises, chances of a heart disease falls
As number of vessels as observed by fluroscopy rises, probability of heart disease falls
- Logistic Regression with the train function in caret package
- ROC = 87%
- Sensitivity = 77%
- Specificity = 84%
## Generalized Linear Model
##
## 242 samples
## 5 predictor
## 2 classes: 'Healthy', 'Heart.Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 218, 217, 218, 219, 218, 218, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8697208 0.7675758 0.8408974
Variable importance. We see that ST Depression is the most important variable followed by Chest Pain Type and No. of Vessels
## glm variable importance
##
## Overall
## oldpeak 100.00
## `cpChest Pain Type 2` 91.83
## ca 72.94
## sexMale 63.47
## `cpChest Pain Type 3` 38.18
## `exangExercise Induced Angina` 29.97
## `cpChest Pain Type 1` 0.00
- Now we predict on the Test Set.
- Accuracy = 90%
- Kappa = 79%
- p-Value = 2.8 x 10 ^ -7 (Highly Significant)
## Confusion Matrix and Statistics
##
##
## pred Healthy Heart Disease
## Healthy 21 3
## Heart Disease 3 34
##
## Accuracy : 0.9016
## 95% CI : (0.7981, 0.963)
## No Information Rate : 0.6066
## P-Value [Acc > NIR] : 2.801e-07
##
## Kappa : 0.7939
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9189
## Specificity : 0.8750
## Pos Pred Value : 0.9189
## Neg Pred Value : 0.8750
## Prevalence : 0.6066
## Detection Rate : 0.5574
## Detection Prevalence : 0.6066
## Balanced Accuracy : 0.8970
##
## 'Positive' Class : Heart Disease
##
Plotting the Confusion Matrix for Logistic Regression
- Training with random forest
- Best ROC = 88%
- Sensitivity = 74%
- Specificity = 80%
## Random Forest
##
## 242 samples
## 5 predictor
## 2 classes: 'Healthy', 'Heart.Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 218, 218, 217, 217, 218, 219, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.8807964 0.7402273 0.8061538
## 4 0.8705818 0.7596212 0.7842308
## 7 0.8636490 0.7422727 0.7826282
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Variable importance of random forest. We see similar importance as logistic regression.
## rf variable importance
##
## Overall
## oldpeak 100.000
## ca 70.330
## exangExercise Induced Angina 43.950
## cpChest Pain Type 2 20.794
## sexMale 18.268
## cpChest Pain Type 3 6.915
## cpChest Pain Type 1 0.000
- Predicting on the Test Set
- The confsion matrix and the accuracy on the test set is output
- Accuracy = 91%
- p-Value is highly significant
## Confusion Matrix and Statistics
##
## pred
## Healthy Heart Disease
## Healthy 21 3
## Heart Disease 2 35
##
## Accuracy : 0.918
## 95% CI : (0.819, 0.9728)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : 1.627e-07
##
## Kappa : 0.827
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9211
## Specificity : 0.9130
## Pos Pred Value : 0.9459
## Neg Pred Value : 0.8750
## Prevalence : 0.6230
## Detection Rate : 0.5738
## Detection Prevalence : 0.6066
## Balanced Accuracy : 0.9170
##
## 'Positive' Class : Heart Disease
##
Plotting the Confusion Matrix for Random Forest
- Running neural net
- Max ROC = 87%
- Sensitivity = 74%
- Specificity = 84%
## Neural Network
##
## 242 samples
## 5 predictor
## 2 classes: 'Healthy', 'Heart.Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 219, 218, 217, 218, 218, 217, ...
## Resampling results across tuning parameters:
##
## size decay ROC Sens Spec
## 1 0e+00 0.8253329 0.7371970 0.7973718
## 1 1e-04 0.8367104 0.7455303 0.7835897
## 1 1e-01 0.8606520 0.7200758 0.8532692
## 3 0e+00 0.8395350 0.7092424 0.8291667
## 3 1e-04 0.8425534 0.7068182 0.8097436
## 3 1e-01 0.8775925 0.7393182 0.8379487
## 5 0e+00 0.8053914 0.7181061 0.7803205
## 5 1e-04 0.8358202 0.7387879 0.7939103
## 5 1e-01 0.8758001 0.7457576 0.8314744
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 3 and decay = 0.1.
Similar variable importance is observed
## nnet variable importance
##
## Overall
## exangExercise Induced Angina 100.00
## cpChest Pain Type 3 56.58
## ca 47.47
## sexMale 40.05
## cpChest Pain Type 2 12.83
## cpChest Pain Type 1 11.14
## oldpeak 0.00
- Predicting in the Test Set
- Accuracy on Test Set = 91.8%
- P-Value is highly significant
## Confusion Matrix and Statistics
##
## pred
## Healthy Heart Disease
## Healthy 22 2
## Heart Disease 3 34
##
## Accuracy : 0.918
## 95% CI : (0.819, 0.9728)
## No Information Rate : 0.5902
## P-Value [Acc > NIR] : 1.172e-08
##
## Kappa : 0.8295
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9444
## Specificity : 0.8800
## Pos Pred Value : 0.9189
## Neg Pred Value : 0.9167
## Prevalence : 0.5902
## Detection Rate : 0.5574
## Detection Prevalence : 0.6066
## Balanced Accuracy : 0.9122
##
## 'Positive' Class : Heart Disease
##
Plotting the Confusion Matrix for Neural Network
Running rpart to obtain Decision Tree for decision making
## user system elapsed
## 1.81 0.01 1.94
## CART
##
## 303 samples
## 5 predictor
## 2 classes: 'Healthy', 'Heart.Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 273, 274, 273, 272, 272, 272, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8192162 0.7413736 0.8274265
##
## Tuning parameter 'cp' was held constant at a value of 0.01
Similar variable importance
## rpart variable importance
##
## Overall
## oldpeak 100.00
## exangExercise Induced Angina 94.56
## cpChest Pain Type 2 69.69
## ca 61.07
## sexMale 55.04
## cpChest Pain Type 1 11.54
## `cpChest Pain Type 1` 0.00
## `exangExercise Induced Angina` 0.00
## `cpChest Pain Type 3` 0.00
## `cpChest Pain Type 2` 0.00
- Plotting the Decision Tree
- ca: Nnumber of major vessels (0-3) colored by flourosopy
- oldpeak: ST depression induced by exercise relative to rest
- Deeper the red, higher the probability of Heart Disease
- Deeper the green, more the chances of being healthy
- If no. of vessels >= 1 AND ST depression < 0.55 AND Chest Pain Type = 2, then there is a 93% chance of Heart Disease
- Similarly doctors can take a decision based on these paramenters whether there is a chance of Heart Disease in the future