Cardiovascular disease includes a number of conditions affecting the structures or function of the heart, including coronary artery disease and vascular (blood vessel) disease. Cardiovascular disease is by far the leading cause of death in the United States. Coronary artery disease (narrowing of the arteries supplying blood to the heart) causes about one million heart attacks each year. Even more worrisome, 220,000 people with heart attacks will die before even reaching the hospital.
In the past decade, heart disease has been the leading cause of death in different continents and countries in the world, regardless of the income level of countries. According to WHO report, heart disease is the leading cause of death across the world, accounting for 7.2 million deaths, i.e., 12.8% of all fatalities in the world. My goal is to investigate what factors lead to heart disease. I am going to use the dataset for this project which is available on UCI machine learning repository.
This database contains 76 attributes, but out of all published experiments we refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
Diagnosis (value 0: < 50% diameter narrowing (no heart disease); value 1: > 50% diameter narrowing (has heart disease))
##
## Variables sorted by number of missings:
## Variable Count
## ca 0.01320132
## thal 0.00660066
## age 0.00000000
## sex 0.00000000
## cp 0.00000000
## trestbps 0.00000000
## chol 0.00000000
## fbs 0.00000000
## restecg 0.00000000
## thalach 0.00000000
## exang 0.00000000
## oldpeak 0.00000000
## slope 0.00000000
## num 0.00000000
The above plot represents proportion of missing data in the dataset. As we can see from the plot above, there are few missing values in the values in the varaibles ‘ca’ and ‘thal’. The red bars denote percentage of missing values. Next, we replace the missing values with the most frequent value in that varaible.
## age trestbps chol thalach oldpeak
## age 1.0000000 0.28494592 0.208950270 -0.393805806 0.20380548
## trestbps 0.2849459 1.00000000 0.130120108 -0.045350879 0.18917097
## chol 0.2089503 0.13012011 1.000000000 -0.003431832 0.04656399
## thalach -0.3938058 -0.04535088 -0.003431832 1.000000000 -0.34308539
## oldpeak 0.2038055 0.18917097 0.046563989 -0.343085392 1.00000000
One of the assumptions for a linear model is that the predictor variables should be independant. So, we check linear relationship between our predictor variables. For checking multicolinearity, correlation plot is used. The size and the color map of the squares represent the magnitude of correlation between the two variables. We need our independant variables to have as little colinearity as possible. From the correlation matrix, we can see that the maximum colinearity is between thalch and age which is -0.3938058. Since, it is >-0.5, we can conclude that the variables have a very little correlation.
As we can see from the plot the resting blood pressure increases as the age increases for both males nad females. This tells us that there is positive relationship between Resting blood pressure and Age.
The variable age annd serum cholesterol plot shows very little positive relationship. As the age increases the serum cholesterol level increases a little bit. There are almost no outliers except one, which represents for a female with very high cholesterol level. We will remove any such outliers that might affect the results drastically.
The above plot shows people having heart disease in various age group. The most commom age range where people have all types of heart disease severity is 56 to 61 years. There are also some outliers where certain people having heart disease severity of 1, 2, and 4 fall outside the common age range suffering from same severity. This tells us that there are few young people who also have heart disease.
Here value 0 refers to female and value 1 refers to male. From the Sex vs Heart disease plot it is clear that men have higher chance of having the heart disease compared to women. Heart disease with severity 1 has highest quantity of the population and the chances of having heart disease with high severity goes on decreasing as the severity increases.
## Mean Stdev Median Minimum Maximum 1. Quartile
## age 54.438944 9.038662 56.0 29 77.0 48.0
## trestbps 131.689769 17.599748 130.0 94 200.0 120.0
## chol 246.693069 51.776918 241.0 126 564.0 211.0
## thalach 149.607261 22.875003 153.0 71 202.0 133.5
## oldpeak 1.039604 1.161075 0.8 0 6.2 0.0
## 3. Quartile
## age 61.0
## trestbps 140.0
## chol 275.0
## thalach 166.0
## oldpeak 1.6
Above is the summary table and histogram of numeric variables in the dataset. For the ‘age’ variable we can see that mean age is 54.438944 and the median age is 56.0 which means that the age distribution is skewed little bit to the left. The ‘trestbps’ variable has mean resting blood pressure of 131.689769 and median resting blood pressure of 130.0. Since the median is little less, the distribution is skewed to right by very small amount. For the variable ‘chol’ mean serum cholesterol level is 246.693069 and median serum cholesterol level is 241.0. The median cholesterol level is less than mean which means that the distribution of cholesterol level is skewed to right. The variable ‘thalach’ has mean maxmium heart rate achieved of 149.607261 and median heart rate achieve of 153.0. The median is higher than mean in this case. Hence, the distribution of maximum heart rate achieved is skewed to the left. The variable ‘oldpeak’ has mean ST depression induced by exercise of 1.039604 and median of 0.8. Since the median is less than distribution is skewed to right.
From the summary it’s clear that the standard deviation for ‘age’, ‘trestbps’, ‘chol’, and ‘thalach’ is greater than 3 standard deviation, while the standard deviatins for normal data are spread within 3 standard deviation on each side of mean. The standard deviation for ‘oldpeak’ is within 3 standard deviations. But the distribution for oldpeak is skewed to right. We will try to transform the data and see if there is any change in the results. The standard deviation for cholesterol(chol) is 51.77 which is pretty high. It tells us that the data is spread over a larger area. The standard deviation for maximum heart rate achieved (thalach) is also high with value of 22.87. The data is distributd over larger area. The standard deviation of resting blood pressure (trestbps) is alos high compared to age and oldpeak. The reason for having such varying standard deviation is because the data for each variables is different and is measured in different units. We need to distribute this data over a common ground so that we can make proper sense of it. We will try to normalize the data with zero meann and one standard deviation annd then predict the accuracy of the model.
While most of the data losely follows normal distribution, ‘oldpeak’ is heavily skewed. Log transformation did not improve the skewness. After applying squareroot transformation to the data, the variable does tend to normal distribution except for the concentration of values at 0. The data is classified and given proper levels. Most of the variables are categorical so we convert them into factors and assign levels.
Since our response variable is binary we use logistic regression model to train and predict the dataset.Initially we include all the variables to train and test the model. Here we used the nonn-normalized data with the original standard deviation of the variables and their respective means.
We will create the training and testing datasets to train the dataset and then use the testing dataset to check the accuracy of the model and predict values. We split the dataset into 75% and 25%. The traing dataset consists of 75% of the total values in the dataset and the testing dataset consists of remaining 25% of the total dataset.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117 25
## 1 6 79
##
## Accuracy : 0.8634
## 95% CI : (0.8118, 0.9053)
## No Information Rate : 0.5419
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.721
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.9512
## Specificity : 0.7596
## Pos Pred Value : 0.8239
## Neg Pred Value : 0.9294
## Prevalence : 0.5419
## Detection Rate : 0.5154
## Detection Prevalence : 0.6256
## Balanced Accuracy : 0.8554
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 8
## 1 4 27
##
## Accuracy : 0.8421
## 95% CI : (0.7404, 0.9157)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 2.51e-08
##
## Kappa : 0.6796
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.9024
## Specificity : 0.7714
## Pos Pred Value : 0.8222
## Neg Pred Value : 0.8710
## Prevalence : 0.5395
## Detection Rate : 0.4868
## Detection Prevalence : 0.5921
## Balanced Accuracy : 0.8369
##
## 'Positive' Class : 0
##
The AIC value of model is 186.21. The accuracy of the testing model came out to be 84.21%. The sensitivity is 0.9024390 and the specificity is 0.7714286 and the p-value is 0.00000002509939 which is good. Next we will try to normalize the data with zero mean and common standard deviation. and see the difference in the accuracy of model.
I normalized the data so that each variable has zero mean and 1 standard deviation. In this way we can compare different variables with different units on a same scale. The dataset is then split into training and testing dataset with a 75% to 25% ratio. Next I created a model including all the variables and applies generalized linear model function to it. Then I predicted the accuracy of both training and testing dataset using connfusion matrix.
Since our response variable is binary we use logistic regression model to train and predict the dataset. Initially we include all the variables to train and test the model. Here we used the normalized data with commom standard deviation among all the variables and with zero mean.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117 25
## 1 6 79
##
## Accuracy : 0.8634
## 95% CI : (0.8118, 0.9053)
## No Information Rate : 0.5419
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.721
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.9512
## Specificity : 0.7596
## Pos Pred Value : 0.8239
## Neg Pred Value : 0.9294
## Prevalence : 0.5419
## Detection Rate : 0.5154
## Detection Prevalence : 0.6256
## Balanced Accuracy : 0.8554
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 8
## 1 4 27
##
## Accuracy : 0.8421
## 95% CI : (0.7404, 0.9157)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 2.51e-08
##
## Kappa : 0.6796
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.9024
## Specificity : 0.7714
## Pos Pred Value : 0.8222
## Neg Pred Value : 0.8710
## Prevalence : 0.5395
## Detection Rate : 0.4868
## Detection Prevalence : 0.5921
## Balanced Accuracy : 0.8369
##
## 'Positive' Class : 0
##
The accuracy of training dataset came to be 86.34% which is pretty good. And the accuracy of testing dataset is 84.21%, The accuracy of testing dataset is less than that of training dataset which tells us that there is no overfitting. The sensitivity of training model is 0.9512195 annd specificity is 0.7596154. While the sensitivity of testing model is 0.9024390 annd specificity is 0.7714286. The p-values for both the models are below 0.05. Since this model had all the variables including the insignificant ones, we will try to improve the model accuracy by removing the insignificant variables by using stepwise procedure. Also after nonrmalizing data the accuracy didn’t increase or decrease but previously the intercept had the p-value of 0.06768019 which is greater that 0.05. But after normalizing data th p-value if the inntercept is 0.00064500 which is less than 0.05 and hence is the intercept is nonw statistically signinficant.
In backward stepwise method we use step function to eliminate the variables from model that have higher p-values and are insignificant one by one. Finally we will get the model that has lowest AIC value. The lower the AIC value the better the model fit. Also we want to make sure that residual deviannce doesn’t increase while the AIC value decreases.
##
## Call:
## glm(formula = formula(back), family = "binomial", data = hrtdattrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7829 -0.4623 -0.1545 0.4342 2.6337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6818 0.8033 -3.339 0.000842 ***
## trestbps 0.3860 0.2048 1.885 0.059420 .
## thalach -0.4059 0.2881 -1.409 0.158821
## oldpeak 0.4697 0.2421 1.940 0.052400 .
## sex.male 1.4025 0.5596 2.506 0.012202 *
## cp.asymptomatic 2.2423 0.4544 4.934 8.04e-07 ***
## slope.flat 1.1644 0.4846 2.403 0.016277 *
## ca.1 1.3530 0.5432 2.491 0.012743 *
## ca.2 2.6611 0.7965 3.341 0.000834 ***
## thal.normal -1.1746 0.4588 -2.560 0.010467 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.10 on 226 degrees of freedom
## Residual deviance: 151.41 on 217 degrees of freedom
## AIC: 171.41
##
## Number of Fisher Scoring iterations: 6
From the summary we can see that the step function has given us the model with reduced variables. The lowest AIC value we got using the step function is 171.41379 with residual deviance of 151.41379. Hence, we select the model with lowest AIC value and create training annd testing datasets once again and check if there is any increase in accuracy.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 116 23
## 1 7 81
##
## Accuracy : 0.8678
## 95% CI : (0.8167, 0.909)
## No Information Rate : 0.5419
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7306
## Mcnemar's Test P-Value : 0.00617
##
## Sensitivity : 0.9431
## Specificity : 0.7788
## Pos Pred Value : 0.8345
## Neg Pred Value : 0.9205
## Prevalence : 0.5419
## Detection Rate : 0.5110
## Detection Prevalence : 0.6123
## Balanced Accuracy : 0.8610
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 10
## 1 4 25
##
## Accuracy : 0.8158
## 95% CI : (0.7103, 0.8955)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 4.281e-07
##
## Kappa : 0.6246
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.9024
## Specificity : 0.7143
## Pos Pred Value : 0.7872
## Neg Pred Value : 0.8621
## Prevalence : 0.5395
## Detection Rate : 0.4868
## Detection Prevalence : 0.6184
## Balanced Accuracy : 0.8084
##
## 'Positive' Class : 0
##
From the summary of confusion matrix we can see that although the accuracy of training model has increased by very little margin, the accuracy of testing model has actually decreased by 3% approx. Hence, this new model has lower accuaracy and is not better at predicting the presence or absece of heart disease than the first model.
The receiver operating characteristic (ROC) curve is the graphical representation of all the possible possible outcome between 0 and 1. It is plot of sensitivity vs specificity. It gives us range of cuttoff value or the threshold value that we can use to predict the accuracy of model. Here we look at optimal threshold value to predict the accuracy of model.
## [,1]
## 0 vs. 1 0.912892
From the ROC plot we try to predict the model accuracy by choosing different threshold values. Using the cutoff value of 0.4 gives us accuracy of 84.21%. Using the cutoff value of 0.3 also gives us accuracy of 84.21% which is same as 0.4. The cutoff value of 0.5 gives us accuracy of 84.21%. Cutt-off value of 0.6 also gives us the same 84%. Hence, we will use cutt-off value of 0.5. While the accuracy at different threshold value is same for testing dataset the accuracy for training dataset is different.
Since we are not able to improve the accuracy more than 84.21% for testing dataset by using different threshold values, we will try to use other methods to predict the accuracy of the model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 114 19
## 1 9 85
##
## Accuracy : 0.8767
## 95% CI : (0.8267, 0.9164)
## No Information Rate : 0.5419
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7497
## Mcnemar's Test P-Value : 0.08897
##
## Sensitivity : 0.9268
## Specificity : 0.8173
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9043
## Prevalence : 0.5419
## Detection Rate : 0.5022
## Detection Prevalence : 0.5859
## Balanced Accuracy : 0.8721
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 8
## 1 4 27
##
## Accuracy : 0.8421
## 95% CI : (0.7404, 0.9157)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 2.51e-08
##
## Kappa : 0.6796
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.9024
## Specificity : 0.7714
## Pos Pred Value : 0.8222
## Neg Pred Value : 0.8710
## Prevalence : 0.5395
## Detection Rate : 0.4868
## Detection Prevalence : 0.5921
## Balanced Accuracy : 0.8369
##
## 'Positive' Class : 0
##
The support vector classification algorithm also predicted the accuracy of the testing dataset to be 84.21 % which is the same as logistic regression model with sensitivity of 0.9024390 and specificity of 0.7714286. Linear kernel was found to be most effective for making the data linearly separable.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 120 9
## 1 3 95
##
## Accuracy : 0.9471366
## 95% CI : (0.9094795, 0.9723897)
## No Information Rate : 0.5418502
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8930506
## Mcnemar's Test P-Value : 0.1489147
##
## Sensitivity : 0.9756098
## Specificity : 0.9134615
## Pos Pred Value : 0.9302326
## Neg Pred Value : 0.9693878
## Prevalence : 0.5418502
## Detection Rate : 0.5286344
## Detection Prevalence : 0.5682819
## Balanced Accuracy : 0.9445356
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 8
## 1 7 27
##
## Accuracy : 0.8026316
## 95% CI : (0.6954487, 0.8851147)
## No Information Rate : 0.5394737
## P-Value [Acc > NIR] : 0.000001556411
##
## Kappa : 0.6019553
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8292683
## Specificity : 0.7714286
## Pos Pred Value : 0.8095238
## Neg Pred Value : 0.7941176
## Prevalence : 0.5394737
## Detection Rate : 0.4473684
## Detection Prevalence : 0.5526316
## Balanced Accuracy : 0.8003484
##
## 'Positive' Class : 0
##
I trained a neural network 2 hidden layers and 2 neurons in each layer. The predictions on the training set were 94.7 percent accurate whereas those on the testing set (unseen data) were only 80 perecent accurate. One of the reasons might be that the data are very few and therefore the neural network cannot capture all the complexities of the data. Hence, we can conclude that neural network tends to overfit the data resulting in poor predictions on new data.
We used 14 predictors variables from UCI heart disease dataset to predict the presence of heart disease in patients. We used different models to predict the presence of disease. The first model which we used is logistic regression which gave us accuracy of 84.21%. Then we used stepwise procedure to remove insignificant variables and predict the accuracy with the new model but the accuracy of model actually decreased. After stepwise procedure was unsuccessfull in improving the accuracy of model we used ROC curve to find optimal cutt-off value. But different cutt-off values (0.3, 0.4, 0.5) proves that the accuracy is the same. Next we used Support Vector Machine method to predict the accuracy. This method also resulted in the same accuracy as 84.21% for testing dataset. Also the Type I a Type II error were also same didin’t change much. Then I used neural network with 2 hidden layers and 2 neurons in each layer to train and the dataset and predict accuracy. Although the accuracy for training dataset inncreased to 94% the accuracy for testing dataset actually decreased to 80% which is 4% less than logistic regression and support vector machine method.
Hence we conclude that the highest accuracy that we can achieve is 84.21% and there is no further need in using more models to predict the accuracy. Also from the heart data result we can say that the best predictors of presence of heart disease are maximum heart rate achieved (thalach), chest pain (cp), nunmber of major vessels (ca), and defect type (thal).