Sujeet S Shinde

Introduction

Cardiovascular disease includes a number of conditions affecting the structures or function of the heart, including coronary artery disease and vascular (blood vessel) disease. Cardiovascular disease is by far the leading cause of death in the United States. Coronary artery disease (narrowing of the arteries supplying blood to the heart) causes about one million heart attacks each year. Even more worrisome, 220,000 people with heart attacks will die before even reaching the hospital.

In the past decade, heart disease has been the leading cause of death in different continents and countries in the world, regardless of the income level of countries. According to WHO report, heart disease is the leading cause of death across the world, accounting for 7.2 million deaths, i.e., 12.8% of all fatalities in the world. My goal is to investigate what factors lead to heart disease. I am going to use the dataset for this project which is available on UCI machine learning repository.

Data Information

This database contains 76 attributes, but out of all published experiments we refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

Description of attributes used in dataset

Predictable attribute

Diagnosis (value 0: < 50% diameter narrowing (no heart disease); value 1: > 50% diameter narrowing (has heart disease))

Input attributes

Sex (value 1: Male; value 0 : Female)
Chest Pain Type (value 1: typical type 1 angina, value 2: typical type angina, value 3: non-angina pain; value 4: asymptomatic)
Fasting Blood Sugar (value 1: > 120 mg/dl; value 0: < 120 mg/dl)
Restecg – resting electrographic results (value 0: normal; value 1: 1 having ST-T wave abnormality; value 2: showing probable or definite left ventricular hypertrophy)
Exang – exercise induced angina (value 1: yes; value 0: no)
Slope – the slope of the peak exercise ST segment (value 1: unsloping; value 2: flat; value 3: downsloping)
CA – number of major vessels colored by floursopy (value 0 – 3)
Thal (value 3: normal; value 6: fixed defect; value 7: reversible defect)
Trest Blood Pressure (mm Hg on admission to the hospital)
Serum Cholesterol (mg/dl)
Thalach – maximum heart rate achieved
Oldpeak – ST depression induced by exercise relative to rest
Age in Year

Missing Data Exploration

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##        ca 0.01320132
##      thal 0.00660066
##       age 0.00000000
##       sex 0.00000000
##        cp 0.00000000
##  trestbps 0.00000000
##      chol 0.00000000
##       fbs 0.00000000
##   restecg 0.00000000
##   thalach 0.00000000
##     exang 0.00000000
##   oldpeak 0.00000000
##     slope 0.00000000
##       num 0.00000000

The above plot represents proportion of missing data in the dataset. As we can see from the plot above, there are few missing values in the values in the varaibles ‘ca’ and ‘thal’. The red bars denote percentage of missing values. Next, we replace the missing values with the most frequent value in that varaible.

Data Exploration

Multi-Collinearity

##                 age    trestbps         chol      thalach     oldpeak
## age       1.0000000  0.28494592  0.208950270 -0.393805806  0.20380548
## trestbps  0.2849459  1.00000000  0.130120108 -0.045350879  0.18917097
## chol      0.2089503  0.13012011  1.000000000 -0.003431832  0.04656399
## thalach  -0.3938058 -0.04535088 -0.003431832  1.000000000 -0.34308539
## oldpeak   0.2038055  0.18917097  0.046563989 -0.343085392  1.00000000

One of the assumptions for a linear model is that the predictor variables should be independant. So, we check linear relationship between our predictor variables. For checking multicolinearity, correlation plot is used. The size and the color map of the squares represent the magnitude of correlation between the two variables. We need our independant variables to have as little colinearity as possible. From the correlation matrix, we can see that the maximum colinearity is between thalch and age which is -0.3938058. Since, it is >-0.5, we can conclude that the variables have a very little correlation.

Data Vizualization

As we can see from the plot the resting blood pressure increases as the age increases for both males nad females. This tells us that there is positive relationship between Resting blood pressure and Age.

The variable age annd serum cholesterol plot shows very little positive relationship. As the age increases the serum cholesterol level increases a little bit. There are almost no outliers except one, which represents for a female with very high cholesterol level. We will remove any such outliers that might affect the results drastically.

The above plot shows people having heart disease in various age group. The most commom age range where people have all types of heart disease severity is 56 to 61 years. There are also some outliers where certain people having heart disease severity of 1, 2, and 4 fall outside the common age range suffering from same severity. This tells us that there are few young people who also have heart disease.

Here value 0 refers to female and value 1 refers to male. From the Sex vs Heart disease plot it is clear that men have higher chance of having the heart disease compared to women. Heart disease with severity 1 has highest quantity of the population and the chances of having heart disease with high severity goes on decreasing as the severity increases.

Summary of Data

##                Mean     Stdev Median Minimum Maximum 1. Quartile
## age       54.438944  9.038662   56.0      29    77.0        48.0
## trestbps 131.689769 17.599748  130.0      94   200.0       120.0
## chol     246.693069 51.776918  241.0     126   564.0       211.0
## thalach  149.607261 22.875003  153.0      71   202.0       133.5
## oldpeak    1.039604  1.161075    0.8       0     6.2         0.0
##          3. Quartile
## age             61.0
## trestbps       140.0
## chol           275.0
## thalach        166.0
## oldpeak          1.6

Above is the summary table and histogram of numeric variables in the dataset. For the ‘age’ variable we can see that mean age is 54.438944 and the median age is 56.0 which means that the age distribution is skewed little bit to the left. The ‘trestbps’ variable has mean resting blood pressure of 131.689769 and median resting blood pressure of 130.0. Since the median is little less, the distribution is skewed to right by very small amount. For the variable ‘chol’ mean serum cholesterol level is 246.693069 and median serum cholesterol level is 241.0. The median cholesterol level is less than mean which means that the distribution of cholesterol level is skewed to right. The variable ‘thalach’ has mean maxmium heart rate achieved of 149.607261 and median heart rate achieve of 153.0. The median is higher than mean in this case. Hence, the distribution of maximum heart rate achieved is skewed to the left. The variable ‘oldpeak’ has mean ST depression induced by exercise of 1.039604 and median of 0.8. Since the median is less than distribution is skewed to right.

From the summary it’s clear that the standard deviation for ‘age’, ‘trestbps’, ‘chol’, and ‘thalach’ is greater than 3 standard deviation, while the standard deviatins for normal data are spread within 3 standard deviation on each side of mean. The standard deviation for ‘oldpeak’ is within 3 standard deviations. But the distribution for oldpeak is skewed to right. We will try to transform the data and see if there is any change in the results. The standard deviation for cholesterol(chol) is 51.77 which is pretty high. It tells us that the data is spread over a larger area. The standard deviation for maximum heart rate achieved (thalach) is also high with value of 22.87. The data is distributd over larger area. The standard deviation of resting blood pressure (trestbps) is alos high compared to age and oldpeak. The reason for having such varying standard deviation is because the data for each variables is different and is measured in different units. We need to distribute this data over a common ground so that we can make proper sense of it. We will try to normalize the data with zero meann and one standard deviation annd then predict the accuracy of the model.

Applying transformations

While most of the data losely follows normal distribution, ‘oldpeak’ is heavily skewed. Log transformation did not improve the skewness. After applying squareroot transformation to the data, the variable does tend to normal distribution except for the concentration of values at 0. The data is classified and given proper levels. Most of the variables are categorical so we convert them into factors and assign levels.

Modeling

Logistic Regression Model without normalization

Since our response variable is binary we use logistic regression model to train and predict the dataset.Initially we include all the variables to train and test the model. Here we used the nonn-normalized data with the original standard deviation of the variables and their respective means.

Splitting the dataset

We will create the training and testing datasets to train the dataset and then use the testing dataset to check the accuracy of the model and predict values. We split the dataset into 75% and 25%. The traing dataset consists of 75% of the total values in the dataset and the testing dataset consists of remaining 25% of the total dataset.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 117  25
##          1   6  79
##                                           
##                Accuracy : 0.8634          
##                  95% CI : (0.8118, 0.9053)
##     No Information Rate : 0.5419          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.721           
##  Mcnemar's Test P-Value : 0.001225        
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.7596          
##          Pos Pred Value : 0.8239          
##          Neg Pred Value : 0.9294          
##              Prevalence : 0.5419          
##          Detection Rate : 0.5154          
##    Detection Prevalence : 0.6256          
##       Balanced Accuracy : 0.8554          
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  8
##          1  4 27
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.7404, 0.9157)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 2.51e-08        
##                                           
##                   Kappa : 0.6796          
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.7714          
##          Pos Pred Value : 0.8222          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.5395          
##          Detection Rate : 0.4868          
##    Detection Prevalence : 0.5921          
##       Balanced Accuracy : 0.8369          
##                                           
##        'Positive' Class : 0               
##

The AIC value of model is 186.21. The accuracy of the testing model came out to be 84.21%. The sensitivity is 0.9024390 and the specificity is 0.7714286 and the p-value is 0.00000002509939 which is good. Next we will try to normalize the data with zero mean and common standard deviation. and see the difference in the accuracy of model.

With Normalization

I normalized the data so that each variable has zero mean and 1 standard deviation. In this way we can compare different variables with different units on a same scale. The dataset is then split into training and testing dataset with a 75% to 25% ratio. Next I created a model including all the variables and applies generalized linear model function to it. Then I predicted the accuracy of both training and testing dataset using connfusion matrix.

Modeling

Logistic Regression Model with normalization

Since our response variable is binary we use logistic regression model to train and predict the dataset. Initially we include all the variables to train and test the model. Here we used the normalized data with commom standard deviation among all the variables and with zero mean.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 117  25
##          1   6  79
##                                           
##                Accuracy : 0.8634          
##                  95% CI : (0.8118, 0.9053)
##     No Information Rate : 0.5419          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.721           
##  Mcnemar's Test P-Value : 0.001225        
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.7596          
##          Pos Pred Value : 0.8239          
##          Neg Pred Value : 0.9294          
##              Prevalence : 0.5419          
##          Detection Rate : 0.5154          
##    Detection Prevalence : 0.6256          
##       Balanced Accuracy : 0.8554          
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  8
##          1  4 27
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.7404, 0.9157)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 2.51e-08        
##                                           
##                   Kappa : 0.6796          
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.7714          
##          Pos Pred Value : 0.8222          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.5395          
##          Detection Rate : 0.4868          
##    Detection Prevalence : 0.5921          
##       Balanced Accuracy : 0.8369          
##                                           
##        'Positive' Class : 0               
##

The accuracy of training dataset came to be 86.34% which is pretty good. And the accuracy of testing dataset is 84.21%, The accuracy of testing dataset is less than that of training dataset which tells us that there is no overfitting. The sensitivity of training model is 0.9512195 annd specificity is 0.7596154. While the sensitivity of testing model is 0.9024390 annd specificity is 0.7714286. The p-values for both the models are below 0.05. Since this model had all the variables including the insignificant ones, we will try to improve the model accuracy by removing the insignificant variables by using stepwise procedure. Also after nonrmalizing data the accuracy didn’t increase or decrease but previously the intercept had the p-value of 0.06768019 which is greater that 0.05. But after normalizing data th p-value if the inntercept is 0.00064500 which is less than 0.05 and hence is the intercept is nonw statistically signinficant.

Backward Elimination

In backward stepwise method we use step function to eliminate the variables from model that have higher p-values and are insignificant one by one. Finally we will get the model that has lowest AIC value. The lower the AIC value the better the model fit. Also we want to make sure that residual deviannce doesn’t increase while the AIC value decreases.

## 
## Call:
## glm(formula = formula(back), family = "binomial", data = hrtdattrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7829  -0.4623  -0.1545   0.4342   2.6337  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.6818     0.8033  -3.339 0.000842 ***
## trestbps          0.3860     0.2048   1.885 0.059420 .  
## thalach          -0.4059     0.2881  -1.409 0.158821    
## oldpeak           0.4697     0.2421   1.940 0.052400 .  
## sex.male          1.4025     0.5596   2.506 0.012202 *  
## cp.asymptomatic   2.2423     0.4544   4.934 8.04e-07 ***
## slope.flat        1.1644     0.4846   2.403 0.016277 *  
## ca.1              1.3530     0.5432   2.491 0.012743 *  
## ca.2              2.6611     0.7965   3.341 0.000834 ***
## thal.normal      -1.1746     0.4588  -2.560 0.010467 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.10  on 226  degrees of freedom
## Residual deviance: 151.41  on 217  degrees of freedom
## AIC: 171.41
## 
## Number of Fisher Scoring iterations: 6

From the summary we can see that the step function has given us the model with reduced variables. The lowest AIC value we got using the step function is 171.41379 with residual deviance of 151.41379. Hence, we select the model with lowest AIC value and create training annd testing datasets once again and check if there is any increase in accuracy.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 116  23
##          1   7  81
##                                          
##                Accuracy : 0.8678         
##                  95% CI : (0.8167, 0.909)
##     No Information Rate : 0.5419         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.7306         
##  Mcnemar's Test P-Value : 0.00617        
##                                          
##             Sensitivity : 0.9431         
##             Specificity : 0.7788         
##          Pos Pred Value : 0.8345         
##          Neg Pred Value : 0.9205         
##              Prevalence : 0.5419         
##          Detection Rate : 0.5110         
##    Detection Prevalence : 0.6123         
##       Balanced Accuracy : 0.8610         
##                                          
##        'Positive' Class : 0              
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37 10
##          1  4 25
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7103, 0.8955)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 4.281e-07       
##                                           
##                   Kappa : 0.6246          
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7872          
##          Neg Pred Value : 0.8621          
##              Prevalence : 0.5395          
##          Detection Rate : 0.4868          
##    Detection Prevalence : 0.6184          
##       Balanced Accuracy : 0.8084          
##                                           
##        'Positive' Class : 0               
##

From the summary of confusion matrix we can see that although the accuracy of training model has increased by very little margin, the accuracy of testing model has actually decreased by 3% approx. Hence, this new model has lower accuaracy and is not better at predicting the presence or absece of heart disease than the first model.

Receiver Operating Characteristic (ROC) Curve

The receiver operating characteristic (ROC) curve is the graphical representation of all the possible possible outcome between 0 and 1. It is plot of sensitivity vs specificity. It gives us range of cuttoff value or the threshold value that we can use to predict the accuracy of model. Here we look at optimal threshold value to predict the accuracy of model.

##             [,1]
## 0 vs. 1 0.912892

From the ROC plot we try to predict the model accuracy by choosing different threshold values. Using the cutoff value of 0.4 gives us accuracy of 84.21%. Using the cutoff value of 0.3 also gives us accuracy of 84.21% which is same as 0.4. The cutoff value of 0.5 gives us accuracy of 84.21%. Cutt-off value of 0.6 also gives us the same 84%. Hence, we will use cutt-off value of 0.5. While the accuracy at different threshold value is same for testing dataset the accuracy for training dataset is different.

Since we are not able to improve the accuracy more than 84.21% for testing dataset by using different threshold values, we will try to use other methods to predict the accuracy of the model.

SVM Classification

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 114  19
##          1   9  85
##                                           
##                Accuracy : 0.8767          
##                  95% CI : (0.8267, 0.9164)
##     No Information Rate : 0.5419          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7497          
##  Mcnemar's Test P-Value : 0.08897         
##                                           
##             Sensitivity : 0.9268          
##             Specificity : 0.8173          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.9043          
##              Prevalence : 0.5419          
##          Detection Rate : 0.5022          
##    Detection Prevalence : 0.5859          
##       Balanced Accuracy : 0.8721          
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  8
##          1  4 27
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.7404, 0.9157)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 2.51e-08        
##                                           
##                   Kappa : 0.6796          
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.7714          
##          Pos Pred Value : 0.8222          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.5395          
##          Detection Rate : 0.4868          
##    Detection Prevalence : 0.5921          
##       Balanced Accuracy : 0.8369          
##                                           
##        'Positive' Class : 0               
##

The support vector classification algorithm also predicted the accuracy of the testing dataset to be 84.21 % which is the same as logistic regression model with sensitivity of 0.9024390 and specificity of 0.7714286. Linear kernel was found to be most effective for making the data linearly separable.

Neural Network

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 120   9
##          1   3  95
##                                                 
##                Accuracy : 0.9471366             
##                  95% CI : (0.9094795, 0.9723897)
##     No Information Rate : 0.5418502             
##     P-Value [Acc > NIR] : < 0.00000000000000022 
##                                                 
##                   Kappa : 0.8930506             
##  Mcnemar's Test P-Value : 0.1489147             
##                                                 
##             Sensitivity : 0.9756098             
##             Specificity : 0.9134615             
##          Pos Pred Value : 0.9302326             
##          Neg Pred Value : 0.9693878             
##              Prevalence : 0.5418502             
##          Detection Rate : 0.5286344             
##    Detection Prevalence : 0.5682819             
##       Balanced Accuracy : 0.9445356             
##                                                 
##        'Positive' Class : 0                     
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34  8
##          1  7 27
##                                                 
##                Accuracy : 0.8026316             
##                  95% CI : (0.6954487, 0.8851147)
##     No Information Rate : 0.5394737             
##     P-Value [Acc > NIR] : 0.000001556411        
##                                                 
##                   Kappa : 0.6019553             
##  Mcnemar's Test P-Value : 1                     
##                                                 
##             Sensitivity : 0.8292683             
##             Specificity : 0.7714286             
##          Pos Pred Value : 0.8095238             
##          Neg Pred Value : 0.7941176             
##              Prevalence : 0.5394737             
##          Detection Rate : 0.4473684             
##    Detection Prevalence : 0.5526316             
##       Balanced Accuracy : 0.8003484             
##                                                 
##        'Positive' Class : 0                     
##

I trained a neural network 2 hidden layers and 2 neurons in each layer. The predictions on the training set were 94.7 percent accurate whereas those on the testing set (unseen data) were only 80 perecent accurate. One of the reasons might be that the data are very few and therefore the neural network cannot capture all the complexities of the data. Hence, we can conclude that neural network tends to overfit the data resulting in poor predictions on new data.

Conclusion

We used 14 predictors variables from UCI heart disease dataset to predict the presence of heart disease in patients. We used different models to predict the presence of disease. The first model which we used is logistic regression which gave us accuracy of 84.21%. Then we used stepwise procedure to remove insignificant variables and predict the accuracy with the new model but the accuracy of model actually decreased. After stepwise procedure was unsuccessfull in improving the accuracy of model we used ROC curve to find optimal cutt-off value. But different cutt-off values (0.3, 0.4, 0.5) proves that the accuracy is the same. Next we used Support Vector Machine method to predict the accuracy. This method also resulted in the same accuracy as 84.21% for testing dataset. Also the Type I a Type II error were also same didin’t change much. Then I used neural network with 2 hidden layers and 2 neurons in each layer to train and the dataset and predict accuracy. Although the accuracy for training dataset inncreased to 94% the accuracy for testing dataset actually decreased to 80% which is 4% less than logistic regression and support vector machine method.

Hence we conclude that the highest accuracy that we can achieve is 84.21% and there is no further need in using more models to predict the accuracy. Also from the heart data result we can say that the best predictors of presence of heart disease are maximum heart rate achieved (thalach), chest pain (cp), nunmber of major vessels (ca), and defect type (thal).

References

World Health Organization (2011) The top ten causes of death.
World Health Organization (2013) Deaths from coronary heart disease.
Center for Disease Control and Prevention (2014) Heart Disease and Family History.
Assari R, Azimi P, Taghva MR (2017) Heart Disease Diagnosis Using Data Mining Techniques. Int J Econ Manag Sci 6: 415. doi: 10.4172/2162- 6359.1000415
Cleveland clinic - Reversing Heart Disease
Creator of dataset: V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Final_Project

Sujeet_Shinde

12/2/2017

Statistics 530 Final Project: Heart Disease Prediction

Sujeet S Shinde

Introduction

Data Information

Description of attributes used in dataset

Predictable attribute

Input attributes

Missing Data Exploration

Data Exploration

Multi-Collinearity

Data Vizualization

Summary of Data

Applying transformations

Modeling

Logistic Regression Model without normalization

Splitting the dataset

With Normalization

Modeling

Logistic Regression Model with normalization

Backward Elimination

Receiver Operating Characteristic (ROC) Curve

SVM Classification

Neural Network

Conclusion

References