Executive Summary

In this report, we analyzed United States Centers of Disease Control and Prevention (CDC) “Heart Disease” 2020 annual survey data in order to find key indicators of heart disease and do predictions. Other interesting applications we explored include causal analysis of related diseases, which helps the detection of secondary condition.

The original dataset consists of 319,795 respondents and 18 variables. We first performed the explanatory variable analysis (EDA) to see if any variables that were worth paying more attention to. The variables were divided into categorical ones (14 variables) and numerical ones (4 variables), on which we conducted univariate and bivariate analysis with our response variable HeartDisease. We thus reached the following insights (more details will be displayed in each section):

The dataset is quite unbalanced with only 8.56% of the observations having heart disease.
Among inveterate smokers, older people, and those with history of certain diseases (stroke, difficulty of walking, diabetic, kidney disease, skin cancer), the percentage of people who have heart disease is likely to be significantly higher.
Intuitively, based on the definition of variables, we may have multicollinearity issue. While, as the largest value from the correlation matrix is below 0.5, we would use PCA and Lasso for further investigation.

In the data cleaning and pre-processing step, we decided to use the under-sampling method to address the issue of unbalanced data, which could avoid poor models leaning towards the prediction of “No Heart Disease” and reduce the size of the dataset as the original one with 320k rows took forever long to run. Eventually, we end up having about 55k observations (17% of the original data). 50% of them are the ones with heart disease and the rest 50% do not have heart disease. With the new dataset, we built models on our training set (80% of the data), and compared their out-of-sample performance.

For the inference part, we applied 2 parametric models and 6 nonparametric models. Here, PCA analysis and Lasso did not suggest significant dimension reduction. Recursive Feature Elimination (RFE) and Genetic Algorithm (GA) were also applied for feature selection. In general, it turned out that Age, Stroke, Diabetic, and self-reported GenHealth are key indicators of heart disease.

For the secondary disease detection, our two-stage Lasso showed that stroke and kidney disease have significant causal effects on heart disease. More specifically, if a respondent was diagnosed with stroke, his/her odds of having heart disease was expected to be multiplied by 2.94, keeping everything else unchanged. Then doctors can give precautionary advice and medications, so that the risks of getting heart disease can be mitigated.

For the prediction performance, after running all the models, we selected one or several optimal models by comparing their prediction accuracy rates, false positive rates and false negative rates. Specifically, we first selected those with the highest accuracy rates, then compared their false negative rates. We put more emphasize on false negative rate since we prefer to diagnose as many subjects who actually have heart disease as possible and give them subsequent treatment. This requires the model to have the lowest possible false negative rate. Also, we will consider the computational complexity of the model. The size of the analyzed dataset can be used to decide whether to use a model with higher computational load but better prediction performance.

In conclusion, following the models comparison, we finally chose logistic regression as the optimal model. It has both high accuracy rate and low false negative rate. Moreover, the computational complexity of logistic regression is low, so it is a feasible method even when dealing with data that has a large number of observations.

Introduction

Background

Heart disease has been the leading cause of death in the United States for most racial and ethnic groups since the last century (CDC). It is reported that, each year in the US, approximately 1 in every 4 deaths is from heart diseases (Virani SS, 2021). Heart disease is multi-factorial. Risk factors include modifiable ones like cigarette smoking, alcohol consumption, physical inactivity, and ones that cannot be modified like age, gender, and heredity. In 2019, there was half of all Americans (47%) had at least one of the three key factors of heart disease: high blood pressure, high cholesterol level, and smoking (Fryar CD, 2012), and the percentage was still mounting. In addition to societal healthcare burdens, the direct medical cost and loss of productivity it causes are enormous. Many researches dedicate their efforts to study heart disease, one of the most prevalent and costly killers.

Dataset

The dataset for this project is the “Personal Key Indicators of Heart Disease” on Kaggle, which consists of 319,795 respondents and 18 variables. It was originally derived from CDC dataset released in February 2022 by discarding some irrelevant variables. The dataset is also a major part of the Behavioral Risk Factor Surveillance System (BRFSS), the world’s largest continuously conducted health survey system. The 18 variables (9 Booleans, 5 strings and 4 decimals) record the answers from respondents regarding their basic information, living habits and health conditions, which may have potential relationships with heart disease. Please refer to Appendix I for column descriptions.

Goal/Question of Interest

In this project, we mainly treated heart disease as the response variable, then did both inference and prediction. More specifically, we wondered

Which factor have a significant effect on the likelihood of heart disease? (Inference)
For a new person, can we predict his/her risk of heart disease? Is the prediction powerful? (Prediction)

Other interests or potential applications include how other diseases, e.g., kidney disease, skin cancer, affect heart disease? Can we detect any pattern of secondary condition or syndrome between them, then take actions beforehand?

EDA

Variable Description:

\(\bullet\) HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI).

\(\bullet\) BMI: Body Mass Index (BMI).

\(\bullet\) Smoking: Have you smoked at least 100 cigarettes in your entire life?

\(\bullet\) AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) or not?

\(\bullet\) Stroke: (Ever told) (you had) a stroke?

\(\bullet\) PhysicalHealth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0-30 days).

\(\bullet\) MentalHealth: Thinking about your mental health, for how many days during the past 30 days was your mental health not good? (0-30 days).

\(\bullet\) DiffWalking: Do you have serious difficulty walking or climbing stairs?

\(\bullet\) Sex: Are you male or female?

\(\bullet\) AgeCategory: Fourteen-level age category.

\(\bullet\) Race: Imputed race/ethnicity value.

\(\bullet\) Diabetic: (Ever told) (you had) diabetes?

\(\bullet\) PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job.

\(\bullet\) GenHealth: Would you say that in general your health is… (General Health Status)

\(\bullet\) SleepTime: On average, how many hours of sleep do you get in a 24-hour period?

\(\bullet\) Asthma: (Ever told) (you had) asthma?

\(\bullet\) KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?

\(\bullet\) SkinCancer: (Ever told) (you had) skin cancer?

Visualization

Column Statistics of Categorical Data

a) Data with two levels

Interpretation:

In the above pie charts, the red part represents the percentage of people that have certain type of diseases (Heart disease; Stroke; Difficulty of walking; Asthma; Kidney disease; Skin cancer) or certain type of habits(Smoking; Alcohol drinking; Doing physical activity) or their sex, while the blue part represents that they don’t have certain features. These plots vividly shows that:

\(\bullet\) The dataset is not balanced, there is way more respondents without heart disease (91.44%) than respondents who have heart disease (8.56%).

\(\bullet\) In terms of other health conditions: 3.77% people have ever had a stroke, 13.41% people have had asthma, 3.68% people have had kidney disease, 9.32% people have had skin cancer, and 13.89% people have serious difficulty walking or climbing stairs.

\(\bullet\) Inveterate smokers are more common than heavy drinkers, where two-fifths of the respondents have smoked at least 100 cigarettes in their entire life, and heavy drinkers count for 6.81%.

\(\bullet\) Other than their regular jobs, 77.54% people did physical activity/exercise in the past 30 days and 22.46% didn’t.

\(\bullet\) 47.53% people are male and 52.47% are females.

b) Data with more than two levels

Interpretation:

From multi-level pie charts above, we find that:

\(\bullet\) The 13 levels of age groups cover people from 18 to 80 or older. Among them, approximately 40% are 60-79 years old.

\(\bullet\) Race has 7 categories, where the white accounts for 76.68%.

\(\bullet\) As for diabetes (4 situations) and general health conditions (5 levels), the majority performs well.

Column Statistics of Numerical Data

a) Summary Statistics

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
BMI	12.02	24.03	27.34	28.325398	31.42	94.85
PhysicalHealth	0.00	0.00	0.00	3.371710	2.00	30.00
MentalHealth	0.00	0.00	0.00	3.898366	3.00	30.00
SleepTime	1.00	6.00	7.00	7.097075	8.00	24.00

b) Graphics

Interpretation:

The above plots shows the density of each continuous variables. In the first plot, it shows the density of the BMI. It has a roughly bell-shaped curved, though right skewed. The highest density occurs when BMI is around 27-28, and the 25% - 75% percentile of BMI is from 24.03 to 31.42.

The second group of plots shows the density of people’s PhysicalHealth and MentalHealth. Nearly half of the people think they’re physically/mentally healthy. None of the 30 days, they felt unwell. Among all, only a small portion of them felt unwell for the entire 30 days. And the rest people were in between, some days they felt well, while other days, they didn’t.

The third plot shows the distribution of people’s sleep time. Most people sleep 6 to 8 hours per day.

Exploratory Analysis

In this section, we performed exploratory analysis. Our potential response variable is HeartDisease and the rest of variables could be the predictors. Below, we performed analysis for each predictors based on whether they have heart disease or not.

Categorical predictors

Interpretation:

In the above stacked bar chart, it vividly shows the proportion of people who have heart disease within the level of each categorical variable. For example, in the first plot, among non-smoking people, approximately, 6.03% of them have heart disease. While among the smoking people, 12.16% of them have heart disease. There are some insights we got from these stacked plots.

\(\bullet\) The percentage of people who have heart disease among smokers is twice of the percentage among non-smokers.

\(\bullet\) The percentage of people who have heart disease among certain types of diseases (Stroke, Difficulty of walking, Diabetic, Kidney Disease, Skin cancer) is significantly higher than the ones who don’t have these diseases.

\(\bullet\) The percentage of people who have heart disease among males is 4% higher than the percentages among females.

\(\bullet\) In terms of different age groups, we can see that, as age grows older, there is a higher percentage of people getting heart disease.

\(\bullet\) Asians have the lowest heart disease rate (3.3%), and Whites have the highest (9.18%).

\(\bullet\) Among people who do physical exercise, there is nearly a half percentage lower of people getting heart disease than the ones who don’t do exercises.

Numerical predictors

Interpretation:

These overlapped plots indicate the density of each numerical variable among people who have heart disease verses the people who don’t have it. From these plots, it vividly shows that our data is unbalanced. The number of people who do not have heart disease is far more than it for people who have heart disease. In this situation, we will not end up with a good model when doing classification. The algorithm learns that a given class is more common, making it “natural” for there to be a greater tendency towards it. Then, it is prone to overfitting the majority class. Therefore, the unbalanced data is the first issue that we need to deal with during the data preparation process.

Data Cleaning and Preprocessing

In this part, first of all, we are going modify the the data types of some variables. Originally, except for the sex variable, all the other two-level variables are coded as “Yes” or “No”. It will be inconvenient for our later analysis, so we change “Yes” into 1 and “No” into 0. Thus, these categorical variables are transformed into binary variables. As for the multi-level categorical variables, we keep them as factor variables.

Then, we make a correlation plot to display the correlations between each variables.

Then, we deal with the imbalanced data. Our data set is really huge, which contains around 320k observations. Most of them (\(\frac{9}{10}\)) are observations with no heart diseases, and only (\(\frac{1}{10}\)) of them have heart diseases. Therefore, when we do classification, it will lean towards the result “No Heart Disease”. There’s a huge bias. In addition, our data set is so huge that it will take a long time (several hours) for R to run each model. We decided to apply the under sampling method to address the imbalance issue as well as reducing the size of the dataset.

Since we have so few observations for people with heart disease, we draw samples randomly from the this heart-disease group until we have about 50% of the observations in our dataset have heart disease. Then, we take 80% of the new dataset to be our training set and 20% to be the test set.

	Original Dataset	New Dataset
Num of Observations:	319795	54746 (17% of the original)
W/ Heart_Disease:	27373	27373
W/O Heart_Disease:	292422	27373

Models

In this section, we build models mainly from two perspectives, regression and classification methods.

Parametric: Logistic Regression, Lasso and Causal Lasso.

Non-parametric: KNN, SVM, Single Tree, Random Forest, Extreme Gradient Boosting Model (XGBoost) and Neural Network model.

Logistic Regression

## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9809  -0.7829   0.1245   0.8144   2.8975  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.351026   0.152076 -15.460  < 2e-16 ***
## BMI                                 0.011814   0.002031   5.816 6.02e-09 ***
## Smoking                             0.387523   0.024320  15.935  < 2e-16 ***
## AlcoholDrinking                    -0.212241   0.052980  -4.006 6.17e-05 ***
## Stroke                              1.109247   0.049545  22.389  < 2e-16 ***
## PhysicalHealth                      0.005251   0.001616   3.249 0.001157 ** 
## MentalHealth                        0.006541   0.001586   4.125 3.71e-05 ***
## DiffWalking                         0.242312   0.033702   7.190 6.48e-13 ***
## SexMale                             0.733439   0.024751  29.632  < 2e-16 ***
## AgeCategory25-29                    0.161946   0.150671   1.075 0.282450    
## AgeCategory30-34                    0.553744   0.136180   4.066 4.78e-05 ***
## AgeCategory35-39                    0.600682   0.131878   4.555 5.24e-06 ***
## AgeCategory40-44                    0.963836   0.125757   7.664 1.80e-14 ***
## AgeCategory45-49                    1.220958   0.122296   9.984  < 2e-16 ***
## AgeCategory50-54                    1.665175   0.117882  14.126  < 2e-16 ***
## AgeCategory55-59                    1.918945   0.115619  16.597  < 2e-16 ***
## AgeCategory60-64                    2.213667   0.114472  19.338  < 2e-16 ***
## AgeCategory65-69                    2.468989   0.114121  21.635  < 2e-16 ***
## AgeCategory70-74                    2.808239   0.114490  24.528  < 2e-16 ***
## AgeCategory75-79                    2.981080   0.116075  25.682  < 2e-16 ***
## AgeCategory80 or older              3.322867   0.115818  28.691  < 2e-16 ***
## RaceBlack                          -0.203105   0.048610  -4.178 2.94e-05 ***
## RaceAsian                          -0.297475   0.097553  -3.049 0.002293 ** 
## RaceAmerican Indian/Alaskan Native  0.053924   0.091544   0.589 0.555827    
## RaceOther                          -0.006137   0.067669  -0.091 0.927739    
## RaceHispanic                       -0.087259   0.050317  -1.734 0.082885 .  
## DiabeticNo, borderline diabetes     0.243780   0.074790   3.260 0.001116 ** 
## DiabeticYes (during pregnancy)      0.051897   0.161693   0.321 0.748239    
## DiabeticYes                         0.538971   0.031329  17.204  < 2e-16 ***
## PhysicalActivity                    0.008341   0.028202   0.296 0.767426    
## GenHealthFair                      -0.299472   0.060556  -4.945 7.60e-07 ***
## GenHealthGood                      -0.844701   0.062410 -13.535  < 2e-16 ***
## GenHealthVery good                 -1.366239   0.065368 -20.901  < 2e-16 ***
## GenHealthExcellent                 -1.843852   0.072342 -25.488  < 2e-16 ***
## SleepTime                          -0.027149   0.007728  -3.513 0.000443 ***
## Asthma                              0.344943   0.034510   9.996  < 2e-16 ***
## KidneyDisease                       0.654859   0.051815  12.638  < 2e-16 ***
## SkinCancer                          0.144184   0.035733   4.035 5.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 60714  on 43795  degrees of freedom
## Residual deviance: 43460  on 43758  degrees of freedom
## AIC: 43536
## 
## Number of Fisher Scoring iterations: 5

From the logistic regression, we find that, for instance, being an inveterate smoker is expected to multiply odds of heart disease by exp(0.387523) \(\approx\) 1.47, keeping everything else fixed.

The regression summary shows that, on average, people with smoking habit, history of other diseases (especially the stroke), and higher age category will have higher odds of heart disease. In addition, odds of heart disease is expected to be multiplied by 2 for a male compare to that of a female, given other things the same. It is interesting to find that doing extra physical activity will not have significant benefit as we expected, while alcohol drinking slightly decreases the odds of heart disease. These might suggest some confounding effects which need further investigation.

Prediction:

Confusion Matrix:

##    
##     FALSE TRUE
##   0  4138 1353
##   1  1211 4248

## [1] "Accuracy rate:  0.766"

## [1] "False positive rate:  0.242"

## [1] "False negative rate:  0.226"

Applying the model on the test set, we get an accuracy rate of 76.58%. The false positive rate and false negative rate are 24.16% and 22.64%, respectively. To improve the result, and further consider the possibility of dimension reduction, PCA and Lasso is applied.

PCA

Cumulative Proportion of Variance for the first 10 PCs:

##  [1] 0.1000140 0.1755696 0.2418532 0.3057870 0.3664403 0.4260800 0.4849573
##  [8] 0.5426664 0.5995771 0.6560721

The first 8 principal components only explain 55% of variations, and it is hard to explain the meanings of each PC. Therefore, we will not expect a very effective dimension reduction.

Lasso

## [1] "1se lambda:  0.00369193455419403"

## [1] "Number of non-zero covariates:  34"

Coefficients:

## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                           seg80
## intercept                          -1.366279721
## BMI                                 0.009556514
## Smoking                             0.391052307
## AlcoholDrinking                    -0.135539460
## Stroke                              1.044050818
## PhysicalHealth                      0.010254354
## MentalHealth                        0.002308579
## DiffWalking                         0.311906876
## SexMale                             0.644823043
## AgeCategory25-29                   -0.911747767
## AgeCategory30-34                   -0.611210249
## AgeCategory35-39                   -0.580408065
## AgeCategory40-44                   -0.263844206
## AgeCategory45-49                   -0.024496504
## AgeCategory50-54                    0.230858283
## AgeCategory55-59                    0.493756952
## AgeCategory60-64                    0.789455616
## AgeCategory65-69                    1.033373881
## AgeCategory70-74                    1.364280268
## AgeCategory75-79                    1.519118496
## AgeCategory80 or older              1.841148321
## RaceBlack                          -0.107337407
## RaceAsian                          -0.189512489
## RaceAmerican Indian/Alaskan Native  .          
## RaceOther                           .          
## RaceHispanic                       -0.056526987
## DiabeticNo, borderline diabetes     0.153552071
## DiabeticYes (during pregnancy)      .          
## DiabeticYes                         0.557305083
## PhysicalActivity                    .          
## GenHealthFair                       0.002218896
## GenHealthGood                      -0.432710420
## GenHealthVery good                 -0.945962262
## GenHealthExcellent                 -1.411680784
## SleepTime                          -0.010297808
## Asthma                              0.257901977
## KidneyDisease                       0.611018557
## SkinCancer                          0.152189359

The chosen lambda value from 1se rule is 0.003691935, quite close to 0. It means most of the covariates still have a non-zero estimated coefficient. Actually, 34 covariates are left. The coefficients have a similar pattern as those from the logistic regression with some shrinkage.

Prediction:

Confusion Matrix:

##    
##     FALSE TRUE
##   0  4162 1329
##   1  1270 4189

## [1] "Accuracy rate:  0.763"

## [1] "False positive rate:  0.241"

## [1] "False negative rate:  0.234"

The accuracy rate, false positive rate and false negative rate are comparable to the logistic regression model outcomes.

Causal Lasso

One interesting topic here is whether the diagnosis of certain diseases will have causal relationship with getting heart disease. If the causal relationship exists, doctors can give precautionary advice and medications to the patients who have such diseases, so that the risks of getting heart disease can be mitigated.

Examine if having stroke would cause heart disease

## [1] "In-sample R^2:  0.117"

The in-sample \(R^2\) value between \(d\) and \(\hat{d}\) is only 11.7%. The treatment still carries a large proportion of information that is independent of other covariates.

## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                          seg100
## intercept                          -1.551734485
## BMI                                 0.011090725
## Smoking                             0.421582897
## AlcoholDrinking                    -0.175329980
## PhysicalHealth                      0.009097748
## MentalHealth                        0.004001600
## DiffWalking                         0.284761425
## SexMale                             0.678448247
## AgeCategory25-29                   -0.811638185
## AgeCategory30-34                   -0.478290372
## AgeCategory35-39                   -0.443446374
## AgeCategory40-44                   -0.107025332
## AgeCategory45-49                    0.029243538
## AgeCategory50-54                    0.479354027
## AgeCategory55-59                    0.737529129
## AgeCategory60-64                    1.032566814
## AgeCategory65-69                    1.281170338
## AgeCategory70-74                    1.615669405
## AgeCategory75-79                    1.778949886
## AgeCategory80 or older              2.112569163
## RaceBlack                          -0.148448811
## RaceAsian                          -0.240665138
## RaceAmerican Indian/Alaskan Native  .          
## RaceOther                           .          
## RaceHispanic                       -0.073305930
## DiabeticNo, borderline diabetes     0.195499355
## DiabeticYes (during pregnancy)      .          
## DiabeticYes                         0.553922771
## PhysicalActivity                    .          
## GenHealthFair                       .          
## GenHealthGood                      -0.505205768
## GenHealthVery good                 -1.017264111
## GenHealthExcellent                 -1.486186016
## SleepTime                          -0.018476558
## Asthma                              0.293595704
## KidneyDisease                       0.635135908
## SkinCancer                          0.151779969
## seg99                               .          
## causal_d                            1.077015741

## [1] "The causal multiplied effect of stroke: 2.94"

The coefficient of treatment after excluding the confounding effect is 1.077. It means having stroke is expected to cause the odds of having heart disease multiplied by 2.94, keep everything else fixed. Similarly, we try for asthma, kidney disease, and skin cancer.

Try asthma:

## [1] "In-sample R^2:  0.066"

## [1] "The causal multiplied effect of asthma:  1.35"

Try kidney disease:

## [1] "In-sample R^2:  0.142"

## [1] "The causal multiplied effect of kidney disease:  1.89"

Try skin cancer:

## [1] "In-sample R^2:  0.146"

## [1] "The causal multiplied effect of skin cancer:  1.11"

In this part, we find that asthma and skin cancer have a relatively small causal relationship with heart disease, while the causal relationship between stroke and kidney disease with heart disease is notable. More specifically, keep everything else fixed, having stroke and kidney disease is expected to cause the odds of having heart disease multiplied by 2.94 and 1.89, respectively.

K-Nearest-Neighbour Model (Knn)

Tuning Parameters:

We first tune the parameters based on the accuracy rate. From this plot, we decide to use k = 9 as it has the highest accuracy rate.

## [1] "The final value used for the model was kmax = 9 , distance = 2 and kernel = optimal"

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4030 1485
##          1 1461 3974
##                                           
##                Accuracy : 0.731           
##                  95% CI : (0.7225, 0.7392)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4619          
##                                           
##  Mcnemar's Test P-Value : 0.6717          
##                                           
##             Sensitivity : 0.7339          
##             Specificity : 0.7280          
##          Pos Pred Value : 0.7307          
##          Neg Pred Value : 0.7312          
##              Prevalence : 0.5015          
##          Detection Rate : 0.3680          
##    Detection Prevalence : 0.5037          
##       Balanced Accuracy : 0.7310          
##                                           
##        'Positive' Class : 0               
##

## [1] "Accuracy rate:  0.731"

## [1] "False positive rate:  0.272"

## [1] "False negative rate:  0.266"

Making prediction on the test set, from the above confusion matrix, by using KNN method, we got the accuracy rate is about 73.1%, false positive rate is about 27.2%, and false negative rate is about 26.6%.

Support Vector Machine (SVM)

Tuning Parameters:

## [1] "The final value used for the model was cost = 0.5 and Loss = L1"

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4068 1141
##          1 1423 4318
##                                           
##                Accuracy : 0.7658          
##                  95% CI : (0.7578, 0.7738)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5318          
##                                           
##  Mcnemar's Test P-Value : 2.866e-08       
##                                           
##             Sensitivity : 0.7408          
##             Specificity : 0.7910          
##          Pos Pred Value : 0.7810          
##          Neg Pred Value : 0.7521          
##              Prevalence : 0.5015          
##          Detection Rate : 0.3715          
##    Detection Prevalence : 0.4757          
##       Balanced Accuracy : 0.7659          
##                                           
##        'Positive' Class : 0               
##

## [1] "Accuracy rate:  0.766"

## [1] "False positive rate:  0.209"

## [1] "False negative rate:  0.259"

Making prediction on the test set, from the above confusion matrix, by using SVM method, we got the accuracy rate is about 76.6%, false positive rate is about 20.9%, and false negative rate is about 25.9%.

Single Tree

For better performance, we move from parametric models to non-parametric ones. For completeness, we start from single tree, and then try random forest.

The single tree splits based on age category and general health status.

We prune the tree with K = 3 as the cross validation error rate stops decreasing significantly after this point. According to the pruned tree, if a patient has age category higher than “g” (50-54), and general health status lower than “d” (very good), he/she would have heart disease.

##    yhat_tree
##        0    1
##   0 4214 1277
##   1 1889 3570

## [1] "Accuracy rate:  0.711"

## [1] "False positive rate:  0.263"

## [1] "False negative rate:  0.31"

The pruned tree has worse prediction accuracy compared to previous models.

Random Forest - Grid Search

Tuning Parameters by using grid search:

## [1] "The final value used for the model was mtry = 2"

Importance Plot:

From this plot, variables with high importance are DiffWalking, Diabetic,GenHealth, and AgeCategory.

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4007 1129
##          1 1484 4330
##                                           
##                Accuracy : 0.7614          
##                  95% CI : (0.7533, 0.7693)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5228          
##                                           
##  Mcnemar's Test P-Value : 4.353e-12       
##                                           
##             Sensitivity : 0.7297          
##             Specificity : 0.7932          
##          Pos Pred Value : 0.7802          
##          Neg Pred Value : 0.7448          
##              Prevalence : 0.5015          
##          Detection Rate : 0.3659          
##    Detection Prevalence : 0.4690          
##       Balanced Accuracy : 0.7615          
##                                           
##        'Positive' Class : 0               
##

## [1] "Accuracy rate:  0.761"

## [1] "False positive rate:  0.207"

## [1] "False negative rate:  0.27"

The accuracy rate is 76.1%, with false positive rate = 20.7%, false negative rate = 27%. As we expected, random forest has better performance compared to the single pruned tree. The results are also slightly better than Lasso.

Random Forest - Recursive Feature Elimination method (RFE)

Recursive Feature Elimination is an efficient feature selection algorithm that fits a model and eliminates the weakest features until the specified number of features is reached. We do feature selection by using REF method and then apply random forest again. Tuning parameters and then get our confusion matrix.

By using this method, the chosen variables are:

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3954 1122
##          1 1537 4337
##                                          
##                Accuracy : 0.7572         
##                  95% CI : (0.749, 0.7652)
##     No Information Rate : 0.5015         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5144         
##                                          
##  Mcnemar's Test P-Value : 9.857e-16      
##                                          
##             Sensitivity : 0.7201         
##             Specificity : 0.7945         
##          Pos Pred Value : 0.7790         
##          Neg Pred Value : 0.7383         
##              Prevalence : 0.5015         
##          Detection Rate : 0.3611         
##    Detection Prevalence : 0.4636         
##       Balanced Accuracy : 0.7573         
##                                          
##        'Positive' Class : 0              
##

## [1] "Accuracy rate:  0.757"

## [1] "False positive rate:  0.206"

## [1] "False negative rate:  0.28"

By using 10-fold cross-validation, we get the above confusion matrix. Its accuracy is similar as the previous method. The accuracy rate is about 75.7%, with false positive rate = 20.6% and false negative rate = 28%.

Random Forest - Genetic Algorithm (GA)

Genetic algorithm is inspired by Charles Darwin’s idea of natural selection. The fittest lives. This model uses the combinatory and optimization to choose the right number of variables in order to create a predictive model.

By using this method, the chosen variables are:

##  [1] "BMI"              "Smoking"          "AlcoholDrinking"  "PhysicalHealth"  
##  [5] "MentalHealth"     "DiffWalking"      "Sex"              "AgeCategory"     
##  [9] "Diabetic"         "PhysicalActivity" "GenHealth"        "SleepTime"       
## [13] "Asthma"           "KidneyDisease"    "SkinCancer"

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3968 1156
##          1 1523 4303
##                                           
##                Accuracy : 0.7553          
##                  95% CI : (0.7472, 0.7634)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5108          
##                                           
##  Mcnemar's Test P-Value : 1.536e-12       
##                                           
##             Sensitivity : 0.7226          
##             Specificity : 0.7882          
##          Pos Pred Value : 0.7744          
##          Neg Pred Value : 0.7386          
##              Prevalence : 0.5015          
##          Detection Rate : 0.3624          
##    Detection Prevalence : 0.4679          
##       Balanced Accuracy : 0.7554          
##                                           
##        'Positive' Class : 0               
##

## [1] "Accuracy rate:  0.755"

## [1] "False positive rate:  0.212"

## [1] "False negative rate:  0.277"

Making prediction on the test set, from the above confusion matrix, by using Genetic Algorithm, we got the accuracy rate is about 75.5%, false positive rate is about 21.2%, and false negative rate is about 27.7%.

Extreme Gradient Boosting (XGBoost)

Extreme gradient boosting method is a both computationally efficient and highly effective ensemble machine learning algorithm that can be used for classification. Ensembles are construction by using decision trees. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. Below, we build the model, tune the parameters and then get the confusion matrix.

Tuning parameters:

We make a plot to compare the accuracy rate of using different values for parameters. We want to tune parameters such that we could have a highest accuracy rate. Then we got the following result.

## [1] "In this model, we use nrounds = 100 , max_depth = 3 , eta = 0.3 , gamma = 0"   
## [2] "colsample_bytree = 0.6 , Minimum Sum of Instance Weight = 1  and subsample = 1"

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4090 1134
##          1 1401 4325
##                                           
##                Accuracy : 0.7685          
##                  95% CI : (0.7605, 0.7764)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.537           
##                                           
##  Mcnemar's Test P-Value : 1.27e-07        
##                                           
##             Sensitivity : 0.7449          
##             Specificity : 0.7923          
##          Pos Pred Value : 0.7829          
##          Neg Pred Value : 0.7553          
##              Prevalence : 0.5015          
##          Detection Rate : 0.3735          
##    Detection Prevalence : 0.4771          
##       Balanced Accuracy : 0.7686          
##                                           
##        'Positive' Class : 0               
##

## [1] "Accuracy rate:  0.768"

## [1] "False positive rate:  0.208"

## [1] "False negative rate:  0.255"

Making prediction on the test set, from the confusion matrix, we can know that, we have an accuracy about 76.8%, false positive rate = 20.8%, and false negative rate = 25.5%.

Neural Network

Tuning Parameters:

We make a plot to compare the accuracy rate of using different values for parameters. We want to tune parameters such that we could have a highest accuracy rate. Then we got the following result.

## [1] "The final value used for the model was size = 3 and decay = 0.01"

Prediction:

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4051 1106
##          1 1440 4353
##                                           
##                Accuracy : 0.7675          
##                  95% CI : (0.7595, 0.7754)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5351          
##                                           
##  Mcnemar's Test P-Value : 4.124e-11       
##                                           
##             Sensitivity : 0.7378          
##             Specificity : 0.7974          
##          Pos Pred Value : 0.7855          
##          Neg Pred Value : 0.7514          
##              Prevalence : 0.5015          
##          Detection Rate : 0.3700          
##    Detection Prevalence : 0.4710          
##       Balanced Accuracy : 0.7676          
##                                           
##        'Positive' Class : 0               
##

## [1] "Accuracy rate:  0.767"

## [1] "False positive rate:  0.203"

## [1] "False negative rate:  0.262"

Making prediction on the test set, from the above confusion matrix, by using Neural Network method, the accuracy rate is about 76.7%, false positive rate is about 20.3%, and false negative rate is about 26.2%.

Model Comparison

We evaluate models by their out-of-sample performance, mainly on accuracy rate, false positive rate and false negative rate.

Here is the table containing the rates from all the models.

Model	Accuracy rate	False positive rate	False negative rate
Logistic	0.766	0.242	0.226
LASSO	0.763	0.241	0.234
KNN	0.731	0.272	0.266
SVM	0.766	0.209	0.259
Single Tree	0.711	0.263	0.310
Random Forest	0.761	0.207	0.270
XGBoost	0.768	0.208	0.255
RF-RFE	0.757	0.206	0.280
RF-GA	0.755	0.212	0.277
Neural Network	0.767	0.203	0.262

As we can see from the table, the prediction accuracy rates of most of the models are not very different, ranging from 71% to 77%. KNN and single tree are two models performing most poorly in prediction with an accuracy rate lower than 74%. Relatively speaking, logistic regression, SVM, XGBoost and neural network are the best models for prediction with an accuracy rate larger than 76.5%. Among these 4 models, logistic regression model has the lowest false negative rate 22.6%. Therefore, we finally choose logistic regression model as the optimal model. In this case, we can diagnose as many people who actually have heart disease as possible while ensuring high prediction accuracy rate. Meanwhile, logistic regression is a model with low computational complexity. This is also an advantage of logistic regression model.

One question that needs to be explained is why prediction accuracy rates of all models are below 80%, which is not high in absolute terms. That is because in the previous data pre-processing part, we only took 10% of the total data for training and testing in order to solve the problem of excessive computational complexity. We lost some useful information in this process.

Appendix - code

# EDA

library(ggplot2)
library(ggpubr)
library(cowplot)
library(plyr)
library(stringr)
library(gridExtra)
library(RColorBrewer)
library(psych)
library(tree)

data = read.csv("heart_2020_cleaned.csv")

# Categorical data
# a) Data with two levels   
chr.list = c()
for (name in colnames(data)) {
  if (is.character(data[,name])) {
    chr.list = append(chr.list, name)
  }
}
num.list = c()
for (name in colnames(data)) {
  if (is.numeric(data[,name])) {
    num.list = append(num.list, name)
  }
}

myplots <- vector("list",14)

for (i in 1:14) {
  name = chr.list[i]      
  df.plot = as.data.frame(count(data, name))
  mylabel = paste0(round(df.plot[,2]/nrow(data)*100, 2), "%")
  colnames(df.plot) = c("factor","number")
  df.plot$mylabel = mylabel
  
  if (i==9|i==11) {
    p = ggplot(data = df.plot, mapping = aes(x = "", y = number, fill = str_wrap(factor, 20))) +   scale_fill_brewer(palette="Pastel1",direction = -1) + 
      geom_bar(stat="identity", width=0.8, color="white") +
      coord_polar("y", start=0) +
      labs(title = name) + 
      theme_void() + 
      theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5), 
            legend.key.size = unit(.2, 'cm')) + 
      geom_text(mapping = aes(x = 1.8, y = sum(number)-cumsum(number)+number/2, label = mylabel), size = 3)
  }else if(i==7){
    colors_palette = c("#FBB4AE","#B3CDE3","#CCEBC5","#DECBE4","#FED9A6",
                       "#FFFFCC","#E5D8BD", "#FDDAEC","#F8C3CD","#A8D8B9",
                       "#A5DEE4","#9B90C2","#FFBA84")
    p = ggplot(data = df.plot, mapping = aes(x = "", y = number, fill = str_wrap(factor, 20))) +   scale_fill_manual(values = colors_palette) + 
      geom_bar(stat="identity", width=0.8, color="white") +
      coord_polar("y", start=0) +
      labs(title = name) + 
      theme_void() +
      theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5), 
            legend.key.size = unit(.2, 'cm')) + 
      geom_text(mapping = aes(x = 1.8, y = sum(number)-cumsum(number)+number/2, label = mylabel), size = 3)
  }else if (i==8) 
  {
    p = ggplot(data = df.plot, mapping = aes(x = "", y = number, fill = str_wrap(factor, 20))) +   scale_fill_brewer(palette="Pastel1") +
      geom_bar(stat="identity", width=1, color="white") +
      coord_polar("y", start=0) +
      labs(title = name) + 
      theme_void()  +
      theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5), 
            legend.key.size = unit(.2, 'cm')) + 
      geom_text(mapping = aes(x = c(1.9,1.7,1.6,1.8,1.8,1.8), y = sum(number)-cumsum(number)+number/2, label = mylabel), size = 3)
  }else if(i == 5 | i==10 | i ==14) # contains legend, yes or no
  {
    p = ggplot(data = df.plot, mapping = aes(x = "", y = number, fill = factor)) + 
      geom_bar(stat="identity", width=2.3, color="white") +
      scale_fill_brewer(palette="Pastel1", direction = -1) +
      coord_polar("y", start=0) +
      labs(title = name) + 
      theme_void() + 
      theme(legend.title = element_blank(), plot.title = element_text(hjust = 1)) + 
      geom_text(mapping = aes(x = 3, label = mylabel), position = position_stack(vjust = .5), size = 3.5)
  }else if(i == 6) # contains legend, female or male
  {
    p = ggplot(data = df.plot, mapping = aes(x = "", y = number, fill = factor)) + 
      geom_bar(stat="identity", width=2.5, color="white") +
      scale_fill_brewer(palette="Pastel1") +
      coord_polar("y", start=0) +
      labs(title = name) + 
      theme_void() + 
      theme(legend.title = element_blank(), plot.title = element_text(hjust = 1)) + 
      geom_text(mapping = aes(x = 2.8,  y = sum(number)*0.85-cumsum(number)+number/2, label = mylabel), size = 3)
  }else
  {
    p = ggplot(data = df.plot, mapping = aes(x = "", y = number, fill = factor)) + 
      geom_bar(stat="identity", width=0.5, color="white") +
      scale_fill_brewer(palette="Pastel1", direction = -1) +
      coord_polar("y", start=0) +
      labs(title = name) + 
      theme_void()  + theme(legend.position="none") +
      theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5)) + 
      geom_text(mapping = aes(x = 1.5, y = sum(number)-cumsum(number)+number/2 , label = mylabel), size = 3.5)
  }
  
  
  
  
  myplots[[i]] = p
}


p1 = ggarrange(myplots[[1]], myplots[[4]],myplots[[5]], ncol = 3)
p2 = ggarrange(myplots[[2]], myplots[[3]], myplots[[10]], ncol = 3)
p3 = ggarrange( myplots[[12]], myplots[[13]], myplots[[14]], ncol = 3)
p4 = ggarrange(myplots[[6]], ncol = 3)

aligned <- align_plots(p1,p3,p2,p4, align = "v")

ggdraw(aligned[[1]])
ggdraw(aligned[[2]])
ggdraw(aligned[[3]])
ggdraw(aligned[[4]])



# b) Data with more than two levels
p6 = ggarrange(myplots[[7]], myplots[[8]], ncol = 2)
p7 = ggarrange(myplots[[9]], myplots[[11]], ncol = 2)
aligned <- align_plots(p6,p7,align = "v")
ggdraw(aligned[[1]])
ggdraw(aligned[[2]])


# Numerical Data 
# a) Summary Statistics
num.table = data.frame(a = rep(0,4), 
                       b = rep(0,4), 
                       c = rep(0,4), 
                       d = rep(0,4), 
                       e = rep(0,4), 
                       f = rep(0,4))
colnames(num.table) = names(summary(data[,num.list[1]]))
rownames(num.table) = num.list
for (i in 1:length(num.list)) {
  num.table[i,] = summary(data[,num.list[i]])
}
knitr::kable(num.table)


# b) Graphics

p1 = ggplot(data = data, mapping = aes(BMI)) + 
  geom_histogram(aes(y=..density..), color = "black", fill = "white", binwidth = 1, lwd = 0.35) + 
  geom_density(alpha=.2, fill="#7B90D2", lwd = 0.35) + theme_minimal()


p2 = ggplot(data = data, mapping = aes(PhysicalHealth)) +
  geom_histogram(aes(y=..density..), color = "black", fill = "#7B90D2",alpha=.2, binwidth = 1, lwd = 0.35) +  theme_minimal() 


p3 = ggplot(data = data, mapping = aes(MentalHealth)) + 
  geom_histogram(aes(y=..density..), color = "black", fill = "#7B90D2",alpha=.2, binwidth = 1, lwd = 0.35) +  theme_minimal() 


p4 = ggplot(data = data, mapping = aes(SleepTime)) + 
  geom_histogram(aes(y=..density..), color = "black", fill = "#7B90D2",alpha=.2, binwidth = 1, lwd = 0.35) +  theme_minimal() 


p1

grid.arrange(p2, p3, ncol=2)

p4


## Exploratory Analysis 

myplots <- vector("list",length(chr.list))
for (i in 1:length(chr.list)) {
  x.list = 1:length(levels(factor(data[,chr.list[i]])))
  y.list = c()
  prob.list = c()
  
  for (k in levels(factor(data[,chr.list[i]]))) {
    df.prob = data[,"HeartDisease"][data[,chr.list[i]] == k]
    prob = sum(df.prob == "Yes")/length(df.prob)
    y.list = append(y.list, length(df.prob))
    prob.list = append(prob.list, prob)
  }
  label.list = paste0(round(prob.list*100, 2), "%")
  levels = as.factor(data$HeartDisease)
  
  p = ggplot(data = data) + 
    geom_bar(aes_string(x = chr.list[i], fill = forcats::fct_rev(levels))) + 
    annotate("text", x=x.list, y=y.list + max(y.list)*0.025, label= label.list , size = 3)+
    guides(fill=guide_legend(title="Heart Disease")) + scale_fill_manual(values = colors_palette) + theme_minimal() + ylab("Count")
  myplots[[i]] = p
  
}


p2 = myplots[[2]]; p3 = myplots[[3]]
pie1 = grid.arrange(p2, p3, ncol=2)


p4 = myplots[[4]]; p5 = myplots[[5]]
pie2 = grid.arrange(p4, p5, ncol=2)


p6 = myplots[[6]]; p10 = myplots[[10]]
pie3 = grid.arrange(p6, p10, ncol=2)


p12 = myplots[[12]];p13 = myplots[[13]]
pie4 = grid.arrange(p12, p13, ncol=2)


p14 = myplots[[14]]
pie5 = grid.arrange(p14, ncol=2)


p7 = myplots[[7]]; p8 = myplots[[8]]; p9 = myplots[[9]]; p11 = myplots[[11]]
p7;p8;p9;p11


p1 = ggplot(data = data, aes(BMI, y=..count.., color = HeartDisease)) + 
  geom_histogram(alpha = .2, fill = "#DBE2E9", binwidth = 1) + 
  geom_density(alpha=.2, fill="#DBE2E9", lwd = 0.35)+ scale_color_brewer(palette="Pastel1",direction = -1) + theme_minimal()

p2 = ggplot(data = data, aes(PhysicalHealth, color = HeartDisease)) + 
  geom_histogram(binwidth = 1, alpha = .2, fill = "#DBE2E9") +
  scale_color_brewer(palette="Pastel1",direction = -1) + theme_minimal()


p3 = ggplot(data = data, aes(MentalHealth, color = HeartDisease)) + 
  geom_histogram(binwidth = 1, alpha = .2, fill = "#DBE2E9") + 
  scale_color_brewer(palette="Pastel1",direction = -1) + theme_minimal()

p4 = ggplot(data = data, aes(SleepTime, color = HeartDisease)) + 
  geom_histogram(binwidth = 1, alpha = .2, fill = "#DBE2E9") + 
  scale_color_brewer(palette="Pastel1",direction = -1) + theme_minimal() 


p1

grid.arrange(p2, p3, ncol=2)

p4

--------------------------------------------------------

# Data cleaning and pre-processing


library(doParallel)
library(caret)
library(tidyverse)
library(MASS)
library(glmnet)
library(gamlr)
library(dplyr)
library(readr)
library(corrplot)

set.seed(41201)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

hd = read_csv("heart_2020_cleaned.csv", col_types = cols(.default = "f", BMI = "d", SleepTime = "d", PhysicalHealth = "d", 
                                                         MentalHealth = "d", AgeCategory = "c"))
library(readr)
hd = read_csv("heart_2020_cleaned.csv", col_types = cols(.default = "f", BMI = "d", SleepTime = "d", PhysicalHealth = "d", 
                                                         MentalHealth = "d", AgeCategory = "c"))

library(doParallel)
library(caret)
library(tidyverse)
library(MASS)
library(glmnet)
library(gamlr)
library(dplyr)
library(corrplot)

# Factor
hd$GenHealth = fct_relevel(factor(hd$GenHealth), 
                           c("Poor", "Fair", "Good", "Very good", "Excellent")) 
hd$Diabetic = fct_relevel(factor(hd$Diabetic), 
                          c("No", "No, borderline diabetes", "Yes (during pregnancy)", "Yes") )
hd$AgeCategory = as.factor(hd$AgeCategory)

# Binary
hd$HeartDisease =ifelse(hd$HeartDisease =="Yes",1,0)
hd$Smoking=ifelse(hd$Smoking=="Yes",1,0)
hd$AlcoholDrinking=ifelse(hd$AlcoholDrinking=="Yes",1,0)
hd$Stroke=ifelse(hd$Stroke=="Yes",1,0)
hd$DiffWalking=ifelse(hd$DiffWalking=="Yes",1,0)
hd$PhysicalActivity=ifelse(hd$PhysicalActivity=="Yes",1,0)
hd$Asthma=ifelse(hd$Asthma=="Yes",1,0)
hd$KidneyDisease=ifelse(hd$KidneyDisease=="Yes",1,0)
hd$SkinCancer=ifelse(hd$SkinCancer=="Yes",1,0)

#hd_num = mutate_if(hd, is.factor, as.numeric)
palette = brewer.pal(50, "RdYlBu")
cols = colorRampPalette(palette)


library(corrplot)
hd_num = mutate_if(hd, is.factor, as.numeric)
corrplot(cor(hd_num[,-11]), type = 'lower', tl.col = "black",  tl.cex = 0.7)

set.seed(41201)
hd1=hd[(hd$HeartDisease==1),]
hd0=hd[(hd$HeartDisease==0),]
idx0 = sample(1:nrow(hd0), sum(hd$HeartDisease==1))
hd0=hd0[idx0,]
hd_undersample=rbind(hd1,hd0)

idx1 = sample(1:nrow(hd_undersample), nrow(hd_undersample)*0.8)
train = hd_undersample[idx1, ]
test = hd_undersample[-idx1, ]

--------------------------------------------------------

# logistic regression

library(dplyr)
logistic_mod = glm(HeartDisease ~., data = train, family = binomial)
summary(logistic_mod)

# predict response using test data
yhat_logistic = predict(logistic_mod, newdata = test, type = "response")
table(test$HeartDisease, (yhat_logistic>=0.5)) # classification table
accuracy_rate_lg = round((4138+4248)/10950,3) # accuracy rate
fp_lg = round(1353/(1353+4248),3) # false positive
fn_lg = round(1211/(1211+4138),3) # false negative

source("deviance.R")
D_lg = deviance(test$HeartDisease, yhat_logistic, family="binomial")
D0 = deviance(test$HeartDisease, mean(test$HeartDisease==1), family="binomial")
OOS_R_sq_lg = round(1-D_lg/D0,3) # oos r2 value

paste("Accuracy rate: ", accuracy_rate_lg)     
paste("False positive rate: ", fp_lg)
paste("False negative rate: ", fn_lg)


----------------------------------------------------------------

# PCA

pca_mod = prcomp(hd_num[,-1], scale = TRUE)
cumsum(pca_mod$sdev/sum(pca_mod$sdev))[1:10]

#pca_mod$rotation[,1:8]
plot(pca_mod$rotation[,1:2], type = "n")
text(pca_mod$rotation[,1:2],labels=(colnames(hd_num)[-1]))

----------------------------------------------------------------
# LASSO

library(gamlr)
# convert the design matrix into the appropriate form, drop the intercept column
X_train = model.matrix(~ ., train[,-1])[,-1]
#X_train[1,]

# response must be a vector instead of a data frame
Y_train = train$HeartDisease

set.seed(41201)
cvfit_lasso = cv.gamlr(X_train, Y_train, family = "binomial")
#cvfit_lasso = cv.glmnet(X_train, Y_train, alpha=1, family = "binomial")


plot(cvfit_lasso)
paste( "1se lambda: ", cvfit_lasso$lambda.1se) 
paste("Number of non-zero covariates: ", sum(coef(cvfit_lasso, s = "1se") !=0))

coef(cvfit_lasso, s = "1se")


# predict y using test data
yhat_lasso = predict(cvfit_lasso, model.matrix(~ ., test[,-1])[,-1], type = "response")
table(test$HeartDisease, (yhat_lasso>=0.5)) # classification table

accuracy_rate_lasso =  round((4162+4189)/10950,3) #accuracy rate
fp_lasso = round(1329/(1329+4189),3) #false positive
fn_lasso = round(1270/(1270+4162),3) #false negative

D_lasso = deviance(test$HeartDisease, yhat_lasso, family="binomial")
D0 = deviance(test$HeartDisease, mean(test$HeartDisease==1), family="binomial")
OOS_R_sq_lasso = round(1-D_lasso/D0,3) # oos r2 value


paste("Accuracy rate: ", accuracy_rate_lasso)     
paste("False positive rate: ", fp_lasso)
paste("False negative rate: ", fn_lasso)

----------------------------------------------------------------

# Casual Lasso 

# remove stroke from the design matrix
#X_train[1,]
causal_x = X_train[,-4]
causal_d = X_train[,4]

treat = gamlr(causal_x, causal_d, family = "binomial")
causal_dhat_cl1 = predict(treat, causal_x, type = "response")

#coef(treat)
D_cl1 = deviance(causal_d, causal_dhat_cl1, family="binomial")
D0_cl1 = deviance(causal_d, mean(causal_d), family="binomial")
In_R_sq_cl1 = round(1-D_cl1/D0_cl1 ,3)
paste("In-sample R^2: ", In_R_sq_cl1)


causal = gamlr(cbind(causal_x, causal_dhat_cl1, causal_d), Y_train, free = 2, family = "binomial")

coef(causal)

paste("The causal multiplied effect of stroke:", round(exp(1.077),2))

# Try asthma:

causal_x = X_train[,-35]
causal_d = X_train[,35]

treat = gamlr(causal_x, causal_d, family = "binomial")

causal_dhat_cl2 = predict(treat, causal_x, type = "response")
D_cl2 = deviance(causal_d, causal_dhat_cl2, family="binomial")
D0_cl2 = deviance(causal_d, mean(causal_d), family="binomial")
In_R_sq_cl2 = round(1-D_cl2/D0_cl2 ,3)
paste("In-sample R^2: ", In_R_sq_cl2)

causal = gamlr(cbind(causal_x, causal_dhat_cl2, causal_d), Y_train, free = 2, family = "binomial")
coef_d = coef(causal)["causal_d",]
paste("The causal multiplied effect of asthma: ", round(exp(coef_d),2))

# Try kidney disease: 

causal_x = X_train[,-36]
causal_d = X_train[,36]

treat = gamlr(causal_x, causal_d, family = "binomial")

causal_dhat_cl3 = predict(treat, causal_x, type = "response")
D_cl3 = deviance(causal_d, causal_dhat_cl3, family="binomial")
D0_cl3 = deviance(causal_d, mean(causal_d), family="binomial")
In_R_sq_cl3 = round(1-D_cl3/D0_cl3 ,3)
paste("In-sample R^2: ", In_R_sq_cl3)

causal = gamlr(cbind(causal_x, causal_dhat_cl3, causal_d), Y_train, free = 2, family = "binomial")

coef_d =  coef(causal)["causal_d",]
paste("The causal multiplied effect of kidney disease: ", round(exp(coef_d),2))


# Try skin cancer:

causal_x = X_train[,-37]
causal_d = X_train[,37]

treat = gamlr(causal_x, causal_d, family = "binomial")

causal_dhat_cl4 = predict(treat, causal_x, type = "response")
D_cl4 = deviance(causal_d, causal_dhat_cl4, family="binomial")
D0_cl4 = deviance(causal_d, mean(causal_d), family="binomial")
In_R_sq_cl4 = round(1-D_cl4/D0_cl4 ,3)
causal = gamlr(cbind(causal_x, causal_dhat_cl4, causal_d), Y_train, free = 2, family = "binomial")

paste("In-sample R^2: ", In_R_sq_cl4)
coef_d = coef(causal)["causal_d",]

paste("The causal multiplied effect of skin cancer: ", round(exp(coef_d),2))

----------------------------------------------------------------
  
## K-Nearest-Neighbour Model (Knn)
load("models.RData")

set.seed(41201)
train$HeartDisease =factor(train$HeartDisease)
test$HeartDisease =factor(test$HeartDisease)
model_knn <- caret::train(HeartDisease ~ .,
                          data = train,
                          method = "kknn",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                   number = 10, 
                                                   repeats = 3, 
                                                   savePredictions = TRUE, 
                                                   
                                                   verboseIter = FALSE))

plot(model_knn)

print(
  paste(
    "The final value used for the model was kmax =",
    model_knn[["finalModel"]][["tuneValue"]][["kmax"]],
    ", distance =",
    model_knn[["finalModel"]][["tuneValue"]][["distance"]],
    "and kernel =",
    model_knn[["finalModel"]][["tuneValue"]][["kernel"]]
  )
)

#predict
#model_knn_predict=predict(model_knn, test)
model_knn_cm=confusionMatrix(model_knn_predict, test$HeartDisease)

model_knn_cm

#accuracy
accuracy_rate_knn = round(model_knn_cm$overall[["Accuracy"]], 3)
fn_knn = round(1-model_knn_cm$byClass[['Sensitivity']],3)
fp_knn = round(1-model_knn_cm$byClass[['Specificity']],3)

paste("Accuracy rate: ", accuracy_rate_knn)     
paste("False positive rate: ", fp_knn)
paste("False negative rate: ", fn_knn)

----------------------------------------------------------------

## Support Vector Machine (SVM)  

set.seed(41201)
model_svm <- caret::train(HeartDisease ~ .,
                          data = train,
                          method = "svmLinear3",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                   number = 10, 
                                                   repeats = 3, 
                                                   savePredictions = TRUE, 
                                                   verboseIter = FALSE))


plot(model_svm)


print(
  paste(
    "The final value used for the model was cost =",
    model_svm[["finalModel"]][["tuneValue"]][["cost"]],
    "and Loss =",
    model_svm[["finalModel"]][["tuneValue"]][["Loss"]]
  )
)



#predict
#model_svm_predict=predict(model_svm, test)

#cm, oos
model_svm_cm=confusionMatrix(model_svm_predict, test$HeartDisease)
model_svm_cm



#accuracy
accuracy_rate_svm = model_svm_cm$overall[["Accuracy"]]
#false negative 
fn_svm = 1-model_svm_cm$byClass[['Sensitivity']]
#false positive
fp_svm = 1-model_svm_cm$byClass[['Specificity']]

paste("Accuracy rate: ", round(accuracy_rate_svm,3))     
paste("False positive rate: ", round(fp_svm,3))
paste("False negative rate: ", round(fn_svm,3))

----------------------------------------------------------------
# Single tree  
set.seed(41201)
tree_mod=tree(HeartDisease ~., data = train)

plot(tree_mod)
text(tree_mod,cex=0.5,digits=2)

# pruning trees
set.seed(41201)
cv_tree_mod=cv.tree(tree_mod,FUN=prune.misclass, K = 10)

plot(cv_tree_mod$size,cv_tree_mod$dev,type="b",cex=.5)


pruned_mod = prune.tree(tree_mod,best =3)
plot(pruned_mod)
text(pruned_mod)


#yhat_tree=predict(pruned_mod,newdata=test, type = "class")
tree_tab = table(test$HeartDisease, yhat_tree)
tree_tab

accuracy_rate_tree=(tree_tab[1,1]+tree_tab[2,2])/nrow(test) #accuracy rate
fn_tree=tree_tab[2,1]/(tree_tab[1,1]+tree_tab[2,1]) #false negative
fp_tree=tree_tab[1,2]/(tree_tab[1,2]+tree_tab[2,2]) #false positive

D_tree = deviance(test$HeartDisease, predict(pruned_mod,newdata=test, type = "vector"), family="binomial")
D0 = deviance(test$HeartDisease, mean(test$HeartDisease==1), family="binomial")
OOS_R_sq_tree=round(1-D_tree/D0,3) 

paste("Accuracy rate: ", round(accuracy_rate_tree,3))     
paste("False positive rate: ", round(fp_tree,3))
paste("False negative rate: ", round(fn_tree,3))

----------------------------------------------------------------
## Random Forest - Grid Search         
set.seed(41201)
model_rf <- caret::train(HeartDisease ~ .,
                         data = train,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 2, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "grid"),
                         tuneLength = 5)



plot(model_rf)
print(paste("The final value used for the model was mtry =", model_rf[["finalModel"]][["tuneValue"
]][["mtry"]]))


model_rf_importance <- varImp(model_rf, scale = TRUE)
plot(model_rf_importance)

#predict
#model_rf_predict=predict(model_rf, test)
#cm, oos
model_rf_cm=confusionMatrix(model_rf_predict, test$HeartDisease)
model_rf_cm


#accuracy
accuracy_rate_rf =  model_rf_cm$overall[["Accuracy"]]

#false negative
fn_rf = 1-model_rf_cm$byClass[['Sensitivity']]

#false positive
fp_rf = 1-model_rf_cm$byClass[['Specificity']]

paste("Accuracy rate: ", round(accuracy_rate_rf,3))  
paste("False positive rate: ", round(fp_rf,3))
paste("False negative rate: ", round(fn_rf,3))

----------------------------------------------------------------
## Random Forest - Recursive Feature Elimination method (RFE) 


set.seed(41201)

#Recursive Feature Elimination
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# run the RFE algorithm
rfe_results <- rfe(x = train[, -1], y = train$HeartDisease, sizes = c(1:9), rfeControl = control)

# chosen features
predictors(rfe_results)

set.seed(41201)
train_rfe <- train[, c(1, which(colnames(train) %in% predictors(rfe_results)))]

model_rf_rfe <- caret::train(HeartDisease ~ .,
                             data = train_rfe,
                             method = "rf",
                             preProcess = c("scale", "center"),
                             trControl = trainControl(method = "repeatedcv", 
                                                      number = 10, 
                                                      repeats = 3, 
                                                      savePredictions = TRUE, 
                                                      verboseIter = FALSE))


#predict
model_rf_rfe_predict=predict(model_rf_rfe, test)


# out of sample confusion matrix
model_rf_rfe_cm=confusionMatrix(model_rf_rfe_predict, test$HeartDisease)                    
model_rf_rfe_cm


#accuracy
accuracy_rate_rf_rfe = round(model_rf_rfe_cm$overall[["Accuracy"]],3)

#false negative 
fn_rf_rfe = round(1-model_rf_rfe_cm$byClass[['Sensitivity']],3)
#false positive
fp_rf_rfe = round(1-model_rf_rfe_cm$byClass[['Specificity']],3)


paste("Accuracy rate: ", accuracy_rate_rf_rfe)     
paste("False positive rate: ", fp_rf_rfe)
paste("False negative rate: ", fn_rf_rfe)


----------------------------------------------------------------

## Random Forest - Genetic Algorithm (GA)           
set.seed(41201)
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv",    # 10 fold cross validation
                       genParallel = TRUE, # Use parallel programming
                       allowParallel = TRUE)


lev <- c("0", "1")     # Set the levels
ga_results <- gafs(x = train[, -1], y = train$HeartDisease,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)

train_ga <- train[, c(1, which(colnames(train) %in% ga_results$ga$final))]


print(ga_results$ga$final)

#train rf on our new train data
model_rf_ga <- caret::train(HeartDisease ~ .,
                            data = train_ga,
                            method = "rf",
                            preProcess = c("scale", "center"),
                            trControl = trainControl(method = "repeatedcv", 
                                                     number = 10, 
                                                     repeats = 3, 
                                                     savePredictions = TRUE, 
                                                     verboseIter = FALSE))

#predict
model_rf_ga_predict=predict(model_rf_ga, test)


#cm, oos
model_rf_ga_cm=confusionMatrix(model_rf_ga_predict, test$HeartDisease)
model_rf_ga_cm

#accuracy
accuracy_rate_rf_ga =  round(model_rf_ga_cm$overall[["Accuracy"]],3)

#false negative 
fn_rf_ga = round(1-model_rf_ga_cm$byClass[['Sensitivity']],3)

#false positive
fp_rf_ga = round(1-model_rf_ga_cm$byClass[['Specificity']],3)

paste("Accuracy rate: ", accuracy_rate_rf_ga)     
paste("False positive rate: ", fp_rf_ga)
paste("False negative rate: ", fn_rf_ga)


----------------------------------------------------------------

## Extreme Gradient Boosting (XGBoost) 

set.seed(41201)
model_xgb <- caret::train(HeartDisease ~ .,
                          data = train,
                          method = "xgbTree",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                   number = 10, 
                                                   repeats = 3, 
                                                   savePredictions = TRUE, 
                                                   verboseIter = FALSE))


par(mfrow = c(1,2))
plot(model_xgb)


a1=paste(
  "In this model, we use nrounds =",
  model_xgb[["finalModel"]][["tuneValue"]][["nrounds"]],
  ", max_depth =",
  model_xgb[["finalModel"]][["tuneValue"]][["max_depth"]],
  ", eta =",
  model_xgb[["finalModel"]][["tuneValue"]][["eta"]],
  ", gamma =",
  model_xgb[["finalModel"]][["tuneValue"]][["gamma"]])
a2=paste("colsample_bytree =",
         model_xgb[["finalModel"]][["tuneValue"]][["colsample_bytree"]],
         ", Minimum Sum of Instance Weight =",
         model_xgb[["finalModel"]][["tuneValue"]][["min_child_weight"]],
         " and subsample =", model_xgb[["finalModel"]][["tuneValue"]][["subsample"]]
)
print(
  c(a1,a2)
)

set.seed(41201)
#predict
#model_xgb_predict=predict(model_xgb, test)
(model_xgb_cm=confusionMatrix(model_xgb_predict, test$HeartDisease))



#accuracy
accuracy_rate_xgb = round(model_xgb_cm$overall[["Accuracy"]],3)

#false negative 
fn_xgb =  round(1-model_xgb_cm$byClass[['Sensitivity']],3)

#false positive
fp_xgb = round(1-model_xgb_cm$byClass[['Specificity']],3)

paste("Accuracy rate: ", accuracy_rate_xgb)     
paste("False positive rate: ", fp_xgb)
paste("False negative rate: ", fn_xgb)


----------------------------------------------------------------

## Neural Network  

set.seed(41201)
model_nnet <- caret::train(HeartDisease ~ .,
                           data = train,
                           method = "nnet",
                           preProcess = c("scale", "center"), 
                           trControl = trainControl(method = "repeatedcv", 
                                                    number = 10, 
                                                    repeats = 3, 
                                                    savePredictions = TRUE, 
                                                    verboseIter = FALSE,
                                                    search = "grid"), 
                           tuneLength = 5)


plot(model_nnet)

print(
  paste(
    "The final value used for the model was size =",
    model_nnet[["finalModel"]][["tuneValue"]][["size"]], "and decay =",model_nnet[["finalModel"]][["tuneValue"]][["decay"]]
  )
)

#predict
#model_nnet_predict=predict(model_nnet, test)

#cm, oos
model_nnet_cm=confusionMatrix(model_nnet_predict, test$HeartDisease)
model_nnet_cm


#accuracy
accuracy_rate_nnet = model_nnet_cm$overall[["Accuracy"]]
#false negative 
fn_nnet = 1-model_nnet_cm$byClass[['Sensitivity']]
#false positive
fp_nnet = 1-model_nnet_cm$byClass[['Specificity']]

paste("Accuracy rate: ", round(accuracy_rate_nnet,3))     
paste("False positive rate: ", round(fp_nnet,3))
paste("False negative rate: ", round(fn_nnet,3))

Statistical Model Inference and Prediction on Heart-disease Big Data

Jiani Song, Zhuoying Huang, Zhenghao Jiang, Xuan Lin, Yin Chen

Executive Summary

Introduction

Background

Dataset

Goal/Question of Interest

EDA

Variable Description:

Visualization

Column Statistics of Categorical Data

Column Statistics of Numerical Data

Exploratory Analysis

Categorical predictors

Numerical predictors

Data Cleaning and Preprocessing

Models

Logistic Regression

PCA

Lasso

Causal Lasso

K-Nearest-Neighbour Model (Knn)

Support Vector Machine (SVM)

Single Tree

Random Forest - Grid Search

Random Forest - Recursive Feature Elimination method (RFE)

Random Forest - Genetic Algorithm (GA)

Extreme Gradient Boosting (XGBoost)

Neural Network

Model Comparison

Appendix - code