Heart disease refers to several types of heart conditions (1) and is the leading cause of death for men, women, and people of most racial and ethnic groups according to the CDC (2). One person dies every 33 seconds because of heart disease, and over 700,000 people died of heart disease in the United States in 2022, which is equivalent to about 1 in 5 deaths (2). Heart disease can lead to serious health events and conditions such as heart attacks, arrhythmia, and heart failure, and may be “silent” before one of these occur (1). As such, the ability to predict heart disease is a large concern in the United States.
Several risk factors for heart disease are known, such as high blood pressure, smoking, high cholesterol, diabetes, obesity, genetic factors, alcohol consumption, physical activity, etc (3).
The working dataset (4) compiles the results of a few known risk factors of heart disease from the survey results of the BRFSS, a survey administered by the CDC (5), with 10000 observations. Half of the observations have been diagnosed with heart disease, and half have not. This dataset will be used with a variety of statistical and machine learning techniques to create models for the prediction of heart disease.
The working dataset has 18 variables, listed below:
HeartDisease: Whether or not the individual has heart disease (binary response variable)
BMI: The BMI of the individual
Smoking: Whether or not the individual smokes (No, Yes)
AlcoholDrinking: Whether or not the individual drinks alcohol (No, Yes)
PhysicalHealth: Integer between 1 and 30 (inclusive) that represents how many days in a month the individual has “not good” physical health status
MentalHealth: Integer between 1 and 30 (inclusive) that represents how many days in a month the individual has “not good” mental health status
DiffWalking: Whether or not the individual has difficulty walking (No, Yes)
Sex: whether the individual is female or male
AgeCategory: The age of the individual, in 13 possible categories (18-24, 25-29, 30-34, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+)
Race: The race of the individual, of six possible categories (American Indian or Alaskan Native, Asian, Black, Hispanic, White, Other)
Diabetic: The diabetes status of the individual, of four possible categories (No, No (borderline diabetes), Yes (during pregnancy), Yes)
PhysicalActivity: Whether or not the individual considers themselves physically active (No, Yes)
GenHealth: The general health of the individual, of five possible categories (Excellent, Very Good, Good, Fair, Poor)
SleepTime: The average number of hours the individual sleeps at night
Asthma: Whether or not the individual has asthma (No, Yes)
KidneyDisease: Whether or not the individual has kidney disease (No, Yes)
SkinCancer: Whether or not the individual has skin cancer (No, Yes)
The response variable for all techniques will be HeartDisease, which is a binary variable that records whether or not the individual has heart disease. Several candidate models will be created using various statistical and machine learning techniques with the goal of predicting whether or not the individual has heart disease based on the other predictors in the dataset. The models will then be considered against each other with an analysis of their ROC curves in order to determine which model is best for heart disease prediction.
We will begin by examining the explanatory and response variables in the dataset and check for possible issues with the assumptions for a logistic model as well as sparse categories and missing information. We will take a look at the underlying structure of the data and identify relationships between different variables, checking for possible issues with multicollinearity. We will look for outliers or unusual observations which may affect the final model or be indicative of possible data entry errors. If possible, missing values will be imputed. Violations to assumptions will be handled through appropriate transformations. Sparse categories or nonnormal explanatory numeric variables may be handled through discretization and/or regrouping. Certain variables may be dropped or otherwise aggregated to account for any possible issues with multicollinearity. Single variable and pairwise distributions will be examined to check for the above.
Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a binary response variable, and is therefore appropriate for situations where linear regression cannot be used due to a dichotomous categorical response variable.The assuptions for a logistic regression include the following: the dependent variable should be binary, the independent variables should not be correlated, the log odds (the logit of the probability) and the independent variables should have a linear relationship, the sample size has to be sufficiently large, observations must be independent. Unlike multiple linear analysis, due to a categorical response, logistic regression does not require a normally distributed response variable with a constant variance. Outliers for both the linear and logistic regression model can have a significant effect on the results, and as such should be examined before and after the model’s creation.
An alpha of 0.05 and a 95% confidence level will be used for all statistical test and confidence interval building respectively.
In the model building process, stepwise regression will be used to identify significant predictors and build the model through an automatic procedure. In each step, predictor variables are considered for addition or subtraction from the model based on some prespecified criterion. Bidirectional stepwise regression, or stepwise selection, is a combination of both forward selection (adding the most statistically significant predictor to the model) and backward elimination (removing the least significant predictor from the model) and will be used, with additional manual adjustments outside of the automatic process, to create the most fitting linear and logistic model while ensuring that the final model makes statistical sense.
A single-layer neural network technique known as the perceptron will be used with the R package neuralnet to run a neural network model for the use of classification. Backpropagation, an element of the neural network model, can be used to improve the model’s predictive power and allow it to outperform logistic regression in the prediction process.
Decision tree algorithms are another method of classification, generating rules that may allow for greater interpretability through conditional statements based on various predictors in the dataset. Through an iterative process of splitting, they can generate classification algorithms with a large degree of flexibility in the ability to scale penalizations for false positive/false negative classifications and combined through ensemble methods for more powerful predictions.
BAGGING, or Bootstrap Aggregation, can be used in combination with other algorithms such as neural net or decision tree algorithms to create more powerful models through the combination of weaker models and therefore mitigate issues of under- and over-fitting the data. These models may have a more stable performance. Bagging uses bootstrap sampling that considers models independently and simultaneously to decrease the overall variance. We will use it in combination with a decision tree algorithm to make another candidate models.
All models will be considered through ROC analysis. Different cutoff probabilities will be used to create different confusion matrices and generate different measures of specificity and sensitivity for each model. Sensitivity is defined as the true positive rate, given by TP/(TP+FN) and the specificity is defined as the true negative rate, given by TN/(TN+FP). The plot of sensitivity against 1-specificity gives a Receiver Operating Characteristic, or ROC curve. The area under curve (AUC) will be used for all candidate models to assess its overall performance.
We continue with a look at the distributions of the continuous explanatory variables as well as their pairwise relationships with other explanatory variables below.
# A tibble: 30 × 2
MentalHealth `n()`
<dbl> <int>
1 0 6578
2 1 267
3 2 464
4 3 254
5 4 153
6 5 387
7 6 56
8 7 162
9 8 25
10 9 6
# ℹ 20 more rows
# A tibble: 31 × 2
PhysicalHealth `n()`
<dbl> <int>
1 0 6265
2 1 280
3 2 435
4 3 279
5 4 136
6 5 252
7 6 52
8 7 164
9 8 35
10 9 8
# ℹ 21 more rows
# A tibble: 21 × 2
SleepTime `n()`
<dbl> <int>
1 1 20
2 2 37
3 3 92
4 4 309
5 5 642
6 6 2040
7 7 2728
8 8 3080
9 9 553
10 10 317
# ℹ 11 more rows
Despite significant p-values noted in the pairwise correlation plots, the scatter plots themselves do not show any obvious trends in the data and the values for the correlations are all very low. As p-values are driven by large sample sizes, we proceed without assuming any glaring issues in multicollinearity between the numerical predictors with caution. However, a closer look at the histograms of MentalHealth, PhysicalHealth, and SleepTime show that there are categories with very few observations. As such, we make the decision to discretize the variables in the following ways:
MentalHealth: 1-10, 11-20, and 21-30
PhysicalHealth: 1-10, 11-20, and 21-30
SleepTime: ≤4, 5, 6, 7, 8, 9, 10, 11+
We follow with a look at the adjusted dataset.
HeartDisease BMI Smoking AlcoholDrinking Stroke
0:5000 Min. :12.21 No :5186 No :9453 No :9028
1:5000 1st Qu.:24.43 Yes:4814 Yes: 547 Yes: 972
Median :27.88
Mean :28.92
3rd Qu.:32.11
Max. :83.33
PhysicalHealth MentalHealth DiffWalking Sex AgeCategory
0to10 :8173 0to10 :8630 No :7591 Female:4735 70to74 :1307
11to20: 590 11to20: 593 Yes:2409 Male :5265 80orMore:1297
21to30:1237 21to30: 777 65to69 :1287
60to64 :1098
75to79 :1069
55to59 : 884
(Other) :3058
Race Diabetic
AmericanIndianOrAlaskanNative: 168 No :7484
Asian : 168 Borderline : 248
Black : 707 YesDuringPregnancy: 62
Hispanic : 714 Yes :2206
Other : 340
White :7903
PhysicalActivity GenHealth SleepTime Asthma KidneyDisease
No :2884 Poor : 835 8 :3080 No :8437 No :9251
Yes:7116 Fair :1796 7 :2728 Yes:1563 Yes: 749
Good :3203 6 :2040
VeryGood :2764 5 : 642
Excellent:1402 9 : 553
4orLess: 458
(Other): 499
SkinCancer
No :8637
Yes:1363
There are no more obvious sparse categories in the dataset. The dataset also does not include missing values. We continue by looking at pairwise relationships between the continuous and categorical predictors and the binary response variable.
The only remaining continuous variable is BMI. we will look at its relationship with HeartDisease through boxplots.
ggplot(f_heart) +
aes(x = BMI, y = HeartDisease, color = HeartDisease) +
geom_boxplot() +
theme(plot.title = element_text(hjust = 0.5),
legend.position="top") +
xlab("BMI") +
ylab("HeartDisease")
BMI is heavily skewed right, as seen by the number of outliers in both boxplots. We observed this as well in the pairwise correlation plot created earlier. There appears to be a slight shift right for those with heart disease, indicating that higher values of BMI may lead to higher instances of heart disease.
We will investigate the categorical predictors against the binary response through mosaic plots.
par(mfrow = c(2,2))
mosaicplot(Smoking ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Heart Disease vs Smoking")
mosaicplot(AlcoholDrinking ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Heart Disease vs Alcohol Consumption")
mosaicplot(Stroke ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Heart Disease vs Stroke")
mosaicplot(PhysicalHealth ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Physical Health")
mosaicplot(MentalHealth ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Mental Health")
mosaicplot(DiffWalking ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Difficulty Walking")
mosaicplot(Sex ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Sex")
mosaicplot(AgeCategory ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Age Category")
mosaicplot(Race ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Race")
mosaicplot(Diabetic ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Diabetes")
mosaicplot(PhysicalActivity ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Physical Activity")
mosaicplot(GenHealth ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="General Health")
mosaicplot(SleepTime ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Hours of Sleep")
mosaicplot(Asthma ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Asthma")
mosaicplot(KidneyDisease ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Kidney Disease")
mosaicplot(SkinCancer ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Skin Cancer")
The mosaic plots show pretty evident relationships across levels of the categorical predictors for many of the variables. AgeCategory shows a higher number of people who said they had heart disease for higher age groups. Mental and Physical Health both have positive relationships with heart disease, indicating that people who had more days of feeling poor mental and physical health also had higher instances of heart disease. General health had an expected negative relationship with heart disease, where lower instances of heart disease were reported among those with better self-reported general health. An almost U-shaped relationship seems to be observed across the levels of SleepTime, where the lowest instances of heart disease were reported by those who slept about 7-8 hours a day. All of the variables have observable relationships with heart disease in the mosaic plots.
We will create and save a copy of the analytic dataset, available on github at https://github.com/xiang-a/sta551/blob/main/analytic_heart_disease_prediction.csv.
We begin with the full model, created based on all the predictors in the dataset.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.2864198 | 0.3765370 | -6.0722307 | 0.0000000 |
BMI | 0.0113523 | 0.0041797 | 2.7160424 | 0.0066067 |
SmokingYes | 0.3107252 | 0.0513546 | 6.0505783 | 0.0000000 |
AlcoholDrinkingYes | -0.1159063 | 0.1107054 | -1.0469792 | 0.2951092 |
StrokeYes | 1.2218614 | 0.1063666 | 11.4872634 | 0.0000000 |
PhysicalHealth11to20 | 0.2473720 | 0.1151348 | 2.1485431 | 0.0316706 |
PhysicalHealth21to30 | -0.0556509 | 0.0993980 | -0.5598800 | 0.5755613 |
MentalHealth11to20 | 0.1649427 | 0.1128192 | 1.4620089 | 0.1437388 |
MentalHealth21to30 | 0.1839556 | 0.1024786 | 1.7950626 | 0.0726437 |
DiffWalkingYes | 0.2127672 | 0.0705836 | 3.0143978 | 0.0025749 |
SexMale | 0.6522694 | 0.0521361 | 12.5108938 | 0.0000000 |
AgeCategory25to29 | 0.2327302 | 0.3242666 | 0.7177125 | 0.4729346 |
AgeCategory30to34 | 0.7720262 | 0.2916128 | 2.6474358 | 0.0081105 |
AgeCategory35to39 | 0.8118005 | 0.2769518 | 2.9311973 | 0.0033766 |
AgeCategory40to44 | 1.1008981 | 0.2685478 | 4.0994490 | 0.0000414 |
AgeCategory45to49 | 1.3805680 | 0.2609638 | 5.2902660 | 0.0000001 |
AgeCategory50to54 | 1.8462775 | 0.2523216 | 7.3171593 | 0.0000000 |
AgeCategory55to59 | 2.1501536 | 0.2478033 | 8.6768561 | 0.0000000 |
AgeCategory60to64 | 2.4088013 | 0.2459236 | 9.7949156 | 0.0000000 |
AgeCategory65to69 | 2.6191750 | 0.2447222 | 10.7026470 | 0.0000000 |
AgeCategory70to74 | 2.9775692 | 0.2458826 | 12.1097203 | 0.0000000 |
AgeCategory75to79 | 3.1957537 | 0.2483269 | 12.8691423 | 0.0000000 |
AgeCategory80orMore | 3.5809135 | 0.2492594 | 14.3662141 | 0.0000000 |
RaceAsian | -0.0869705 | 0.2929339 | -0.2968945 | 0.7665470 |
RaceBlack | -0.1938231 | 0.2193996 | -0.8834249 | 0.3770067 |
RaceHispanic | 0.0343352 | 0.2212412 | 0.1551936 | 0.8766687 |
RaceOther | 0.0021654 | 0.2426190 | 0.0089250 | 0.9928790 |
RaceWhite | 0.0518221 | 0.2001993 | 0.2588525 | 0.7957490 |
DiabeticBorderline | 0.1901093 | 0.1535695 | 1.2379366 | 0.2157396 |
DiabeticYesDuringPregnancy | -0.1061314 | 0.3319201 | -0.3197497 | 0.7491580 |
DiabeticYes | 0.5579723 | 0.0652919 | 8.5458159 | 0.0000000 |
PhysicalActivityYes | -0.0260050 | 0.0598500 | -0.4345031 | 0.6639231 |
GenHealthFair | -0.2708652 | 0.1280054 | -2.1160448 | 0.0343410 |
GenHealthGood | -0.7980379 | 0.1304163 | -6.1191583 | 0.0000000 |
GenHealthVeryGood | -1.4683072 | 0.1367978 | -10.7334155 | 0.0000000 |
GenHealthExcellent | -1.8733588 | 0.1499485 | -12.4933437 | 0.0000000 |
SleepTime5 | -0.0751380 | 0.1573643 | -0.4774779 | 0.6330219 |
SleepTime6 | -0.3310479 | 0.1354076 | -2.4448250 | 0.0144922 |
SleepTime7 | -0.4470748 | 0.1350511 | -3.3104106 | 0.0009316 |
SleepTime8 | -0.3630545 | 0.1337120 | -2.7151975 | 0.0066236 |
SleepTime9 | -0.2464594 | 0.1649659 | -1.4940019 | 0.1351751 |
SleepTime10 | -0.1385529 | 0.1912473 | -0.7244699 | 0.4687773 |
SleepTime11orMore | -0.1266088 | 0.2323120 | -0.5449949 | 0.5857571 |
AsthmaYes | 0.3303882 | 0.0725389 | 4.5546371 | 0.0000052 |
KidneyDiseaseYes | 0.4392750 | 0.1067819 | 4.1137583 | 0.0000389 |
SkinCancerYes | 0.0793600 | 0.0744873 | 1.0654165 | 0.2866875 |
We choose the reduced model based on clinically important risk factors listed by the CDC (1). These include smoking, diabetes, overweight/obesity (which will be included through BMI), physical inactivity, and alcohol consumption.
We will then use forward stepwise model selection to create a stepwise logistic regression model. We do so with the R package MASS and its step() function.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.2838855 | 0.3190477 | -7.1584443 | 0.0000000 |
SmokingYes | 0.3164029 | 0.0511196 | 6.1894680 | 0.0000000 |
DiabeticBorderline | 0.1881642 | 0.1531270 | 1.2288115 | 0.2191425 |
DiabeticYesDuringPregnancy | -0.1013936 | 0.3306827 | -0.3066191 | 0.7591333 |
DiabeticYes | 0.5502487 | 0.0650853 | 8.4542647 | 0.0000000 |
BMI | 0.0110776 | 0.0041698 | 2.6566462 | 0.0078922 |
AlcoholDrinkingYes | -0.1064020 | 0.1105749 | -0.9622621 | 0.3359180 |
PhysicalActivityYes | -0.0246788 | 0.0597297 | -0.4131746 | 0.6794787 |
AgeCategory25to29 | 0.2295365 | 0.3241802 | 0.7080522 | 0.4789129 |
AgeCategory30to34 | 0.7778376 | 0.2915574 | 2.6678709 | 0.0076334 |
AgeCategory35to39 | 0.8116117 | 0.2768055 | 2.9320653 | 0.0033672 |
AgeCategory40to44 | 1.1060054 | 0.2683277 | 4.1218452 | 0.0000376 |
AgeCategory45to49 | 1.3835301 | 0.2607772 | 5.3054114 | 0.0000001 |
AgeCategory50to54 | 1.8524155 | 0.2520502 | 7.3493922 | 0.0000000 |
AgeCategory55to59 | 2.1663294 | 0.2474478 | 8.7546909 | 0.0000000 |
AgeCategory60to64 | 2.4236931 | 0.2454770 | 9.8734010 | 0.0000000 |
AgeCategory65to69 | 2.6429383 | 0.2439924 | 10.8320502 | 0.0000000 |
AgeCategory70to74 | 3.0110393 | 0.2448600 | 12.2969815 | 0.0000000 |
AgeCategory75to79 | 3.2350393 | 0.2471244 | 13.0907321 | 0.0000000 |
AgeCategory80orMore | 3.6246639 | 0.2477085 | 14.6327796 | 0.0000000 |
GenHealthFair | -0.2714171 | 0.1279868 | -2.1206636 | 0.0339501 |
GenHealthGood | -0.7969887 | 0.1303895 | -6.1123678 | 0.0000000 |
GenHealthVeryGood | -1.4564368 | 0.1365777 | -10.6637985 | 0.0000000 |
GenHealthExcellent | -1.8647902 | 0.1498180 | -12.4470359 | 0.0000000 |
StrokeYes | 1.2114785 | 0.1058748 | 11.4425568 | 0.0000000 |
SexMale | 0.6612408 | 0.0519906 | 12.7184620 | 0.0000000 |
AsthmaYes | 0.3284718 | 0.0724732 | 4.5323225 | 0.0000058 |
KidneyDiseaseYes | 0.4372834 | 0.1066397 | 4.1005664 | 0.0000412 |
SleepTime5 | -0.0725150 | 0.1573760 | -0.4607754 | 0.6449598 |
SleepTime6 | -0.3234478 | 0.1353787 | -2.3892081 | 0.0168847 |
SleepTime7 | -0.4329628 | 0.1349413 | -3.2085262 | 0.0013342 |
SleepTime8 | -0.3494227 | 0.1336298 | -2.6148568 | 0.0089265 |
SleepTime9 | -0.2291741 | 0.1647739 | -1.3908401 | 0.1642739 |
SleepTime10 | -0.1284684 | 0.1909686 | -0.6727199 | 0.5011255 |
SleepTime11orMore | -0.1505237 | 0.2313773 | -0.6505554 | 0.5153336 |
DiffWalkingYes | 0.2084680 | 0.0704829 | 2.9577116 | 0.0030993 |
PhysicalHealth11to20 | 0.2546074 | 0.1149893 | 2.2141835 | 0.0268162 |
PhysicalHealth21to30 | -0.0420861 | 0.0991250 | -0.4245756 | 0.6711461 |
MentalHealth11to20 | 0.1725802 | 0.1127830 | 1.5301976 | 0.1259678 |
MentalHealth21to30 | 0.1890748 | 0.1023611 | 1.8471346 | 0.0647276 |
We will assess the candidate logistic regression models through ROC analysis.
The reduced model performed the most poorly of the three, which is expected considering how it includes the smallest number of predictors. The stepwise model and full model performed very similarly with AUC’s of 0.8408 and 0.8413 respectively, and so we choose the stepwise model on principles of parsimony.
We continue by creating a candidate model based on the neural network technique expanded on above. We will use min-max scaling.
We begin by creating the model formula with all necessary dummies.
HeartDisease1 ~ BMIscale + SmokingYes + AlcoholDrinkingYes +
StrokeYes + PhysicalHealth11to20 + PhysicalHealth21to30 +
MentalHealth11to20 + MentalHealth21to30 + DiffWalkingYes +
SexMale + AgeCategory25to29 + AgeCategory30to34 + AgeCategory35to39 +
AgeCategory40to44 + AgeCategory45to49 + AgeCategory50to54 +
AgeCategory55to59 + AgeCategory60to64 + AgeCategory65to69 +
AgeCategory70to74 + AgeCategory75to79 + AgeCategory80orMore +
RaceAsian + RaceBlack + RaceHispanic + RaceOther + RaceWhite +
DiabeticBorderline + DiabeticYesDuringPregnancy + DiabeticYes +
PhysicalActivityYes + GenHealthFair + GenHealthGood + GenHealthVeryGood +
GenHealthExcellent + SleepTime5 + SleepTime6 + SleepTime7 +
SleepTime8 + SleepTime9 + SleepTime10 + SleepTime11orMore +
AsthmaYes
The model can be created through the neuralnet() function to create estimated weights and a visual representation of the model.
error | 805.6380133 |
reached.threshold | 0.0096540 |
steps | 23528.0000000 |
Intercept.to.1layhid1 | -0.5018073 |
BMIscale.to.1layhid1 | 0.5504230 |
SmokingYes.to.1layhid1 | 0.2061092 |
AlcoholDrinkingYes.to.1layhid1 | -0.0737902 |
StrokeYes.to.1layhid1 | 1.1430037 |
PhysicalHealth11to20.to.1layhid1 | 0.2358557 |
PhysicalHealth21to30.to.1layhid1 | 0.0023464 |
MentalHealth11to20.to.1layhid1 | 0.1556340 |
MentalHealth21to30.to.1layhid1 | 0.1430190 |
DiffWalkingYes.to.1layhid1 | 0.1536798 |
SexMale.to.1layhid1 | 0.4564541 |
AgeCategory25to29.to.1layhid1 | 0.2509221 |
AgeCategory30to34.to.1layhid1 | 0.5199641 |
AgeCategory35to39.to.1layhid1 | 0.5527978 |
AgeCategory40to44.to.1layhid1 | 0.6975788 |
AgeCategory45to49.to.1layhid1 | 0.9009559 |
AgeCategory50to54.to.1layhid1 | 1.1650248 |
AgeCategory55to59.to.1layhid1 | 1.3708861 |
AgeCategory60to64.to.1layhid1 | 1.5345545 |
AgeCategory65to69.to.1layhid1 | 1.6831910 |
AgeCategory70to74.to.1layhid1 | 1.9641980 |
AgeCategory75to79.to.1layhid1 | 2.0717678 |
AgeCategory80orMore.to.1layhid1 | 2.3998906 |
RaceAsian.to.1layhid1 | 0.0389476 |
RaceBlack.to.1layhid1 | -0.1221795 |
RaceHispanic.to.1layhid1 | 0.0799317 |
RaceOther.to.1layhid1 | 0.0814178 |
RaceWhite.to.1layhid1 | 0.1202088 |
DiabeticBorderline.to.1layhid1 | 0.0391239 |
DiabeticYesDuringPregnancy.to.1layhid1 | -0.0858710 |
DiabeticYes.to.1layhid1 | 0.4105826 |
PhysicalActivityYes.to.1layhid1 | -0.0026157 |
GenHealthFair.to.1layhid1 | -0.3332644 |
GenHealthGood.to.1layhid1 | -0.7282760 |
GenHealthVeryGood.to.1layhid1 | -1.1368055 |
GenHealthExcellent.to.1layhid1 | -1.3821466 |
SleepTime5.to.1layhid1 | -0.0825250 |
SleepTime6.to.1layhid1 | -0.2414696 |
SleepTime7.to.1layhid1 | -0.3209366 |
SleepTime8.to.1layhid1 | -0.2889483 |
SleepTime9.to.1layhid1 | -0.2605813 |
SleepTime10.to.1layhid1 | -0.1052293 |
SleepTime11orMore.to.1layhid1 | 0.0407236 |
AsthmaYes.to.1layhid1 | 0.2050571 |
Intercept.to.HeartDisease1 | -5.1592175 |
1layhid1.to.HeartDisease1 | 7.7006493 |
## ROC Analysis and Model Selection
We compare the neural net model with the other logistic candidate models to compare their predictive performance through ROC analysis.
The performance of the neural network model is very similar to that of the other logistic models, with an almost identical ROC curve and AUC value. Therefore, it seems about equally appropriate as a method for prediction using this dataset, and may be better suited for different scenarios than the logistic model.
Decision trees are another classification technique which can be used to generate predictive models. We will construct several different decision trees through the “gini index” and “information gain” impurity measures, as well as without penalization, penalizing false negatives, and penalizing false positives.
We continue by constructing the confusion matrices of each model and plotting the ROC curves to select the best candidate model.
We note that both the decision trees formed through the “gini” and “information” impurity measures performed similarly for the nonpenalized trees and trees where the false negative was penalized, but performed very different for the trees where the false positive was penalized. The best performing models were the nonpenalized trees, with the gini index performing slightly better. However, the AUC for all decision tree candidate models created were smaller than the earlier created logistic and neural network predictive models.
We will also look at the optimal cutoffs for the gini and information techniques respectively.
Both decision tree impurity measures methods show optimal performance with a cutoff probability between 0.32 and 0.53, centered at around 0.45.
BAGGING, or bootstrap aggregation, can be used in conjunction with predictive models to enhance their predictive power. By combining weaker models, they can reduce the variability in the final model and reduce the effects of over- and under-fitting the data. We will use BAGGING with decision trees to create a candidate model.
BAGGING produced an aggregate candidate model that performs slightly better than the single decision tree candidate models seen earlier. The optimal cutoff probabilities are smaller, centered at around 0.25. However, the AUC of the BAGGING model is also smaller than models created through other techniques used earlier.
We use the ROC curves and their respective AUC values to select the final models. Out of all of the techniques used, the full and stepwise logistic model and neural network model produced the highest values of AUC and all performed very similarly, with AUC values of 0.8413, 0.8408, and 0.8409 respectively. The bagged decision tree model produced the second highest AUC, and the gini decision tree without penalization performed the best of the decision tree algorithms but the worst of the methodologies used in this report.
A few issues could be observed in the creation of these models. The bagged model and neural network model both faced issues with processing time, and would require much more time to run for a larger dataset. Bagging improves the predictive power of decision tree algorithms, but influences their interpretability. The bagged model specifically showed certain issues with the construction of the ROC curve, where it appears that the specificity did not appear to be decreasing past a value of 0.5.
For this dataset, we recommend the stepwise model on principles of parsimony, its high AUC when considering all ROC analysis in conjunction, and its interpretability. The coefficients of the logistic model can be used to calculate predicted odds and probabilities to better give an idea on the specific influence of each individual predictors. The neural network model is our second recommendation with a very similar performance, but it differs slightly in terms of interpretation and may face more issues with runtime with a much larger dataset.
The data used for this analysis was a subset of a much larger dataset with an unbalanced number of respondents who had heart disease versus those that did not. Certain adjustments may be necessary to each of these methods to make it work better for a larger volume of data and to enhance its predictive power with an unbalanced design. Different predictive models from this brief look may be better suited for such a scenario.
Expansions on the techniques used in this report include multi-layer perceptrons, randomforest algorithms, and boosting ensemble methods.