Introduction

Heart disease refers to several types of heart conditions (1) and is the leading cause of death for men, women, and people of most racial and ethnic groups according to the CDC (2). One person dies every 33 seconds because of heart disease, and over 700,000 people died of heart disease in the United States in 2022, which is equivalent to about 1 in 5 deaths (2). Heart disease can lead to serious health events and conditions such as heart attacks, arrhythmia, and heart failure, and may be “silent” before one of these occur (1). As such, the ability to predict heart disease is a large concern in the United States.

Several risk factors for heart disease are known, such as high blood pressure, smoking, high cholesterol, diabetes, obesity, genetic factors, alcohol consumption, physical activity, etc (3).

Description of the dataset

The working dataset (4) compiles the results of a few known risk factors of heart disease from the survey results of the BRFSS, a survey administered by the CDC (5), with 10000 observations. Half of the observations have been diagnosed with heart disease, and half have not. This dataset will be used with a variety of statistical and machine learning techniques to create models for the prediction of heart disease.

The working dataset has 18 variables, listed below:

HeartDisease: Whether or not the individual has heart disease (binary response variable)

BMI: The BMI of the individual

Smoking: Whether or not the individual smokes (No, Yes)

AlcoholDrinking: Whether or not the individual drinks alcohol (No, Yes)

PhysicalHealth: Integer between 1 and 30 (inclusive) that represents how many days in a month the individual has “not good” physical health status

MentalHealth: Integer between 1 and 30 (inclusive) that represents how many days in a month the individual has “not good” mental health status

DiffWalking: Whether or not the individual has difficulty walking (No, Yes)

Sex: whether the individual is female or male

AgeCategory: The age of the individual, in 13 possible categories (18-24, 25-29, 30-34, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+)

Race: The race of the individual, of six possible categories (American Indian or Alaskan Native, Asian, Black, Hispanic, White, Other)

Diabetic: The diabetes status of the individual, of four possible categories (No, No (borderline diabetes), Yes (during pregnancy), Yes)

PhysicalActivity: Whether or not the individual considers themselves physically active (No, Yes)

GenHealth: The general health of the individual, of five possible categories (Excellent, Very Good, Good, Fair, Poor)

SleepTime: The average number of hours the individual sleeps at night

Asthma: Whether or not the individual has asthma (No, Yes)

KidneyDisease: Whether or not the individual has kidney disease (No, Yes)

SkinCancer: Whether or not the individual has skin cancer (No, Yes)

Research Questions

The response variable for all techniques will be HeartDisease, which is a binary variable that records whether or not the individual has heart disease. Several candidate models will be created using various statistical and machine learning techniques with the goal of predicting whether or not the individual has heart disease based on the other predictors in the dataset. The models will then be considered against each other with an analysis of their ROC curves in order to determine which model is best for heart disease prediction.

Methodology

We will begin by examining the explanatory and response variables in the dataset and check for possible issues with the assumptions for a logistic model as well as sparse categories and missing information. We will take a look at the underlying structure of the data and identify relationships between different variables, checking for possible issues with multicollinearity. We will look for outliers or unusual observations which may affect the final model or be indicative of possible data entry errors. If possible, missing values will be imputed. Violations to assumptions will be handled through appropriate transformations. Sparse categories or nonnormal explanatory numeric variables may be handled through discretization and/or regrouping. Certain variables may be dropped or otherwise aggregated to account for any possible issues with multicollinearity. Single variable and pairwise distributions will be examined to check for the above.

Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a binary response variable, and is therefore appropriate for situations where linear regression cannot be used due to a dichotomous categorical response variable.The assuptions for a logistic regression include the following: the dependent variable should be binary, the independent variables should not be correlated, the log odds (the logit of the probability) and the independent variables should have a linear relationship, the sample size has to be sufficiently large, observations must be independent. Unlike multiple linear analysis, due to a categorical response, logistic regression does not require a normally distributed response variable with a constant variance. Outliers for both the linear and logistic regression model can have a significant effect on the results, and as such should be examined before and after the model’s creation.

An alpha of 0.05 and a 95% confidence level will be used for all statistical test and confidence interval building respectively.

In the model building process, stepwise regression will be used to identify significant predictors and build the model through an automatic procedure. In each step, predictor variables are considered for addition or subtraction from the model based on some prespecified criterion. Bidirectional stepwise regression, or stepwise selection, is a combination of both forward selection (adding the most statistically significant predictor to the model) and backward elimination (removing the least significant predictor from the model) and will be used, with additional manual adjustments outside of the automatic process, to create the most fitting linear and logistic model while ensuring that the final model makes statistical sense.

A single-layer neural network technique known as the perceptron will be used with the R package neuralnet to run a neural network model for the use of classification. Backpropagation, an element of the neural network model, can be used to improve the model’s predictive power and allow it to outperform logistic regression in the prediction process.

Decision tree algorithms are another method of classification, generating rules that may allow for greater interpretability through conditional statements based on various predictors in the dataset. Through an iterative process of splitting, they can generate classification algorithms with a large degree of flexibility in the ability to scale penalizations for false positive/false negative classifications and combined through ensemble methods for more powerful predictions.

BAGGING, or Bootstrap Aggregation, can be used in combination with other algorithms such as neural net or decision tree algorithms to create more powerful models through the combination of weaker models and therefore mitigate issues of under- and over-fitting the data. These models may have a more stable performance. Bagging uses bootstrap sampling that considers models independently and simultaneously to decrease the overall variance. We will use it in combination with a decision tree algorithm to make another candidate models.

All models will be considered through ROC analysis. Different cutoff probabilities will be used to create different confusion matrices and generate different measures of specificity and sensitivity for each model. Sensitivity is defined as the true positive rate, given by TP/(TP+FN) and the specificity is defined as the true negative rate, given by TN/(TN+FP). The plot of sensitivity against 1-specificity gives a Receiver Operating Characteristic, or ROC curve. The area under curve (AUC) will be used for all candidate models to assess its overall performance.

Explanatory Data Analysis

Single Variable Distributions and Pairwise Variable Distributions

Continuous Predictors

We continue with a look at the distributions of the continuous explanatory variables as well as their pairwise relationships with other explanatory variables below.

# A tibble: 30 × 2
   MentalHealth `n()`
          <dbl> <int>
 1            0  6578
 2            1   267
 3            2   464
 4            3   254
 5            4   153
 6            5   387
 7            6    56
 8            7   162
 9            8    25
10            9     6
# ℹ 20 more rows

# A tibble: 31 × 2
   PhysicalHealth `n()`
            <dbl> <int>
 1              0  6265
 2              1   280
 3              2   435
 4              3   279
 5              4   136
 6              5   252
 7              6    52
 8              7   164
 9              8    35
10              9     8
# ℹ 21 more rows

# A tibble: 21 × 2
   SleepTime `n()`
       <dbl> <int>
 1         1    20
 2         2    37
 3         3    92
 4         4   309
 5         5   642
 6         6  2040
 7         7  2728
 8         8  3080
 9         9   553
10        10   317
# ℹ 11 more rows

Despite significant p-values noted in the pairwise correlation plots, the scatter plots themselves do not show any obvious trends in the data and the values for the correlations are all very low. As p-values are driven by large sample sizes, we proceed without assuming any glaring issues in multicollinearity between the numerical predictors with caution. However, a closer look at the histograms of MentalHealth, PhysicalHealth, and SleepTime show that there are categories with very few observations. As such, we make the decision to discretize the variables in the following ways:

MentalHealth: 1-10, 11-20, and 21-30

PhysicalHealth: 1-10, 11-20, and 21-30

SleepTime: ≤4, 5, 6, 7, 8, 9, 10, 11+

We follow with a look at the adjusted dataset.

 HeartDisease      BMI        Smoking    AlcoholDrinking Stroke    
 0:5000       Min.   :12.21   No :5186   No :9453        No :9028  
 1:5000       1st Qu.:24.43   Yes:4814   Yes: 547        Yes: 972  
              Median :27.88                                        
              Mean   :28.92                                        
              3rd Qu.:32.11                                        
              Max.   :83.33                                        
                                                                   
 PhysicalHealth MentalHealth  DiffWalking     Sex         AgeCategory  
 0to10 :8173    0to10 :8630   No :7591    Female:4735   70to74  :1307  
 11to20: 590    11to20: 593   Yes:2409    Male  :5265   80orMore:1297  
 21to30:1237    21to30: 777                             65to69  :1287  
                                                        60to64  :1098  
                                                        75to79  :1069  
                                                        55to59  : 884  
                                                        (Other) :3058  
                            Race                    Diabetic   
 AmericanIndianOrAlaskanNative: 168   No                :7484  
 Asian                        : 168   Borderline        : 248  
 Black                        : 707   YesDuringPregnancy:  62  
 Hispanic                     : 714   Yes               :2206  
 Other                        : 340                            
 White                        :7903                            
                                                               
 PhysicalActivity     GenHealth      SleepTime    Asthma     KidneyDisease
 No :2884         Poor     : 835   8      :3080   No :8437   No :9251     
 Yes:7116         Fair     :1796   7      :2728   Yes:1563   Yes: 749     
                  Good     :3203   6      :2040                           
                  VeryGood :2764   5      : 642                           
                  Excellent:1402   9      : 553                           
                                   4orLess: 458                           
                                   (Other): 499                           
 SkinCancer
 No :8637  
 Yes:1363  
           
           
           
           
           

There are no more obvious sparse categories in the dataset. The dataset also does not include missing values. We continue by looking at pairwise relationships between the continuous and categorical predictors and the binary response variable.

The only remaining continuous variable is BMI. we will look at its relationship with HeartDisease through boxplots.

ggplot(f_heart) +
  aes(x = BMI, y = HeartDisease, color = HeartDisease) +
  geom_boxplot() + 
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="top") +
  xlab("BMI") + 
  ylab("HeartDisease") 

BMI is heavily skewed right, as seen by the number of outliers in both boxplots. We observed this as well in the pairwise correlation plot created earlier. There appears to be a slight shift right for those with heart disease, indicating that higher values of BMI may lead to higher instances of heart disease.

We will investigate the categorical predictors against the binary response through mosaic plots.

par(mfrow = c(2,2))
mosaicplot(Smoking ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Heart Disease vs Smoking")
mosaicplot(AlcoholDrinking ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Heart Disease vs Alcohol Consumption")
mosaicplot(Stroke ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Heart Disease vs Stroke")
mosaicplot(PhysicalHealth ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Physical Health")

mosaicplot(MentalHealth ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Mental Health")
mosaicplot(DiffWalking ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Difficulty Walking")
mosaicplot(Sex ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Sex")
mosaicplot(AgeCategory ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Age Category")

mosaicplot(Race ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Race")
mosaicplot(Diabetic ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Diabetes")
mosaicplot(PhysicalActivity ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Physical Activity")
mosaicplot(GenHealth ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="General Health")

mosaicplot(SleepTime ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Hours of Sleep")
mosaicplot(Asthma ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Asthma")
mosaicplot(KidneyDisease ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Kidney Disease")
mosaicplot(SkinCancer ~ HeartDisease, data=f_heart,col=c("Blue","Red"), main="Skin Cancer")

The mosaic plots show pretty evident relationships across levels of the categorical predictors for many of the variables. AgeCategory shows a higher number of people who said they had heart disease for higher age groups. Mental and Physical Health both have positive relationships with heart disease, indicating that people who had more days of feeling poor mental and physical health also had higher instances of heart disease. General health had an expected negative relationship with heart disease, where lower instances of heart disease were reported among those with better self-reported general health. An almost U-shaped relationship seems to be observed across the levels of SleepTime, where the lowest instances of heart disease were reported by those who slept about 7-8 hours a day. All of the variables have observable relationships with heart disease in the mosaic plots.

We will create and save a copy of the analytic dataset, available on github at https://github.com/xiang-a/sta551/blob/main/analytic_heart_disease_prediction.csv.

Logistic Regression

Creating Candidate Models

We begin with the full model, created based on all the predictors in the dataset.

Significance tests of logistic regression model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2864198 0.3765370 -6.0722307 0.0000000
BMI 0.0113523 0.0041797 2.7160424 0.0066067
SmokingYes 0.3107252 0.0513546 6.0505783 0.0000000
AlcoholDrinkingYes -0.1159063 0.1107054 -1.0469792 0.2951092
StrokeYes 1.2218614 0.1063666 11.4872634 0.0000000
PhysicalHealth11to20 0.2473720 0.1151348 2.1485431 0.0316706
PhysicalHealth21to30 -0.0556509 0.0993980 -0.5598800 0.5755613
MentalHealth11to20 0.1649427 0.1128192 1.4620089 0.1437388
MentalHealth21to30 0.1839556 0.1024786 1.7950626 0.0726437
DiffWalkingYes 0.2127672 0.0705836 3.0143978 0.0025749
SexMale 0.6522694 0.0521361 12.5108938 0.0000000
AgeCategory25to29 0.2327302 0.3242666 0.7177125 0.4729346
AgeCategory30to34 0.7720262 0.2916128 2.6474358 0.0081105
AgeCategory35to39 0.8118005 0.2769518 2.9311973 0.0033766
AgeCategory40to44 1.1008981 0.2685478 4.0994490 0.0000414
AgeCategory45to49 1.3805680 0.2609638 5.2902660 0.0000001
AgeCategory50to54 1.8462775 0.2523216 7.3171593 0.0000000
AgeCategory55to59 2.1501536 0.2478033 8.6768561 0.0000000
AgeCategory60to64 2.4088013 0.2459236 9.7949156 0.0000000
AgeCategory65to69 2.6191750 0.2447222 10.7026470 0.0000000
AgeCategory70to74 2.9775692 0.2458826 12.1097203 0.0000000
AgeCategory75to79 3.1957537 0.2483269 12.8691423 0.0000000
AgeCategory80orMore 3.5809135 0.2492594 14.3662141 0.0000000
RaceAsian -0.0869705 0.2929339 -0.2968945 0.7665470
RaceBlack -0.1938231 0.2193996 -0.8834249 0.3770067
RaceHispanic 0.0343352 0.2212412 0.1551936 0.8766687
RaceOther 0.0021654 0.2426190 0.0089250 0.9928790
RaceWhite 0.0518221 0.2001993 0.2588525 0.7957490
DiabeticBorderline 0.1901093 0.1535695 1.2379366 0.2157396
DiabeticYesDuringPregnancy -0.1061314 0.3319201 -0.3197497 0.7491580
DiabeticYes 0.5579723 0.0652919 8.5458159 0.0000000
PhysicalActivityYes -0.0260050 0.0598500 -0.4345031 0.6639231
GenHealthFair -0.2708652 0.1280054 -2.1160448 0.0343410
GenHealthGood -0.7980379 0.1304163 -6.1191583 0.0000000
GenHealthVeryGood -1.4683072 0.1367978 -10.7334155 0.0000000
GenHealthExcellent -1.8733588 0.1499485 -12.4933437 0.0000000
SleepTime5 -0.0751380 0.1573643 -0.4774779 0.6330219
SleepTime6 -0.3310479 0.1354076 -2.4448250 0.0144922
SleepTime7 -0.4470748 0.1350511 -3.3104106 0.0009316
SleepTime8 -0.3630545 0.1337120 -2.7151975 0.0066236
SleepTime9 -0.2464594 0.1649659 -1.4940019 0.1351751
SleepTime10 -0.1385529 0.1912473 -0.7244699 0.4687773
SleepTime11orMore -0.1266088 0.2323120 -0.5449949 0.5857571
AsthmaYes 0.3303882 0.0725389 4.5546371 0.0000052
KidneyDiseaseYes 0.4392750 0.1067819 4.1137583 0.0000389
SkinCancerYes 0.0793600 0.0744873 1.0654165 0.2866875

We choose the reduced model based on clinically important risk factors listed by the CDC (1). These include smoking, diabetes, overweight/obesity (which will be included through BMI), physical inactivity, and alcohol consumption.

We will then use forward stepwise model selection to create a stepwise logistic regression model. We do so with the R package MASS and its step() function.

Summary table of significant tests
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2838855 0.3190477 -7.1584443 0.0000000
SmokingYes 0.3164029 0.0511196 6.1894680 0.0000000
DiabeticBorderline 0.1881642 0.1531270 1.2288115 0.2191425
DiabeticYesDuringPregnancy -0.1013936 0.3306827 -0.3066191 0.7591333
DiabeticYes 0.5502487 0.0650853 8.4542647 0.0000000
BMI 0.0110776 0.0041698 2.6566462 0.0078922
AlcoholDrinkingYes -0.1064020 0.1105749 -0.9622621 0.3359180
PhysicalActivityYes -0.0246788 0.0597297 -0.4131746 0.6794787
AgeCategory25to29 0.2295365 0.3241802 0.7080522 0.4789129
AgeCategory30to34 0.7778376 0.2915574 2.6678709 0.0076334
AgeCategory35to39 0.8116117 0.2768055 2.9320653 0.0033672
AgeCategory40to44 1.1060054 0.2683277 4.1218452 0.0000376
AgeCategory45to49 1.3835301 0.2607772 5.3054114 0.0000001
AgeCategory50to54 1.8524155 0.2520502 7.3493922 0.0000000
AgeCategory55to59 2.1663294 0.2474478 8.7546909 0.0000000
AgeCategory60to64 2.4236931 0.2454770 9.8734010 0.0000000
AgeCategory65to69 2.6429383 0.2439924 10.8320502 0.0000000
AgeCategory70to74 3.0110393 0.2448600 12.2969815 0.0000000
AgeCategory75to79 3.2350393 0.2471244 13.0907321 0.0000000
AgeCategory80orMore 3.6246639 0.2477085 14.6327796 0.0000000
GenHealthFair -0.2714171 0.1279868 -2.1206636 0.0339501
GenHealthGood -0.7969887 0.1303895 -6.1123678 0.0000000
GenHealthVeryGood -1.4564368 0.1365777 -10.6637985 0.0000000
GenHealthExcellent -1.8647902 0.1498180 -12.4470359 0.0000000
StrokeYes 1.2114785 0.1058748 11.4425568 0.0000000
SexMale 0.6612408 0.0519906 12.7184620 0.0000000
AsthmaYes 0.3284718 0.0724732 4.5323225 0.0000058
KidneyDiseaseYes 0.4372834 0.1066397 4.1005664 0.0000412
SleepTime5 -0.0725150 0.1573760 -0.4607754 0.6449598
SleepTime6 -0.3234478 0.1353787 -2.3892081 0.0168847
SleepTime7 -0.4329628 0.1349413 -3.2085262 0.0013342
SleepTime8 -0.3494227 0.1336298 -2.6148568 0.0089265
SleepTime9 -0.2291741 0.1647739 -1.3908401 0.1642739
SleepTime10 -0.1284684 0.1909686 -0.6727199 0.5011255
SleepTime11orMore -0.1505237 0.2313773 -0.6505554 0.5153336
DiffWalkingYes 0.2084680 0.0704829 2.9577116 0.0030993
PhysicalHealth11to20 0.2546074 0.1149893 2.2141835 0.0268162
PhysicalHealth21to30 -0.0420861 0.0991250 -0.4245756 0.6711461
MentalHealth11to20 0.1725802 0.1127830 1.5301976 0.1259678
MentalHealth21to30 0.1890748 0.1023611 1.8471346 0.0647276

ROC Curves and Model Selection

We will assess the candidate logistic regression models through ROC analysis.

The reduced model performed the most poorly of the three, which is expected considering how it includes the smallest number of predictors. The stepwise model and full model performed very similarly with AUC’s of 0.8408 and 0.8413 respectively, and so we choose the stepwise model on principles of parsimony.

Neural Network Model/Perceptron

We continue by creating a candidate model based on the neural network technique expanded on above. We will use min-max scaling.

We begin by creating the model formula with all necessary dummies.

HeartDisease1 ~ BMIscale + SmokingYes + AlcoholDrinkingYes + 
    StrokeYes + PhysicalHealth11to20 + PhysicalHealth21to30 + 
    MentalHealth11to20 + MentalHealth21to30 + DiffWalkingYes + 
    SexMale + AgeCategory25to29 + AgeCategory30to34 + AgeCategory35to39 + 
    AgeCategory40to44 + AgeCategory45to49 + AgeCategory50to54 + 
    AgeCategory55to59 + AgeCategory60to64 + AgeCategory65to69 + 
    AgeCategory70to74 + AgeCategory75to79 + AgeCategory80orMore + 
    RaceAsian + RaceBlack + RaceHispanic + RaceOther + RaceWhite + 
    DiabeticBorderline + DiabeticYesDuringPregnancy + DiabeticYes + 
    PhysicalActivityYes + GenHealthFair + GenHealthGood + GenHealthVeryGood + 
    GenHealthExcellent + SleepTime5 + SleepTime6 + SleepTime7 + 
    SleepTime8 + SleepTime9 + SleepTime10 + SleepTime11orMore + 
    AsthmaYes

The model can be created through the neuralnet() function to create estimated weights and a visual representation of the model.

error 805.6380133
reached.threshold 0.0096540
steps 23528.0000000
Intercept.to.1layhid1 -0.5018073
BMIscale.to.1layhid1 0.5504230
SmokingYes.to.1layhid1 0.2061092
AlcoholDrinkingYes.to.1layhid1 -0.0737902
StrokeYes.to.1layhid1 1.1430037
PhysicalHealth11to20.to.1layhid1 0.2358557
PhysicalHealth21to30.to.1layhid1 0.0023464
MentalHealth11to20.to.1layhid1 0.1556340
MentalHealth21to30.to.1layhid1 0.1430190
DiffWalkingYes.to.1layhid1 0.1536798
SexMale.to.1layhid1 0.4564541
AgeCategory25to29.to.1layhid1 0.2509221
AgeCategory30to34.to.1layhid1 0.5199641
AgeCategory35to39.to.1layhid1 0.5527978
AgeCategory40to44.to.1layhid1 0.6975788
AgeCategory45to49.to.1layhid1 0.9009559
AgeCategory50to54.to.1layhid1 1.1650248
AgeCategory55to59.to.1layhid1 1.3708861
AgeCategory60to64.to.1layhid1 1.5345545
AgeCategory65to69.to.1layhid1 1.6831910
AgeCategory70to74.to.1layhid1 1.9641980
AgeCategory75to79.to.1layhid1 2.0717678
AgeCategory80orMore.to.1layhid1 2.3998906
RaceAsian.to.1layhid1 0.0389476
RaceBlack.to.1layhid1 -0.1221795
RaceHispanic.to.1layhid1 0.0799317
RaceOther.to.1layhid1 0.0814178
RaceWhite.to.1layhid1 0.1202088
DiabeticBorderline.to.1layhid1 0.0391239
DiabeticYesDuringPregnancy.to.1layhid1 -0.0858710
DiabeticYes.to.1layhid1 0.4105826
PhysicalActivityYes.to.1layhid1 -0.0026157
GenHealthFair.to.1layhid1 -0.3332644
GenHealthGood.to.1layhid1 -0.7282760
GenHealthVeryGood.to.1layhid1 -1.1368055
GenHealthExcellent.to.1layhid1 -1.3821466
SleepTime5.to.1layhid1 -0.0825250
SleepTime6.to.1layhid1 -0.2414696
SleepTime7.to.1layhid1 -0.3209366
SleepTime8.to.1layhid1 -0.2889483
SleepTime9.to.1layhid1 -0.2605813
SleepTime10.to.1layhid1 -0.1052293
SleepTime11orMore.to.1layhid1 0.0407236
AsthmaYes.to.1layhid1 0.2050571
Intercept.to.HeartDisease1 -5.1592175
1layhid1.to.HeartDisease1 7.7006493

## ROC Analysis and Model Selection

We compare the neural net model with the other logistic candidate models to compare their predictive performance through ROC analysis.

The performance of the neural network model is very similar to that of the other logistic models, with an almost identical ROC curve and AUC value. Therefore, it seems about equally appropriate as a method for prediction using this dataset, and may be better suited for different scenarios than the logistic model.

Decision Trees

Decision trees are another classification technique which can be used to generate predictive models. We will construct several different decision trees through the “gini index” and “information gain” impurity measures, as well as without penalization, penalizing false negatives, and penalizing false positives.

ROC Analysis and Model Building

We continue by constructing the confusion matrices of each model and plotting the ROC curves to select the best candidate model.

We note that both the decision trees formed through the “gini” and “information” impurity measures performed similarly for the nonpenalized trees and trees where the false negative was penalized, but performed very different for the trees where the false positive was penalized. The best performing models were the nonpenalized trees, with the gini index performing slightly better. However, the AUC for all decision tree candidate models created were smaller than the earlier created logistic and neural network predictive models.

We will also look at the optimal cutoffs for the gini and information techniques respectively.

Both decision tree impurity measures methods show optimal performance with a cutoff probability between 0.32 and 0.53, centered at around 0.45.

BAGGING

BAGGING, or bootstrap aggregation, can be used in conjunction with predictive models to enhance their predictive power. By combining weaker models, they can reduce the variability in the final model and reduce the effects of over- and under-fitting the data. We will use BAGGING with decision trees to create a candidate model.

BAGGING produced an aggregate candidate model that performs slightly better than the single decision tree candidate models seen earlier. The optimal cutoff probabilities are smaller, centered at around 0.25. However, the AUC of the BAGGING model is also smaller than models created through other techniques used earlier.

Summary and Discussion

We use the ROC curves and their respective AUC values to select the final models. Out of all of the techniques used, the full and stepwise logistic model and neural network model produced the highest values of AUC and all performed very similarly, with AUC values of 0.8413, 0.8408, and 0.8409 respectively. The bagged decision tree model produced the second highest AUC, and the gini decision tree without penalization performed the best of the decision tree algorithms but the worst of the methodologies used in this report.

A few issues could be observed in the creation of these models. The bagged model and neural network model both faced issues with processing time, and would require much more time to run for a larger dataset. Bagging improves the predictive power of decision tree algorithms, but influences their interpretability. The bagged model specifically showed certain issues with the construction of the ROC curve, where it appears that the specificity did not appear to be decreasing past a value of 0.5.

For this dataset, we recommend the stepwise model on principles of parsimony, its high AUC when considering all ROC analysis in conjunction, and its interpretability. The coefficients of the logistic model can be used to calculate predicted odds and probabilities to better give an idea on the specific influence of each individual predictors. The neural network model is our second recommendation with a very similar performance, but it differs slightly in terms of interpretation and may face more issues with runtime with a much larger dataset.

The data used for this analysis was a subset of a much larger dataset with an unbalanced number of respondents who had heart disease versus those that did not. Certain adjustments may be necessary to each of these methods to make it work better for a larger volume of data and to enhance its predictive power with an unbalanced design. Different predictive models from this brief look may be better suited for such a scenario.

Expansions on the techniques used in this report include multi-layer perceptrons, randomforest algorithms, and boosting ensemble methods.

References and Appendix

  1. https://www.cdc.gov/heart-disease/about/index.html
  2. https://www.cdc.gov/heart-disease/data-research/facts-stats/index.html
  3. https://www.cdc.gov/heart-disease/risk-factors/index.html
  4. https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data
  5. https://www.cdc.gov/brfss/annual_data/annual_2023.html