Data 622 - Final Project (Group 2)

Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/FinalProject

Load Data & EDA

Information about the dataset

Before loading our dataset, and discussing our data exploration process, We’ll quickly summarize the dataset that we’ll be using for our machine learning analysis. The dataset is part of UCI’s Cleveland Heart Disease data file, which consists of 303 individuals and various health-related metrics. The dataset we’ll be using includes 14 attributes retrieved from patient tests and can be used for machine learning and classification analysis. Ultimately, these attributes will be examined to determine their effectiveness at predicting whether or not a particular patient has Heart Disease. You can find more information about these 14 attributes below:

Data Dictionary:

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
trestbps: resting blood pressure (in mm Hg on admission to the hospital)
chol: serum cholesterol in mg/dl
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg: resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
thalach: maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
- Value 0: upsloping
- Value 1: flat
- Value 2: downsloping
ca: number of major vessels (0-4) colored by fluoroscopy
thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
target: diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

Exploratory Data Analysis

First, we can check the shape of the dataset. It looks like there are 303 rows and 14 columns in our imported data. We also show the first few rows of data just to sanity check that things look correct.

df <- read.csv("./data/heart.csv", header = TRUE)

dim(df)

## [1] 303  14

kable(head(df))

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

Missing values in the dataset

Missing values can be problematic if present. We can also check to see if there is missing data.

plot_missing(df)

As we can see above, there aren’t any missing values across our dataset, which will make it easier for us as we start to prepare our dataset for machine learning analysis.

As a next steps, let’s convert known categorical features into factor data type, then show a summary of our dataframe.

df_viz <- df %>% 
  mutate(sex = if_else(sex == 1, "Male", "Female"),
         fbs = ordered(fbs, levels=c(0, 1), labels=c("<=120", ">120")),
         exang = if_else(exang == 1, "Yes" ,"No"),
         cp = if_else(cp == 1, "Atypical Angina",
                      if_else(cp == 2, "Non-anginal pain", "Asymptomatic")),
         restecg = if_else(restecg == 0, "Normal",
                           if_else(restecg == 1, "Abnormality", "probable or definite")),
         slope = ordered(slope, 
                         levels=c(2, 1, 0), 
                         labels=c('down', 'flat', 'up')),
         ca = ordered(ca, 
                      levels=c(0, 1, 2, 3, 4), 
                      labels=c(0, 1, 2, 3, 4)),
         thal = as.factor(thal),
         target = if_else(target == 1, "Has heart disease", "Does not have heart disease")) %>% 
  mutate_if(is.character, as.factor) %>% 
  dplyr::select(target, sex, fbs, exang, cp, restecg, slope, ca, thal, everything())

Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:

skimr::skim(df_viz)

Data summary

Name	df_viz
Number of rows	303
Number of columns	14
_______________________
Column type frequency:
factor	9
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
target	1	FALSE	2	Has: 165, Doe: 138
sex	1	FALSE	2	Mal: 207, Fem: 96
fbs	1	TRUE	2	<=1: 258, >12: 45
exang	1	FALSE	2	No: 204, Yes: 99
cp	1	FALSE	3	Asy: 166, Non: 87, Aty: 50
restecg	1	FALSE	3	Abn: 152, Nor: 147, pro: 4
slope	1	TRUE	3	dow: 142, fla: 140, up: 21
ca	1	TRUE	5	0: 175, 1: 65, 2: 38, 3: 20
thal	1	FALSE	4	2: 166, 3: 117, 1: 18, 0: 2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.37	9.08	29	47.5	55.0	61.0	77.0	<U+2581><U+2586><U+2587><U+2587><U+2581>
trestbps	1	131.62	17.54	94	120.0	130.0	140.0	200.0	<U+2583><U+2587><U+2585><U+2581><U+2581>
chol	1	246.26	51.83	126	211.0	240.0	274.5	564.0	<U+2583><U+2587><U+2582><U+2581><U+2581>
thalach	1	149.65	22.91	71	133.5	153.0	166.0	202.0	<U+2581><U+2582><U+2585><U+2587><U+2582>
oldpeak	1	1.04	1.16	0	0.0	0.8	1.6	6.2	<U+2587><U+2582><U+2581><U+2581><U+2581>

Looking at our target, there is a reasonable class balance between the presence and absence of heart disease. However, we do see value class imbalance when looking at several of the factor features: fbs, exang, cp, restecg, slop, ca, and thal.

While age looks normally distributed, there is apparent right-skew in trestbps, chol, and oldpeak. thalach is left skewed. Our continuous features are also on different scales. We will want to transform and normalize these ahead of modeling.

Let’s visualize our target distribution.

# Bar plot for target (Heart disease) 
ggplot(df_viz, aes(x=target, fill=target)) + 
  geom_bar() +
  xlab("Heart Disease") +
  ylab("Count") +
  ggtitle("Analysis of Presence and Absence of Heart Disease") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence"))

prop.table(table(df_viz$target))

## 
## Does not have heart disease           Has heart disease 
##                   0.4554455                   0.5445545

We can see that there is roughly a 50/50 split in this dataset of those that do have heart disease from those that do not have heart disease. This will come in handy later when we set up our machine learning analyses.

Additionally, when looking at the summary distributions, we can see that our patient population skews above 50 years, with an average age of individuals in this dataset at about 54 years old. The oldest individual is 77 years old, and the youngest is 29. We can take a look at the age frequency counts here:

# Counting the frequency of the values of the age
#df_viz <- df_viz %>% rename("age" = "ï..age")

hist(df_viz$age, labels=TRUE, main="Histogram of Age", xlab="Age Class", 
     ylab="Frequency", col="steelblue")

df_viz %>%
  ggplot(aes(x=age, y=chol, size=chol, color=target)) +
  geom_density2d() + 
  geom_point(alpha=0.4) + 
  scale_color_manual(values = c("#2bbac0", "#f06e64")) + 
  xlab("Age") +
  ylab("Cholesterol")

Interestingly from the plot above, when we take a look at the relationship between a patient’s age, cholesterol and whether or not they have heart disease, we can see that those with positive diagnoses tend to be clustered between the ages of 40 and 60 years and relatively high cholesterol levels. The patients without heart disease tended to be older 55-65 and had a wider range of cholesterol values.

We can explore this further by taking a look at these averages split by our target:

df %>% 
  group_by(target) %>% 
  summarise(mean=mean(trestbps))

## # A tibble: 2 x 2
##   target  mean
##    <int> <dbl>
## 1      0  134.
## 2      1  129.

Feature Plots

Next, we explore the relationships between each feature and the target. For our continuous features, we do a simple featurePlot from the caret package which also lets us see how pairs of features relate with the target.

featurePlot(x=df_viz[,c(10, 11, 12, 13, 14)], y=df_viz[, 1], 
            plot = "ellipse", 
            auto.key = list(columns = 2)
            )

Here we see some patterns emerge, for example, oldpeak tends to cluster towards lower values across all other features in the heart disease group. Also thalach tends towards higher values across each other features in the heart disease group. chol and thalach together have a tighter clustering and separation between our target classes.

For the continuous variables, we can also examine the correlations, broken out by the target variable.

wo_hd <- df_viz %>% 
  filter(target == 'Does not have heart disease') %>% 
  select(age, trestbps, chol, thalach, oldpeak)

cor_heart_wo <- cor(wo_hd)

ggcorr(cor_heart_wo, label = T, label_round = 2) +
  scale_fill_gradient2(low='#ff0000', mid='#ffffff', high='#00ff00', 
                        midpoint=0, guide = "colourbar", aesthetics = "fill") +
  ggtitle('Correlation Plot: Patients without Heart Disease')

wi_hd <- df_viz %>% 
  filter(target == 'Has heart disease') %>% 
  select(age, trestbps, chol, thalach, oldpeak)

cor_heart_wi <- cor(wi_hd)

ggcorr(cor_heart_wi, label = T, label_round = 2) +
  scale_fill_gradient2(low='#ff0000', mid='#ffffff', high='#00ff00', 
                        midpoint=0, guide = "colourbar", aesthetics = "fill") +
  ggtitle('Correlation Plot: Patients with Heart Disease')

cor_heart_delta = cor_heart_wi - cor_heart_wo

delta_df <- as.data.frame(cor_heart_delta)

delta_dfl <- delta_df %>% 
  rownames_to_column(var="X") %>%
  pivot_longer(cols=-c(X), names_to='Y')
  
ggplot(delta_dfl, aes(X, Y, fill=value)) +
  geom_tile() + 
  coord_flip() +
  scale_fill_gradient2(low='#ff0000', mid='#ffffff', high='#00ff00', 
                        midpoint=0, guide = "colourbar", aesthetics = "fill") +
  ggtitle('Difference between Correlograms: With Disease - Without Disease')

We separate out the correleograms based on target class. The first chart shows the pairwise correlations between continuous variables in our Healthy patients. The second chart shows the pairwise correlations in patients with heart disease. The third chart shows the difference to help draw attention to how heart disease patients differ from healthy patients. For example, age vs. thalach correlation in healthy patients is -0.55. In heart diseased patients, the correlation is -0.94 … there is a much stronger negative correlation between these features in our diseased patients. The third chart shows a bright red tile filled for age vs. thalach to draw attention that this correlation became more negative. As a counter example, the age vs. cholesterol correlation is higher in diseased patients vs healthy (0.23 > -0.12). This increase in correlation shows as a light green tile in the Difference plot (third).

We can use bar charts to visualize how our categorical features are distributed across the two target classes.

long_df <- df_viz %>% 
  select(c(target:thal)) %>%
  mutate(fbs = factor(as.character(fbs)),
         ca = factor(as.character(ca)),
         thal = factor(as.character(thal)),
         slope = factor(as.character(slope))) %>%
  pivot_longer(cols=-c(target), names_to='kpi')

ggplot(long_df, aes(x=value, fill=target)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64")) + 
  ggtitle('Comparing Categorical Features and Target')

A few notable points:

Looking as sex, we now see that of the patients in this study, a far higher portion of female presented with heart disease compared with the males.
exang of No is more common with having heart disease.
As we would expect, cp = Asymptomatic is far more correlated with being healthy.

Dummy Columns

We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.

# dummy encode our categorical features
df_dummy <- dummyVars(~ ., fullRank=T, drop2nd=T, data=df_viz)
df_dummy <- data.frame(predict(df_dummy, newdata=df_viz))

df_dummy <- df_dummy %>%
  mutate(target = as.factor(target.Has.heart.disease)) %>%
  select(-target.Has.heart.disease)

Note: the dummyVars() function from caret also automatically applies a scale and center transform on all our continuous features. While we could probably choose custom transformations that improve performance (e.g. using BoxCox.lambda per feature), the distributions we saw above didn’t look that skewed. The default caret transform is probably sufficient.

Splitting the Dataset into Training and Testing

Our first step will be to separate the data into a training and test dataset. This way, we can test the accuracy by using our holdout test set later on. We decided to perform a conventional 80/20 training to testing data split on our dummied dataset.

set.seed(123)

index <- createDataPartition(df_dummy$target, p=0.8, list = FALSE)

df_train <- df_dummy[index,]
df_test  <- df_dummy[-index,]

With our dataset prepared and a training/test split in place, we are ready to proceed with trying different machine learning models to predict heart disease given our feature.

Supervised Machine Learning

Random Forest Analysis

Random forest is a Supervised Learning algorithm which uses an ensemble learning method for classification and regression. It is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

set.seed(123)
fit.forest <- randomForest(target ~ ., data = df_train, importance=TRUE, ntree=2000)

# display model details
fit.forest

## 
## Call:
##  randomForest(formula = target ~ ., data = df_train, importance = TRUE,      ntree = 2000) 
##                Type of random forest: classification
##                      Number of trees: 2000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 16.87%
## Confusion matrix:
##    0   1 class.error
## 0 93  18   0.1621622
## 1 23 109   0.1742424

plot(fit.forest)

The plot above shows the Error and the Number of Trees as a result of our random forest analysis. We can very easily see that the Error drops dramatically as more trees are added and averaged. For reference, the red line represents the MCR of class designated as not having heart disease in our training set, whereas the green line represents the MCR of class designated as having heart disease in our training set. Additionally, the black line represents the overall MCR or OOB error. The overall error rate is what we are interested in which seems considerably good. Interestingly, we do see a pretty high Error for our class designated as not having heart disease over our first few trees, but this does drop quickly when more trees are added and averaged.

# show which feature were most important
importance(fit.forest)

##                                       0          1 MeanDecreaseAccuracy
## sex.Male                      9.0704704 25.0338607           24.4119583
## fbs.L                        -0.9344651  0.4164734           -0.3184235
## exang.Yes                    11.9277558 10.5587409           15.1351725
## cp.Atypical.Angina            5.2528497  0.4315867            4.0363058
## cp.Non.anginal.pain          17.2381156  7.3229993           16.9461636
## restecg.Normal                3.2540965  4.8975554            5.8390767
## restecg.probable.or.definite -1.2655302 -3.8634344           -4.0848817
## slope.L                      12.9774985  2.8493332           11.4015972
## slope.Q                       2.1566005  1.9906048            3.1049258
## ca.L                         21.8928226 20.1368599           26.8148332
## ca.Q                         24.6405370 27.4469281           34.6635740
## ca.C                         11.4421390 17.7668713           20.6168641
## ca.4                          8.6745184 15.6614353           16.6657480
## thal.1                       -1.5445887  5.3702696            2.6987890
## thal.2                       23.9738640 25.4875592           31.5974142
## thal.3                       20.4695216 22.4645270           27.7900762
## age                           5.5165381  6.3363017            8.5017674
## trestbps                      1.7600099  0.7345711            1.9188623
## chol                         -5.0859014  6.7756548            1.8362207
## thalach                      14.4557134 23.3814674           26.9555574
## oldpeak                      20.3630123 17.6957179           26.2933800
##                              MeanDecreaseGini
## sex.Male                            3.9605866
## fbs.L                               1.1401556
## exang.Yes                           5.3811409
## cp.Atypical.Angina                  1.3364486
## cp.Non.anginal.pain                 3.7908046
## restecg.Normal                      2.3829847
## restecg.probable.or.definite        0.1479234
## slope.L                             3.3942602
## slope.Q                             1.9938619
## ca.L                                7.9996782
## ca.Q                               10.6301479
## ca.C                                3.9869417
## ca.4                                3.2525415
## thal.1                              0.6340835
## thal.2                             11.2929543
## thal.3                              7.8552130
## age                                 7.9787206
## trestbps                            6.9632178
## chol                                7.5964484
## thalach                            14.2977390
## oldpeak                            11.1235862

varImpPlot(fit.forest,type=2)

Next, we took a look at variable importance in our Random Forest model. Since Mean Decrease in Gini is the average of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest, we can use this calculation to help us guage which features are more effective at estimating the value of our target variable across all of the trees that make up the forest. As we can see from our table and Mean Decrease Gini above, thalach, thal.2 and oldpeak seem to be more effective features in our estimation of our target variable across our trees. We’ll use this knowledge later on when running future models, and this helps provide perspective that certain factors of the heart when induced by exercise can help play a key role in our estimations.

After fitting our Random Forest model, we can now use it to create predictions on our holdout test set to evaluate its performance.

rf.pred <- predict(fit.forest, newdata=df_test, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, df_test$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  7
##          1  6 26
##                                          
##                Accuracy : 0.7833         
##                  95% CI : (0.658, 0.8793)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 0.0001472      
##                                          
##                   Kappa : 0.5638         
##                                          
##  Mcnemar's Test P-Value : 1.0000000      
##                                          
##             Sensitivity : 0.7778         
##             Specificity : 0.7879         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 0.8125         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3500         
##    Detection Prevalence : 0.4667         
##       Balanced Accuracy : 0.7828         
##                                          
##        'Positive' Class : 0              
##

rf.perf <- c(forest.cm_train$overall[1], 
             forest.cm_train$byClass[1], 
             forest.cm_train$byClass[2], 
             forest.cm_train$byClass[5], 
             forest.cm_train$byClass[6], 
             forest.cm_train$byClass[7])

From our ensuing confusion matrix and evaluation, we can see that our accuracy is about 0.78. This is a pretty good accuracy value. We can see that although this isn’t too bad of an accuracy rate, there were 6 patients in our holdout test set that did not have heart disease that were misclassified by our model as having heart disease. Similarly, seven patients who had actually had heart disease were misclassified as not having heart disease by our model. Since sensitivity in this case is the percentage of patients that actually have heart disease and were correctly predicted to have heart disease, we can see that 78% of the time this outcome occurred from our model outcome. Conversely, our specificity value of 78%, which denotes the portion of our patient population in the test set that do not have heart disease that were correctly classified as not having heart disease is also not too bad. However, when thinking about the ramifications of utilizing this approach in the real-world, we’d have some unhappy people with our model misclassifying about 22% of our patients. This is why it’s important to think critically about the evaluation of our machine learning models, and to determine whether or not an accuracy of 78% would be effective enough to implement in the real world.

Gradient Boosting

In order to build on our initial Random Forest model, we’ll take a gradient boosting approach. Different from Random Forest which is a bagging approach, this boosting method calculates large residuals from previous iterations, and given it’s a boosting technique, these residuals will build on each other until reaching a certain threshold. We’ll use the gbm() function, using our training dataset we’ll generate 2000 trees with 5 folds.

set.seed(123)

# build gradient boosted model
fit.boost <- gbm(target~., df_train,
             distribution = "gaussian",
             n.trees = 2000,
             cv.folds = 5)

# display model details
fit.boost

## gbm(formula = target ~ ., distribution = "gaussian", data = df_train, 
##     n.trees = 2000, cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 2000 iterations were performed.
## The best cross-validation iteration was 82.
## There were 21 predictors of which 16 had non-zero influence.

sqrt(min(fit.boost$cv.error))

## [1] 0.3548503

summary(fit.boost)

##                                                       var    rel.inf
## chol                                                 chol 17.3985753
## thalach                                           thalach 15.5215930
## trestbps                                         trestbps 14.3672500
## age                                                   age 10.8354123
## oldpeak                                           oldpeak 10.0523903
## ca.Q                                                 ca.Q  5.8455820
## thal.2                                             thal.2  5.5081521
## exang.Yes                                       exang.Yes  3.2352595
## ca.C                                                 ca.C  2.3668716
## cp.Non.anginal.pain                   cp.Non.anginal.pain  1.9694880
## fbs.L                                               fbs.L  1.9521777
## restecg.Normal                             restecg.Normal  1.8773360
## slope.L                                           slope.L  1.4708664
## sex.Male                                         sex.Male  1.4682426
## cp.Atypical.Angina                     cp.Atypical.Angina  1.4047385
## thal.3                                             thal.3  1.3909945
## ca.4                                                 ca.4  1.2916909
## slope.Q                                           slope.Q  1.0296266
## ca.L                                                 ca.L  0.8271799
## thal.1                                             thal.1  0.1865729
## restecg.probable.or.definite restecg.probable.or.definite  0.0000000

As we can see above from the model summary and relative influence of our features, again thalach and thal.2 are features that hold the highest importance in our classification of our training set along our target parameter.

Next, we’ll take a look at the partial dependence plots of a few of our variables, including chol and thalach. These plots are a bit noisy, but we can see that both thalach and chol seem to be somewhat directly related to our response variable (a patient having heart disease).

plot(fit.boost, i="chol")

plot(fit.boost, i="thalach")

# plot loss function as a result of n trees added to the ensemble
gbm.perf(fit.boost, method = "cv")

## [1] 82

gbm.perf(fit.boost, method = "OOB")

## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

## [1] 43
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 2000 
## Equivalent Number of Parameters: 39.99 
## Residual Standard Error: 0.0003449

The plots indicating the optimum number of trees based on the technique we used. The green line indicates the error on the test data, and the blue dotted line points to the optimum number of iterations. We can observe that beyond a certain point (82 iterations for the cross-validation method and 43 for the OOB method), the test data error appears to increase because of overfitting.

According to the cross-validation method, we choose 42 as the number of trees to predict the test set.

boostPre <- predict.gbm(fit.boost, 
                        df_test, 
                        n.trees = 42)
boostPre <- ifelse(boostPre < 1.5, 0, 1)

gbm.pred <- as.factor(boostPre)

(boost.cm <- confusionMatrix(gbm.pred, df_test$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  3
##          1  6 30
##                                          
##                Accuracy : 0.85           
##                  95% CI : (0.7343, 0.929)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 8.068e-07      
##                                          
##                   Kappa : 0.6939         
##                                          
##  Mcnemar's Test P-Value : 0.505          
##                                          
##             Sensitivity : 0.7778         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.8750         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3500         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.8434         
##                                          
##        'Positive' Class : 0              
##

gbm.perf <- c(boost.cm$overall[1],
              boost.cm$byClass[1],
              boost.cm$byClass[2],
              boost.cm$byClass[5],
              boost.cm$byClass[6],
              boost.cm$byClass[7])

From our ensuing confusion matrix and evaluation of our gradient boosting method, we can see that our accuracy is about 0.85, which is an improvement from our Random Forest model above. This is also a pretty good accuracy value. We can see that although this isn’t too bad of an accuracy rate, there were still 6 patients in our holdout test set that did not have heart disease that were misclassified by our model as having heart disease. However, this time, only three patients who had actually had heart disease were misclassified as not having heart disease by our model. As a reminder, the sensitivity in this case is the percentage of patients that actually have heart disease and were correctly predicted to have heart disease, we can see that 78% of the time this outcome occurred from our model outcome. This is similar to our Random Forest model evaluation. With gradient boosting, our specificity value of 91%, which denotes the portion of our patient population in the test set that do not have heart disease that were correctly classified as not having heart disease is much better than our Random Forest model. Put another way, we are trying to maximize sensitivity which means were are minimizing False Negative errors. A False Negative is when we predict a patient is healthy when in fact they are not. For health related predictions, this error is far worse than telling a healthy patient they might have a disease.

We can see that this gradient boosting technique was more effective at classifying individuals as having heart disease. When thinking about the ramifications of utilizing this approach in the real-world, we’d feel a bit more confident that this model could effectively determine patients with heart disease, but still given the lower sensitivity rate of 78% and misclassifying a fair amount of patients with having heart disease that actually didn’t, we’d have to be careful that we weren’t administering too many false positives.

Given this context, we’ll keep working through other machine learning approaches.

Support Vector Machine

In machine learning, Support vector machine (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces. Therefore, we’ll also use this technique to classify patients based on our target variable.

set.seed(123)

fit.svm <- svm(target~., 
               data=df_train, 
               kernel="radial", 
               cost=10, 
               scale = FALSE)
fit.svm

## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  242

We’ll initially start our svm model by fitting it to our training set. This is a baseline model, which we’ll tune later.

svm.pred <- predict(fit.svm, 
                    newdata = df_test)

(svm.cm_train <- confusionMatrix(svm.pred, df_test$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  2  1
##          1 25 32
##                                           
##                Accuracy : 0.5667          
##                  95% CI : (0.4324, 0.6941)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.4502          
##                                           
##                   Kappa : 0.0476          
##                                           
##  Mcnemar's Test P-Value : 6.462e-06       
##                                           
##             Sensitivity : 0.07407         
##             Specificity : 0.96970         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.56140         
##              Prevalence : 0.45000         
##          Detection Rate : 0.03333         
##    Detection Prevalence : 0.05000         
##       Balanced Accuracy : 0.52189         
##                                           
##        'Positive' Class : 0               
##

When we subject this fitted model to our holdout test set, we can see that our model accuracy is quite low, around 57%. Additionally, our sensitivity value is very low, at about 7%, and our specificity value is quite high, at 97%. Given this, and our confusion matrix output, we can see that this first svm model was very keen to classify most patients as having heart disease. Only three patients in our holdout test set were classified as not having heart disease! Obviously, this first svm model would not be very effective in the real world. Therefore, we’ll set some tuning parameters to find an optimal set for our next iteration of svm.

svm.tune <- tune(e1071::svm, target~.,
                 data = df_train,
                 ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100),
                 kernel=c("linear", "polynomial", "radial"),
                 gamma=c(0.5, 1, 2, 3, 4)))

summary(svm.tune)

## 
## Parameter tuning of 'e1071::svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost kernel gamma
##  0.01 linear   0.5
## 
## - best performance: 0.1568333 
## 
## - Detailed performance results:
##      cost     kernel gamma     error dispersion
## 1   1e-03     linear   0.5 0.2098333 0.06563296
## 2   1e-02     linear   0.5 0.1568333 0.07804893
## 3   1e-01     linear   0.5 0.1896667 0.07199880
## 4   1e+00     linear   0.5 0.1646667 0.05537638
## 5   5e+00     linear   0.5 0.1648333 0.05577983
## 6   1e+01     linear   0.5 0.1690000 0.05746282
## 7   1e+02     linear   0.5 0.1690000 0.05746282
## 8   1e-03 polynomial   0.5 0.1851667 0.06517521
## 9   1e-02 polynomial   0.5 0.2138333 0.07962121
## 10  1e-01 polynomial   0.5 0.2135000 0.08310506
## 11  1e+00 polynomial   0.5 0.2135000 0.08310506
## 12  5e+00 polynomial   0.5 0.2135000 0.08310506
## 13  1e+01 polynomial   0.5 0.2135000 0.08310506
## 14  1e+02 polynomial   0.5 0.2135000 0.08310506
## 15  1e-03     radial   0.5 0.4565000 0.09035407
## 16  1e-02     radial   0.5 0.4565000 0.09035407
## 17  1e-01     radial   0.5 0.4565000 0.09035407
## 18  1e+00     radial   0.5 0.3448333 0.09410872
## 19  5e+00     radial   0.5 0.3406667 0.06690790
## 20  1e+01     radial   0.5 0.3406667 0.06690790
## 21  1e+02     radial   0.5 0.3406667 0.06690790
## 22  1e-03     linear   1.0 0.2098333 0.06563296
## 23  1e-02     linear   1.0 0.1568333 0.07804893
## 24  1e-01     linear   1.0 0.1896667 0.07199880
## 25  1e+00     linear   1.0 0.1646667 0.05537638
## 26  5e+00     linear   1.0 0.1648333 0.05577983
## 27  1e+01     linear   1.0 0.1690000 0.05746282
## 28  1e+02     linear   1.0 0.1690000 0.05746282
## 29  1e-03 polynomial   1.0 0.1850000 0.05812396
## 30  1e-02 polynomial   1.0 0.2135000 0.08310506
## 31  1e-01 polynomial   1.0 0.2135000 0.08310506
## 32  1e+00 polynomial   1.0 0.2135000 0.08310506
## 33  5e+00 polynomial   1.0 0.2135000 0.08310506
## 34  1e+01 polynomial   1.0 0.2135000 0.08310506
## 35  1e+02 polynomial   1.0 0.2135000 0.08310506
## 36  1e-03     radial   1.0 0.4565000 0.09035407
## 37  1e-02     radial   1.0 0.4565000 0.09035407
## 38  1e-01     radial   1.0 0.4565000 0.09035407
## 39  1e+00     radial   1.0 0.4150000 0.09497563
## 40  5e+00     radial   1.0 0.4026667 0.10346527
## 41  1e+01     radial   1.0 0.4026667 0.10346527
## 42  1e+02     radial   1.0 0.4026667 0.10346527
## 43  1e-03     linear   2.0 0.2098333 0.06563296
## 44  1e-02     linear   2.0 0.1568333 0.07804893
## 45  1e-01     linear   2.0 0.1896667 0.07199880
## 46  1e+00     linear   2.0 0.1646667 0.05537638
## 47  5e+00     linear   2.0 0.1648333 0.05577983
## 48  1e+01     linear   2.0 0.1690000 0.05746282
## 49  1e+02     linear   2.0 0.1690000 0.05746282
## 50  1e-03 polynomial   2.0 0.2135000 0.08310506
## 51  1e-02 polynomial   2.0 0.2135000 0.08310506
## 52  1e-01 polynomial   2.0 0.2135000 0.08310506
## 53  1e+00 polynomial   2.0 0.2135000 0.08310506
## 54  5e+00 polynomial   2.0 0.2135000 0.08310506
## 55  1e+01 polynomial   2.0 0.2135000 0.08310506
## 56  1e+02 polynomial   2.0 0.2135000 0.08310506
## 57  1e-03     radial   2.0 0.4565000 0.09035407
## 58  1e-02     radial   2.0 0.4565000 0.09035407
## 59  1e-01     radial   2.0 0.4565000 0.09035407
## 60  1e+00     radial   2.0 0.4481667 0.09393802
## 61  5e+00     radial   2.0 0.4440000 0.09229315
## 62  1e+01     radial   2.0 0.4440000 0.09229315
## 63  1e+02     radial   2.0 0.4440000 0.09229315
## 64  1e-03     linear   3.0 0.2098333 0.06563296
## 65  1e-02     linear   3.0 0.1568333 0.07804893
## 66  1e-01     linear   3.0 0.1896667 0.07199880
## 67  1e+00     linear   3.0 0.1646667 0.05537638
## 68  5e+00     linear   3.0 0.1648333 0.05577983
## 69  1e+01     linear   3.0 0.1690000 0.05746282
## 70  1e+02     linear   3.0 0.1690000 0.05746282
## 71  1e-03 polynomial   3.0 0.2135000 0.08310506
## 72  1e-02 polynomial   3.0 0.2135000 0.08310506
## 73  1e-01 polynomial   3.0 0.2135000 0.08310506
## 74  1e+00 polynomial   3.0 0.2135000 0.08310506
## 75  5e+00 polynomial   3.0 0.2135000 0.08310506
## 76  1e+01 polynomial   3.0 0.2135000 0.08310506
## 77  1e+02 polynomial   3.0 0.2135000 0.08310506
## 78  1e-03     radial   3.0 0.4565000 0.09035407
## 79  1e-02     radial   3.0 0.4565000 0.09035407
## 80  1e-01     radial   3.0 0.4565000 0.09035407
## 81  1e+00     radial   3.0 0.4523333 0.09931618
## 82  5e+00     radial   3.0 0.4481667 0.09393802
## 83  1e+01     radial   3.0 0.4481667 0.09393802
## 84  1e+02     radial   3.0 0.4481667 0.09393802
## 85  1e-03     linear   4.0 0.2098333 0.06563296
## 86  1e-02     linear   4.0 0.1568333 0.07804893
## 87  1e-01     linear   4.0 0.1896667 0.07199880
## 88  1e+00     linear   4.0 0.1646667 0.05537638
## 89  5e+00     linear   4.0 0.1648333 0.05577983
## 90  1e+01     linear   4.0 0.1690000 0.05746282
## 91  1e+02     linear   4.0 0.1690000 0.05746282
## 92  1e-03 polynomial   4.0 0.2135000 0.08310506
## 93  1e-02 polynomial   4.0 0.2135000 0.08310506
## 94  1e-01 polynomial   4.0 0.2135000 0.08310506
## 95  1e+00 polynomial   4.0 0.2135000 0.08310506
## 96  5e+00 polynomial   4.0 0.2135000 0.08310506
## 97  1e+01 polynomial   4.0 0.2135000 0.08310506
## 98  1e+02 polynomial   4.0 0.2135000 0.08310506
## 99  1e-03     radial   4.0 0.4565000 0.09035407
## 100 1e-02     radial   4.0 0.4565000 0.09035407
## 101 1e-01     radial   4.0 0.4565000 0.09035407
## 102 1e+00     radial   4.0 0.4565000 0.09035407
## 103 5e+00     radial   4.0 0.4523333 0.09931618
## 104 1e+01     radial   4.0 0.4523333 0.09931618
## 105 1e+02     radial   4.0 0.4523333 0.09931618

From our tuning, we can see that the best parameters were an initial cost of 0.01, a linear kernel, and a gamma value of 0.5. We’ll use these parameters in our next svm model iteration below:

bestmod <- svm.tune$best.model
bestmod

## 
## Call:
## best.tune(method = e1071::svm, train.x = target ~ ., data = df_train, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100), kernel = c("linear", 
##         "polynomial", "radial"), gamma = c(0.5, 1, 2, 3, 4)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  131

svm.pred2 <- predict(bestmod, newdata = df_test, type = "class")

(svm.cm_train2 <- confusionMatrix(svm.pred2, df_test$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20  3
##          1  7 30
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.6587          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.7407          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8696          
##          Neg Pred Value : 0.8108          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3833          
##       Balanced Accuracy : 0.8249          
##                                           
##        'Positive' Class : 0               
##

svm.perf <- c(svm.cm_train2$overall[1],
              svm.cm_train2$byClass[1],
              svm.cm_train2$byClass[2],
              svm.cm_train2$byClass[5],
              svm.cm_train2$byClass[6],
              svm.cm_train2$byClass[7])

According to model output and confusion matrix, our tuned SVM model accuracy is slightly lower than our GBM method at 0.8333. Similar to our Random Forest and GBM models before, we are still seeing low sensitivity rates. For this model we are seeing it lower at 74%, and 7 patients from our holdout test set were misclassified as having heart disease when they actually did not. Our tuned SVM model is a lot more balanced than our initial SVM model, and is less keen to classify patients as having heart disease when they actually do not, but we are still going to explore other tactics to see if we can increase our sensitivity while keeping our specificity above 90% as well.

Neural Network

We’ll also take a neural network approach to our heart disease data. Since neural networks are adaptive in nature, where the first layer of a neural network receives a raw input, processes it and passes the processed information to hidden layers, where the hidden layers then pass the information to the last layer to produce an output, we can see that this technique learns from information provided in past iterations. It then optimizing its weights to better predict a final outcome. We will try to use this technique to see if we can continue to improve our model performance.

set.seed(123)


## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(
  .decay = c(0, 0.1, 1),
  .size = c(1:10),
  ## The next option is to use bagging (see the
  ## next chapter) instead of different random
  ## seeds.
  .bag = FALSE)

ctrl <- trainControl(method='cv', number=10)

nnetModel <- train(df_train[, 1:21], df_train[, 22],
  method = "avNNet",
  tuneGrid = nnetGrid,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  MaxNWts = 243,
  maxit = 500)

# Model results
plot(nnetModel)

summary(nnetModel)

##             Length Class      Mode     
## model        5     -none-     list     
## repeats      1     -none-     numeric  
## bag          1     -none-     logical  
## seeds        5     -none-     numeric  
## names       21     -none-     character
## terms        3     terms      call     
## coefnames   21     -none-     character
## xlevels      0     -none-     list     
## xNames      21     -none-     character
## problemType  1     -none-     character
## tuneValue    3     data.frame list     
## obsLevels    2     -none-     character
## param        4     -none-     list

varImp(nnetModel)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                              Importance
## thalach                         100.000
## ca.Q                             99.835
## thal.2                           99.466
## oldpeak                          90.758
## ca.L                             90.605
## thal.3                           87.796
## exang.Yes                        75.515
## slope.L                          64.086
## slope.Q                          58.162
## sex.Male                         53.280
## cp.Non.anginal.pain              52.670
## ca.C                             51.195
## age                              42.080
## restecg.Normal                   35.660
## cp.Atypical.Angina               31.960
## ca.4                             25.108
## trestbps                         18.370
## chol                             17.480
## thal.1                           11.632
## restecg.probable.or.definite      1.907

nn.pred <- predict(nnetModel, newdata = df_test[, 1:21])

(nn.cm_train <- confusionMatrix(nn.pred, df_test[, 22]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  6
##          1  6 27
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6767, 0.8922)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 4.67e-05        
##                                           
##                   Kappa : 0.596           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.7980          
##                                           
##        'Positive' Class : 0               
##

nn.perf <- c(nn.cm_train$overall[1], 
             nn.cm_train$byClass[1], 
             nn.cm_train$byClass[2], 
             nn.cm_train$byClass[5], 
             nn.cm_train$byClass[6], 
             nn.cm_train$byClass[7])

When evaluating the performance of our neural network, we can first take a look at the variable importance table, where we again see that thalach, thal.2, and oldpeak are at the top of our list. Interestingly, ca.Q is also identified as high importance.

The neural network accuracy was at 80%, which is slightly less than our SVM and GBM techniques. Additionally, the sensitivity is still right around 78%, similar to our other approaches, and our specificity is lower than GBM, at 82%. Therefore, this approach seems to be less effective than our GBM model.

Multivariate Adaptive Regressive Splines (MARS)

One final model we will try is MARS. MARS has the advantage that it uses derived features so in effect has built in dimensionality reduction steps.

set.seed(123)

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsModel <- train(df_train[, 1:21], df_train[, 22],
  method = "earth",
  tuneGrid = marsGrid,
  trControl = trainControl(method = "cv"))

# Model results
plot(marsModel)

summary(marsModel)

## Call: earth(x=data.frame[243,21], y=factor.object, keepxy=TRUE,
##             glm=list(family=function.object, maxit=100), degree=2, nprune=7)
## 
## GLM coefficients
##                                                 1
## (Intercept)                            -3.2990910
## thal.2                                  4.2933943
## h(ca.Q- -0.267261)                      5.0167956
## sex.Male * thal.2                      -2.2781910
## cp.Non.anginal.pain * h(0.119523-ca.4)  4.2489117
## thal.2 * h(oldpeak-1.2)                -1.4101835
## h(ca.Q- -0.267261) * h(157-thalach)    -0.0805919
## 
## GLM (family binomial, link logit):
##  nulldev  df       dev  df   devratio     AIC iters converged
##  335.052 242   159.831 236      0.523   173.8     6         1
## 
## Earth selected 7 of 33 terms, and 7 of 21 predictors (nprune=7)
## Termination condition: Reached nk 43
## Importance: thal.2, ca.Q, thalach, cp.Non.anginal.pain, ca.4, oldpeak, ...
## Number of terms at each degree of interaction: 1 2 4
## Earth GCV 0.1244706    RSS 26.39442    GRSq 0.5024916    RSq 0.5622548

varImp(marsModel)

## earth variable importance
## 
##                     Overall
## thal.2              100.000
## ca.Q                 61.770
## thalach              32.425
## ca.4                 22.818
## cp.Non.anginal.pain  22.818
## oldpeak               9.846
## sex.Male              0.000

mars.pred <- predict(marsModel, newdata = df_test[, 1:21])

mars.cm_train <- confusionMatrix(mars.pred, df_test[, 22])

(mars.perf <- c(mars.cm_train$overall[1], 
                mars.cm_train$byClass[1], 
                mars.cm_train$byClass[2], 
                mars.cm_train$byClass[5], 
                mars.cm_train$byClass[6], 
                mars.cm_train$byClass[7]))

##    Accuracy Sensitivity Specificity   Precision      Recall          F1 
##   0.7666667   0.7037037   0.8181818   0.7600000   0.7037037   0.7307692

From our model output above, we can see that the accuracy is lower than some of our other model approaches at 78%. Additionally, the sensitivity and specificity are also lower at 70% and 85% respectively. Because of this, we can determine that other techniques may be more effective classifiers.

Ensemble Classifier

Lastly, let’s build an ensemble classifer that combines all of the models we’ve trained. The simplest approach is to predict the class with each model, then take a vote of the 5 models on which class should be selected. The hope is that the models made errors on different rows so by combining, models can help cover for each other where a given model might have struggled.

# Create the function.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

esm.df <- data.frame(cbind(rf.pred, gbm.pred, svm.pred2, nn.pred, mars.pred))
esm.df$vote <- 0

esm.df <- esm.df %>%
  rowwise() %>%
  mutate(vote = getmode(c(rf.pred, gbm.pred, svm.pred2, nn.pred, mars.pred)))

esm.df <- esm.df %>%
  mutate(vote = factor(vote, levels=c(1,2), labels=c(0,1)))

esm.cm_train <- confusionMatrix(esm.df[['vote']], df_test[, 22])

(esm.perf <- c(esm.cm_train$overall[1], 
               esm.cm_train$byClass[1], 
               esm.cm_train$byClass[2], 
               esm.cm_train$byClass[5], 
               esm.cm_train$byClass[6], 
               esm.cm_train$byClass[7]))

##    Accuracy Sensitivity Specificity   Precision      Recall          F1 
##   0.8500000   0.8148148   0.8787879   0.8461538   0.8148148   0.8301887

Our Ensemble model has the best overall performance. We retained the \(Accuracy = 0.85\) (i.e. correctly predicting True Positives and True Negative). However, our \(Sensitivity\) is now \(0.8148\) which is an improvement over our best single model, GBM with \(Sensitvity=0.7778\). The Ensemble model made fewer False Negative predictions, which is a critical measure of performance for this use case.

Supervised Model Performance

Here are the tabulated results from all our models, including the Ensemble Model that combines all 5.

result <- rbind(rf.perf, gbm.perf, svm.perf, nn.perf, mars.perf, esm.perf)

kable(result)

	Accuracy	Sensitivity	Specificity	Precision	Recall	F1
rf.perf	0.7833333	0.7777778	0.7878788	0.7500000	0.7777778	0.7636364
gbm.perf	0.8500000	0.7777778	0.9090909	0.8750000	0.7777778	0.8235294
svm.perf	0.8333333	0.7407407	0.9090909	0.8695652	0.7407407	0.8000000
nn.perf	0.8000000	0.7777778	0.8181818	0.7777778	0.7777778	0.7777778
mars.perf	0.7666667	0.7037037	0.8181818	0.7600000	0.7037037	0.7307692
esm.perf	0.8500000	0.8148148	0.8787879	0.8461538	0.8148148	0.8301887

See discussion of performance in our conclusion below.

Conclusions

In the end, we worked through a fair bit of analysis on this heart disease dataset. From this, we’ve come up with a few interesting conclusions:

Many of the attributes that were available in the Cleveland Heart Disease dataset, and the individuals in this dataset, matched many of the underlying factors that can lead to heart disease. For instance, many that did indeed have heart disease exhibited higher resting blood pressure, higher cholesterol, and were older, on average than those that did not have heart disease.

Additionally, a large percentage of individuals that exhibited exercise-induced angina (chest pain) had heart disease (roughly 75% of the individuals in the dataset). This is something worth noting, as it may help physicians with diagnoses and helping to identify factors that could lead to this type of disease.

Feature Importance Across our Models

Across many of our models, when we examined feature importance, we noticed that thalach, thal.2 and ca.Q consistently were identified as having larger effects on the classification of whether or not patients in our dataset had heart disease or not. When thinking about this from a real-world perspective, physicians can really take these factors into consideration when evaluating their patients. For instance, we noticed that thalach (maximum heart rate achieved) can very much be used as a predictor for cardiovascular health. Additionally, thal.2, which was a characteristic of patients that had a fixed defect, also showed high variable importance. Again, physicians can use this information to be cautious of patients with underlying fixed heart defects and their associated risk of heart disease. Finally, ca.Q, which refers to the number of major vessels colored by fluoroscopy, also had high importance on our classifications. By utilizing this technology, physicians can weigh risk based on this factor as well.

Data 622 - Final Project (Group 2)

Predicting Heart Disease

Mengqin Cai, Zhi Ying Chen, Donny Lofland, Zach Alexander

5/19/2021

Load Data & EDA

Information about the dataset

Exploratory Data Analysis

Missing values in the dataset

Feature Plots

Dummy Columns

Splitting the Dataset into Training and Testing

Supervised Machine Learning

Random Forest Analysis

Gradient Boosting

Support Vector Machine

Neural Network

Multivariate Adaptive Regressive Splines (MARS)

Ensemble Classifier

Supervised Model Performance

Conclusions

Feature Importance Across our Models

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

Data 622 - Final Project (Group 2)

Predicting Heart Disease

Mengqin Cai, Zhi Ying Chen, Donny Lofland, Zach Alexander

5/19/2021

Load Data & EDA

Information about the dataset

Exploratory Data Analysis

Missing values in the dataset

Feature Plots

Dummy Columns

Splitting the Dataset into Training and Testing

Supervised Machine Learning

Random Forest Analysis

Gradient Boosting

Support Vector Machine

Neural Network

Multivariate Adaptive Regressive Splines (MARS)

Ensemble Classifier

Supervised Model Performance

Conclusions

Feature Importance Across our Models

Machine Learning Models in a Health-Related Context

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1