Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/FinalProject


Load Data & EDA


Information about the dataset

Before loading our dataset, and discussing our data exploration process, I’ll quickly summarize the dataset that we’ll be using for our machine learning analysis. The dataset is part of UCI’s Cleveland Heart Disease data file, which consists of 303 individuals and various health-related metrics. The dataset we’ll be using includes 14 attributes retrieved from patient tests and can be used for machine learning and classification analysis. Ultimately, these attributes will be examined to determine their effectiveness at predicting whether or not a particular patient has Heart Disease. You can find more information about these 14 attributes below:

Data Dictionary:

  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type
    • `Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholestoral in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • exang: exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment
    • Value 1: upsloping
    • Value 2: flat
    • Value 3: downsloping
  • ca: number of major vessels (0-3) colored by flourosopy
  • thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
  • target: diagnosis of heart disease (angiographic disease status)
    • Value 0: < 50% diameter narrowing
    • Value 1: > 50% diameter narrowing

Given this context, we are now ready to load our data into R:

df <- read.csv("https://raw.githubusercontent.com/djlofland/DATA622_S2021_Group2/main/FinalProject/data/heart.csv", header = TRUE)
#df <- df %>% rename(age = "ï..age")

Exploratory Data Analysis

First, we can check the shape of the dataset. It looks like there are 303 rows and 14 columns in our imported data.

dim(df)
## [1] 303  14

Missing values in the dataset

We can also check to see if there is missing data.

plot_missing(df)

As we can see above, there aren’t any missing values across our dataset, which will make it easier for us as we start to prepare our dataset for machine learning analysis. Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:

skimr::skim(df)
Data summary
Name df
Number of rows 303
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 54.37 9.08 29 47.5 55.0 61.0 77.0 <U+2581><U+2586><U+2587><U+2587><U+2581>
sex 0 1 0.68 0.47 0 0.0 1.0 1.0 1.0 <U+2583><U+2581><U+2581><U+2581><U+2587>
cp 0 1 0.97 1.03 0 0.0 1.0 2.0 3.0 <U+2587><U+2583><U+2581><U+2585><U+2581>
trestbps 0 1 131.62 17.54 94 120.0 130.0 140.0 200.0 <U+2583><U+2587><U+2585><U+2581><U+2581>
chol 0 1 246.26 51.83 126 211.0 240.0 274.5 564.0 <U+2583><U+2587><U+2582><U+2581><U+2581>
fbs 0 1 0.15 0.36 0 0.0 0.0 0.0 1.0 <U+2587><U+2581><U+2581><U+2581><U+2582>
restecg 0 1 0.53 0.53 0 0.0 1.0 1.0 2.0 <U+2587><U+2581><U+2587><U+2581><U+2581>
thalach 0 1 149.65 22.91 71 133.5 153.0 166.0 202.0 <U+2581><U+2582><U+2585><U+2587><U+2582>
exang 0 1 0.33 0.47 0 0.0 0.0 1.0 1.0 <U+2587><U+2581><U+2581><U+2581><U+2583>
oldpeak 0 1 1.04 1.16 0 0.0 0.8 1.6 6.2 <U+2587><U+2582><U+2581><U+2581><U+2581>
slope 0 1 1.40 0.62 0 1.0 1.0 2.0 2.0 <U+2581><U+2581><U+2587><U+2581><U+2587>
ca 0 1 0.73 1.02 0 0.0 0.0 1.0 4.0 <U+2587><U+2583><U+2582><U+2581><U+2581>
thal 0 1 2.31 0.61 0 2.0 2.0 3.0 3.0 <U+2581><U+2581><U+2581><U+2587><U+2586>
target 0 1 0.54 0.50 0 0.0 1.0 1.0 1.0 <U+2587><U+2581><U+2581><U+2581><U+2587>
df_viz <- df %>% mutate(sex = if_else(sex == 1, "Male", "Female"),
         fbs = if_else(fbs == 1, ">120", "<=120"),
         exang = if_else(exang == 1, "Yes" ,"No"),
         cp = if_else(cp == 1, "Atypical Angina",
                      if_else(cp == 2, "Non-anginal pain", "Asymptomatic")),
         restecg = if_else(restecg == 0, "Normal",
                           if_else(restecg == 1, "Abnormality", "probable or definite")),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = if_else(target == 1, "Has heart disease", "Does not have heart disease")
         ) %>% 
  mutate_if(is.character, as.factor) %>% 
  dplyr::select(target, sex, fbs, exang, cp, restecg, slope, ca, thal, everything())

Interestingly, we can also see that there’s a pretty even split between those that have a presence and absence of heart disease:

# Bar plot for target (Heart disease) 
ggplot(df_viz, aes(x=target, fill=target)) + 
  geom_bar() +
  xlab("Heart Disease") +
  ylab("Count") +
  ggtitle("Analysis of Presence and Absence of Heart Disease") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence"))

prop.table(table(df_viz$target))
## 
## Does not have heart disease           Has heart disease 
##                   0.4554455                   0.5445545

We can see that there is roughly a 50/50 split in this dataset of those that do have heart disease from those that do not have heart disease. This will come in handy later when we set up our machine learning analyses.

Additionally, when looking at the summary distributions, we can see that our patient population skews above 50 years, with an average age of individuals in this dataset at about 54 years old. The oldest individual is 77 years old, and the youngest is 29. We can take a look at the age frequency counts here:

# Counting the frequency of the values of the age
hist(df_viz$age,labels=TRUE,main="Histogram of Age",xlab="Age Class",ylab="Frequency",col="steelblue")

df_viz %>%
  ggplot(aes(x=age,y=chol,size=chol, color=target))+
  geom_point(alpha=0.7)+xlab("Age") +
  ylab("Cholestoral")

Interestingly from the plot above, when we take a look at the relationship between a patient’s age, cholesterol and whether or not they have heart disease, we can see that those with positive diagnoses tend to be clustered between the ages of 40 and 70 years and relatively high cholesterol levels.

We can explore this further by taking a look at these in percentages:

df %>% 
  group_by(target) %>% 
  summarise(mean=mean(trestbps))
## # A tibble: 2 x 2
##   target  mean
##    <int> <dbl>
## 1      0  134.
## 2      1  129.

Feature Plots

For the continuous variables, we can examine the distributions, broken out by the target variable.

cor_heart <- cor(df[,c(4,5,8,10,14)])
cor_heart
##             trestbps         chol      thalach     oldpeak      target
## trestbps  1.00000000  0.123174207 -0.046697728  0.19321647 -0.14493113
## chol      0.12317421  1.000000000 -0.009939839  0.05395192 -0.08523911
## thalach  -0.04669773 -0.009939839  1.000000000 -0.34418695  0.42174093
## oldpeak   0.19321647  0.053951920 -0.344186948  1.00000000 -0.43069600
## target   -0.14493113 -0.085239105  0.421740934 -0.43069600  1.00000000
corrplot(cor_heart, method = "ellipse", type="upper",)

ggcorrplot(cor_heart,lab = T)

ggcorr(cor_heart, label = T, label_round = 2)

[TODO: More explanation of these corr plots]

Dummy Columns

We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.

# dummy encode our categorical features
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df_viz)
df_dummy <- data.frame(predict(df_dummy, newdata = df_viz))

Splitting the Dataset into Training and Testing

Our first step will be to separate the data into training and test dataset. This way, we can test the accuracy by using our holdout test set later on. We decided to perform a conventional 80/20 training to testing data split on our original dataframe.

df$sex<-as.factor(df$sex)
levels(df$sex)<-c("Female","Male")

df$cp<-as.factor(df$cp)
levels(df$cp)<-c("typical","atypical","non-anginal","asymptomatic")

df$fbs<-as.factor(df$fbs)
levels(df$fbs)<-c("False","True")

df$restecg<-as.factor(df$restecg)
levels(df$restecg)<-c("normal","stt","hypertrophy")

df$exang<-as.factor(df$exang)
levels(df$exang)<-c("No","Yes")

df$slope<-as.factor(df$slope)
levels(df$slope)<-c("upsloping","flat","downsloping")

df$ca<-as.factor(df$ca)

df$thal<-as.factor(df$thal)

df$target<-as.factor(df$target)
levels(df$target)<-c("No", "Yes")
str(df)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "typical","atypical",..: 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "normal","stt",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "upsloping","flat",..: 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)
df2      <- data.frame(df)
index    <- createDataPartition(df2$target, p=0.8, list = FALSE)
df_train <- df2[index,]
df_test  <- df2[-index,]

Supervised Machine Learning

Random Forest Analysis

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. It is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

set.seed(123)
fit.forest <- randomForest(target ~ ., data = df_train)

# display model details
fit.forest
## 
## Call:
##  randomForest(formula = target ~ ., data = df_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.87%
## Confusion matrix:
##     No Yes class.error
## No  87  24   0.2162162
## Yes 17 115   0.1287879
plot(fit.forest)

Red line represents MCR of class not having heart diseases, green line represents MCR of class having heart diseases and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.

# show which feature were most important
importance(fit.forest)
##          MeanDecreaseGini
## age             9.2841291
## sex             4.3740045
## cp             15.0673508
## trestbps        7.9247197
## chol            8.7957958
## fbs             0.9508415
## restecg         2.5503213
## thalach        15.2103273
## exang           5.3227971
## oldpeak        13.0797417
## slope           4.1650562
## ca             16.8017252
## thal           15.6703627
varImpPlot(fit.forest,type=2)

After fitting our model, we can now use it to create predictions on our holdout test set to evaluate its performance.

forest.pred <- predict(fit.forest, newdata = df_test, type = "class")
forest.cm_train <- confusionMatrix(forest.pred, df_test$target)
forest.cm_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   4
##        Yes  6  29
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.661           
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.8286          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.8283          
##                                           
##        'Positive' Class : No              
## 

The accuracy is 0.8333. The test result is not bad.

[TODO: Need more explanation]

Gradient Boosting

set.seed(123)

# build gradient boosted model
fit.boost <- gbm(target~., df_train,
             distribution = "gaussian",
             n.trees = 2000,
             cv.folds = 3)

# display model details
fit.boost
## gbm(formula = target ~ ., distribution = "gaussian", data = df_train, 
##     n.trees = 2000, cv.folds = 3)
## A gradient boosted model with gaussian loss function.
## 2000 iterations were performed.
## The best cross-validation iteration was 42.
## There were 13 predictors of which 12 had non-zero influence.
sqrt(min(fit.boost$cv.error))
## [1] 0.3594497
summary(fit.boost)

##               var   rel.inf
## chol         chol 15.776176
## thalach   thalach 14.392320
## trestbps trestbps 13.624019
## ca             ca 10.401630
## oldpeak   oldpeak 10.191402
## cp             cp 10.180335
## age           age 10.118679
## thal         thal  6.307101
## exang       exang  2.393067
## slope       slope  2.065584
## sex           sex  1.874257
## restecg   restecg  1.613538
## fbs           fbs  1.061891
plot(fit.boost, i="chol")

plot(fit.boost, i="thalach")

# plot loss function as a result of n trees added to the ensemble
gbm.perf(fit.boost, method = "cv")

## [1] 42
gbm.perf(fit.boost, method = "OOB")
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

## [1] 45
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 2000 
## Equivalent Number of Parameters: 39.99 
## Residual Standard Error: 0.0003441

The plots indicating the optimum number of trees based on the technique we used. The green line indicates the error on the test data, and the blue dotted line points to the optimum number of iterations. We can observe that beyond a certain point (42 iterations for the cv method and 45 for the OOB method), the test data’s error appears to increase because of overfitting.

According to the cv method, we choose 42 as the number of trees to predict the test set.

boostPre <- predict.gbm(fit.boost, 
                        df_test, 
                        n.trees = 42)
boostPre <- ifelse(boostPre < 1.5, 'No', 'Yes')

boostPre <- as.factor(boostPre)
boost.cm <- confusionMatrix(boostPre, df_test$target)
boost.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  20   3
##        Yes  7  30
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.6587          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.7407          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8696          
##          Neg Pred Value : 0.8108          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3833          
##       Balanced Accuracy : 0.8249          
##                                           
##        'Positive' Class : No              
## 

Support Vector Machine

In machine learning, Support vector machine (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.

set.seed(123)
fit.svm<-svm(target~.,data=df_train,kernel="radial", cost=10,scale = FALSE)
fit.svm
## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  242
svm.pred <- predict(fit.svm, newdata = df_test, type = "class")
svm.cm_train <- confusionMatrix(forest.pred, df_test$target)
svm.cm_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   4
##        Yes  6  29
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.661           
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.8286          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.8283          
##                                           
##        'Positive' Class : No              
## 
svm.tune <- tune(e1071::svm, target~.,
                 data = df_train,
                 ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100),
                 kernel=c("linear", "polynomial", "radial"),
                 gamma=c(0.5,1,2,3,4)))
summary(svm.tune)
## 
## Parameter tuning of 'e1071::svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost     kernel gamma
##     1 polynomial   0.5
## 
## - best performance: 0.1556667 
## 
## - Detailed performance results:
##      cost     kernel gamma     error dispersion
## 1   1e-03     linear   0.5 0.4565000 0.09035407
## 2   1e-02     linear   0.5 0.2138333 0.05730065
## 3   1e-01     linear   0.5 0.1806667 0.08135124
## 4   1e+00     linear   0.5 0.1766667 0.07442786
## 5   5e+00     linear   0.5 0.1685000 0.06255590
## 6   1e+01     linear   0.5 0.1685000 0.06255590
## 7   1e+02     linear   0.5 0.1770000 0.07846836
## 8   1e-03 polynomial   0.5 0.2633333 0.09427108
## 9   1e-02 polynomial   0.5 0.1891667 0.07329229
## 10  1e-01 polynomial   0.5 0.1763333 0.07644816
## 11  1e+00 polynomial   0.5 0.1556667 0.07546645
## 12  5e+00 polynomial   0.5 0.1556667 0.07546645
## 13  1e+01 polynomial   0.5 0.1556667 0.07546645
## 14  1e+02 polynomial   0.5 0.1556667 0.07546645
## 15  1e-03     radial   0.5 0.4565000 0.09035407
## 16  1e-02     radial   0.5 0.4565000 0.09035407
## 17  1e-01     radial   0.5 0.4565000 0.09035407
## 18  1e+00     radial   0.5 0.2061667 0.07676343
## 19  5e+00     radial   0.5 0.1853333 0.05978253
## 20  1e+01     radial   0.5 0.1853333 0.05978253
## 21  1e+02     radial   0.5 0.1853333 0.05978253
## 22  1e-03     linear   1.0 0.4565000 0.09035407
## 23  1e-02     linear   1.0 0.2138333 0.05730065
## 24  1e-01     linear   1.0 0.1806667 0.08135124
## 25  1e+00     linear   1.0 0.1766667 0.07442786
## 26  5e+00     linear   1.0 0.1685000 0.06255590
## 27  1e+01     linear   1.0 0.1685000 0.06255590
## 28  1e+02     linear   1.0 0.1770000 0.07846836
## 29  1e-03 polynomial   1.0 0.2055000 0.07478384
## 30  1e-02 polynomial   1.0 0.1763333 0.07644816
## 31  1e-01 polynomial   1.0 0.1556667 0.07546645
## 32  1e+00 polynomial   1.0 0.1556667 0.07546645
## 33  5e+00 polynomial   1.0 0.1556667 0.07546645
## 34  1e+01 polynomial   1.0 0.1556667 0.07546645
## 35  1e+02 polynomial   1.0 0.1556667 0.07546645
## 36  1e-03     radial   1.0 0.4565000 0.09035407
## 37  1e-02     radial   1.0 0.4565000 0.09035407
## 38  1e-01     radial   1.0 0.4565000 0.09035407
## 39  1e+00     radial   1.0 0.3455000 0.09543937
## 40  5e+00     radial   1.0 0.3086667 0.05259207
## 41  1e+01     radial   1.0 0.3086667 0.05259207
## 42  1e+02     radial   1.0 0.3086667 0.05259207
## 43  1e-03     linear   2.0 0.4565000 0.09035407
## 44  1e-02     linear   2.0 0.2138333 0.05730065
## 45  1e-01     linear   2.0 0.1806667 0.08135124
## 46  1e+00     linear   2.0 0.1766667 0.07442786
## 47  5e+00     linear   2.0 0.1685000 0.06255590
## 48  1e+01     linear   2.0 0.1685000 0.06255590
## 49  1e+02     linear   2.0 0.1770000 0.07846836
## 50  1e-03 polynomial   2.0 0.1805000 0.07181101
## 51  1e-02 polynomial   2.0 0.1556667 0.07546645
## 52  1e-01 polynomial   2.0 0.1556667 0.07546645
## 53  1e+00 polynomial   2.0 0.1556667 0.07546645
## 54  5e+00 polynomial   2.0 0.1556667 0.07546645
## 55  1e+01 polynomial   2.0 0.1556667 0.07546645
## 56  1e+02 polynomial   2.0 0.1556667 0.07546645
## 57  1e-03     radial   2.0 0.4565000 0.09035407
## 58  1e-02     radial   2.0 0.4565000 0.09035407
## 59  1e-01     radial   2.0 0.4565000 0.09035407
## 60  1e+00     radial   2.0 0.4440000 0.09836377
## 61  5e+00     radial   2.0 0.4398333 0.09040529
## 62  1e+01     radial   2.0 0.4398333 0.09040529
## 63  1e+02     radial   2.0 0.4398333 0.09040529
## 64  1e-03     linear   3.0 0.4565000 0.09035407
## 65  1e-02     linear   3.0 0.2138333 0.05730065
## 66  1e-01     linear   3.0 0.1806667 0.08135124
## 67  1e+00     linear   3.0 0.1766667 0.07442786
## 68  5e+00     linear   3.0 0.1685000 0.06255590
## 69  1e+01     linear   3.0 0.1685000 0.06255590
## 70  1e+02     linear   3.0 0.1770000 0.07846836
## 71  1e-03 polynomial   3.0 0.1556667 0.07546645
## 72  1e-02 polynomial   3.0 0.1556667 0.07546645
## 73  1e-01 polynomial   3.0 0.1556667 0.07546645
## 74  1e+00 polynomial   3.0 0.1556667 0.07546645
## 75  5e+00 polynomial   3.0 0.1556667 0.07546645
## 76  1e+01 polynomial   3.0 0.1556667 0.07546645
## 77  1e+02 polynomial   3.0 0.1556667 0.07546645
## 78  1e-03     radial   3.0 0.4565000 0.09035407
## 79  1e-02     radial   3.0 0.4565000 0.09035407
## 80  1e-01     radial   3.0 0.4565000 0.09035407
## 81  1e+00     radial   3.0 0.4523333 0.09931618
## 82  5e+00     radial   3.0 0.4481667 0.09393802
## 83  1e+01     radial   3.0 0.4481667 0.09393802
## 84  1e+02     radial   3.0 0.4481667 0.09393802
## 85  1e-03     linear   4.0 0.4565000 0.09035407
## 86  1e-02     linear   4.0 0.2138333 0.05730065
## 87  1e-01     linear   4.0 0.1806667 0.08135124
## 88  1e+00     linear   4.0 0.1766667 0.07442786
## 89  5e+00     linear   4.0 0.1685000 0.06255590
## 90  1e+01     linear   4.0 0.1685000 0.06255590
## 91  1e+02     linear   4.0 0.1770000 0.07846836
## 92  1e-03 polynomial   4.0 0.1556667 0.07546645
## 93  1e-02 polynomial   4.0 0.1556667 0.07546645
## 94  1e-01 polynomial   4.0 0.1556667 0.07546645
## 95  1e+00 polynomial   4.0 0.1556667 0.07546645
## 96  5e+00 polynomial   4.0 0.1556667 0.07546645
## 97  1e+01 polynomial   4.0 0.1556667 0.07546645
## 98  1e+02 polynomial   4.0 0.1556667 0.07546645
## 99  1e-03     radial   4.0 0.4565000 0.09035407
## 100 1e-02     radial   4.0 0.4565000 0.09035407
## 101 1e-01     radial   4.0 0.4565000 0.09035407
## 102 1e+00     radial   4.0 0.4565000 0.09035407
## 103 5e+00     radial   4.0 0.4523333 0.09931618
## 104 1e+01     radial   4.0 0.4523333 0.09931618
## 105 1e+02     radial   4.0 0.4523333 0.09931618
bestmod<-svm.tune$best.model
bestmod
## 
## Call:
## best.tune(method = e1071::svm, train.x = target ~ ., data = df_train, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100), kernel = c("linear", 
##         "polynomial", "radial"), gamma = c(0.5, 1, 2, 3, 4)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  134
svm.pred2 <- predict(bestmod, newdata = df_test, type = "class")
svm.cm_train2 <- confusionMatrix(forest.pred, df_test$target)
svm.cm_train2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   4
##        Yes  6  29
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.661           
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.8286          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.8283          
##                                           
##        'Positive' Class : No              
## 

According to the Confusion matrix, Our model Accuracy is 0.8333.

Supervised Model Performance

Random_Forests<-c(forest.cm_train$overall[1],forest.cm_train$byClass[1],forest.cm_train$byClass[2],forest.cm_train$byClass[5],forest.cm_train$byClass[6],forest.cm_train$byClass[7])

Gradient_Boosting<-c(boost.cm$overall[1],boost.cm$byClass[1],boost.cm$byClass[2],boost.cm$byClass[5],boost.cm$byClass[6],boost.cm$byClass[7])

SVM<-c(svm.cm_train2$overall[1],svm.cm_train2$byClass[1],svm.cm_train2$byClass[2],svm.cm_train2$byClass[5],svm.cm_train2$byClass[6],svm.cm_train2$byClass[7])

result<-rbind(Random_Forests,Gradient_Boosting,SVM)
kable(result)
Accuracy Sensitivity Specificity Precision Recall F1
Random_Forests 0.8333333 0.7777778 0.8787879 0.8400000 0.7777778 0.8076923
Gradient_Boosting 0.8333333 0.7407407 0.9090909 0.8695652 0.7407407 0.8000000
SVM 0.8333333 0.7777778 0.8787879 0.8400000 0.7777778 0.8076923

Unsupervised Machine Learning

Principle Component Analysis

Principal Component Analysis (PCA) is one of the most commonly used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction and information compression. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entired dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:

df_dummy.pca <- prcomp(df_dummy, center = TRUE,scale. = TRUE)
summary(df_dummy.pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4969 1.59479 1.50965 1.36202 1.32718 1.27564 1.22439
## Proportion of Variance 0.2011 0.08204 0.07352 0.05984 0.05682 0.05249 0.04836
## Cumulative Proportion  0.2011 0.28316 0.35667 0.41652 0.47334 0.52583 0.57419
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.15506 1.1342 1.06882 1.03041 1.02755 0.98902 0.98278
## Proportion of Variance 0.04304 0.0415 0.03685 0.03425 0.03406 0.03155 0.03116
## Cumulative Proportion  0.61723 0.6587 0.69557 0.72982 0.76388 0.79544 0.82659
##                           PC15    PC16    PC17    PC18   PC19    PC20    PC21
## Standard deviation     0.94244 0.93529 0.88443 0.86181 0.8536 0.76794 0.64136
## Proportion of Variance 0.02865 0.02822 0.02523 0.02396 0.0235 0.01902 0.01327
## Cumulative Proportion  0.85524 0.88346 0.90870 0.93265 0.9562 0.97518 0.98845
##                           PC22     PC23      PC24      PC25     PC26      PC27
## Standard deviation     0.59839 5.87e-15 2.144e-15 1.336e-15 1.15e-15 8.766e-16
## Proportion of Variance 0.01155 0.00e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.00000 1.00e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                             PC28      PC29     PC30      PC31
## Standard deviation     8.174e-16 5.828e-16 3.23e-16 2.387e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00
fviz_eig(df_dummy.pca)

#compute standard deviation of each principal component
std_dev <- df_dummy.pca$sdev
#compute variance
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
round(prop_varex[1:10], 2)
##  [1] 0.20 0.08 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.04

The first principal component in our example therefore explains 20% of the variability, and the second principal component explains 8%. Together, the first ten principal components explain 69% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.

#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

library("FactoMineR")
library("factoextra")

eig.val <- get_eigenvalue(df_dummy.pca)
eig.val
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  6.234534e+00     2.011140e+01                    20.11140
## Dim.2  2.543346e+00     8.204344e+00                    28.31574
## Dim.3  2.279033e+00     7.351718e+00                    35.66746
## Dim.4  1.855109e+00     5.984222e+00                    41.65168
## Dim.5  1.761418e+00     5.681994e+00                    47.33368
## Dim.6  1.627248e+00     5.249187e+00                    52.58286
## Dim.7  1.499129e+00     4.835900e+00                    57.41877
## Dim.8  1.334161e+00     4.303745e+00                    61.72251
## Dim.9  1.286410e+00     4.149711e+00                    65.87222
## Dim.10 1.142386e+00     3.685117e+00                    69.55734
## Dim.11 1.061749e+00     3.424996e+00                    72.98233
## Dim.12 1.055861e+00     3.406003e+00                    76.38834
## Dim.13 9.781525e-01     3.155331e+00                    79.54367
## Dim.14 9.658557e-01     3.115663e+00                    82.65933
## Dim.15 8.881858e-01     2.865115e+00                    85.52445
## Dim.16 8.747641e-01     2.821820e+00                    88.34627
## Dim.17 7.822181e-01     2.523284e+00                    90.86955
## Dim.18 7.427240e-01     2.395884e+00                    93.26543
## Dim.19 7.285658e-01     2.350212e+00                    95.61565
## Dim.20 5.897297e-01     1.902354e+00                    97.51800
## Dim.21 4.113489e-01     1.326932e+00                    98.84493
## Dim.22 3.580708e-01     1.155067e+00                   100.00000
## Dim.23 3.445532e-29     1.111462e-28                   100.00000
## Dim.24 4.596664e-30     1.482795e-29                   100.00000
## Dim.25 1.783936e-30     5.754634e-30                   100.00000
## Dim.26 1.323229e-30     4.268482e-30                   100.00000
## Dim.27 7.684005e-31     2.478711e-30                   100.00000
## Dim.28 6.681098e-31     2.155193e-30                   100.00000
## Dim.29 3.396772e-31     1.095733e-30                   100.00000
## Dim.30 1.043089e-31     3.364804e-31                   100.00000
## Dim.31 5.697784e-32     1.837995e-31                   100.00000
res.pca <- PCA(df_dummy, scale.unit = TRUE, ncp = 5, graph = FALSE)
var <- get_pca_var(res.pca)
corrplot(var$cos2, is.corr=FALSE)

corrplot(var$contrib, is.corr=FALSE)

fviz_contrib(df_dummy.pca, choice = "var", axes = 1, top = 15)

fviz_contrib(df_dummy.pca, choice = "var", axes = 2, top = 15)

fviz_pca_var(df_dummy.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

[TODO: Need more explanation]

Kmeans Analysis

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms that is part of a much deep pool of data techniques and operations in the realm of Data Science. It is the fastest and most efficient algorithm to categorize data points into groups even when very little information is available about data.

fviz_nbclust(df_dummy, kmeans, method = "wss") +
  geom_vline(xintercept = 5, linetype = 2) + # add line for better visualization
  labs(subtitle = "Elbow method") # add subtitle

We can see above, that it is approximately 5. More clusters do not improve within segment variability. Therefore, we’ll perform our initial K-Means model with \(k=5\).

set.seed(42)
km_res <- kmeans(df_dummy, centers = 5, nstart = 20)
summary(km_res)
##              Length Class  Mode   
## cluster      303    -none- numeric
## centers      155    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
sil <- silhouette(km_res$cluster, dist(df_dummy))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   89          0.24
## 2       2   96          0.33
## 3       3   53          0.18
## 4       4   60          0.33
## 5       5    5          0.46

fviz_cluster(km_res, df_dummy, ellipse.type = "norm")

To provide some context around the clusters that were generated by our algorithm, we decided to isolate the 5 clusters we created, and subjected them to our df_dummy dataset, grouping and computing the means for each one of our features, to see if any noticeable trends start to arise.

k_means_analysis_df <- df_dummy %>%
  mutate(Cluster = km_res$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
k_means_analysis_df <- k_means_analysis_df %>% select('Cluster', 'age', 'trestbps', 'chol', 'thalach', 'oldpeak')
kable(k_means_analysis_df) %>% 
  kable_styling(bootstrap_options = "responsive")
Cluster age trestbps chol thalach oldpeak
1 52.69663 128.0674 192.2472 148.0337 1.0348315
2 51.39583 128.6354 240.7396 163.9583 0.7458333
3 59.86792 138.4340 258.3962 122.0189 1.5320755
4 56.05000 135.3167 308.5167 153.0500 1.0100000
5 62.60000 135.8000 438.2000 155.6000 1.9000000

[Zach Note: just adding these in here since these were a few of the findings I discovered when running this data in python. We can take these out if they aren’t helpful, but thought I’d leave them here in case others find similar trends or would like to expand on these.]

Conclusions

In the end, we worked through a fair bit of analysis on this heart disease dataset. From this, we’ve come up with a few interesting conclusions:

Many of the attributes that were available in the Cleveland Heart Disease dataset, and the individuals in this dataset, matched many of the underlying factors that can lead to heart disease. For instance, many that did indeed have heart disease exhibited higher resting blood pressure, higher cholesterol, and were older, on average than those that did not have heart disease.

Additionally, a large percentage of individuals that exhibited exercise-induced angina (chest pain) had heart disease (roughly 75% of the individuals in the dataset). This is something worth noting, as it may help physicians with diagnoses and helping to identify factors that could lead to this type of disease.

It was also very interesting to run some machine learning algorithms on this dataset. Support Vector Machines (SVM) and Random Forest both provided pretty good classification outputs for predicting whether or not an individual has heart disease. Implementing more of these types of models in the future may be valuable to help physicians make proper decisions and can aid in diagnoses.

Finally, the SVM model that I used for this dataset appeared to be slightly more effective when it was implemented on the test dataset. Although the Random Forest model did extraordinarily well on the training dataset (accuracy around 99%), when extending this model to the testing dataset it seemed to yield a higher number of false negatives and false positives thant he SVM model. If this model were to be used in real world scenarios, it could lead to catastrophic consequences if this were the only decision-making tool – a relatively large percentage of patients could be wrongfully diagnosed with heart disease, and others would be wrongfully told that they do not have heart disease when they actually do.