Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/FinalProject
Before loading our dataset, and discussing our data exploration process, I’ll quickly summarize the dataset that we’ll be using for our machine learning analysis. The dataset is part of UCI’s Cleveland Heart Disease data file, which consists of 303 individuals and various health-related metrics. The dataset we’ll be using includes 14 attributes retrieved from patient tests and can be used for machine learning and classification analysis. Ultimately, these attributes will be examined to determine their effectiveness at predicting whether or not a particular patient has Heart Disease. You can find more information about these 14 attributes below:
Data Dictionary:
age: age in yearssex: sex (1 = male; 0 = female)cp: chest pain type
trestbps: resting blood pressure (in mm Hg on admission to the hospital)chol: serum cholestoral in mg/dlfbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)restecg: resting electrocardiographic results
thalach: maximum heart rate achievedexang: exercise induced angina (1 = yes; 0 = no)oldpeak: ST depression induced by exercise relative to restslope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by flourosopythal: 3 = normal; 6 = fixed defect; 7 = reversable defecttarget: diagnosis of heart disease (angiographic disease status)
Given this context, we are now ready to load our data into R:
df <- read.csv("https://raw.githubusercontent.com/djlofland/DATA622_S2021_Group2/main/FinalProject/data/heart.csv", header = TRUE)
#df <- df %>% rename(age = "ï..age")First, we can check the shape of the dataset. It looks like there are 303 rows and 14 columns in our imported data.
dim(df)## [1] 303 14
We can also check to see if there is missing data.
plot_missing(df)As we can see above, there aren’t any missing values across our dataset, which will make it easier for us as we start to prepare our dataset for machine learning analysis. Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:
skimr::skim(df)| Name | df |
| Number of rows | 303 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 54.37 | 9.08 | 29 | 47.5 | 55.0 | 61.0 | 77.0 | <U+2581><U+2586><U+2587><U+2587><U+2581> |
| sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| cp | 0 | 1 | 0.97 | 1.03 | 0 | 0.0 | 1.0 | 2.0 | 3.0 | <U+2587><U+2583><U+2581><U+2585><U+2581> |
| trestbps | 0 | 1 | 131.62 | 17.54 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | <U+2583><U+2587><U+2585><U+2581><U+2581> |
| chol | 0 | 1 | 246.26 | 51.83 | 126 | 211.0 | 240.0 | 274.5 | 564.0 | <U+2583><U+2587><U+2582><U+2581><U+2581> |
| fbs | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| restecg | 0 | 1 | 0.53 | 0.53 | 0 | 0.0 | 1.0 | 1.0 | 2.0 | <U+2587><U+2581><U+2587><U+2581><U+2581> |
| thalach | 0 | 1 | 149.65 | 22.91 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | <U+2581><U+2582><U+2585><U+2587><U+2582> |
| exang | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| oldpeak | 0 | 1 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| slope | 0 | 1 | 1.40 | 0.62 | 0 | 1.0 | 1.0 | 2.0 | 2.0 | <U+2581><U+2581><U+2587><U+2581><U+2587> |
| ca | 0 | 1 | 0.73 | 1.02 | 0 | 0.0 | 0.0 | 1.0 | 4.0 | <U+2587><U+2583><U+2582><U+2581><U+2581> |
| thal | 0 | 1 | 2.31 | 0.61 | 0 | 2.0 | 2.0 | 3.0 | 3.0 | <U+2581><U+2581><U+2581><U+2587><U+2586> |
| target | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
df_viz <- df %>% mutate(sex = if_else(sex == 1, "Male", "Female"),
fbs = if_else(fbs == 1, ">120", "<=120"),
exang = if_else(exang == 1, "Yes" ,"No"),
cp = if_else(cp == 1, "Atypical Angina",
if_else(cp == 2, "Non-anginal pain", "Asymptomatic")),
restecg = if_else(restecg == 0, "Normal",
if_else(restecg == 1, "Abnormality", "probable or definite")),
slope = as.factor(slope),
ca = as.factor(ca),
thal = as.factor(thal),
target = if_else(target == 1, "Has heart disease", "Does not have heart disease")
) %>%
mutate_if(is.character, as.factor) %>%
dplyr::select(target, sex, fbs, exang, cp, restecg, slope, ca, thal, everything())Interestingly, we can also see that there’s a pretty even split between those that have a presence and absence of heart disease:
# Bar plot for target (Heart disease)
ggplot(df_viz, aes(x=target, fill=target)) +
geom_bar() +
xlab("Heart Disease") +
ylab("Count") +
ggtitle("Analysis of Presence and Absence of Heart Disease") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence"))prop.table(table(df_viz$target))##
## Does not have heart disease Has heart disease
## 0.4554455 0.5445545
We can see that there is roughly a 50/50 split in this dataset of those that do have heart disease from those that do not have heart disease. This will come in handy later when we set up our machine learning analyses.
Additionally, when looking at the summary distributions, we can see that our patient population skews above 50 years, with an average age of individuals in this dataset at about 54 years old. The oldest individual is 77 years old, and the youngest is 29. We can take a look at the age frequency counts here:
# Counting the frequency of the values of the age
hist(df_viz$age,labels=TRUE,main="Histogram of Age",xlab="Age Class",ylab="Frequency",col="steelblue")df_viz %>%
ggplot(aes(x=age,y=chol,size=chol, color=target))+
geom_point(alpha=0.7)+xlab("Age") +
ylab("Cholestoral")Interestingly from the plot above, when we take a look at the relationship between a patient’s age, cholesterol and whether or not they have heart disease, we can see that those with positive diagnoses tend to be clustered between the ages of 40 and 70 years and relatively high cholesterol levels.
We can explore this further by taking a look at these in percentages:
df %>%
group_by(target) %>%
summarise(mean=mean(trestbps))## # A tibble: 2 x 2
## target mean
## <int> <dbl>
## 1 0 134.
## 2 1 129.
For the continuous variables, we can examine the distributions, broken out by the target variable.
cor_heart <- cor(df[,c(4,5,8,10,14)])
cor_heart## trestbps chol thalach oldpeak target
## trestbps 1.00000000 0.123174207 -0.046697728 0.19321647 -0.14493113
## chol 0.12317421 1.000000000 -0.009939839 0.05395192 -0.08523911
## thalach -0.04669773 -0.009939839 1.000000000 -0.34418695 0.42174093
## oldpeak 0.19321647 0.053951920 -0.344186948 1.00000000 -0.43069600
## target -0.14493113 -0.085239105 0.421740934 -0.43069600 1.00000000
corrplot(cor_heart, method = "ellipse", type="upper",)ggcorrplot(cor_heart,lab = T)ggcorr(cor_heart, label = T, label_round = 2)[TODO: More explanation of these corr plots]
We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.
# dummy encode our categorical features
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df_viz)
df_dummy <- data.frame(predict(df_dummy, newdata = df_viz))Our first step will be to separate the data into training and test dataset. This way, we can test the accuracy by using our holdout test set later on. We decided to perform a conventional 80/20 training to testing data split on our original dataframe.
df$sex<-as.factor(df$sex)
levels(df$sex)<-c("Female","Male")
df$cp<-as.factor(df$cp)
levels(df$cp)<-c("typical","atypical","non-anginal","asymptomatic")
df$fbs<-as.factor(df$fbs)
levels(df$fbs)<-c("False","True")
df$restecg<-as.factor(df$restecg)
levels(df$restecg)<-c("normal","stt","hypertrophy")
df$exang<-as.factor(df$exang)
levels(df$exang)<-c("No","Yes")
df$slope<-as.factor(df$slope)
levels(df$slope)<-c("upsloping","flat","downsloping")
df$ca<-as.factor(df$ca)
df$thal<-as.factor(df$thal)
df$target<-as.factor(df$target)
levels(df$target)<-c("No", "Yes")str(df)## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "typical","atypical",..: 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : Factor w/ 3 levels "normal","stt",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "upsloping","flat",..: 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)
df2 <- data.frame(df)
index <- createDataPartition(df2$target, p=0.8, list = FALSE)
df_train <- df2[index,]
df_test <- df2[-index,]Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. It is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
set.seed(123)
fit.forest <- randomForest(target ~ ., data = df_train)
# display model details
fit.forest##
## Call:
## randomForest(formula = target ~ ., data = df_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.87%
## Confusion matrix:
## No Yes class.error
## No 87 24 0.2162162
## Yes 17 115 0.1287879
plot(fit.forest)Red line represents MCR of class not having heart diseases, green line represents MCR of class having heart diseases and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.
# show which feature were most important
importance(fit.forest)## MeanDecreaseGini
## age 9.2841291
## sex 4.3740045
## cp 15.0673508
## trestbps 7.9247197
## chol 8.7957958
## fbs 0.9508415
## restecg 2.5503213
## thalach 15.2103273
## exang 5.3227971
## oldpeak 13.0797417
## slope 4.1650562
## ca 16.8017252
## thal 15.6703627
varImpPlot(fit.forest,type=2)After fitting our model, we can now use it to create predictions on our holdout test set to evaluate its performance.
forest.pred <- predict(fit.forest, newdata = df_test, type = "class")
forest.cm_train <- confusionMatrix(forest.pred, df_test$target)
forest.cm_train## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 21 4
## Yes 6 29
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.661
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.7778
## Specificity : 0.8788
## Pos Pred Value : 0.8400
## Neg Pred Value : 0.8286
## Prevalence : 0.4500
## Detection Rate : 0.3500
## Detection Prevalence : 0.4167
## Balanced Accuracy : 0.8283
##
## 'Positive' Class : No
##
The accuracy is 0.8333. The test result is not bad.
[TODO: Need more explanation]
set.seed(123)
# build gradient boosted model
fit.boost <- gbm(target~., df_train,
distribution = "gaussian",
n.trees = 2000,
cv.folds = 3)
# display model details
fit.boost## gbm(formula = target ~ ., distribution = "gaussian", data = df_train,
## n.trees = 2000, cv.folds = 3)
## A gradient boosted model with gaussian loss function.
## 2000 iterations were performed.
## The best cross-validation iteration was 42.
## There were 13 predictors of which 12 had non-zero influence.
sqrt(min(fit.boost$cv.error))## [1] 0.3594497
summary(fit.boost)## var rel.inf
## chol chol 15.776176
## thalach thalach 14.392320
## trestbps trestbps 13.624019
## ca ca 10.401630
## oldpeak oldpeak 10.191402
## cp cp 10.180335
## age age 10.118679
## thal thal 6.307101
## exang exang 2.393067
## slope slope 2.065584
## sex sex 1.874257
## restecg restecg 1.613538
## fbs fbs 1.061891
plot(fit.boost, i="chol")plot(fit.boost, i="thalach")# plot loss function as a result of n trees added to the ensemble
gbm.perf(fit.boost, method = "cv")## [1] 42
gbm.perf(fit.boost, method = "OOB")## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.
## [1] 45
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 2000
## Equivalent Number of Parameters: 39.99
## Residual Standard Error: 0.0003441
The plots indicating the optimum number of trees based on the technique we used. The green line indicates the error on the test data, and the blue dotted line points to the optimum number of iterations. We can observe that beyond a certain point (42 iterations for the cv method and 45 for the OOB method), the test data’s error appears to increase because of overfitting.
According to the cv method, we choose 42 as the number of trees to predict the test set.
boostPre <- predict.gbm(fit.boost,
df_test,
n.trees = 42)
boostPre <- ifelse(boostPre < 1.5, 'No', 'Yes')
boostPre <- as.factor(boostPre)
boost.cm <- confusionMatrix(boostPre, df_test$target)
boost.cm## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 20 3
## Yes 7 30
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.6587
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.7407
## Specificity : 0.9091
## Pos Pred Value : 0.8696
## Neg Pred Value : 0.8108
## Prevalence : 0.4500
## Detection Rate : 0.3333
## Detection Prevalence : 0.3833
## Balanced Accuracy : 0.8249
##
## 'Positive' Class : No
##
In machine learning, Support vector machine (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.
set.seed(123)
fit.svm<-svm(target~.,data=df_train,kernel="radial", cost=10,scale = FALSE)
fit.svm##
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "radial", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 242
svm.pred <- predict(fit.svm, newdata = df_test, type = "class")
svm.cm_train <- confusionMatrix(forest.pred, df_test$target)
svm.cm_train## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 21 4
## Yes 6 29
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.661
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.7778
## Specificity : 0.8788
## Pos Pred Value : 0.8400
## Neg Pred Value : 0.8286
## Prevalence : 0.4500
## Detection Rate : 0.3500
## Detection Prevalence : 0.4167
## Balanced Accuracy : 0.8283
##
## 'Positive' Class : No
##
svm.tune <- tune(e1071::svm, target~.,
data = df_train,
ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100),
kernel=c("linear", "polynomial", "radial"),
gamma=c(0.5,1,2,3,4)))summary(svm.tune)##
## Parameter tuning of 'e1071::svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost kernel gamma
## 1 polynomial 0.5
##
## - best performance: 0.1556667
##
## - Detailed performance results:
## cost kernel gamma error dispersion
## 1 1e-03 linear 0.5 0.4565000 0.09035407
## 2 1e-02 linear 0.5 0.2138333 0.05730065
## 3 1e-01 linear 0.5 0.1806667 0.08135124
## 4 1e+00 linear 0.5 0.1766667 0.07442786
## 5 5e+00 linear 0.5 0.1685000 0.06255590
## 6 1e+01 linear 0.5 0.1685000 0.06255590
## 7 1e+02 linear 0.5 0.1770000 0.07846836
## 8 1e-03 polynomial 0.5 0.2633333 0.09427108
## 9 1e-02 polynomial 0.5 0.1891667 0.07329229
## 10 1e-01 polynomial 0.5 0.1763333 0.07644816
## 11 1e+00 polynomial 0.5 0.1556667 0.07546645
## 12 5e+00 polynomial 0.5 0.1556667 0.07546645
## 13 1e+01 polynomial 0.5 0.1556667 0.07546645
## 14 1e+02 polynomial 0.5 0.1556667 0.07546645
## 15 1e-03 radial 0.5 0.4565000 0.09035407
## 16 1e-02 radial 0.5 0.4565000 0.09035407
## 17 1e-01 radial 0.5 0.4565000 0.09035407
## 18 1e+00 radial 0.5 0.2061667 0.07676343
## 19 5e+00 radial 0.5 0.1853333 0.05978253
## 20 1e+01 radial 0.5 0.1853333 0.05978253
## 21 1e+02 radial 0.5 0.1853333 0.05978253
## 22 1e-03 linear 1.0 0.4565000 0.09035407
## 23 1e-02 linear 1.0 0.2138333 0.05730065
## 24 1e-01 linear 1.0 0.1806667 0.08135124
## 25 1e+00 linear 1.0 0.1766667 0.07442786
## 26 5e+00 linear 1.0 0.1685000 0.06255590
## 27 1e+01 linear 1.0 0.1685000 0.06255590
## 28 1e+02 linear 1.0 0.1770000 0.07846836
## 29 1e-03 polynomial 1.0 0.2055000 0.07478384
## 30 1e-02 polynomial 1.0 0.1763333 0.07644816
## 31 1e-01 polynomial 1.0 0.1556667 0.07546645
## 32 1e+00 polynomial 1.0 0.1556667 0.07546645
## 33 5e+00 polynomial 1.0 0.1556667 0.07546645
## 34 1e+01 polynomial 1.0 0.1556667 0.07546645
## 35 1e+02 polynomial 1.0 0.1556667 0.07546645
## 36 1e-03 radial 1.0 0.4565000 0.09035407
## 37 1e-02 radial 1.0 0.4565000 0.09035407
## 38 1e-01 radial 1.0 0.4565000 0.09035407
## 39 1e+00 radial 1.0 0.3455000 0.09543937
## 40 5e+00 radial 1.0 0.3086667 0.05259207
## 41 1e+01 radial 1.0 0.3086667 0.05259207
## 42 1e+02 radial 1.0 0.3086667 0.05259207
## 43 1e-03 linear 2.0 0.4565000 0.09035407
## 44 1e-02 linear 2.0 0.2138333 0.05730065
## 45 1e-01 linear 2.0 0.1806667 0.08135124
## 46 1e+00 linear 2.0 0.1766667 0.07442786
## 47 5e+00 linear 2.0 0.1685000 0.06255590
## 48 1e+01 linear 2.0 0.1685000 0.06255590
## 49 1e+02 linear 2.0 0.1770000 0.07846836
## 50 1e-03 polynomial 2.0 0.1805000 0.07181101
## 51 1e-02 polynomial 2.0 0.1556667 0.07546645
## 52 1e-01 polynomial 2.0 0.1556667 0.07546645
## 53 1e+00 polynomial 2.0 0.1556667 0.07546645
## 54 5e+00 polynomial 2.0 0.1556667 0.07546645
## 55 1e+01 polynomial 2.0 0.1556667 0.07546645
## 56 1e+02 polynomial 2.0 0.1556667 0.07546645
## 57 1e-03 radial 2.0 0.4565000 0.09035407
## 58 1e-02 radial 2.0 0.4565000 0.09035407
## 59 1e-01 radial 2.0 0.4565000 0.09035407
## 60 1e+00 radial 2.0 0.4440000 0.09836377
## 61 5e+00 radial 2.0 0.4398333 0.09040529
## 62 1e+01 radial 2.0 0.4398333 0.09040529
## 63 1e+02 radial 2.0 0.4398333 0.09040529
## 64 1e-03 linear 3.0 0.4565000 0.09035407
## 65 1e-02 linear 3.0 0.2138333 0.05730065
## 66 1e-01 linear 3.0 0.1806667 0.08135124
## 67 1e+00 linear 3.0 0.1766667 0.07442786
## 68 5e+00 linear 3.0 0.1685000 0.06255590
## 69 1e+01 linear 3.0 0.1685000 0.06255590
## 70 1e+02 linear 3.0 0.1770000 0.07846836
## 71 1e-03 polynomial 3.0 0.1556667 0.07546645
## 72 1e-02 polynomial 3.0 0.1556667 0.07546645
## 73 1e-01 polynomial 3.0 0.1556667 0.07546645
## 74 1e+00 polynomial 3.0 0.1556667 0.07546645
## 75 5e+00 polynomial 3.0 0.1556667 0.07546645
## 76 1e+01 polynomial 3.0 0.1556667 0.07546645
## 77 1e+02 polynomial 3.0 0.1556667 0.07546645
## 78 1e-03 radial 3.0 0.4565000 0.09035407
## 79 1e-02 radial 3.0 0.4565000 0.09035407
## 80 1e-01 radial 3.0 0.4565000 0.09035407
## 81 1e+00 radial 3.0 0.4523333 0.09931618
## 82 5e+00 radial 3.0 0.4481667 0.09393802
## 83 1e+01 radial 3.0 0.4481667 0.09393802
## 84 1e+02 radial 3.0 0.4481667 0.09393802
## 85 1e-03 linear 4.0 0.4565000 0.09035407
## 86 1e-02 linear 4.0 0.2138333 0.05730065
## 87 1e-01 linear 4.0 0.1806667 0.08135124
## 88 1e+00 linear 4.0 0.1766667 0.07442786
## 89 5e+00 linear 4.0 0.1685000 0.06255590
## 90 1e+01 linear 4.0 0.1685000 0.06255590
## 91 1e+02 linear 4.0 0.1770000 0.07846836
## 92 1e-03 polynomial 4.0 0.1556667 0.07546645
## 93 1e-02 polynomial 4.0 0.1556667 0.07546645
## 94 1e-01 polynomial 4.0 0.1556667 0.07546645
## 95 1e+00 polynomial 4.0 0.1556667 0.07546645
## 96 5e+00 polynomial 4.0 0.1556667 0.07546645
## 97 1e+01 polynomial 4.0 0.1556667 0.07546645
## 98 1e+02 polynomial 4.0 0.1556667 0.07546645
## 99 1e-03 radial 4.0 0.4565000 0.09035407
## 100 1e-02 radial 4.0 0.4565000 0.09035407
## 101 1e-01 radial 4.0 0.4565000 0.09035407
## 102 1e+00 radial 4.0 0.4565000 0.09035407
## 103 5e+00 radial 4.0 0.4523333 0.09931618
## 104 1e+01 radial 4.0 0.4523333 0.09931618
## 105 1e+02 radial 4.0 0.4523333 0.09931618
bestmod<-svm.tune$best.model
bestmod##
## Call:
## best.tune(method = e1071::svm, train.x = target ~ ., data = df_train,
## ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100), kernel = c("linear",
## "polynomial", "radial"), gamma = c(0.5, 1, 2, 3, 4)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 134
svm.pred2 <- predict(bestmod, newdata = df_test, type = "class")
svm.cm_train2 <- confusionMatrix(forest.pred, df_test$target)
svm.cm_train2## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 21 4
## Yes 6 29
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.661
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.7778
## Specificity : 0.8788
## Pos Pred Value : 0.8400
## Neg Pred Value : 0.8286
## Prevalence : 0.4500
## Detection Rate : 0.3500
## Detection Prevalence : 0.4167
## Balanced Accuracy : 0.8283
##
## 'Positive' Class : No
##
According to the Confusion matrix, Our model Accuracy is 0.8333.
Random_Forests<-c(forest.cm_train$overall[1],forest.cm_train$byClass[1],forest.cm_train$byClass[2],forest.cm_train$byClass[5],forest.cm_train$byClass[6],forest.cm_train$byClass[7])
Gradient_Boosting<-c(boost.cm$overall[1],boost.cm$byClass[1],boost.cm$byClass[2],boost.cm$byClass[5],boost.cm$byClass[6],boost.cm$byClass[7])
SVM<-c(svm.cm_train2$overall[1],svm.cm_train2$byClass[1],svm.cm_train2$byClass[2],svm.cm_train2$byClass[5],svm.cm_train2$byClass[6],svm.cm_train2$byClass[7])
result<-rbind(Random_Forests,Gradient_Boosting,SVM)
kable(result)| Accuracy | Sensitivity | Specificity | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| Random_Forests | 0.8333333 | 0.7777778 | 0.8787879 | 0.8400000 | 0.7777778 | 0.8076923 |
| Gradient_Boosting | 0.8333333 | 0.7407407 | 0.9090909 | 0.8695652 | 0.7407407 | 0.8000000 |
| SVM | 0.8333333 | 0.7777778 | 0.8787879 | 0.8400000 | 0.7777778 | 0.8076923 |
Principal Component Analysis (PCA) is one of the most commonly used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction and information compression. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entired dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:
df_dummy.pca <- prcomp(df_dummy, center = TRUE,scale. = TRUE)
summary(df_dummy.pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4969 1.59479 1.50965 1.36202 1.32718 1.27564 1.22439
## Proportion of Variance 0.2011 0.08204 0.07352 0.05984 0.05682 0.05249 0.04836
## Cumulative Proportion 0.2011 0.28316 0.35667 0.41652 0.47334 0.52583 0.57419
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.15506 1.1342 1.06882 1.03041 1.02755 0.98902 0.98278
## Proportion of Variance 0.04304 0.0415 0.03685 0.03425 0.03406 0.03155 0.03116
## Cumulative Proportion 0.61723 0.6587 0.69557 0.72982 0.76388 0.79544 0.82659
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.94244 0.93529 0.88443 0.86181 0.8536 0.76794 0.64136
## Proportion of Variance 0.02865 0.02822 0.02523 0.02396 0.0235 0.01902 0.01327
## Cumulative Proportion 0.85524 0.88346 0.90870 0.93265 0.9562 0.97518 0.98845
## PC22 PC23 PC24 PC25 PC26 PC27
## Standard deviation 0.59839 5.87e-15 2.144e-15 1.336e-15 1.15e-15 8.766e-16
## Proportion of Variance 0.01155 0.00e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion 1.00000 1.00e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
## PC28 PC29 PC30 PC31
## Standard deviation 8.174e-16 5.828e-16 3.23e-16 2.387e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.00e+00 1.000e+00
fviz_eig(df_dummy.pca)#compute standard deviation of each principal component
std_dev <- df_dummy.pca$sdev
#compute variance
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
round(prop_varex[1:10], 2)## [1] 0.20 0.08 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.04
The first principal component in our example therefore explains 20% of the variability, and the second principal component explains 8%. Together, the first ten principal components explain 69% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.
#scree plot
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")library("FactoMineR")
library("factoextra")
eig.val <- get_eigenvalue(df_dummy.pca)
eig.val## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 6.234534e+00 2.011140e+01 20.11140
## Dim.2 2.543346e+00 8.204344e+00 28.31574
## Dim.3 2.279033e+00 7.351718e+00 35.66746
## Dim.4 1.855109e+00 5.984222e+00 41.65168
## Dim.5 1.761418e+00 5.681994e+00 47.33368
## Dim.6 1.627248e+00 5.249187e+00 52.58286
## Dim.7 1.499129e+00 4.835900e+00 57.41877
## Dim.8 1.334161e+00 4.303745e+00 61.72251
## Dim.9 1.286410e+00 4.149711e+00 65.87222
## Dim.10 1.142386e+00 3.685117e+00 69.55734
## Dim.11 1.061749e+00 3.424996e+00 72.98233
## Dim.12 1.055861e+00 3.406003e+00 76.38834
## Dim.13 9.781525e-01 3.155331e+00 79.54367
## Dim.14 9.658557e-01 3.115663e+00 82.65933
## Dim.15 8.881858e-01 2.865115e+00 85.52445
## Dim.16 8.747641e-01 2.821820e+00 88.34627
## Dim.17 7.822181e-01 2.523284e+00 90.86955
## Dim.18 7.427240e-01 2.395884e+00 93.26543
## Dim.19 7.285658e-01 2.350212e+00 95.61565
## Dim.20 5.897297e-01 1.902354e+00 97.51800
## Dim.21 4.113489e-01 1.326932e+00 98.84493
## Dim.22 3.580708e-01 1.155067e+00 100.00000
## Dim.23 3.445532e-29 1.111462e-28 100.00000
## Dim.24 4.596664e-30 1.482795e-29 100.00000
## Dim.25 1.783936e-30 5.754634e-30 100.00000
## Dim.26 1.323229e-30 4.268482e-30 100.00000
## Dim.27 7.684005e-31 2.478711e-30 100.00000
## Dim.28 6.681098e-31 2.155193e-30 100.00000
## Dim.29 3.396772e-31 1.095733e-30 100.00000
## Dim.30 1.043089e-31 3.364804e-31 100.00000
## Dim.31 5.697784e-32 1.837995e-31 100.00000
res.pca <- PCA(df_dummy, scale.unit = TRUE, ncp = 5, graph = FALSE)
var <- get_pca_var(res.pca)corrplot(var$cos2, is.corr=FALSE)corrplot(var$contrib, is.corr=FALSE)fviz_contrib(df_dummy.pca, choice = "var", axes = 1, top = 15)fviz_contrib(df_dummy.pca, choice = "var", axes = 2, top = 15)fviz_pca_var(df_dummy.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
[TODO: Need more explanation]
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms that is part of a much deep pool of data techniques and operations in the realm of Data Science. It is the fastest and most efficient algorithm to categorize data points into groups even when very little information is available about data.
fviz_nbclust(df_dummy, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2) + # add line for better visualization
labs(subtitle = "Elbow method") # add subtitleWe can see above, that it is approximately 5. More clusters do not improve within segment variability. Therefore, we’ll perform our initial K-Means model with \(k=5\).
set.seed(42)
km_res <- kmeans(df_dummy, centers = 5, nstart = 20)
summary(km_res)## Length Class Mode
## cluster 303 -none- numeric
## centers 155 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
sil <- silhouette(km_res$cluster, dist(df_dummy))
fviz_silhouette(sil)## cluster size ave.sil.width
## 1 1 89 0.24
## 2 2 96 0.33
## 3 3 53 0.18
## 4 4 60 0.33
## 5 5 5 0.46
fviz_cluster(km_res, df_dummy, ellipse.type = "norm")To provide some context around the clusters that were generated by our algorithm, we decided to isolate the 5 clusters we created, and subjected them to our df_dummy dataset, grouping and computing the means for each one of our features, to see if any noticeable trends start to arise.
k_means_analysis_df <- df_dummy %>%
mutate(Cluster = km_res$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")k_means_analysis_df <- k_means_analysis_df %>% select('Cluster', 'age', 'trestbps', 'chol', 'thalach', 'oldpeak')
kable(k_means_analysis_df) %>%
kable_styling(bootstrap_options = "responsive")| Cluster | age | trestbps | chol | thalach | oldpeak |
|---|---|---|---|---|---|
| 1 | 52.69663 | 128.0674 | 192.2472 | 148.0337 | 1.0348315 |
| 2 | 51.39583 | 128.6354 | 240.7396 | 163.9583 | 0.7458333 |
| 3 | 59.86792 | 138.4340 | 258.3962 | 122.0189 | 1.5320755 |
| 4 | 56.05000 | 135.3167 | 308.5167 | 153.0500 | 1.0100000 |
| 5 | 62.60000 | 135.8000 | 438.2000 | 155.6000 | 1.9000000 |
[Zach Note: just adding these in here since these were a few of the findings I discovered when running this data in python. We can take these out if they aren’t helpful, but thought I’d leave them here in case others find similar trends or would like to expand on these.]
In the end, we worked through a fair bit of analysis on this heart disease dataset. From this, we’ve come up with a few interesting conclusions:
Many of the attributes that were available in the Cleveland Heart Disease dataset, and the individuals in this dataset, matched many of the underlying factors that can lead to heart disease. For instance, many that did indeed have heart disease exhibited higher resting blood pressure, higher cholesterol, and were older, on average than those that did not have heart disease.
Additionally, a large percentage of individuals that exhibited exercise-induced angina (chest pain) had heart disease (roughly 75% of the individuals in the dataset). This is something worth noting, as it may help physicians with diagnoses and helping to identify factors that could lead to this type of disease.
It was also very interesting to run some machine learning algorithms on this dataset. Support Vector Machines (SVM) and Random Forest both provided pretty good classification outputs for predicting whether or not an individual has heart disease. Implementing more of these types of models in the future may be valuable to help physicians make proper decisions and can aid in diagnoses.
Finally, the SVM model that I used for this dataset appeared to be slightly more effective when it was implemented on the test dataset. Although the Random Forest model did extraordinarily well on the training dataset (accuracy around 99%), when extending this model to the testing dataset it seemed to yield a higher number of false negatives and false positives thant he SVM model. If this model were to be used in real world scenarios, it could lead to catastrophic consequences if this were the only decision-making tool – a relatively large percentage of patients could be wrongfully diagnosed with heart disease, and others would be wrongfully told that they do not have heart disease when they actually do.