Diabetes Classification using Tree Based and Support Vector Machine Methods
library(tidyverse)
library(reshape2)
library(ggpubr)
library(skimr)
library(caret)
library(rpart.plot)
library(performanceEstimation)
library(e1071)
library(splitTools)
library(cluster)
library(factoextra)
library(ggfortify)
Data Exploration
Data Source
I used the ‘Diabetes Dataset’ freely available on Kaggle.com (https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset).
This data source was originally from the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset focuses on patients who are females at least 21 years old of Pima Indian heritage. The objective of this data source is to predict whether a patient is likely to have diabetes to aid in diagnosing. There are 9 variables in the dataset: 1 binary indicator variable for diabetes and 8 variables related to medical information about the patient.
Variable | Data Type | Description |
---|---|---|
Pregnancies | Numeric | To express the number of pregnancies |
Glucose | Numeric | To express the glucose level in blood |
BloodPressure | Numeric | To express the blood pressure measurement |
SkinThickness | Numeric | To express the thickness of the skin |
Insulin | Numeric | To express the insulin level in blood |
BMI | Numeric | To express the body mass index |
DiabetesPedigreeFunction | Numeric | To express the likelihood of diabetes based on family history |
Age | Numeric | To express age |
Outcome | Categorical | To express the whether someone has diabetes. 1 is Yes and 0 is No |
Data Preparation
I decided to remove all rows that contained a 0 for
BloodPressure
, SkinThickness
, and
BMI
as it is not humanly possible to survive with those
measures. I believe that those rows contained incomplete data and should
be excluded from modeling.
<- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA622/main/HW2/Data/diabetes.csv") diabetes
# adjust data types
$Pregnancies <- as.numeric(diabetes$Pregnancies)
diabetes$Glucose <- as.numeric(diabetes$Glucose)
diabetes$BloodPressure <- as.numeric(diabetes$BloodPressure)
diabetes$SkinThickness <- as.numeric(diabetes$SkinThickness)
diabetes$Insulin <- as.numeric(diabetes$Insulin)
diabetes$Age <- as.numeric(diabetes$Age)
diabetes$Outcome <- as.factor(diabetes$Outcome) diabetes
<- diabetes %>%
diabetes.new filter(BloodPressure != 0) %>%
filter(SkinThickness != 0) %>%
filter(BMI != 0)
Summary Statistics
skim(diabetes.new)
Name | diabetes.new |
Number of rows | 537 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Outcome | 0 | 1 | FALSE | 2 | 0: 358, 1: 179 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Pregnancies | 0 | 1 | 3.51 | 3.30 | 0.00 | 1.00 | 2.00 | 5.00 | 17.00 | ▇▂▂▁▁ |
Glucose | 0 | 1 | 119.90 | 32.98 | 0.00 | 97.00 | 115.00 | 141.00 | 199.00 | ▁▁▇▅▂ |
BloodPressure | 0 | 1 | 71.47 | 12.30 | 24.00 | 64.00 | 72.00 | 80.00 | 110.00 | ▁▂▇▆▁ |
SkinThickness | 0 | 1 | 29.19 | 10.51 | 7.00 | 22.00 | 29.00 | 36.00 | 99.00 | ▅▇▁▁▁ |
Insulin | 0 | 1 | 113.96 | 122.89 | 0.00 | 0.00 | 90.00 | 165.00 | 846.00 | ▇▂▁▁▁ |
BMI | 0 | 1 | 32.89 | 6.88 | 18.20 | 27.80 | 32.80 | 36.90 | 67.10 | ▃▇▃▁▁ |
DiabetesPedigreeFunction | 0 | 1 | 0.50 | 0.34 | 0.09 | 0.26 | 0.42 | 0.66 | 2.42 | ▇▃▁▁▁ |
Age | 0 | 1 | 31.59 | 10.75 | 21.00 | 23.00 | 28.00 | 38.00 | 81.00 | ▇▂▁▁▁ |
Check the Target Variable Distribution
There is a class imbalance present in this dataset. Imbalanced data can prove to be quite problematic in classifying the minority class. I will conduct oversampling using SMOTE before modeling the data to give the model a better chance at classifying the minority class, that a women has Diabetes.
%>% ggplot(aes(Outcome)) +
diabetes.new geom_bar(fill = "#04354F") +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
Data Distributions
There appears to be a difference between the two target classes amongst all of the variables.
<- diabetes.new %>%
p1 ggplot(aes(x = Pregnancies, fill = Outcome)) +
geom_boxplot()
<- diabetes.new %>%
p2 ggplot(aes(x = Glucose, fill = Outcome)) +
geom_boxplot()
<- diabetes.new %>%
p3 ggplot(aes(x = BloodPressure, fill = Outcome)) +
geom_boxplot()
<- diabetes.new %>%
p4 ggplot(aes(x = Insulin, fill = Outcome)) +
geom_boxplot()
<- diabetes.new %>%
p5 ggplot(aes(x = BMI, fill = Outcome)) +
geom_boxplot()
<- diabetes.new %>%
p6 ggplot(aes(x = DiabetesPedigreeFunction, fill = Outcome)) +
geom_boxplot()
<- diabetes.new %>%
p7 ggplot(aes(x = Age, fill = Outcome)) +
geom_boxplot()
ggarrange(p1, p2, p3,
p4, p5, p6,nrow = 4, ncol = 2) p7,
Unsupervised Methods
Prepare Data
As clustering algorithms utilize distance metrics, it is important to normalize of the data.
<- diabetes.new[,-9]
cluster.data
<- as.data.frame(scale(cluster.data)) cluster.data
K-means and Principal Component Analysis
K-means and Principal Component Analysis were performed to further explore the dataset. This dataset does not appear to be a strong candidate for clustering as the PCA shows that most of the data is grouped in a single cohesive shape. Visualizing the first two principal components show that there is some overlap between the two classes.
set.seed(12345)
# function to compute total within-cluster sum of square
<- function(k) {
wss kmeans(cluster.data, k, nstart = 20 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
<- 1:15
k.values
# extract wss for 2-15 clusters
<- map_dbl(k.values, wss)
wss_values
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
fviz_cluster(kmeans(cluster.data, centers = 2, nstart = 20), data = cluster.data)
<- prcomp(cluster.data,
pca center = TRUE,
scale. = TRUE)
<- cluster.data
cluster.data1 $Outcome <- diabetes.new$Outcome
cluster.data1autoplot(pca, data = cluster.data1, colour="Outcome")
Decision Tree and Random Forest
Modeling
The data was randomly split into a training set and testing set with 80% of the data used for training and the remaining 20% used for evaluating model performance. The data source is imbalanced. To account for this the training dataset will undergo a SMOTE algorithm to synthetically create new data points to be used to balance the classes. This should allow the model to more accurately predict whether a patient has Diabetes. Repeated k-Fold Cross Validation was used to gain a better understanding of the estimated performance of the machine learning models. I selected k to be 10 and for there to be 3 repeats.
set.seed(12345)
<- createDataPartition(diabetes.new[,"Outcome"], p = 0.8, list = FALSE)
train_ind
<- diabetes.new[train_ind, ]
train <- diabetes.new[-train_ind, ] test
set.seed(12345)
<- smote(Outcome ~ ., data = train, perc.over = 1) train.balanced
Decision Trees
I chose to build a simple decision tree and a much more complicated decision tree to see if there were vast differences in model performance. The models are:
- Using the variables:
Glucose
,BMI
, andAge
.- Hyperparameters: 10 fold cross validation with 3 repeats and a tune length of 5
- Using all variables in the dataset.
- Hyperparameters: 10 fold cross validation with 3 repeats and a tune length of 200
Model A: Decision Tree Using the variables: Glucose
,
BMI
, and Age
This first model only used three variables to predict
Outcome
. The decision tree plot indicates that Glucose is
the most important variable for determining Diabetes followed by
BMI
and lastly Age
. The model is simple only
having 8 terminal nodes.
set.seed(12345)
<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
trctrl
<- train(Outcome ~ Glucose + BMI + Age,
tree1 method = "rpart",
trControl = trctrl,
tuneLength = 5,
data = train.balanced)
rpart.plot(tree1$finalModel)
Model Results
The model achieved an accuracy of 0.74. Though I would argue that overall accuracy is not the best metric for this type of problem as the data is imbalanced. Instead I will focus on Kappa, Precision, Sensitivity, and Specificity. The Kappa is 0.46, Precision is 0.57, Sensitivity is 0.83, and Specificity is 0.69. The model was very good capturing almost all positive results. Though it had a terrible precision score as it had a large number of false positive predictions.
<- predict(tree1, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cmA
cmA
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 49 6
## 1 22 29
##
## Accuracy : 0.7358
## 95% CI : (0.6413, 0.8168)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.087957
##
## Kappa : 0.4648
##
## Mcnemar's Test P-Value : 0.004586
##
## Sensitivity : 0.8286
## Specificity : 0.6901
## Pos Pred Value : 0.5686
## Neg Pred Value : 0.8909
## Prevalence : 0.3302
## Detection Rate : 0.2736
## Detection Prevalence : 0.4811
## Balanced Accuracy : 0.7594
##
## 'Positive' Class : 1
##
Model B: Decision Tree Using all variables
This decision tree uses all the variables in the dataset to predict the Outcome variable. This model is much more complex than the basic 3 variable model. There are a total of 22 terminal nodes and 21 decision nodes. Glucose and Age are still the most important factors in determining a patients likelihood of being a Diabetic. Other important variables include SkinThickness and DiabetesPedegreeFunction.
set.seed(12345)
<- train(Outcome ~ .,
tree2 method = "rpart",
trControl = trctrl,
tuneLength = 200,
data = train.balanced)
rpart.plot(tree2$finalModel)
Model Results
This model achieved an overall accuracy of 0.72, Kappa of 0.43, Precision of 0.56, and Sensitivity of 0.77. This model is slightly worse at correctly classifying whether a patient is a diabetic compared the first model due to the lower Kappa and sensitivity values.
<- predict(tree2, newdata = test)
pred <- table(test$Outcome, pred)
result
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cmB
cmB
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50 8
## 1 21 27
##
## Accuracy : 0.7264
## 95% CI : (0.6313, 0.8085)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.12716
##
## Kappa : 0.4347
##
## Mcnemar's Test P-Value : 0.02586
##
## Sensitivity : 0.7714
## Specificity : 0.7042
## Pos Pred Value : 0.5625
## Neg Pred Value : 0.8621
## Prevalence : 0.3302
## Detection Rate : 0.2547
## Detection Prevalence : 0.4528
## Balanced Accuracy : 0.7378
##
## 'Positive' Class : 1
##
Random Forest
Model C: Random Forest
The final model is a Random Forest model using all variables and the hyperparameters of 10 fold cross validation and ntree = 250. The model seems to perform best using only a handful of variables. The most important features in the Random Forest are Glucose, Age, BMI, DiabetesPredigreeFunction, and Pregnancies. This is similar to model 2.
set.seed(12345)
<- train(Outcome ~ .,
forest method = "rf",
trControl = trctrl,
ntree = 250,
data = train.balanced)
plot(forest)
plot(varImp(forest))
Model Results
The Random Forest achieved the following metrics:
- Overall Accuracy of 0.82
- Kappa of 0.58
- Precision of 0.72
- Sensitivity of 0.74
The Random Forest performed the best of the tree based models to predict the Outcome variable of determining if a patient is diabetic.
<- predict(forest, newdata = test)
pred <- table(test$Outcome, pred)
result
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cmC
cmC
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61 9
## 1 10 26
##
## Accuracy : 0.8208
## 95% CI : (0.7343, 0.8885)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.0003989
##
## Kappa : 0.5977
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.7429
## Specificity : 0.8592
## Pos Pred Value : 0.7222
## Neg Pred Value : 0.8714
## Prevalence : 0.3302
## Detection Rate : 0.2453
## Detection Prevalence : 0.3396
## Balanced Accuracy : 0.8010
##
## 'Positive' Class : 1
##
Support Vector Machine
Modeling
I elected to use the same variables that trained the decision tree and random forest models to be used to train the support vector machine models. Using the exact same variables should allow for a true comparison between the algorithms on the dataset.
The data has undergone feature scaling to reduce bias. SVM calculates the distance between data points to find the optimal support vectors which would lead to the best decision boundary. Using non-scaled data will negatively affect the model’s ability to discover the true patterns present in the data as the distance between observations can greatly differ when the scale is different for each variable.
<- diabetes.new
diabetes.new.scale -9] <- scale(diabetes.new[,-9])
diabetes.new.scale[,
<- melt(diabetes.new[,1:8]) data_long1
## No id variables; using all as measure variables
<- melt(diabetes.new.scale[,1:8]) data_long2
## No id variables; using all as measure variables
<- ggplot(data_long1, aes(x = variable, y = value)) +
p1 geom_boxplot() +
ggtitle("Non-Scaled Data")
<- ggplot(data_long2, aes(x = variable, y = value)) +
p2 geom_boxplot() +
ggtitle("Scaled Data")
ggarrange(p1, p2, nrow = 1, ncol = 2)
set.seed(12345)
<- createDataPartition(diabetes.new[,"Outcome"], p = 0.8, list = FALSE)
train_ind
<- diabetes.new[train_ind, ]
train <- diabetes.new[-train_ind, ] test
set.seed(12345)
<- smote(Outcome ~ ., data = train, perc.over = 1) train.balanced
Model 1: Linear SVM using the variables: Glucose
,
BMI
, and Age
The first SVM model used only three variables to predict
Outcome
. The model underwent a grid search to identify the
best hyperparameter for C given a linear kernel.
set.seed(12345)
<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
trctrl <- expand.grid(C = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 5, 10)) grid_linear
set.seed(12345)
<- train(Outcome ~ Glucose + BMI + Age,
svm.model1 data = train.balanced, method = "svmLinear",
trControl=trctrl,
tuneGrid = grid_linear,
tuneLength = 10)
Model Results
The Simple Linear SVM model achieved the following metrics:
- Overall Accuracy of 0.81
- Kappa of 0.597
- Precision of 0.67
- Sensitivity of 0.83
This model’s performance is comparable to the Random Forest model.
set.seed(12345)
<- predict(svm.model1, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cm1
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 57 6
## 1 14 29
##
## Accuracy : 0.8113
## 95% CI : (0.7238, 0.8808)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.0008956
##
## Kappa : 0.5968
##
## Mcnemar's Test P-Value : 0.1175249
##
## Sensitivity : 0.8286
## Specificity : 0.8028
## Pos Pred Value : 0.6744
## Neg Pred Value : 0.9048
## Prevalence : 0.3302
## Detection Rate : 0.2736
## Detection Prevalence : 0.4057
## Balanced Accuracy : 0.8157
##
## 'Positive' Class : 1
##
Model 2: Radial SVM using the variables: Glucose
,
BMI
, and Age
This second SVM model also uses three variables to determine the
Outcome
of a patient. This model used the radial kernel
(RBF) instead of a linear kernel to try to classify the data. Both C and
sigma were tuned to find the optimal model with this kernel.
set.seed(12345)
<- expand.grid(
grid_radial sigma = c(0.01, 0.02, 0.025, 0.03, 0.04,
0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9, 1),
C = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
1, 1.5, 2,5, 10))
set.seed(12345)
<- train(Outcome ~ Glucose + BMI + Age,
svm.model2 data = train.balanced, method = "svmRadial",
trControl=trctrl,
tuneGrid = grid_radial,
tuneLength = 10)
Model Results
The Simple Radial SVM model reached the following metrics:
- Overall Accuracy of 0.73
- Kappa of 0.44
- Precision of 0.58
- Sensitivity of 0.71
The model does not out perform the Simple Linear SVM model. This model’s performance is comparable to the Simple Decision Tree model.
set.seed(12345)
<- predict(svm.model2, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cm2
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 53 10
## 1 18 25
##
## Accuracy : 0.7358
## 95% CI : (0.6413, 0.8168)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.08796
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 0.18588
##
## Sensitivity : 0.7143
## Specificity : 0.7465
## Pos Pred Value : 0.5814
## Neg Pred Value : 0.8413
## Prevalence : 0.3302
## Detection Rate : 0.2358
## Detection Prevalence : 0.4057
## Balanced Accuracy : 0.7304
##
## 'Positive' Class : 1
##
Model 3: Polynomial SVM using the variables: Glucose
,
BMI
, and Age
This SVM model uses three variables to determine the
Outcome
of a patient leveraging the polynomial kernel. The
polynomial kernel has three hyperparameters that need to be tuned to
find the optimal model. They are degree, scale, and C which have been
tuned. The model was only tuned up to the 4th degree.
set.seed(12345)
<- expand.grid(
grid_poly degree = c(2,3,4),
scale = c(0.01, 0.02, 0.025, 0.03, 0.04,
0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9, 1),
C = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
1, 1.5, 2,5, 10))
set.seed(12345)
<- train(Outcome ~ Glucose + BMI + Age,
svm.model3 data = train.balanced, method = "svmPoly",
trControl=trctrl,
tuneGrid = grid_poly,
tuneLength = 10)
Model Results
The Simple Polynomial SVM model reached the following metrics:
- Overall Accuracy of 0.80
- Kappa of 0.57
- Precision of 0.675
- Sensitivity of 0.77
The model does not out perform the Simple Linear SVM model, but has close results.
set.seed(12345)
<- predict(svm.model3, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cm3
cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 58 8
## 1 13 27
##
## Accuracy : 0.8019
## 95% CI : (0.7132, 0.873)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.001898
##
## Kappa : 0.5678
##
## Mcnemar's Test P-Value : 0.382733
##
## Sensitivity : 0.7714
## Specificity : 0.8169
## Pos Pred Value : 0.6750
## Neg Pred Value : 0.8788
## Prevalence : 0.3302
## Detection Rate : 0.2547
## Detection Prevalence : 0.3774
## Balanced Accuracy : 0.7942
##
## 'Positive' Class : 1
##
Model 4 Linear SVM usng all variables
This model uses all available variables in the dataset to predict
Outcome
. The same tuning procedure as Model 1 was conducted
as they both use a linear kernel.
set.seed(12345)
<- train(Outcome ~ .,
svm.model4 data = train.balanced, method = "svmLinear",
trControl=trctrl,
tuneGrid = grid_linear,
tuneLength = 10)
Model Results
The Complex Linear SVM model reached the following metrics:
- Overall Accuracy of 0.79
- Kappa of 0.54
- Precision of 0.67
- Sensitivity of 0.74
The model does not out perform the Simple Linear SVM model. It appear that adding more features hurt the model’s performance.
set.seed(12345)
<- predict(svm.model4, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cm4
cm4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 58 9
## 1 13 26
##
## Accuracy : 0.7925
## 95% CI : (0.7028, 0.8651)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.003808
##
## Kappa : 0.544
##
## Mcnemar's Test P-Value : 0.522431
##
## Sensitivity : 0.7429
## Specificity : 0.8169
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.8657
## Prevalence : 0.3302
## Detection Rate : 0.2453
## Detection Prevalence : 0.3679
## Balanced Accuracy : 0.7799
##
## 'Positive' Class : 1
##
Model 5: Radial SVM using all variables
This model uses all available variables in the dataset to predict
Outcome
. The same tuning procedure as Model 2 was conducted
as they both use a radial kernel.
set.seed(12345)
<- train(Outcome ~ .,
svm.model5 data = train.balanced, method = "svmRadial",
trControl=trctrl,
tuneGrid = grid_radial,
tuneLength = 10)
Model Results
The Complex Radial SVM model reached the following metrics:
- Overall Accuracy of 0.679
- Kappa of 0.35
- Precision of 0.51
- Sensitivity of 0.743
The model does not out perform the Simple Linear SVM model. It also does worse than the Simple Radial SVM model.
set.seed(12345)
<- predict(svm.model5, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cm5
cm5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46 9
## 1 25 26
##
## Accuracy : 0.6792
## 95% CI : (0.5816, 0.7666)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.4635
##
## Kappa : 0.3502
##
## Mcnemar's Test P-Value : 0.0101
##
## Sensitivity : 0.7429
## Specificity : 0.6479
## Pos Pred Value : 0.5098
## Neg Pred Value : 0.8364
## Prevalence : 0.3302
## Detection Rate : 0.2453
## Detection Prevalence : 0.4811
## Balanced Accuracy : 0.6954
##
## 'Positive' Class : 1
##
Model 6: Polynomial SVM using all variables
The last model applies the polynomial kernel to classify the
Outcome
using all available features in the dataset. This
model underwent the same tuning procedure as the simple polynomial SVM
(Model 3).
set.seed(12345)
<- train(Outcome ~ .,
svm.model6 data = train.balanced, method = "svmPoly",
trControl=trctrl,
tuneGrid = grid_poly,
tuneLength = 10)
Model Results
The Complex Polynomial SVM model reached the following metrics:
- Overall Accuracy of 0.698
- Kappa of 0.33
- Precision of 0.54
- Sensitivity of 0.57
The model does not out perform the Simple Linear SVM model. It also does worse than its counterpart, Simple Polynomial SVM model.
set.seed(12345)
<- predict(svm.model6, newdata = test)
pred
<- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")
cm6
cm6
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 15
## 1 17 20
##
## Accuracy : 0.6981
## 95% CI : (0.6013, 0.7835)
## No Information Rate : 0.6698
## P-Value [Acc > NIR] : 0.3059
##
## Kappa : 0.3273
##
## Mcnemar's Test P-Value : 0.8597
##
## Sensitivity : 0.5714
## Specificity : 0.7606
## Pos Pred Value : 0.5405
## Neg Pred Value : 0.7826
## Prevalence : 0.3302
## Detection Rate : 0.1887
## Detection Prevalence : 0.3491
## Balanced Accuracy : 0.6660
##
## 'Positive' Class : 1
##
Comparison of All Models for Tree Based and Support Vector Machines
The best tree based model in all metrics, except sensitivity, is the Random Forest Model (Model C). Interestingly, the simple decision tree (Model A) and complex decision tree (Model B) had very similar results with the simple decision tree performing only a percent or two better in most metrics.
The model with the SVM model with the best combination of precision and recall is the simple linear SVM model (Model 1). The complex linear SVM model (Model 4) had worse metrics across most of the metrics compared to the simple linear SVM model. The models (Model 2 and Model 5) using the radial (RBF) kernel under performed compared to the other models using a linear or polynomial kernel. The simple polynomial SVM (Model 3) was closest to outperforming the simple linear SVM (Model 1). The best Support Vector Machine model is Model 1.
It appears that adding more features made classifying harder for the algorithms as the simple models were slightly better than their complex model counterparts in both the decision tree and SVM algorithms.
The best model to use for diabetes prediction between both classes of algorithms would be the Random Forest Model (Model C) as it has the highest precision and specificity. The f1 scores are similar between the two models, meaning that they both have a similar harmonic mean between precision and sensitivity. The precision and specificity values for the Random Forest model are 4.8% and 5.6% higher than that of the and the Simple Linear SVM, respectively. This means that the Random Forest is superior in identifying relevant patients that are diabetic and is better at minimizing false positives. It should be noted that the Simple Linear SVM had a superior sensitivity value (8.6% higher than Random Forest), but the Simple Linear SVM was not as good at minimizing the false positives.
Tree Based Model Summaries
<- c(cmA$overall[1], cmB$overall[1], cmC$overall[1])
accuracy <- c(cmA$overall["Kappa"], cmB$overall["Kappa"], cmC$overall["Kappa"])
kappa.val <- c(cmA$byClass['Pos Pred Value'], cmB$byClass['Pos Pred Value'], cmC$byClass['Pos Pred Value'] )
precision.val <- c(cmA$byClass['Sensitivity'], cmB$byClass['Sensitivity'], cmC$byClass['Sensitivity'])
sensitivity.val
<- c(cmA$byClass['Specificity'], cmB$byClass['Specificity'], cmC$byClass['Specificity'])
specificity.val
<- c("Model A: Simple Decision Tree", "Model B: Complex Decision Tree", "Model C: Random Forest")
model.type <- data.frame(model.type,
results
accuracy,
kappa.val,
precision.val,
sensitivity.val,
specificity.val)
$f1.score <- 2 * ((results$precision.val * results$sensitivity.val) / (results$precision.val + results$sensitivity.val))
results
<- results %>%
results mutate_if(is.numeric,
round,digits = 3)
::kbl(results) %>%
kableExtra::kable_classic() kableExtra
model.type | accuracy | kappa.val | precision.val | sensitivity.val | specificity.val | f1.score |
---|---|---|---|---|---|---|
Model A: Simple Decision Tree | 0.736 | 0.465 | 0.569 | 0.829 | 0.690 | 0.674 |
Model B: Complex Decision Tree | 0.726 | 0.435 | 0.562 | 0.771 | 0.704 | 0.651 |
Model C: Random Forest | 0.821 | 0.598 | 0.722 | 0.743 | 0.859 | 0.732 |
Support Vector Machine Model Summaries
<- c(cm1$overall[1], cm2$overall[1], cm3$overall[1], cm4$overall[1], cm5$overall[1], cm6$overall[1])
accuracy <- c(cm1$overall["Kappa"], cm2$overall["Kappa"], cm3$overall["Kappa"], cm4$overall["Kappa"], cm5$overall["Kappa"], cm6$overall["Kappa"])
kappa.val <- c(cm1$byClass['Pos Pred Value'], cm2$byClass['Pos Pred Value'], cm3$byClass['Pos Pred Value'], cm4$byClass['Pos Pred Value'], cm5$byClass['Pos Pred Value'], cm6$byClass['Pos Pred Value'])
precision.val <- c(cm1$byClass['Sensitivity'], cm2$byClass['Sensitivity'], cm3$byClass['Sensitivity'], cm4$byClass['Sensitivity'], cm5$byClass['Sensitivity'], cm6$byClass['Sensitivity'])
sensitivity.val
<- c(cm1$byClass['Specificity'], cm2$byClass['Specificity'], cm3$byClass['Specificity'], cm4$byClass['Specificity'], cm5$byClass['Specificity'], cm6$byClass['Specificity'])
specificity.val
<- c("Model 1: Simple Linear SVM", "Model 2: Simple Radial SVM", "Model 3: Simple Polynomial SVM", "Model 4: Complex Linear SVM", "Model 5: Complex Radial SVM", "Model 6: Complex Polynomial SVM")
model.type <- data.frame(model.type,
results
accuracy,
kappa.val,
precision.val,
sensitivity.val,
specificity.val)
$f1.score <- 2 * ((results$precision.val * results$sensitivity.val) / (results$precision.val + results$sensitivity.val))
results
<- results %>%
results mutate_if(is.numeric,
round,digits = 3)
::kbl(results) %>%
kableExtra::kable_classic() kableExtra
model.type | accuracy | kappa.val | precision.val | sensitivity.val | specificity.val | f1.score |
---|---|---|---|---|---|---|
Model 1: Simple Linear SVM | 0.811 | 0.597 | 0.674 | 0.829 | 0.803 | 0.744 |
Model 2: Simple Radial SVM | 0.736 | 0.436 | 0.581 | 0.714 | 0.746 | 0.641 |
Model 3: Simple Polynomial SVM | 0.802 | 0.568 | 0.675 | 0.771 | 0.817 | 0.720 |
Model 4: Complex Linear SVM | 0.792 | 0.544 | 0.667 | 0.743 | 0.817 | 0.703 |
Model 5: Complex Radial SVM | 0.679 | 0.350 | 0.510 | 0.743 | 0.648 | 0.605 |
Model 6: Complex Polynomial SVM | 0.698 | 0.327 | 0.541 | 0.571 | 0.761 | 0.556 |
Literature Review
Machine learning algorithms are being applied to a plethora of different applications within the medical field, most notably image recognition and diagnosis prediction. The focus of the literature will be on the latter. Many of the academic articles that compared the predictive power of decision trees, random forests, and support vector machines in healthcare found that one algorithm can prove to be more effective given the same dataset and data preparation.
In the article, “A comparative analysis of machine learning classifiers for stroke prediction: A predictive analytics approach” (https://www.sciencedirect.com/science/article/pii/S2772442522000569), a machine learning approach is applied to diagnose stroke. Many algorithms were applied including SVM, random forest, and decision tree. Random Over Sampling (ROS) was used to balance the data and used 10 fold cross validation to gain a more accurate estimate. This study found that: SVM performed the best with a 99.99% accuracy, random forest performed second best with a 99.87% accuracy, and decision tree performed the worst with a 96.9% accuracy.
The article, “Comparative Analysis of Classification Models for Healthcare Data Analysis” (https://www.ijcit.com/archives/volume7/issue4/IJCIT070404.pdf), developed classification models to predict heart disease. Many algorithms were tested to find the optimal classifier for the data. SVM (84% accurate) outperformed all other classifiers including random forest (83% accurate) and decision tree (77.7% accurate). This study also tried other ensemble-learning methods (bagging and boosting) to try improve accuracy. The decision tree accuracy increased when paired with the ensemble learning, but it did not out perform SVM.
The article, “A Comparative Analysis on the Evaluation of Classification Algorithms in the Prediction of Diabetes” (https://www.researchgate.net/publication/328020082_A_Comparative_Analysis_on_the_Evaluation_of_Classification_Algorithms_in_the_Prediction_of_Diabetes), aimed to predict Diabetes Mellitus using a multitude of machine learning algorithms using data from the National Institute of Diabetes and Digestive and Kidney Diseases. The study determined that logistic regression and gradient boost were the best an accuracy of 79%. Random forest (76% accuracy) was the best between SVM, random forest, and decision tree. SVM with a linear kernel did not perform well in this study, only achieving a 68% accuracy. This study does not appear to balance the imbalanced target class which most likely had a negative impact on some algorithms ability to learn the true patterns in the data.
Conclusion
Tree Based algorithms and Support Vector Machines are used in a wide array of applications. Depending on the task and data limitations one type of algorithms may greatly outperform the other, as shown in the literature review. In the case of this data source, the Random Forest model slightly outperformed the Simple Linear SVM model.