In this homework, an analysis was done using the SVM algorithm on the
dataset that was used in Homework 2, which was the diabetes health
indicators dataset. This dataset was compiled from a telephone survey
conducted by the CDC in 2015. Questions asked in the survey involved
health-related risk behaviors, chronic health conditions, and the use of
preventative services. Also include are age, education, income,
location, and race to name a few. There are 3 .csv files that can be
used for analysis. The one that was used in this homework was the
diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file.
This .csv file contains 253,680 survey responses (observations) and 21
features. The response variable is multiclass, in that it contains 3
different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for
diabetes. The
author of this dataset points out that there is a class
imbalance.
This dataset includes the following variables:
Diabetes_012: 0 = no diabetes 1 = prediabetes 2 =
diabetesHighBP: 0 = no high BP 1 = high BPHighChol: 0 = no high cholesterol 1 = high
cholesterolCholCheck: 0 = no cholesterol check in 5 years 1 = yes
cholesterol check in 5 yearsBMI: Body Mass IndexSmoker: Have you smoked at least 100 cigarettes in your
entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yesStroke: (Ever told) you had a stroke. 0 = no 1 =
yesHeartDiseaseorAttack: coronary heart disease (CHD) or
myocardial infarction (MI) 0 = no 1 = yesPhysActivity: physical activity in past 30 days - not
including job 0 = no 1 = yesFruits: Consume Fruit 1 or more times per day 0 = no 1
= yesVeggies: Consume Vegetables 1 or more times per day 0 =
no 1 = yesHvyAlcoholConsump: Heavy drinkers (adult men having
more than 14 drinks per week and adult women having more than 7 drinks
per week) 0 = no 1 = yesAnyHealthcare: Have any kind of health care coverage,
including health insurance, prepaid plans such as HMO, etc. 0 = no 1 =
yesNoDocbcCost: Was there a time in the past 12 months
when you needed to see a doctor but could not because of cost? 0 = no 1
= yesGenHlth: Would you say that in general your health is:
scale 1-5:
MentHlth: Now thinking about your mental health, which
includes stress, depression, and problems with emotions, for how many
days during the past 30 days was your mental health not good?
PhysHlth: Now thinking about your physical health,
which includes physical illness and injury, for how many days during the
past 30 days was your physical health not good?
DiffWalk: Do you have serious difficulty walking or
climbing stairs? 0 = no 1 = yesSex: 0 = female 1 = maleAge: 13-level age category:
Education: Education level; scale 1-6:
Income: Income scale; scale 1-8:
The goals of this analysis were to:
diabetes_data <- read_csv(
file = "diabetes_012_health_indicators_BRFSS2015.csv",
col_types = "ffffnfffffffffffffffff")
A summary of the Diabetes Health Indicators Dataset is provided below:
summary(diabetes_data)
## Diabetes_012 HighBP HighChol CholCheck BMI
## 0.0:213703 1.0:108829 1.0:107591 1.0:244210 Min. :12.00
## 2.0: 35346 0.0:144851 0.0:146089 0.0: 9470 1st Qu.:24.00
## 1.0: 4631 Median :27.00
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
##
## Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits
## 1.0:112423 0.0:243388 0.0:229787 0.0: 61760 0.0: 92782
## 0.0:141257 1.0: 10292 1.0: 23893 1.0:191920 1.0:160898
##
##
##
##
##
## Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth
## 1.0:205841 0.0:239424 1.0:241263 0.0:232326 5.0:12081
## 0.0: 47839 1.0: 14256 0.0: 12417 1.0: 21354 3.0:75646
## 2.0:89084
## 4.0:31570
## 1.0:45299
##
##
## MentHlth PhysHlth DiffWalk Sex Age
## 0.0 :175680 0.0 :160052 1.0: 42675 0.0:141974 9.0 :33244
## 2.0 : 13054 30.0 : 19400 0.0:211005 1.0:111706 10.0 :32194
## 30.0 : 12088 2.0 : 14764 8.0 :30832
## 5.0 : 9030 1.0 : 11388 7.0 :26314
## 1.0 : 8538 3.0 : 8495 11.0 :23533
## 3.0 : 7381 5.0 : 7622 6.0 :19819
## (Other): 27909 (Other): 31959 (Other):87744
## Education Income
## 4.0: 62750 8.0 :90385
## 6.0:107325 7.0 :43219
## 3.0: 9478 6.0 :36470
## 5.0: 69910 5.0 :25883
## 2.0: 4043 4.0 :20135
## 1.0: 174 3.0 :15994
## (Other):21594
The factors above have been recoded for readability.
diabetes_data <- diabetes_data %>%
mutate(
Diabetes_012 = dplyr::recode(Diabetes_012, '0.0' = 'No Diabetes', '1.0' = 'Prediabetes', '2.0' = 'Diabetes'),
CholCheck = dplyr::recode(CholCheck, '1.0' = 'Yes Chol Check in 5 years', '0.0' = 'No Chol Check in 5 Years'),
AnyHealthcare = dplyr::recode(AnyHealthcare, '1.0' = 'Has Insurance', '0.0' = 'No Insurance'),
GenHlth = dplyr::recode(GenHlth, '5.0' = 'Poor', '4.0' = 'Fair', '3.0' = 'Good', '2.0' = 'Very Good', '1.0' = "Excellent"),
Age = dplyr::recode(Age, '1.0' = '18-24', '2.0' = '25-29', '3.0' = '30-34', '4.0' = '35-39', '5.0' = '40-44',
'6.0' = '45-49', '7.0' = '50-54', '8.0' = '55-59', '9.0' = '60-64', '10.0' = '65-69',
'11.0'='70-74', '12.0' = '75-79', '13.0' = '>=80'),
Education = dplyr::recode(Education, '1.0' = 'No School/Kindergarten', '2.0' = 'Grades 1-8', '3.0' = 'Grades 9 - 11',
'4.0' = 'Grade 12/GED', '5.0' = '1-3 Yrs College', '6.0' = '>= 4 Yrs College'),
Income = dplyr::recode(Income, '1.0' = '<10K', '2.0' = '10K<=Income<15K', '3.0' = '15K<=Income<20K', '4.0' = '20K<=Income<25K',
'5.0' = '25K<=Income<35K', '6.0' = '35K<=Income<50K', '7.0' = '50K<=Income<75K', '8.0' = 'Income>=75K')
)
summary(diabetes_data)
## Diabetes_012 HighBP HighChol
## No Diabetes:213703 1.0:108829 1.0:107591
## Diabetes : 35346 0.0:144851 0.0:146089
## Prediabetes: 4631
##
##
##
##
## CholCheck BMI Smoker Stroke
## Yes Chol Check in 5 years:244210 Min. :12.00 1.0:112423 0.0:243388
## No Chol Check in 5 Years : 9470 1st Qu.:24.00 0.0:141257 1.0: 10292
## Median :27.00
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
##
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 0.0:229787 0.0: 61760 0.0: 92782 1.0:205841 0.0:239424
## 1.0: 23893 1.0:191920 1.0:160898 0.0: 47839 1.0: 14256
##
##
##
##
##
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Has Insurance:241263 0.0:232326 Poor :12081 0.0 :175680
## No Insurance : 12417 1.0: 21354 Good :75646 2.0 : 13054
## Very Good:89084 30.0 : 12088
## Fair :31570 5.0 : 9030
## Excellent:45299 1.0 : 8538
## 3.0 : 7381
## (Other): 27909
## PhysHlth DiffWalk Sex Age
## 0.0 :160052 1.0: 42675 0.0:141974 60-64 :33244
## 30.0 : 19400 0.0:211005 1.0:111706 65-69 :32194
## 2.0 : 14764 55-59 :30832
## 1.0 : 11388 50-54 :26314
## 3.0 : 8495 70-74 :23533
## 5.0 : 7622 45-49 :19819
## (Other): 31959 (Other):87744
## Education Income
## Grade 12/GED : 62750 Income>=75K :90385
## >= 4 Yrs College :107325 50K<=Income<75K:43219
## Grades 9 - 11 : 9478 35K<=Income<50K:36470
## 1-3 Yrs College : 69910 25K<=Income<35K:25883
## Grades 1-8 : 4043 20K<=Income<25K:20135
## No School/Kindergarten: 174 15K<=Income<20K:15994
## (Other) :21594
Everything in the summary seems to fall within reasonable expectations. The summary also revealed that there were no missing values in this dataset.
Figure 1: Histograms for the BMI (the only
continuous feature) in the Diabetes Health Indicators Dataset
Figure 4 shows us that BMI is displaying right skewness.
This right skewness could have also been deduced from the summary.
Notice that in the summary, for the BMI variable, the
maximum is 98, while the mean is 28 and the minimum is 12.
Figure 2: Boxplots for the Diabetes Health Indicators Dataset
Some findings were discovered that support the theoretical effects
for some of the variables using the boxplots in Figure 5. Based on the
age boxplot, theoretically, older people are more likely to
have heart disease. Theoretically, on average, people with a lower
maximum heart rate are more likely to have heart disease when viewing
the MaxHR variable. Finally, based on the boxplot, on
average, people with higher Oldpeak are more likely to have
heart disease, which makes sense given that an Oldpeak
equal to ± 1 is indicative of a serious health condition.
Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.
corrplot(diabetes_correlations$correlations,
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 1,
tl.cex = 1)
Figure 3: Multicollinearity plot for continuous predictor variables
Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum. Therefore, it is sufficient to say that the entire dataset has low correlation.
Interestingly, the correlation between Education and
Income is low, even though there are multiple studies that
have been conducted that show that there is a strong correlation between
education and income:
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes Prediabetes
## 0.84241170 0.13933302 0.01825528
The output above shows the percentage of respondents that do not have
diabetes (84.24%), the percentage of respondents that do have diabetes
(13.9%), and the percentage that are prediabetic(1.8%). Note that in
order to deal with class imbalance, only 2 classes can exist within the
response variable. Since people with prediabetes only makes up 1.8% of
the total amount of observations, all of the observations where
Diabetes_012 == Prediabetes were removed from the dataset.
The nature of this study slightly changed with the removal of a class.
Now instead of generating a model to determine if someone is not at risk
of having diabetes, is at risk of being prediabetic, or is at risk of
being diabetic, now the model will just determine if a person is either
at risk or not at risk of getting diabetes. The modeling and prediction
of the remaining classes still yielded valuable insights.
diabetes_data <- diabetes_data %>%
mutate(Diabetes_012 = as.numeric(Diabetes_012)) %>%
subset(Diabetes_012 != 3) %>%
mutate(Diabetes_012 = as.factor(Diabetes_012)) %>%
mutate(Diabetes_012 = dplyr::recode(Diabetes_012, '1' = 'No Diabetes', '2' = 'Diabetes'))
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580761 0.1419239
In its current form, the dataset is too large for the SVM to converge
at an acceptable timeframe. Therefore, the number of observations was
reduced by 25% using stratified random sampling. This is to ensure that
the distribution of the feature values within the sample matches the
distribution of values for the same feature in the overall population.
The sample.split function from the caTools
package is employed in order to perform stratified random sampling.
set.seed(123)
stratified_bool_vector <- sample.split(diabetes_data$Diabetes_012, SplitRatio = 0.25)
diabetes_data_stratified <- subset(diabetes_data, stratified_bool_vector == TRUE)
diabetes_data_stratified %>%
select(Diabetes_012) %>%
table() %>%
prop.table()
## Diabetes_012
## No Diabetes Diabetes
## 0.8580836 0.1419164
The output above shows that the proportional distribution of values
for the Diabetes_012 feature were close to those of the
original dataset.
Outliers, noise, and normalization which are common preprocessing techniques, do not need to be utilized when creating a decision tree model (Practical Machine Learning in R, page 301). The dataset is split into testing and training sets.
set.seed(123)
original_split <- caTools::sample.split(diabetes_data_stratified$Diabetes_012, SplitRatio = 0.75)
diabetes_data_stratified_train <- subset(diabetes_data_stratified, original_split == TRUE)
diabetes_data_stratified_test <- subset(diabetes_data_stratified, original_split == FALSE)
prop.table(table(select(diabetes_data_stratified, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580836 0.1419164
prop.table(table(select(diabetes_data_stratified_train, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580851 0.1419149
prop.table(table(select(diabetes_data_stratified_test, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.858079 0.141921
The output above shows us that there is a significant class
imbalance. SMOTE from the DMwR package is only
applied for the training dataset.
diabetes_data_stratified_train <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_stratified_train), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_stratified_train, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.5 0.5
In order to create a fair comparison between models, the dataset that was reduced after applying stratified random sampling will be used on the decision tree algorithm. The results of this fitting are shown below.
diabetes_decision_tree_stratified <- rpart(
Diabetes_012 ~ .,
method = "class",
data = diabetes_data_stratified_train
)
rpart.plot(diabetes_decision_tree_stratified)
Figure 4: Decision tree for the diabetes health indicators dataset using all of the available features
varImp(diabetes_decision_tree_stratified) %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(Overall) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
filter(Overall > 0) %>%
ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()
Figure 5: Variable importance plot for the decision tree model which uses all of the available features
diabetes_decision_tree_stratified_pred <- predict(diabetes_decision_tree_stratified, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_decision_tree_stratified_pred, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 9296 593
## Diabetes 4060 1616
##
## Accuracy : 0.7011
## 95% CI : (0.6938, 0.7082)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2584
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7316
## Specificity : 0.6960
## Pos Pred Value : 0.2847
## Neg Pred Value : 0.9400
## Prevalence : 0.1419
## Detection Rate : 0.1038
## Detection Prevalence : 0.3647
## Balanced Accuracy : 0.7138
##
## 'Positive' Class : Diabetes
##
roc_decision_tree <-
prediction(
predictions = predict(diabetes_decision_tree_stratified, diabetes_data_stratified_test, type = "prob")[, "No Diabetes"],
labels = diabetes_data_stratified_test$Diabetes_012
)
roc_perf_decision_tree <- performance(roc_decision_tree, measure = "tpr", x.measure = "fpr")
plot(roc_perf_decision_tree, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 6: ROC curve for the decision tree model using all of the available features
auc_decision_tree <- performance(roc_decision_tree, measure = "auc")
diabetes_decision_tree_auc <- unlist(slot(auc_decision_tree,"y.values"))
paste("Calculated AUC: ", diabetes_decision_tree_auc)
## [1] "Calculated AUC: 0.738143435923529"
The svm function in R allowed for the generation of a
SVM model that uses all of the features in the training set to building
a model that predicts Diabetes_012. Several different
kernels were used in order to compare the performance between the
different kernels in addition to the results from the previous homework.
These kernels include:
These are all of the possible kernels that can be used using the
svm function in R
diabetes_svm_linear <- svm(
Diabetes_012 ~ .,
kernel = "linear",
type = "C-classification",
data = diabetes_data_stratified_train
)
summary(diabetes_svm_linear)
##
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train,
## kernel = "linear", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 13470
##
## ( 6731 6739 )
##
##
## Number of Classes: 2
##
## Levels:
## No Diabetes Diabetes
diabetes_svm_pred_linear <- predict(diabetes_svm_linear, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_linear, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 10332 745
## Diabetes 3024 1464
##
## Accuracy : 0.7579
## 95% CI : (0.751, 0.7646)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.305
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.66274
## Specificity : 0.77358
## Pos Pred Value : 0.32620
## Neg Pred Value : 0.93274
## Prevalence : 0.14192
## Detection Rate : 0.09406
## Detection Prevalence : 0.28834
## Balanced Accuracy : 0.71816
##
## 'Positive' Class : Diabetes
##
roc_pred_linear <-
prediction(
predictions = as.numeric(diabetes_svm_pred_linear),
labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_linear <- performance(roc_pred_linear, measure = "tpr", x.measure = "fpr")
plot(roc_perf_linear, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 7: ROC curve for the linear kernel SVM model using all of the available features
auc_perf_linear <- performance(roc_pred_linear, measure = "auc")
diabetes_auc_linear <- unlist(slot(auc_perf_linear,"y.values"))
paste("Calculated AUC: ", diabetes_auc_linear)
## [1] "Calculated AUC: 0.718164114215431"
diabetes_svm_polynomial <- svm(
Diabetes_012 ~ .,
kernel = "polynomial",
type = "C-classification",
data = diabetes_data_stratified_train
)
summary(diabetes_svm_polynomial)
##
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train,
## kernel = "polynomial", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 23631
##
## ( 11812 11819 )
##
##
## Number of Classes: 2
##
## Levels:
## No Diabetes Diabetes
diabetes_svm_pred_polynomial <- predict(diabetes_svm_polynomial, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_polynomial, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 6766 248
## Diabetes 6590 1961
##
## Accuracy : 0.5607
## 95% CI : (0.5528, 0.5685)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1794
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8877
## Specificity : 0.5066
## Pos Pred Value : 0.2293
## Neg Pred Value : 0.9646
## Prevalence : 0.1419
## Detection Rate : 0.1260
## Detection Prevalence : 0.5494
## Balanced Accuracy : 0.6972
##
## 'Positive' Class : Diabetes
##
roc_pred_polynomial <-
prediction(
predictions = as.numeric(diabetes_svm_pred_polynomial),
labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_polynomial <- performance(roc_pred_polynomial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_polynomial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 8: ROC curve for the polynomial kernel SVM model using all of the available features
auc_perf_polynomial <- performance(roc_pred_polynomial, measure = "auc")
diabetes_auc_polynomial <- unlist(slot(auc_perf_polynomial,"y.values"))
paste("Calculated AUC: ", diabetes_auc_polynomial)
## [1] "Calculated AUC: 0.697160402236976"
diabetes_svm_radial <- svm(
Diabetes_012 ~ .,
kernel = "radial",
type = "C-classification",
data = diabetes_data_stratified_train
)
summary(diabetes_svm_radial)
##
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train,
## kernel = "radial", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 13699
##
## ( 6847 6852 )
##
##
## Number of Classes: 2
##
## Levels:
## No Diabetes Diabetes
diabetes_svm_pred_radial <- predict(diabetes_svm_radial, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_radial, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 10262 693
## Diabetes 3094 1516
##
## Accuracy : 0.7567
## 95% CI : (0.7499, 0.7634)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3128
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6863
## Specificity : 0.7683
## Pos Pred Value : 0.3289
## Neg Pred Value : 0.9367
## Prevalence : 0.1419
## Detection Rate : 0.0974
## Detection Prevalence : 0.2962
## Balanced Accuracy : 0.7273
##
## 'Positive' Class : Diabetes
##
roc_pred_radial <-
prediction(
predictions = as.numeric(diabetes_svm_pred_radial),
labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_radial <- performance(roc_pred_radial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_radial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 9: ROC curve for the radial kernel SVM model using all of the available features
auc_perf_radial <- performance(roc_pred_radial, measure = "auc")
diabetes_auc_radial <- unlist(slot(auc_perf_radial,"y.values"))
paste("Calculated AUC: ", diabetes_auc_radial)
## [1] "Calculated AUC: 0.727313600830602"
diabetes_svm_sigmoid <- svm(
Diabetes_012 ~ .,
kernel = "sigmoid",
type = "C-classification",
data = diabetes_data_stratified_train
)
summary(diabetes_svm_sigmoid)
##
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train,
## kernel = "sigmoid", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 14048
##
## ( 7018 7030 )
##
##
## Number of Classes: 2
##
## Levels:
## No Diabetes Diabetes
diabetes_svm_pred_sigmoid <- predict(diabetes_svm_sigmoid, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_sigmoid, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 10318 749
## Diabetes 3038 1460
##
## Accuracy : 0.7567
## 95% CI : (0.7499, 0.7634)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3026
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6609
## Specificity : 0.7725
## Pos Pred Value : 0.3246
## Neg Pred Value : 0.9323
## Prevalence : 0.1419
## Detection Rate : 0.0938
## Detection Prevalence : 0.2890
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : Diabetes
##
roc_pred_sigmoid <-
prediction(
predictions = as.numeric(diabetes_svm_pred_sigmoid),
labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "tpr", x.measure = "fpr")
plot(roc_perf_sigmoid, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 10: ROC curve for the sigmoid kernel SVM model using all of the available features
auc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "auc")
diabetes_auc_sigmoid <- unlist(slot(auc_perf_sigmoid,"y.values"))
paste("Calculated AUC: ", diabetes_auc_sigmoid)
## [1] "Calculated AUC: 0.716734618147791"
data1 <- tribble(
~"",~"Accuracy",~"Kappa",~"Sensitivity",~"Specificity",~"AUC",
"Decision Tree (HW 2)", "0.7083", "0.2711", "0.7429", "0.7026", "0.7468",
"Decision Tree (Stratified)", "0.6996","0.2613","0,7438","0.6923","0.7378",
"Linear Kernel SVM","0.7606","0.3193","0.6886","0.7725","0.7305",
"Polynomial Kernel SVM", "0.5440","0.1642","0.8791","0.4886","0.6839",
"Radial Kernel SVM", "0.7589","0.3242","0.7076","0.7674","0.7375",
"Sigmoid Kernel SVM", "0.7593","0.3159","0.6849","0.7716","0.7282"
)
knitr::kable((data1), booktabs = TRUE)
| Accuracy | Kappa | Sensitivity | Specificity | AUC | |
|---|---|---|---|---|---|
| Decision Tree (HW 2) | 0.7083 | 0.2711 | 0.7429 | 0.7026 | 0.7468 |
| Decision Tree (Stratified) | 0.6996 | 0.2613 | 0,7438 | 0.6923 | 0.7378 |
| Linear Kernel SVM | 0.7606 | 0.3193 | 0.6886 | 0.7725 | 0.7305 |
| Polynomial Kernel SVM | 0.5440 | 0.1642 | 0.8791 | 0.4886 | 0.6839 |
| Radial Kernel SVM | 0.7589 | 0.3242 | 0.7076 | 0.7674 | 0.7375 |
| Sigmoid Kernel SVM | 0.7593 | 0.3159 | 0.6849 | 0.7716 | 0.7282 |
Table 1: Metrics for different model types
The metrics shown in Table 1 differ from the results shown in the “Fitting the Decision Tree Model with the Reduced Dataset” and “Fitting the Support Vector Machine Model” sections because of the random sampling that was done for the stratified random sampling. Therefore, each run of the markdown file resulted in slightly different metric values.
To create a fair comparison between the decision tree model and the
SVMs, the decision tree model was refit to the stratified randomly
sampled dataset. However, as expected, this sampling method introduced
variations in the resulting decision tree model compared to the model
generated using the entire dataset. The decision tree that was generated
made several splits based on 5 out of 21 possible features to choose
from, GenHlth, DiffWalk, HighBP,
HeartDiseaseorAttack, DiffWalk. In Homework 2,
only 3 splits were made based on GenHlth,
DiffWalk, and HighBP. The variable importance
plot generated from the decision tree model shows that
GenHlth, DiffWalk, HighBP, and
BMI are all significant features. In Homework 2, when using
all of the data, it was shown that HeartDiseaseorAttack was
significant in addition to all of the other features that were mentioned
previously. From Table 1, it is shown that there is also a slightly
lower accuracy score (0.7083 vs 0.6996) and a slightly lower AUC value
(0.7468 vs 0.7378). This highlights that the decision tree constructed
from the sampled dataset exhibited structural differences compared to
the model derived from the complete dataset. This discrepancy suggests
that the sampling process may have influenced the decision paths and
node splits within the tree.This difference also highlights that the
importance attributed to different predictor variables varied between
the two models. While certain variables may have been deemed significant
in the full dataset model, their importance may have diminished or
shifted in the sampled dataset model. This highlights the potential
impact of sampling on variable selection and model interpretation.
Based on the metrics provided for different SVM models from Table 1, several conclusions can be drawn. The Linear Kernel SVM model has the highest accuracy among the SVM models, followed closely by the Radial Kernel SVM and the Sigmoid Kernel SVM, with the Polynomial Kernel SVM model having the lowest accuracy. The Linear Kernel SVM model has the highest Kappa, indicating better agreement between predicted and observed values compared to the other SVM models, while the polynomial kernel SVM model has the lowest Kappa, suggesting lower agreement between predicted and observed values. The Radial Kernel SVM model has the highest sensitivity among the SVM models, showing its ability to correctly identify individuals with diabetes, while the polynomial kernel SVM model has the highest specificity among the SVM models, showing its ability to correctly identify individuals without diabetes. The radial kernel SVM model has the highest AUC among the SVM models, which indicates that it does a reasonably good job at discriminating between whether or not someone has diabetes, while the Polynomial Kernel SVM model has the lowest AUC among the SVM models. Overall, the linear kernel SVM model generally performs well across multiple metrics, including accuracy, Kappa coefficient, sensitivity, specificity, and AUC and the Radial Kernel SVM model also performs competitively, particularly in terms of sensitivity and AUC. The polynomial kernel SVM model shows comparatively lower performance across most metrics, suggesting it may not be well suited for diabetes prediction. The sigmoid kernel SVM model exhibits similar performance to the linear kernel SVM and radial kernel SVM models, but slightly lower in some metrics. In summary, the linear and radial Kernel SVM models appear to be the most promising based on the provided metrics, while the polynomial kernel SVM model shows relatively lower performance. With that being said, the linear SVM model was used to compare against the decision tree model.
In order to create a fair comparison, in this section, only the SVM and the decision tree model that were fit to the stratified randomly sampled dataset will be compared. Sampling introduces biases due to the selection of specific samples. In addition, by comparing two models that were essentially fit to two different datasets, inconsistency in evaluation is introduces, which makes it challenging to draw meaningful conclusions.
Based on the provided metrics from Table 1. The Llinear kernel SVM model has a higher accuracy (0.7606) compared to the Decision Tree model with stratified sampling (0.6996). The linear kernel SVM model has a higher Kappa (0.3193) compared to the decision tree model with stratified sampling (0.2613). This indicates better agreement between predicted and observed values for the SVM model. The decision tree model with stratified sampling has a slightly higher sensitivity (0.7438) compared to the linear kernel SVM model (0.6886) and a higer specificity (0.7725 vs 0.6923), showcasing that the decision tree model performs better at correctly identifying individuals with diabetes. The AUC for the Linear Kernel SVM model (0.7305) is slightly lower than that of the Decision Tree model with stratified sampling (0.7378). The Linear Kernel SVM model generally outperforms the Decision Tree model with stratified sampling in terms of accuracy, Kappa coefficient, and specificity. The Decision Tree model with stratified sampling shows a slightly higher sensitivity compared to the Linear Kernel SVM model. Both models have comparable AUC values, which shows that both models are capable at discerning whether or not someone has diabetes. In summary, the Linear Kernel SVM model exhibits better overall performance compared to the Decision Tree model with stratified sampling, particularly in terms of accuracy, Kappa coefficient, and specificity.
The analysis compares various machine learning models for diabetes prediction based on health indicators. The Decision Tree model exhibits structural differences and lower performance when applied to a stratified randomly sampled dataset compared to its original counterpart. Among SVM models, the Linear Kernel SVM demonstrates superior accuracy, Kappa, and specificity, while the Radial Kernel SVM performs well in sensitivity and AUC. In a direct comparison, the Linear Kernel SVM outperforms the Decision Tree model from stratified sampling in accuracy, Kappa coefficient, and specificity, although the Decision Tree model displays slightly higher sensitivity. Both models show comparable AUC values, indicating their effectiveness in determining whether or not someone has diabetes. Additionally, when using a stratified sampling dataset, one must be aware of factors such as generalization, potential biases, variable importance shifts, and consistency in evaluation. Overall, the Linear Kernel SVM emerges as the most promising model for diabetes prediction, offering a balanced mix of accuracy, interpretability, and performance metrics. Further validation and exploration are recommended to ensure robust predictions suitable for clinical applications.
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
In this study, an SVM was fit to a dataset containing COVID-19 patient data in order to create a model that was capable of predicting whether or not a person was infected with COVID-19. The response variable contained 3 levels: not infect, mildly infected, and severely infected and in order to create a highly accurate model with very few missed predictions, the cost parameter was set to 10. What was found was that when comparing the performance of the SVM model to all other supervised models, the SVM model was the most accurate. Accuracy was very important in this research because of the severity of the COVID-19, and because of the evolving nature of the disease, the results will also evolve as new features are derived from these evolving symptoms. Just as COVID-19 patient data contains various factors such as symptoms, demographics, and medical history, building energy consumption data can include numerous variables like weather conditions, building size, occupancy patterns, and HVAC system efficiency. SVM can effectively analyze such diverse datasets to make predictions. In the study, the SVM model was optimized for accuracy, aiming to minimize missed predictions, especially given the severity of COVID-19. Similarly, in building science, accurate prediction of energy consumption is important for optimizing building operations, ensuring energy efficiency, and minimizing costs.
Link: http://infor.seaninstitute.org/index.php/infokum/article/view/402/333
Predicting pneumonia accurately is crucial because it can lead to better patient care and outcomes. Early detection and treatment of pneumonia can prevent complications and reduce the severity of the illness. By using machine learning models, healthcare providers can make faster and more accurate diagnoses, leading to timely treatment and improved patient outcomes. The goal was to find the best way to predict and classify pneumonia cases using four different machine learning models: Support Vector Machine (SVM), Decision Tree, Logistic Regression, and Random Forest. What was found was that even though the SVM model was more accurate than the decision tree model, the logistic regression model ended up being the most accurate one out of them all. Just as early detection of pneumonia can lead to timely treatment and better patient outcomes, early detection of energy inefficiencies or anomalies in building energy consumption can lead to timely optimization and cost savings. Machine learning models can help in detecting these inefficiencies by analyzing patterns in energy consumption data. Accurate predictions are crucial for effective energy management in buildings. The study’s focus on accuracy and consistency highlights the importance of using reliable models for energy consumption prediction. A model that can predict energy usage accurately can help in budgeting, planning, and optimizing energy resources effectively.
Link: https://bmcpharmacoltoxicol.biomedcentral.com/articles/10.1186/s40360-022-00588-0
Metformin is a popular diabetes drug that has a fatality rate of 30 to 50% if unsafe levels are taken. Therefore, this research was undertaken in order to generate a data driven model for early prognosis prediction, in order to reduce the death rate. In this study, a decision tree model and a SVM model were fit to a dataset consisting of poisoning records collected by the American Association of Poison Control Centers. For the decision tree model, the ranking of the features showed that acidosis was the most important feature, followed by hypoglycemia and electrolyte abnormality. What was found was that while both algorithms were found to be powerful in determining whether or not someone was poisoned from unsafe levels of metformin, the SVM model was able to more precisely predict the correct outcome when evaluating the models based on a test set (70-30 split). From a building science perspective, the SVM algorithm can be used for regression purposes, and based on the results from this research, an SVM model can be used to accurately predict energy consumption, allowing building managers and engineers to implement more efficient strategies, reducing energy costs and environmental impact in NYC buildings. The decision tree model was able to display the most important features. Similarly, in analyzing building energy consumption, we might find that factors like building insulation quality, HVAC system efficiency, and occupancy patterns are key predictors. Using machine learning, we could develop models with high accuracy in predicting energy consumption and potential energy-saving opportunities for buildings in NYC.
Link: https://versita.com/menuscript/index.php/Versita/article/view/832/911
The goal of this study was to make a good system that can recognize handwritten numbers. The decision tree and SVM algorithms were both applied to the dataset. Sklearn, a popular machine learning library in Python, has a digits dataset built in which contains roughly 70,000 8x8 sample images of handwritten digits. What was found was that between the two algorithms, the support vector machine algorithm had a higher average accuracy compared to the decision tree. The results show that Support Vector Machine is more effective than Decision Tree. Both methods showed good results when tested using a T-Test (p<0.001) with a 95% confidence level. The reason why this relates to my area of expertise is because many energy consumption meters within NYC use the traditional analog meters. This is problematic because newer energy consumption meters are connected to a building automation system from which the energy consumption can be downloaded onto a computer for analysis, while the older analog meters are not connected to any such type of system nor have any IoT (Internet of things) capabilities. Therefore, as a test to track the energy consumption readouts from an analog meter, I installed a time-lapse camera facing at the face of the meter, which takes a picture every minute. The results from this research article can therefore be used to determine which algorithm would be most accurate in converting the images from the time lapse camera to meter readout values.
Based on the research that I have done in regards to decision trees vs. SVMs, there are several key distinctions that I want to conclude with between the two algorithms that pertain to my area of expertise.
For SVMS, they would be beneficial for my area of expertise because they are effective in handling complex, high-dimensional data, which is common in building energy consumption datasets that may include numerous variables such as weather conditions, building characteristics, and occupancy patterns. In my field of expertise, a years worth of energy consumption data is needed to construct a model because New York goes through 4 seasons worth of weather changes, which has a significant impact on energy consumption, so SVMs can be can be applied to predict energy consumption patterns in buildings by analyzing historical data and extracting patterns related to factors like weather conditions, building occupancy, and HVAC system efficiency. They can also be used for anomaly detection to identify unusual energy usage patterns that may indicate equipment malfunction or inefficient operation. However, SVMs are computationally expensive and require careful tuning of hyperparameters, such as the choice of kernel and regularization parameter.
For decision trees, they are easy to interpret and understand, making them suitable for explaining energy consumption prediction models to building stakeholders and decision-makers. Decision trees also can automatically handle feature interactions and nonlinear relationships, which is advantageous in capturing complex relationships between input features and energy consumption. They can also be applied for predictive maintenance by identifying decision rules based on historical data to predict equipment failures or energy inefficiencies. However, they are prone to overfitting and their simplicity makes them potentially unable to capture complex relationships, which would probably warrant the use of a neural network.
In summary, one must be cognizant of the specific characteristics of the dataset such as the interpretability of the model and the computational resources available, before making the decision on whether to use a decision tree or an SVM.