In this Homework, we explored the performance of two different
decision trees and a random forest on a diabetes health indicators
dataset. This dataset was compiled from a telephone survey conducted by
the CDC in 2015. Questions asked in the survey involved health-related
risk behaviors, chronic health conditions, and the use of preventative
services. Also include are age, education, income, location, and race to
name a few. There are 3 .csv files that can be used for analysis. The
one that was used in this homework was the
diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file.
This .csv file contains 253,680 survey responses (observations) and 21
features. The response variable is multiclass, in that it contains 3
different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for
diabetes. The
author of this dataset points out that there is a class
imbalance.
This dataset includes the following variables:
Diabetes_012: 0 = no diabetes 1 = prediabetes 2 =
diabetesHighBP: 0 = no high BP 1 = high BPHighChol: 0 = no high cholesterol 1 = high
cholesterolCholCheck: 0 = no cholesterol check in 5 years 1 = yes
cholesterol check in 5 yearsBMI: Body Mass IndexSmoker: Have you smoked at least 100 cigarettes in your
entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yesStroke: (Ever told) you had a stroke. 0 = no 1 =
yesHeartDiseaseorAttack: coronary heart disease (CHD) or
myocardial infarction (MI) 0 = no 1 = yesPhysActivity: physical activity in past 30 days - not
including job 0 = no 1 = yesFruits: Consume Fruit 1 or more times per day 0 = no 1
= yesVeggies: Consume Vegetables 1 or more times per day 0 =
no 1 = yesHvyAlcoholConsump: Heavy drinkers (adult men having
more than 14 drinks per week and adult women having more than 7 drinks
per week) 0 = no 1 = yesAnyHealthcare: Have any kind of health care coverage,
including health insurance, prepaid plans such as HMO, etc. 0 = no 1 =
yesNoDocbcCost: Was there a time in the past 12 months
when you needed to see a doctor but could not because of cost? 0 = no 1
= yesGenHlth: Would you say that in general your health is:
scale 1-5:
MentHlth: Now thinking about your mental health, which
includes stress, depression, and problems with emotions, for how many
days during the past 30 days was your mental health not good?
PhysHlth: Now thinking about your physical health,
which includes physical illness and injury, for how many days during the
past 30 days was your physical health not good?
DiffWalk: Do you have serious difficulty walking or
climbing stairs? 0 = no 1 = yesSex: 0 = female 1 = maleAge: 13-level age category:
Education: Education level; scale 1-6:
Income: Income scale; scale 1-8:
Decision trees have pros and cons, some of which are explained here.. This homework will explain how the decision trees that were generated for analysis can alter the negative perception of decision trees by addressing a real world problem.
diabetes_data <- read_csv(
file = "diabetes_012_health_indicators_BRFSS2015.csv",
col_types = "ffffnfffffffffffffffff")
A summary of the Diabetes Health Indicators Dataset is provided below:
summary(diabetes_data)
## Diabetes_012 HighBP HighChol CholCheck BMI
## 0.0:213703 1.0:108829 1.0:107591 1.0:244210 Min. :12.00
## 2.0: 35346 0.0:144851 0.0:146089 0.0: 9470 1st Qu.:24.00
## 1.0: 4631 Median :27.00
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
##
## Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits
## 1.0:112423 0.0:243388 0.0:229787 0.0: 61760 0.0: 92782
## 0.0:141257 1.0: 10292 1.0: 23893 1.0:191920 1.0:160898
##
##
##
##
##
## Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth
## 1.0:205841 0.0:239424 1.0:241263 0.0:232326 5.0:12081
## 0.0: 47839 1.0: 14256 0.0: 12417 1.0: 21354 3.0:75646
## 2.0:89084
## 4.0:31570
## 1.0:45299
##
##
## MentHlth PhysHlth DiffWalk Sex Age
## 0.0 :175680 0.0 :160052 1.0: 42675 0.0:141974 9.0 :33244
## 2.0 : 13054 30.0 : 19400 0.0:211005 1.0:111706 10.0 :32194
## 30.0 : 12088 2.0 : 14764 8.0 :30832
## 5.0 : 9030 1.0 : 11388 7.0 :26314
## 1.0 : 8538 3.0 : 8495 11.0 :23533
## 3.0 : 7381 5.0 : 7622 6.0 :19819
## (Other): 27909 (Other): 31959 (Other):87744
## Education Income
## 4.0: 62750 8.0 :90385
## 6.0:107325 7.0 :43219
## 3.0: 9478 6.0 :36470
## 5.0: 69910 5.0 :25883
## 2.0: 4043 4.0 :20135
## 1.0: 174 3.0 :15994
## (Other):21594
The factors above have been recoded for readability.
diabetes_data <- diabetes_data %>%
mutate(
Diabetes_012 = dplyr::recode(Diabetes_012, '0.0' = 'No Diabetes', '1.0' = 'Prediabetes', '2.0' = 'Diabetes'),
CholCheck = dplyr::recode(CholCheck, '1.0' = 'Yes Chol Check in 5 years', '0.0' = 'No Chol Check in 5 Years'),
AnyHealthcare = dplyr::recode(AnyHealthcare, '1.0' = 'Has Insurance', '0.0' = 'No Insurance'),
GenHlth = dplyr::recode(GenHlth, '5.0' = 'Poor', '4.0' = 'Fair', '3.0' = 'Good', '2.0' = 'Very Good', '1.0' = "Excellent"),
Age = dplyr::recode(Age, '1.0' = '18-24', '2.0' = '25-29', '3.0' = '30-34', '4.0' = '35-39', '5.0' = '40-44',
'6.0' = '45-49', '7.0' = '50-54', '8.0' = '55-59', '9.0' = '60-64', '10.0' = '65-69',
'11.0'='70-74', '12.0' = '75-79', '13.0' = '>=80'),
Education = dplyr::recode(Education, '1.0' = 'No School/Kindergarten', '2.0' = 'Grades 1-8', '3.0' = 'Grades 9 - 11',
'4.0' = 'Grade 12/GED', '5.0' = '1-3 Yrs College', '6.0' = '>= 4 Yrs College'),
Income = dplyr::recode(Income, '1.0' = '<10K', '2.0' = '10K<=Income<15K', '3.0' = '15K<=Income<20K', '4.0' = '20K<=Income<25K',
'5.0' = '25K<=Income<35K', '6.0' = '35K<=Income<50K', '7.0' = '50K<=Income<75K', '8.0' = 'Income>=75K')
)
summary(diabetes_data)
## Diabetes_012 HighBP HighChol
## No Diabetes:213703 1.0:108829 1.0:107591
## Diabetes : 35346 0.0:144851 0.0:146089
## Prediabetes: 4631
##
##
##
##
## CholCheck BMI Smoker Stroke
## Yes Chol Check in 5 years:244210 Min. :12.00 1.0:112423 0.0:243388
## No Chol Check in 5 Years : 9470 1st Qu.:24.00 0.0:141257 1.0: 10292
## Median :27.00
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
##
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 0.0:229787 0.0: 61760 0.0: 92782 1.0:205841 0.0:239424
## 1.0: 23893 1.0:191920 1.0:160898 0.0: 47839 1.0: 14256
##
##
##
##
##
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Has Insurance:241263 0.0:232326 Poor :12081 0.0 :175680
## No Insurance : 12417 1.0: 21354 Good :75646 2.0 : 13054
## Very Good:89084 30.0 : 12088
## Fair :31570 5.0 : 9030
## Excellent:45299 1.0 : 8538
## 3.0 : 7381
## (Other): 27909
## PhysHlth DiffWalk Sex Age
## 0.0 :160052 1.0: 42675 0.0:141974 60-64 :33244
## 30.0 : 19400 0.0:211005 1.0:111706 65-69 :32194
## 2.0 : 14764 55-59 :30832
## 1.0 : 11388 50-54 :26314
## 3.0 : 8495 70-74 :23533
## 5.0 : 7622 45-49 :19819
## (Other): 31959 (Other):87744
## Education Income
## Grade 12/GED : 62750 Income>=75K :90385
## >= 4 Yrs College :107325 50K<=Income<75K:43219
## Grades 9 - 11 : 9478 35K<=Income<50K:36470
## 1-3 Yrs College : 69910 25K<=Income<35K:25883
## Grades 1-8 : 4043 20K<=Income<25K:20135
## No School/Kindergarten: 174 15K<=Income<20K:15994
## (Other) :21594
Everything in the summary seems to fall within reasonable expectations. The summary also revealed that there were no missing values in this dataset.
Figure 1: Histograms for the BMI (the only
continuous feature) in the Diabetes Health Indicators Dataset
Figure 4 shows us that BMI is displaying right skewness.
This right skewness could have also been deduced from the summary.
Notice that in the summary, for the BMI variable, the
maximum is 98, while the mean is 28 and the minimum is 12.
Figure 2: Boxplots for the Diabetes Health Indicators Dataset
Some findings were discovered that support the theoretical effects
for some of the variables using the boxplots in Figure 5. Based on the
age boxplot, theoretically, older people are more likely to
have heart disease. Theoretically, on average, people with a lower
maximum heart rate are more likely to have heart disease when viewing
the MaxHR variable. Finally, based on the boxplot, on
average, people with higher Oldpeak are more likely to have
heart disease, which makes sense given that an Oldpeak
equal to ± 1 is indicative of a serious health condition.
Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.
corrplot(diabetes_correlations$correlations,
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 1,
tl.cex = 1)
Figure 3: Multicollinearity plot for continuous predictor variables
Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum. Therefore, it is sufficient to say that the entire dataset has low correlation.
Interestingly, the correlation between Education and
Income is low, even though there are multiple studies that
have been conducted that show that there is a strong correlation between
education and income:
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes Prediabetes
## 0.84241170 0.13933302 0.01825528
The output above shows the percentage of respondents that do not have
diabetes (84.24%), the percentage of respondents that do have diabetes
(13.9%), and the percentage that are prediabetic(1.8%). Note that in
order to deal with class imbalance, only 2 classes can exist within the
response variable. Since people with prediabetes only makes up 1.8% of
the total amount of observations, all of the observations where
Diabetes_012 == Prediabetes were removed from the dataset.
The nature of this study slightly changed with the removal of a class.
Now instead of generating a model to determine if someone is not at risk
of having diabetes, is at risk of being prediabetic, or is at risk of
being diabetic, now the model will just determine if a person is either
at risk or not at risk of getting diabetes. The modeling and prediction
of the remaining classes still yielded valuable insights.
diabetes_data <- diabetes_data %>%
mutate(Diabetes_012 = as.numeric(Diabetes_012)) %>%
subset(Diabetes_012 != 3) %>%
mutate(Diabetes_012 = as.factor(Diabetes_012)) %>%
mutate(Diabetes_012 = dplyr::recode(Diabetes_012, '1' = 'No Diabetes', '2' = 'Diabetes'))
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580761 0.1419239
Outliers, noise, and normalization which are common preprocessing techniques, do not need to be utilized when creating a decision tree model (Practical Machine Learning in R, page 301). The dataset is split into testing and training sets.
set.seed(123)
original_split <- caTools::sample.split(diabetes_data$Diabetes_012, SplitRatio = 0.75)
diabetes_data_train <- subset(diabetes_data, original_split == TRUE)
diabetes_data_test <- subset(diabetes_data, original_split == FALSE)
prop.table(table(select(diabetes_data, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580761 0.1419239
prop.table(table(select(diabetes_data_train, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580736 0.1419264
prop.table(table(select(diabetes_data_test, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.8580836 0.1419164
The output above shows us that there is a significant class
imbalance. SMOTE from the DMwR package is only
applied for the training dataset.
diabetes_data_train <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_train, Diabetes_012)))
## Diabetes_012
## No Diabetes Diabetes
## 0.5 0.5
The rpart function in R allowed for the generation of a
decision tree model that uses all of the features in the training set to
building a model that predicts Diabetes_012,
diabetes_mod1 <- rpart(
Diabetes_012 ~ .,
method = "class",
data = diabetes_data_train
)
rpart.plot(diabetes_mod1)
Figure 4: Decision tree for the diabetes health indicators dataset using all of the available features
The decision tree that was generated made several splits based on 3
out of 21 possible features to choose from, GenHlth,
DiffWalk, HighBP. While the exact cause of
diabetes has not been discovered, researchers believe that “a
combination of genetics, lifestyle, and environmental factors that can
contribute to its onset”. Therefore, GenHlth makes
sense as a significant factor in determining if someone is at risk of
developing diabetes. HighBP is one of the main symptoms of
diabetes but it does
not cause diabetes by itself. Diabetes also affects gait
patterns which is probably the reason why the DiffWalk
variable was also identified as a significant feature.
varImp(diabetes_mod1) %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(Overall) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
filter(Overall > 0) %>%
ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()
Figure 5: Variable importance plot for the decision tree model which uses all of the available features
The variable importance plot generated from the decision tree model
shows that GenHlth, DiffWalk,
HeartDiseaseorAttack, BMI, and
HighBP are all significant features. These features that
are significant make sense in determining whether or not someone has
diabetes. People
who are overweight and with cardiovascular problems tend to be at higher
risk for diabetes.
diabetes_pred1 <- predict(diabetes_mod1, diabetes_data_test, type = "class")
confusionMatrix(diabetes_pred1, diabetes_data_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 37536 2272
## Diabetes 15890 6564
##
## Accuracy : 0.7083
## 95% CI : (0.7047, 0.7119)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2711
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7429
## Specificity : 0.7026
## Pos Pred Value : 0.2923
## Neg Pred Value : 0.9429
## Prevalence : 0.1419
## Detection Rate : 0.1054
## Detection Prevalence : 0.3606
## Balanced Accuracy : 0.7227
##
## 'Positive' Class : Diabetes
##
The confusion matrix reveals an accuracy score of 0.7083. Note that when a multiple logistic regression model and naive Bayes model were fit to this same dataset in Homework 1, the accuracies were 0.7737 and 0.7678, respectively, indicating that a decision tree model has worse performance on the test dataset.
roc_pred1 <-
prediction(
predictions = predict(diabetes_mod1, diabetes_data_test, type = "prob")[, "No Diabetes"],
labels = diabetes_data_test$Diabetes_012
)
roc_perf1 <- performance(roc_pred1, measure = "tpr", x.measure = "fpr")
plot(roc_perf1, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 6: ROC curve for the decision tree model using all of the available features
auc_perf1 <- performance(roc_pred1, measure = "auc")
diabetes_auc1 <- unlist(slot(auc_perf1,"y.values"))
paste("Calculated AUC: ", diabetes_auc1)
## [1] "Calculated AUC: 0.746777432125331"
The AUC for the multiple logistic regression and the naive Bayes model that were fit to this same dataset in Homework 1 were 0.815 and 0.798 respectively, indicating worse performance.
Several key features were selected from the Diabetes Health
Indicators Dataset based on their correlation to whether or not someone
actually has diabetes. In this decision tree, 5 of the top important
features shown in Figure 5(GenHlth, DiffWalk,
HeartDiseaseorAttack, BMI, and
HighBP) were omitted. This was done to create a comparison
between the outputs and metrics of two totally different decision
trees.
HighChol: Webmd
indicates that high cholesterol tends to increase the risk of
diabetes.Smoker: The CDC
points out that people who smoke are 30% to 40% more likely to develop
diabetes.PhysActivity: The CDC
points out that people who do not get enough physical activity are at a
higher risk of developing diabetes.Veggies: UCLA
Health indicates that people whose diets are rich in vegetables have
a lower risk of developing diabetes.HvyAlcoholConsump: drinkaware
indicates that heavy alcohol consumption can contribute to diabetes by
reducing the body’s sensitivity to insulin.Sex: The CDC
points out that men store more fat in their bellies than women, and this
is a risk factor for diabetes.Age: The American
Diabetes Association points out that “older adults are at high risk
for both diabetes and prediabetes”.Education: MedStar
Health points out that people with “lower income and less education
are two to four times more likely to develop diabetes”.The rpart function in R allowed for the generation of a
decision tree model that uses HighChol,
Smoker, PhysActivity, Veggies,
HvyAlcoholConsumption, Sex, Age,
and Education in the training set to building a model that
predicts Diabetes_012,
diabetes_mod2 <- rpart(
Diabetes_012 ~ HighChol + Smoker + PhysActivity + Veggies + HvyAlcoholConsump + Sex + Age + Education,
method = "class",
data = diabetes_data_train
)
rpart.plot(diabetes_mod2)
Figure 7: Decision tree for the diabetes health indicators
dataset using just HighChol, Smoker,
PhysActivity, Veggies,
HvyAlcoholConsumption, Sex, Age,
and Education
The decision tree shown in Figure 7 uses 4 different features to make
the splits: HighChol, PhysActivity,
Age, and Education
varImp(diabetes_mod2) %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(Overall) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
filter(Overall > 0) %>%
ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()
Figure 8: Variable importance plot for the decision tree shown in Figure 7
The variable importance plot shows that PhysActivity,
Age, Education, Veggies, and
HighChol contribute greatly in determining if someone has
diabetes or not.
diabetes_pred2 <- predict(diabetes_mod2, diabetes_data_test, type = "class")
confusionMatrix(diabetes_pred2, diabetes_data_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 35550 3194
## Diabetes 17876 5642
##
## Accuracy : 0.6616
## 95% CI : (0.6579, 0.6653)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1795
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.63852
## Specificity : 0.66541
## Pos Pred Value : 0.23990
## Neg Pred Value : 0.91756
## Prevalence : 0.14192
## Detection Rate : 0.09062
## Detection Prevalence : 0.37773
## Balanced Accuracy : 0.65197
##
## 'Positive' Class : Diabetes
##
The statistics shown above show an accuracy of 0.6616, which is worse than the decision tree that was generated using all of the available features.
roc_pred2 <-
prediction(
predictions = predict(diabetes_mod2, diabetes_data_test, type = "prob")[, "No Diabetes"],
labels = diabetes_data_test$Diabetes_012
)
roc_perf2 <- performance(roc_pred2, measure = "tpr", x.measure = "fpr")
plot(roc_perf2, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 9: ROC curve for the decision tree shown in Figure 7
auc_perf2 <- performance(roc_pred2, measure = "auc")
diabetes_auc2 <- unlist(slot(auc_perf2,"y.values"))
paste("Calculated AUC: ", diabetes_auc2)
## [1] "Calculated AUC: 0.70366657692332"
Similarily, the AUC is worse than the AUC that was calculated when all of the features were used.
The randomForest package will be used to generate a
random forest model. This model requires the user to input a value for
mtry, which is the number of randomly selected
features.
Practical Machine Learning in R explains the following:
“Based on the documentation provided by the randomForest
package, the default value for mtry is the square root of
the number of features in the dataset when working on a classification
problem.”
Therefore, mtry will be set to 4 for the diabetes health
indicators dataset.
rf_mod <- train(
Diabetes_012 ~ .,
data = diabetes_data_train,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 4)
)
plot(varImp(rf_mod), top = 10)
Figure 11: Variable importance plot for the random forest model generated using all of the available features
Figure 11 reveals that BMI was very important in
predicting if a person has diabetes or not, followed by a significant
decrease in importance with the DiffWalk variable.
rf_pred <- predict(rf_mod, diabetes_data_test)
confusionMatrix(rf_pred, diabetes_data_test$Diabetes_012, positive = "Diabetes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Diabetes Diabetes
## No Diabetes 41417 2779
## Diabetes 12009 6057
##
## Accuracy : 0.7625
## 95% CI : (0.7591, 0.7658)
## No Information Rate : 0.8581
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3208
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.68549
## Specificity : 0.77522
## Pos Pred Value : 0.33527
## Neg Pred Value : 0.93712
## Prevalence : 0.14192
## Detection Rate : 0.09728
## Detection Prevalence : 0.29016
## Balanced Accuracy : 0.73036
##
## 'Positive' Class : Diabetes
##
The statistics shown above reveal an accuracy of 0.7624, and while this is better than both of the decision trees that were generated for this homework, this accuracy is worse than the naive Bayes model and multiple logistic regression models. Note that it took almost 13 minutes to fit the data to the model, while runtimes for the models generated for Homework 1 were much less.
rf_pred <-
prediction(
predictions = predict(rf_mod, diabetes_data_test, type = "prob")[, "No Diabetes"],
labels = diabetes_data_test$Diabetes_012
)
rf_perf <- performance(rf_pred, measure = "tpr", x.measure = "fpr")
plot(rf_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Figure 12: ROC curve for the random forest model
auc_perf_rf <- performance(rf_pred, measure = "auc")
diabetes_auc_rf <- unlist(slot(auc_perf_rf,"y.values"))
paste("Calculated AUC: ", diabetes_auc_rf)
## [1] "Calculated AUC: 0.804684760085056"
The AUC for the random forest model is slightly better than the naive Bayes model that was generated in Homework 1, but still worse than the multiple logistic regression model.
The decision tree model that was generated using all of the features performed better than the decision tree model that was generated using a subset of the features. What is interesting is that they both have the same number of splits, and each of the trees only have 5 splits in total, making both of the trees easily readable to a non-technical audience. For example, in the second decision tree, it is easy to tell that those below the age of 44 with low cholesterol will probably not have diabetes/be at risk for having diabetes. In the DeciZone article, one of the “ugly”’s was usability. The fact that both of the decision trees have so few splits, means that they’re not large, making them easy to navigate through from a usability standpoint. Both decision trees that were generated took seconds to generate, and the benefit of this is that if new data was fed into the decision tree, then the model could be refit in seconds. Also, in the DeciZone article, one of the “bads” was decision tree complexity, but that was not so much of a problem for both of the generated decision trees. One could imagine that with more features and more observations, that a decision tree could have too many branches to navigate through. This is not a problem for this particular dataset and in a business setting, and it would be beneficial to show the results of this analysis to healthcare stakeholders and policymakers. According to the American Diabetes Association, the annual healthcare cost of diabetes is $412.9 billion. This indicates that with the decision tree models that were generated, there could be significant savings generated by identifying and mitigating certain factors which contribute to diabetes.
However, with that being said, none of the models that were generated
in this homework fared better than the multiple logistic regression
model. However, the models themselves offer valuable insight into which
features were important in generating the model. Therefore, the
importance of the features can be interpreted as important factors that
contribute to whether or not someone is at risk/has diabetes. Note that
while the 2nd decision tree has the lowest accuracy, each of the
features was linked to an article or study which indicates that
particular feature’s relevance to people with diabetes, so there is
evidence linked to the results for that particular tree, and from that,
conclusions could be drawn (ex: PhysActivity having a high
importance to the response variable, Diabetes_012,
indicates that those who are physically active are less likely to
develop diabetes). Note that each model had a different set of important
predictors. BMI, which is the feature with the most importance in the
random forest model, is the 4th most important feature for the decision
tree model. However, the random forest model is also more accurate than
both of the decision tree models but also had the longest runtime out of
all of the models and is less interpretable. Therefore, in terms of
interpretability, it would be best to go with a decision tree model for
this particular homework, but overall, it would probably be best to go
with a multiple logistic regression model for interpretability and
predictive accuracy.