Introduction

In this Homework, we explored the performance of two different decision trees and a random forest on a diabetes health indicators dataset. This dataset was compiled from a telephone survey conducted by the CDC in 2015. Questions asked in the survey involved health-related risk behaviors, chronic health conditions, and the use of preventative services. Also include are age, education, income, location, and race to name a few. There are 3 .csv files that can be used for analysis. The one that was used in this homework was the diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file. This .csv file contains 253,680 survey responses (observations) and 21 features. The response variable is multiclass, in that it contains 3 different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for diabetes. The author of this dataset points out that there is a class imbalance.

This dataset includes the following variables:

Diabetes_012: 0 = no diabetes 1 = prediabetes 2 = diabetes
HighBP: 0 = no high BP 1 = high BP
HighChol: 0 = no high cholesterol 1 = high cholesterol
CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
BMI: Body Mass Index
Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
HeartDiseaseorAttack: coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes
PhysActivity: physical activity in past 30 days - not including job 0 = no 1 = yes
Fruits: Consume Fruit 1 or more times per day 0 = no 1 = yes
Veggies: Consume Vegetables 1 or more times per day 0 = no 1 = yes
HvyAlcoholConsump: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes
AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes
NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes
GenHlth: Would you say that in general your health is: scale 1-5:
- 1 = excellent
- 2 = very good
- 3 = good
- 4 = fair
- 5 = poor
MentHlth: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
- 1 - 30: number of days
- 88: None
- 77: Don’t know/Not sure
- 99: Refused
PhysHlth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
- 1 - 30: number of days
- 88: None
- 77: Don’t know/Not sure
- 99: Refused
- BLANK: Not asked or Missing
DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes
Sex: 0 = female 1 = male
Age: 13-level age category:
- 1 = 18-24
- 2 = 25-29
- 3 = 30-34
- 4 = 35-39
- 5 = 40-44
- 6 = 45-49
- 7 = 50-54
- 8 = 55-59
- 9 = 60-64
- 10 = 65-69
- 11 = 70-74
- 12 = 75-79
- 13 = 80 or older
Education: Education level; scale 1-6:
- 1 = Never attended school or only kindergarten
- 2 = Grades 1 through 8 (Elementary)
- 3 = Grades 9 through 11 (Some high school)
- 4 = Grade 12 or GED (High school graduate)
- 5 = College 1 year to 3 years (Some college or technical school)
- 6 = College 4 years or more (College graduate)
Income: Income scale; scale 1-8:
- 1 = less than $10,000
- 2 = less than $15,000
- 3 = less than $20,000
- 4 = less than $25,000
- 5 = less than $35,000
- 6 = less than $50,000
- 7 = less than $75,000
- 8 = $75,000 or more

Decision trees have pros and cons, some of which are explained here.. This homework will explain how the decision trees that were generated for analysis can alter the negative perception of decision trees by addressing a real world problem.

Importing Data

diabetes_data <- read_csv(
  file = "diabetes_012_health_indicators_BRFSS2015.csv",
  col_types = "ffffnfffffffffffffffff")

Exploratory Data Analysis

A summary of the Diabetes Health Indicators Dataset is provided below:

summary(diabetes_data)

##  Diabetes_012 HighBP       HighChol     CholCheck         BMI       
##  0.0:213703   1.0:108829   1.0:107591   1.0:244210   Min.   :12.00  
##  2.0: 35346   0.0:144851   0.0:146089   0.0:  9470   1st Qu.:24.00  
##  1.0:  4631                                          Median :27.00  
##                                                      Mean   :28.38  
##                                                      3rd Qu.:31.00  
##                                                      Max.   :98.00  
##                                                                     
##  Smoker       Stroke       HeartDiseaseorAttack PhysActivity Fruits      
##  1.0:112423   0.0:243388   0.0:229787           0.0: 61760   0.0: 92782  
##  0.0:141257   1.0: 10292   1.0: 23893           1.0:191920   1.0:160898  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Veggies      HvyAlcoholConsump AnyHealthcare NoDocbcCost  GenHlth    
##  1.0:205841   0.0:239424        1.0:241263    0.0:232326   5.0:12081  
##  0.0: 47839   1.0: 14256        0.0: 12417    1.0: 21354   3.0:75646  
##                                                            2.0:89084  
##                                                            4.0:31570  
##                                                            1.0:45299  
##                                                                       
##                                                                       
##     MentHlth         PhysHlth      DiffWalk      Sex              Age       
##  0.0    :175680   0.0    :160052   1.0: 42675   0.0:141974   9.0    :33244  
##  2.0    : 13054   30.0   : 19400   0.0:211005   1.0:111706   10.0   :32194  
##  30.0   : 12088   2.0    : 14764                             8.0    :30832  
##  5.0    :  9030   1.0    : 11388                             7.0    :26314  
##  1.0    :  8538   3.0    :  8495                             11.0   :23533  
##  3.0    :  7381   5.0    :  7622                             6.0    :19819  
##  (Other): 27909   (Other): 31959                             (Other):87744  
##  Education        Income     
##  4.0: 62750   8.0    :90385  
##  6.0:107325   7.0    :43219  
##  3.0:  9478   6.0    :36470  
##  5.0: 69910   5.0    :25883  
##  2.0:  4043   4.0    :20135  
##  1.0:   174   3.0    :15994  
##               (Other):21594

The factors above have been recoded for readability.

diabetes_data <- diabetes_data %>%
  mutate(
    Diabetes_012 = dplyr::recode(Diabetes_012, '0.0' = 'No Diabetes', '1.0' = 'Prediabetes', '2.0' = 'Diabetes'),
    CholCheck = dplyr::recode(CholCheck, '1.0' = 'Yes Chol Check in 5 years', '0.0' = 'No Chol Check in 5 Years'),
    AnyHealthcare = dplyr::recode(AnyHealthcare, '1.0' = 'Has Insurance', '0.0' = 'No Insurance'),
    GenHlth = dplyr::recode(GenHlth, '5.0' = 'Poor', '4.0' = 'Fair', '3.0' = 'Good', '2.0' = 'Very Good', '1.0' = "Excellent"),
    Age = dplyr::recode(Age, '1.0' = '18-24', '2.0' = '25-29', '3.0' = '30-34', '4.0' = '35-39', '5.0' = '40-44',
                 '6.0' = '45-49', '7.0' = '50-54', '8.0' = '55-59', '9.0' = '60-64', '10.0' = '65-69',
                 '11.0'='70-74', '12.0' = '75-79', '13.0' = '>=80'),
    Education = dplyr::recode(Education, '1.0' = 'No School/Kindergarten', '2.0' = 'Grades 1-8', '3.0' = 'Grades 9 - 11',
                       '4.0' = 'Grade 12/GED', '5.0' = '1-3 Yrs College', '6.0' = '>= 4 Yrs College'),
    Income = dplyr::recode(Income, '1.0' = '<10K', '2.0' = '10K<=Income<15K', '3.0' = '15K<=Income<20K', '4.0' = '20K<=Income<25K',
                    '5.0' = '25K<=Income<35K', '6.0' = '35K<=Income<50K', '7.0' = '50K<=Income<75K', '8.0' = 'Income>=75K')
  )

summary(diabetes_data)

##       Diabetes_012    HighBP       HighChol    
##  No Diabetes:213703   1.0:108829   1.0:107591  
##  Diabetes   : 35346   0.0:144851   0.0:146089  
##  Prediabetes:  4631                            
##                                                
##                                                
##                                                
##                                                
##                      CholCheck           BMI        Smoker       Stroke      
##  Yes Chol Check in 5 years:244210   Min.   :12.00   1.0:112423   0.0:243388  
##  No Chol Check in 5 Years :  9470   1st Qu.:24.00   0.0:141257   1.0: 10292  
##                                     Median :27.00                            
##                                     Mean   :28.38                            
##                                     3rd Qu.:31.00                            
##                                     Max.   :98.00                            
##                                                                              
##  HeartDiseaseorAttack PhysActivity Fruits       Veggies      HvyAlcoholConsump
##  0.0:229787           0.0: 61760   0.0: 92782   1.0:205841   0.0:239424       
##  1.0: 23893           1.0:191920   1.0:160898   0.0: 47839   1.0: 14256       
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##        AnyHealthcare    NoDocbcCost       GenHlth         MentHlth     
##  Has Insurance:241263   0.0:232326   Poor     :12081   0.0    :175680  
##  No Insurance : 12417   1.0: 21354   Good     :75646   2.0    : 13054  
##                                      Very Good:89084   30.0   : 12088  
##                                      Fair     :31570   5.0    :  9030  
##                                      Excellent:45299   1.0    :  8538  
##                                                        3.0    :  7381  
##                                                        (Other): 27909  
##     PhysHlth      DiffWalk      Sex              Age       
##  0.0    :160052   1.0: 42675   0.0:141974   60-64  :33244  
##  30.0   : 19400   0.0:211005   1.0:111706   65-69  :32194  
##  2.0    : 14764                             55-59  :30832  
##  1.0    : 11388                             50-54  :26314  
##  3.0    :  8495                             70-74  :23533  
##  5.0    :  7622                             45-49  :19819  
##  (Other): 31959                             (Other):87744  
##                   Education                  Income     
##  Grade 12/GED          : 62750   Income>=75K    :90385  
##  >= 4 Yrs College      :107325   50K<=Income<75K:43219  
##  Grades 9 - 11         :  9478   35K<=Income<50K:36470  
##  1-3 Yrs College       : 69910   25K<=Income<35K:25883  
##  Grades 1-8            :  4043   20K<=Income<25K:20135  
##  No School/Kindergarten:   174   15K<=Income<20K:15994  
##                                  (Other)        :21594

Everything in the summary seems to fall within reasonable expectations. The summary also revealed that there were no missing values in this dataset.

Figure 1: Histograms for the BMI (the only continuous feature) in the Diabetes Health Indicators Dataset

Figure 4 shows us that BMI is displaying right skewness. This right skewness could have also been deduced from the summary. Notice that in the summary, for the BMI variable, the maximum is 98, while the mean is 28 and the minimum is 12.

Boxplot

Figure 2: Boxplots for the Diabetes Health Indicators Dataset

Some findings were discovered that support the theoretical effects for some of the variables using the boxplots in Figure 5. Based on the age boxplot, theoretically, older people are more likely to have heart disease. Theoretically, on average, people with a lower maximum heart rate are more likely to have heart disease when viewing the MaxHR variable. Finally, based on the boxplot, on average, people with higher Oldpeak are more likely to have heart disease, which makes sense given that an Oldpeak equal to ± 1 is indicative of a serious health condition.

Examining Feature Multicollinearity for Continuous and Categorical Variables

Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.

corrplot(diabetes_correlations$correlations, 
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 1,
         tl.cex = 1)

Figure 3: Multicollinearity plot for continuous predictor variables

Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum. Therefore, it is sufficient to say that the entire dataset has low correlation.

Interestingly, the correlation between Education and Income is low, even though there are multiple studies that have been conducted that show that there is a strong correlation between education and income:

Study 1: https://research.stlouisfed.org/publications/page1-econ/2017/01/03/education-income-and-wealth#:~:text=Research%20indicates%20that%20the%20level,of%20debt%20relative%20to%20assets.
Study 2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4534330/
Study 3: https://www.jstor.org/stable/3445959

Class Imbalance

prop.table(table(select(diabetes_data, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes Prediabetes 
##  0.84241170  0.13933302  0.01825528

The output above shows the percentage of respondents that do not have diabetes (84.24%), the percentage of respondents that do have diabetes (13.9%), and the percentage that are prediabetic(1.8%). Note that in order to deal with class imbalance, only 2 classes can exist within the response variable. Since people with prediabetes only makes up 1.8% of the total amount of observations, all of the observations where Diabetes_012 == Prediabetes were removed from the dataset. The nature of this study slightly changed with the removal of a class. Now instead of generating a model to determine if someone is not at risk of having diabetes, is at risk of being prediabetic, or is at risk of being diabetic, now the model will just determine if a person is either at risk or not at risk of getting diabetes. The modeling and prediction of the remaining classes still yielded valuable insights.

diabetes_data <- diabetes_data %>%
  mutate(Diabetes_012 = as.numeric(Diabetes_012)) %>%
  subset(Diabetes_012 != 3) %>%
  mutate(Diabetes_012 = as.factor(Diabetes_012)) %>%
  mutate(Diabetes_012 = dplyr::recode(Diabetes_012, '1' = 'No Diabetes', '2' = 'Diabetes'))

prop.table(table(select(diabetes_data, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580761   0.1419239

Splitting Data into Testing and Training

Outliers, noise, and normalization which are common preprocessing techniques, do not need to be utilized when creating a decision tree model (Practical Machine Learning in R, page 301). The dataset is split into testing and training sets.

set.seed(123)
original_split <- caTools::sample.split(diabetes_data$Diabetes_012, SplitRatio = 0.75)
diabetes_data_train <-  subset(diabetes_data, original_split == TRUE)
diabetes_data_test <- subset(diabetes_data, original_split == FALSE)

prop.table(table(select(diabetes_data, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580761   0.1419239

prop.table(table(select(diabetes_data_train, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580736   0.1419264

prop.table(table(select(diabetes_data_test, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580836   0.1419164

The output above shows us that there is a significant class imbalance. SMOTE from the DMwR package is only applied for the training dataset.

diabetes_data_train <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_train), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_train, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##         0.5         0.5

Training and Evaluating the Decision Tree Model

The rpart function in R allowed for the generation of a decision tree model that uses all of the features in the training set to building a model that predicts Diabetes_012,

diabetes_mod1 <- rpart(
  Diabetes_012 ~ .,
  method = "class",
  data = diabetes_data_train
)

rpart.plot(diabetes_mod1)

Figure 4: Decision tree for the diabetes health indicators dataset using all of the available features

The decision tree that was generated made several splits based on 3 out of 21 possible features to choose from, GenHlth, DiffWalk, HighBP. While the exact cause of diabetes has not been discovered, researchers believe that “a combination of genetics, lifestyle, and environmental factors that can contribute to its onset”. Therefore, GenHlth makes sense as a significant factor in determining if someone is at risk of developing diabetes. HighBP is one of the main symptoms of diabetes but it does not cause diabetes by itself. Diabetes also affects gait patterns which is probably the reason why the DiffWalk variable was also identified as a significant feature.

varImp(diabetes_mod1) %>%
  tibble::rownames_to_column() %>%
  dplyr::rename("variable" = rowname) %>%
  dplyr::arrange(Overall) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
  filter(Overall > 0) %>%
  ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()

Figure 5: Variable importance plot for the decision tree model which uses all of the available features

The variable importance plot generated from the decision tree model shows that GenHlth, DiffWalk, HeartDiseaseorAttack, BMI, and HighBP are all significant features. These features that are significant make sense in determining whether or not someone has diabetes. People who are overweight and with cardiovascular problems tend to be at higher risk for diabetes.

diabetes_pred1 <- predict(diabetes_mod1, diabetes_data_test, type = "class")
confusionMatrix(diabetes_pred1, diabetes_data_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       37536     2272
##   Diabetes          15890     6564
##                                           
##                Accuracy : 0.7083          
##                  95% CI : (0.7047, 0.7119)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2711          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7429          
##             Specificity : 0.7026          
##          Pos Pred Value : 0.2923          
##          Neg Pred Value : 0.9429          
##              Prevalence : 0.1419          
##          Detection Rate : 0.1054          
##    Detection Prevalence : 0.3606          
##       Balanced Accuracy : 0.7227          
##                                           
##        'Positive' Class : Diabetes        
##

The confusion matrix reveals an accuracy score of 0.7083. Note that when a multiple logistic regression model and naive Bayes model were fit to this same dataset in Homework 1, the accuracies were 0.7737 and 0.7678, respectively, indicating that a decision tree model has worse performance on the test dataset.

roc_pred1 <-
prediction(
  predictions = predict(diabetes_mod1, diabetes_data_test, type = "prob")[, "No Diabetes"],
  labels = diabetes_data_test$Diabetes_012
)
roc_perf1 <- performance(roc_pred1, measure = "tpr", x.measure = "fpr")
plot(roc_perf1, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 6: ROC curve for the decision tree model using all of the available features

auc_perf1 <- performance(roc_pred1, measure = "auc")
diabetes_auc1 <- unlist(slot(auc_perf1,"y.values"))
paste("Calculated AUC: ", diabetes_auc1)

## [1] "Calculated AUC:  0.746777432125331"

The AUC for the multiple logistic regression and the naive Bayes model that were fit to this same dataset in Homework 1 were 0.815 and 0.798 respectively, indicating worse performance.

Training and Evaluating the Decision Tree Model with Switched Features

Several key features were selected from the Diabetes Health Indicators Dataset based on their correlation to whether or not someone actually has diabetes. In this decision tree, 5 of the top important features shown in Figure 5(GenHlth, DiffWalk, HeartDiseaseorAttack, BMI, and HighBP) were omitted. This was done to create a comparison between the outputs and metrics of two totally different decision trees.

HighChol: Webmd indicates that high cholesterol tends to increase the risk of diabetes.
Smoker: The CDC points out that people who smoke are 30% to 40% more likely to develop diabetes.
PhysActivity: The CDC points out that people who do not get enough physical activity are at a higher risk of developing diabetes.
Veggies: UCLA Health indicates that people whose diets are rich in vegetables have a lower risk of developing diabetes.
HvyAlcoholConsump: drinkaware indicates that heavy alcohol consumption can contribute to diabetes by reducing the body’s sensitivity to insulin.
Sex: The CDC points out that men store more fat in their bellies than women, and this is a risk factor for diabetes.
Age: The American Diabetes Association points out that “older adults are at high risk for both diabetes and prediabetes”.
Education: MedStar Health points out that people with “lower income and less education are two to four times more likely to develop diabetes”.

The rpart function in R allowed for the generation of a decision tree model that uses HighChol, Smoker, PhysActivity, Veggies, HvyAlcoholConsumption, Sex, Age, and Education in the training set to building a model that predicts Diabetes_012,

diabetes_mod2 <- rpart(
  Diabetes_012 ~ HighChol + Smoker + PhysActivity + Veggies + HvyAlcoholConsump + Sex + Age + Education,
  method = "class",
  data = diabetes_data_train
)

rpart.plot(diabetes_mod2)

Figure 7: Decision tree for the diabetes health indicators dataset using just HighChol, Smoker, PhysActivity, Veggies, HvyAlcoholConsumption, Sex, Age, and Education

The decision tree shown in Figure 7 uses 4 different features to make the splits: HighChol, PhysActivity, Age, and Education

varImp(diabetes_mod2) %>%
  tibble::rownames_to_column() %>%
  dplyr::rename("variable" = rowname) %>%
  dplyr::arrange(Overall) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
  filter(Overall > 0) %>%
  ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()

Figure 8: Variable importance plot for the decision tree shown in Figure 7

The variable importance plot shows that PhysActivity, Age, Education, Veggies, and HighChol contribute greatly in determining if someone has diabetes or not.

diabetes_pred2 <- predict(diabetes_mod2, diabetes_data_test, type = "class")
confusionMatrix(diabetes_pred2, diabetes_data_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       35550     3194
##   Diabetes          17876     5642
##                                           
##                Accuracy : 0.6616          
##                  95% CI : (0.6579, 0.6653)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1795          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.63852         
##             Specificity : 0.66541         
##          Pos Pred Value : 0.23990         
##          Neg Pred Value : 0.91756         
##              Prevalence : 0.14192         
##          Detection Rate : 0.09062         
##    Detection Prevalence : 0.37773         
##       Balanced Accuracy : 0.65197         
##                                           
##        'Positive' Class : Diabetes        
##

The statistics shown above show an accuracy of 0.6616, which is worse than the decision tree that was generated using all of the available features.

roc_pred2 <-
prediction(
  predictions = predict(diabetes_mod2, diabetes_data_test, type = "prob")[, "No Diabetes"],
  labels = diabetes_data_test$Diabetes_012
)
roc_perf2 <- performance(roc_pred2, measure = "tpr", x.measure = "fpr")
plot(roc_perf2, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 9: ROC curve for the decision tree shown in Figure 7

auc_perf2 <- performance(roc_pred2, measure = "auc")
diabetes_auc2 <- unlist(slot(auc_perf2,"y.values"))
paste("Calculated AUC: ", diabetes_auc2)

## [1] "Calculated AUC:  0.70366657692332"

Similarily, the AUC is worse than the AUC that was calculated when all of the features were used.

Training and Evaluating the Random Forest Model

The randomForest package will be used to generate a random forest model. This model requires the user to input a value for mtry, which is the number of randomly selected features.

Practical Machine Learning in R explains the following:

“Based on the documentation provided by the randomForest package, the default value for mtry is the square root of the number of features in the dataset when working on a classification problem.”

Therefore, mtry will be set to 4 for the diabetes health indicators dataset.

rf_mod <- train(
  Diabetes_012 ~ .,
  data = diabetes_data_train,
  metric = "Accuracy",
  method = "rf",
  trControl = trainControl(method = "none"),
  tuneGrid = expand.grid(.mtry = 4)
  )

plot(varImp(rf_mod), top = 10)

Figure 11: Variable importance plot for the random forest model generated using all of the available features

Figure 11 reveals that BMI was very important in predicting if a person has diabetes or not, followed by a significant decrease in importance with the DiffWalk variable.

rf_pred <- predict(rf_mod, diabetes_data_test)
confusionMatrix(rf_pred, diabetes_data_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       41417     2779
##   Diabetes          12009     6057
##                                           
##                Accuracy : 0.7625          
##                  95% CI : (0.7591, 0.7658)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3208          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.68549         
##             Specificity : 0.77522         
##          Pos Pred Value : 0.33527         
##          Neg Pred Value : 0.93712         
##              Prevalence : 0.14192         
##          Detection Rate : 0.09728         
##    Detection Prevalence : 0.29016         
##       Balanced Accuracy : 0.73036         
##                                           
##        'Positive' Class : Diabetes        
##

The statistics shown above reveal an accuracy of 0.7624, and while this is better than both of the decision trees that were generated for this homework, this accuracy is worse than the naive Bayes model and multiple logistic regression models. Note that it took almost 13 minutes to fit the data to the model, while runtimes for the models generated for Homework 1 were much less.

rf_pred <-
prediction(
  predictions = predict(rf_mod, diabetes_data_test, type = "prob")[, "No Diabetes"],
  labels = diabetes_data_test$Diabetes_012
)
rf_perf <- performance(rf_pred, measure = "tpr", x.measure = "fpr")
plot(rf_perf, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 12: ROC curve for the random forest model

auc_perf_rf <- performance(rf_pred, measure = "auc")
diabetes_auc_rf <- unlist(slot(auc_perf_rf,"y.values"))
paste("Calculated AUC: ", diabetes_auc_rf)

## [1] "Calculated AUC:  0.804684760085056"

The AUC for the random forest model is slightly better than the naive Bayes model that was generated in Homework 1, but still worse than the multiple logistic regression model.

Essay

The decision tree model that was generated using all of the features performed better than the decision tree model that was generated using a subset of the features. What is interesting is that they both have the same number of splits, and each of the trees only have 5 splits in total, making both of the trees easily readable to a non-technical audience. For example, in the second decision tree, it is easy to tell that those below the age of 44 with low cholesterol will probably not have diabetes/be at risk for having diabetes. In the DeciZone article, one of the “ugly”’s was usability. The fact that both of the decision trees have so few splits, means that they’re not large, making them easy to navigate through from a usability standpoint. Both decision trees that were generated took seconds to generate, and the benefit of this is that if new data was fed into the decision tree, then the model could be refit in seconds. Also, in the DeciZone article, one of the “bads” was decision tree complexity, but that was not so much of a problem for both of the generated decision trees. One could imagine that with more features and more observations, that a decision tree could have too many branches to navigate through. This is not a problem for this particular dataset and in a business setting, and it would be beneficial to show the results of this analysis to healthcare stakeholders and policymakers. According to the American Diabetes Association, the annual healthcare cost of diabetes is $412.9 billion. This indicates that with the decision tree models that were generated, there could be significant savings generated by identifying and mitigating certain factors which contribute to diabetes.

However, with that being said, none of the models that were generated in this homework fared better than the multiple logistic regression model. However, the models themselves offer valuable insight into which features were important in generating the model. Therefore, the importance of the features can be interpreted as important factors that contribute to whether or not someone is at risk/has diabetes. Note that while the 2nd decision tree has the lowest accuracy, each of the features was linked to an article or study which indicates that particular feature’s relevance to people with diabetes, so there is evidence linked to the results for that particular tree, and from that, conclusions could be drawn (ex: PhysActivity having a high importance to the response variable, Diabetes_012, indicates that those who are physically active are less likely to develop diabetes). Note that each model had a different set of important predictors. BMI, which is the feature with the most importance in the random forest model, is the 4th most important feature for the decision tree model. However, the random forest model is also more accurate than both of the decision tree models but also had the longest runtime out of all of the models and is less interpretable. Therefore, in terms of interpretability, it would be best to go with a decision tree model for this particular homework, but overall, it would probably be best to go with a multiple logistic regression model for interpretability and predictive accuracy.

Homework 2

Peter Phung

2024-03-30