DATA622: Homework 2

library(tidyverse)
library(ggpubr)
library(skimr)
library(caret)
library(rpart.plot)
library(performanceEstimation)

Exploratory Analysis

Data Source

For this assignment, I used the ‘Diabetes Dataset’ freely available on Kaggle.com (https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset). This data source was originally from the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset focuses on patients who are females at least 21 years old of Pima Indian heritage. The objective of this data source is to predict whether a patient is likely to have diabetes to aid in diagnosing. There are 9 variables in the dataset: 1 binary indicator variable for diabetes and 8 variables related to medical information about the patient.

Variable	Data Type	Description
Pregnancies	Numeric	To express the number of pregnancies
Glucose	Numeric	To express the glucose level in blood
BloodPressure	Numeric	To express the blood pressure measurement
SkinThickness	Numeric	To express the thickness of the skin
Insulin	Numeric	To express the insulin level in blood
BMI	Numeric	To express the body mass index
DiabetesPedigreeFunction	Numeric	To express the likelihood of diabetes based on family history
Age	Numeric	To express age
Outcome	Categorical	To express the whether someone has diabetes. 1 is Yes and 0 is No

diabetes <- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA622/main/HW2/Data/diabetes.csv")

# adjust data types
diabetes$Pregnancies <- as.numeric(diabetes$Pregnancies)
diabetes$Glucose <- as.numeric(diabetes$Glucose)
diabetes$BloodPressure <- as.numeric(diabetes$BloodPressure)
diabetes$SkinThickness <- as.numeric(diabetes$SkinThickness)
diabetes$Insulin <- as.numeric(diabetes$Insulin)
diabetes$Age <- as.numeric(diabetes$Age)
diabetes$Outcome <- as.factor(diabetes$Outcome)

Summary Statistics

Before Cleaning Data

At a quick glance of the summary statistics:

There are no missing values present in the dataset
25% of the women have a glucose level of 140 - 199 mg/dL, indicating prediabetes (https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451#:~:text=A%20blood%20sugar%20level%20less,L)%20means%20you%20have%20prediabetes.)
More than 75% of women have a BMI of higher than the “normal” range of 18.5-24.9.
There are a wide range of ages in the dataset from 21 to 81 years old.
Blood Pressure has values of 0 which should be investigated. Zero DBP can be attributed to monitor malfunction so I will remove all rows with a zero for Blood Pressure. The rows with 0 in BloodPressure also had 0s for SkinThickness and Glucose. This is why I elected to drop the rows.

skim(diabetes)

Data summary
Name	diabetes
Number of rows	768
Number of columns	9
_______________________
Column type frequency:
factor	1
numeric	8
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Outcome	0	1	FALSE	2	0: 500, 1: 268

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Pregnancies	1	3.85	3.37	0.00	1.00	3.00	6.00	17.00	▇▃▂▁▁
Glucose	1	120.89	31.97	0.00	99.00	117.00	140.25	199.00	▁▁▇▆▂
BloodPressure	1	69.11	19.36	0.00	62.00	72.00	80.00	122.00	▁▁▇▇▁
SkinThickness	1	20.54	15.95	0.00	0.00	23.00	32.00	99.00	▇▇▂▁▁
Insulin	1	79.80	115.24	0.00	0.00	30.50	127.25	846.00	▇▁▁▁▁
BMI	1	31.99	7.88	0.00	27.30	32.00	36.60	67.10	▁▃▇▂▁
DiabetesPedigreeFunction	1	0.47	0.33	0.08	0.24	0.37	0.63	2.42	▇▃▁▁▁
Age	1	33.24	11.76	21.00	24.00	29.00	41.00	81.00	▇▃▁▁▁

diabetes.new <- diabetes %>% filter(BloodPressure != 0)

After Cleaning Data

As 35 rows of zeros were removed, the mean of Glucose, BloodPressure, and SkinThickness increased. The other variables remained similar to the precleaned data summary.

skim(diabetes.new)

Data summary
Name	diabetes.new
Number of rows	733
Number of columns	9
_______________________
Column type frequency:
factor	1
numeric	8
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Outcome	0	1	FALSE	2	0: 481, 1: 252

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Pregnancies	1	3.86	3.36	0.00	1.00	3.00	6.00	17.00	▇▃▂▁▁
Glucose	1	121.04	32.18	0.00	99.00	117.00	141.00	199.00	▁▁▇▆▂
BloodPressure	1	72.41	12.38	24.00	64.00	72.00	80.00	122.00	▁▃▇▂▁
SkinThickness	1	21.44	15.71	0.00	0.00	24.00	33.00	99.00	▇▇▂▁▁
Insulin	1	83.61	116.61	0.00	0.00	45.00	130.00	846.00	▇▁▁▁▁
BMI	1	32.29	7.27	0.00	27.40	32.30	36.60	67.10	▁▃▇▂▁
DiabetesPedigreeFunction	1	0.48	0.33	0.08	0.24	0.38	0.63	2.42	▇▃▁▁▁
Age	1	33.36	11.84	21.00	24.00	29.00	41.00	81.00	▇▃▂▁▁

Check the Target Variable Distribution

There is a class imbalance present in this dataset. Imbalanced data can prove to be quite problematic in classifying the minority class. I will conduct oversampling using SMOTE before modeling the data to give the model a better chance at classifying the minority class, that a women has Diabetes.

diabetes.new %>% ggplot(aes(Outcome)) +
  geom_bar(fill = "#04354F") +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")

Data Distributions

Aggregate

Most women are in there 20s and early 30s
After removing rows with 0, BloodPressure is normally distributed
BMI is also normally distributed
Most people have a small chance of having diabetes via family affiliation
Most people have a normal glucose level
Insulin and SkinThickness have a large number of 0 values
Pregnacies are right skewed

diabetes.new %>% 
  gather(-c(Outcome), key = variable, value = value) %>%
  ggplot(., aes(x = value)) +
  geom_histogram(aes(x=value, y = ..density..), bins = 30, fill="#69b3a2", color="#e9ecef") +
  geom_density(aes(x=value), color='red', lwd = 1.75) +
  facet_wrap(~variable, scales ="free", ncol = 4)

p1 <- diabetes.new %>% 
  ggplot(aes(x = Pregnancies, fill = Outcome)) +
  geom_density(alpha = 0.5)

p2 <- diabetes.new %>% 
  ggplot(aes(x = Glucose, fill = Outcome)) +
  geom_density(alpha = 0.5)

p3 <- diabetes.new %>% 
  ggplot(aes(x = BloodPressure, fill = Outcome)) +
  geom_density(alpha = 0.5)


p4 <- diabetes.new %>% 
  ggplot(aes(x = Insulin, fill = Outcome)) +
  geom_density(alpha = 0.5)


p5 <- diabetes.new %>% 
  ggplot(aes(x = BMI, fill = Outcome)) +
  geom_density(alpha = 0.5)

p6 <- diabetes.new %>% 
  ggplot(aes(x = DiabetesPedigreeFunction, fill = Outcome)) +
  geom_density(alpha = 0.5)


p7 <- diabetes.new %>% 
  ggplot(aes(x = Age, fill = Outcome)) +
  geom_density(alpha = 0.5)

ggarrange(p1, p2, p3,
          p4, p5, p6,
          p7, nrow = 4, ncol = 2)

By Class: Outcome

Both the histogram and boxplots indicate that Glucose, BloodPressure, and BMI, Pregnancies, and Age should be strong predictors for classifying the outcome of a patient.

p1 <- diabetes.new %>% 
  ggplot(aes(x = Pregnancies, fill = Outcome)) +
  geom_boxplot()

p2 <- diabetes.new %>% 
  ggplot(aes(x = Glucose, fill = Outcome)) +
  geom_boxplot()

p3 <- diabetes.new %>% 
  ggplot(aes(x = BloodPressure, fill = Outcome)) +
  geom_boxplot()


p4 <- diabetes.new %>% 
  ggplot(aes(x = Insulin, fill = Outcome)) +
  geom_boxplot()


p5 <- diabetes.new %>% 
  ggplot(aes(x = BMI, fill = Outcome)) +
  geom_boxplot()

p6 <- diabetes.new %>% 
  ggplot(aes(x = DiabetesPedigreeFunction, fill = Outcome)) +
  geom_boxplot()


p7 <- diabetes.new %>% 
  ggplot(aes(x = Age, fill = Outcome)) +
  geom_boxplot()

ggarrange(p1, p2, p3,
          p4, p5, p6,
          p7, nrow = 4, ncol = 2)

Modeling

The goal of this assignment is to generate two different Decision Tree algorithms and a Random Forest algorithm and compare the results. To achieve this it is important to first partition the data into training and testing.

Create Data Partition

The training set contains 80% of the dataset and the remaining 20% will be used for testing the model performance.

set.seed(123)

train_ind <- createDataPartition(diabetes.new[,"Outcome"], p = 0.8, list = FALSE)

train <- diabetes.new[train_ind, ]
test <- diabetes.new[-train_ind, ]

Oversample Training Data to Balance Class using SMOTE

As mentioned above, the data source is imbalanced. To account for this the training dataset will undergo a SMOTE algorithm to synthetically create new data points to be used to balance the classes. This should allow the model to more accurately predict whether a patient has Diabetes.

train.balanced <- smote(Outcome ~ ., data = train, perc.over = 1)

p1 <- train %>% ggplot(aes(Outcome)) +
  geom_bar(fill = "#133ECE") +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white") +
  ggtitle("Before Oversampling")

p2 <- train.balanced %>% ggplot(aes(Outcome)) +
  geom_bar(fill = "#CE8D13") +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black") +
  ggtitle("After Oversampling")

ggarrange(p1, p2, nrow = 1, ncol = 2)

Decision Trees

I elected to build a simple decision tree and a much more complicated decision tree to see if there were vast differences in model performance. The models are:

Using the variables: Glucose, BMI, and Age.
- Hyperparameters: 10 fold cross validation and a tune length of 5
Using all variables in the dataset.
- Hyperparameters: 10 fold cross validation and a tune length of 200

Model 1: Decision Tree Using the variables: `Glucose`, `BMI`, and `Age`

This first model only used three variables to predict Outcome. The decision tree plot indicates that Glucose is the most important variable for determining Diabetes followed by BMI and lastly Age. The model is simple only having 5 terminal nodes.

set.seed(123)

trctrl <- trainControl(method = "cv",
                       number = 10) 

tree1 <- train(Outcome ~ Glucose + BMI + Age,
               method = "rpart",
               trControl = trctrl,
               tuneLength = 5,
               data = train.balanced)

rpart.plot(tree1$finalModel)

Model Results

The model achieved an accuracy of 0.64. Though I would argue that overall accuracy is not the best metric for this type of problem as the data is imbalanced. Instead I will focus on Kappa, Precision, and Sensitivity. The Kappa is 0.352, Precision is 0.48, and Sensitivity is 0.94, The model was very good capturing almst all postive results. Though it had a terrible precision score as it had a large number of false positive predictions.

pred <- predict(tree1, newdata = test)

cm1 <- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")

cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47  3
##          1 49 47
##                                           
##                Accuracy : 0.6438          
##                  95% CI : (0.5604, 0.7213)
##     No Information Rate : 0.6575          
##     P-Value [Acc > NIR] : 0.6712          
##                                           
##                   Kappa : 0.352           
##                                           
##  Mcnemar's Test P-Value : 4.365e-10       
##                                           
##             Sensitivity : 0.9400          
##             Specificity : 0.4896          
##          Pos Pred Value : 0.4896          
##          Neg Pred Value : 0.9400          
##              Prevalence : 0.3425          
##          Detection Rate : 0.3219          
##    Detection Prevalence : 0.6575          
##       Balanced Accuracy : 0.7148          
##                                           
##        'Positive' Class : 1               
##

Model 2: Decision Tree Using all variables

This decision tree uses all the variables in the dataset to predict the Outcome variable. This model is much more complex than the basic 3 variable model. There are a total of 20 terminal nodes and 19 decision nodes. Glucose, BMI, and Age are still the most important factors in determining a patients likelihood of being a Diabetic. Other important variables include Pregnancies, DiabetesPedegreeFunction, and BloodPressure. Insulin and SkinThickness contribute the least to the model.

set.seed(123)

trctrl <- trainControl(method = "cv",
                       number = 10) 

tree2 <- train(Outcome ~ .,
               method = "rpart",
               trControl = trctrl,
               tuneLength = 200,
               data = train.balanced)

rpart.plot(tree2$finalModel)

Model Results

This model achieved an overall accuracy of 0.75, Kappa of 0.48, Precision of 0.59, and Sensitivity of 0.8. This model is better at correctly classifying whether a patient is a diabetic than the first model due to the higher precision.

pred <- predict(tree2, newdata = test)
result <- table(test$Outcome, pred)

cm2 <- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")

cm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 69 10
##          1 27 40
##                                          
##                Accuracy : 0.7466         
##                  95% CI : (0.668, 0.8149)
##     No Information Rate : 0.6575         
##     P-Value [Acc > NIR] : 0.013142       
##                                          
##                   Kappa : 0.4797         
##                                          
##  Mcnemar's Test P-Value : 0.008529       
##                                          
##             Sensitivity : 0.8000         
##             Specificity : 0.7188         
##          Pos Pred Value : 0.5970         
##          Neg Pred Value : 0.8734         
##              Prevalence : 0.3425         
##          Detection Rate : 0.2740         
##    Detection Prevalence : 0.4589         
##       Balanced Accuracy : 0.7594         
##                                          
##        'Positive' Class : 1              
##

Random Forest

Model 3: Random Forest

The final model for this assignment is a Random Forest model using all variables and the hyperparameters of 10 fold cross validation and ntree = 250. The model seems to perform best using only a handful of variables. The most important features in the Random Forest are Glucose, BMI, Age, DiabetesPredigreeFunction, and BloodPressure. This is similar to model 2.

set.seed(123)

forest <- train(Outcome ~ .,
                method = "rf",
                trControl = trctrl,
                ntree = 250,
                data = train.balanced)

plot(forest)

plot(varImp(forest))

Model Results

The Random Forest achieved the following metrics:

Overall Accuracy of 0.79
Kappa of 0.57
Precision of 0.63
Sensitivity of 0.73

The Random Forest did the best to predict the Outcome variable of determining if a patient is diabetic.

pred <- predict(forest, newdata = test)
result <- table(test$Outcome, pred)

cm3 <- confusionMatrix(data = pred, reference = test$Outcome, positive = "1")

cm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70  5
##          1 26 45
##                                           
##                Accuracy : 0.7877          
##                  95% CI : (0.7124, 0.8509)
##     No Information Rate : 0.6575          
##     P-Value [Acc > NIR] : 0.0004157       
##                                           
##                   Kappa : 0.5716          
##                                           
##  Mcnemar's Test P-Value : 0.0003280       
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.7292          
##          Pos Pred Value : 0.6338          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.3425          
##          Detection Rate : 0.3082          
##    Detection Prevalence : 0.4863          
##       Balanced Accuracy : 0.8146          
##                                           
##        'Positive' Class : 1               
##

Model Comparisons

The best model in all metrics, except sensitivity, is the Random Forest Model (Model 3). Model 1 is the most simple of the models which is best if interpretability is desired, but is outperformed my the other models. Model 2 has greater predictive power than Model 1 but does not outperform Model 3. Model 3 has the best predictive power, but is the least interpretable of the Models. As the objective of the assignment is to predict the outcome of a particular feature, I believe Model 3 is the best model to use for this classification task.

accuracy <- c(cm1$overall[1], cm2$overall[1], cm3$overall[1])
kappa.val <- c(cm1$overall["Kappa"], cm2$overall["Kappa"], cm3$overall["Kappa"])
precision.val <- c(cm1$byClass['Pos Pred Value'], cm2$byClass['Pos Pred Value'], cm3$byClass['Pos Pred Value'] )
sensitivity.val <- c(cm1$byClass['Sensitivity'], cm2$byClass['Sensitivity'], cm3$byClass['Sensitivity'])

specificity.val <- c(cm1$byClass['Specificity'], cm2$byClass['Specificity'], cm3$byClass['Specificity'])

model.type <- c("Model 1: Simple Decision Tree", "Model 2: Complex Decision Tree", "Model 3: Random Forest")
results <- data.frame(model.type,
           accuracy,
           kappa.val,
           precision.val,
           sensitivity.val,
           specificity.val)

results$f1.score <- 2 * ((results$precision.val * results$sensitivity.val) / (results$precision.val + results$sensitivity.val))

results <- results %>%
  mutate_if(is.numeric,
            round,
            digits = 3)

kableExtra::kbl(results) %>%
  kableExtra::kable_classic()

model.type	accuracy	kappa.val	precision.val	sensitivity.val	specificity.val	f1.score
Model 1: Simple Decision Tree	0.644	0.352	0.490	0.94	0.490	0.644
Model 2: Complex Decision Tree	0.747	0.480	0.597	0.80	0.719	0.684
Model 3: Random Forest	0.788	0.572	0.634	0.90	0.729	0.744

Conclusion

This assignment showcased some of the pros and cons to using one algorithm to another. For example, the decision trees produced a result extremely fast but at a cost of predictive power. Random forests, on the other hand, produced a model with much greater predictive power at the cost of both computational speed and model interpretability. There is a stark difference in model interpretability between the simple (Model 1) and the complex (Model 2) decision trees. Model 1 only used 4 decision nodes to generate the classifier which is easy to follow. Model 2 used 19 decision nodes (almost 5 times the number of decisions) to generate an outcome. The higher the decision nodes, the more easily a person can get lost when attempting to understand the model.

Like what was mentioned in the article, there are good, bad, and ugly sides to using a decision tree. Unbalanced data is not great for tree based models. To give the model a fighting chance at predicting the positive class, resampling of the data was used. This adjustment paired with the 10 fold cross validation overcame the usability problem through continuous improvement of the models.

DATA622: Homework 2