Data Overview and Initial Analysis

drug_data <- read.csv('https://raw.githubusercontent.com/Kingtilon1/MachineLearning-BigData/refs/heads/main/DecisionTree/drug200.csv')
head(drug_data)
##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX
str(drug_data)
## 'data.frame':    200 obs. of  6 variables:
##  $ Age        : int  23 47 47 28 61 22 49 41 60 43 ...
##  $ Sex        : chr  "F" "M" "M" "F" ...
##  $ BP         : chr  "HIGH" "LOW" "LOW" "NORMAL" ...
##  $ Cholesterol: chr  "HIGH" "HIGH" "HIGH" "HIGH" ...
##  $ Na_to_K    : num  25.4 13.1 10.1 7.8 18 ...
##  $ Drug       : chr  "drugY" "drugC" "drugC" "drugX" ...
skim(drug_data)
Data summary
Name drug_data
Number of rows 200
Number of columns 6
_______________________
Column type frequency:
character 4
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sex 0 1 1 1 0 2 0
BP 0 1 3 6 0 3 0
Cholesterol 0 1 4 6 0 2 0
Drug 0 1 5 5 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 44.31 16.54 15.00 31.00 45.00 58.00 74.00 ▆▆▇▆▆
Na_to_K 0 1 16.08 7.22 6.27 10.45 13.94 19.38 38.25 ▇▆▂▂▁

The medical dataset contains records for 200 patients with their corresponding drug responses. A quick examination shows we have a complete dataset with no missing values, which is crucial for building reliable decision trees. The data includes patient age ranging from 15 to 74 years, blood pressure levels, cholesterol readings, and Na_to_K ratios. This clean, well-structured dataset provides an excellent foundation for our classification task.

Understanding Target Variable Distribution

table(drug_data$Drug)
## 
## drugA drugB drugC drugX drugY 
##    23    16    16    54    91
ggplot(drug_data, aes(x = Drug, fill = BP)) +
  geom_bar(position = "stack") +
  theme_minimal() +
  labs(title = "Drug Distribution by Blood Pressure Levels")

ggplot(drug_data, aes(x = Drug, fill = Cholesterol)) +
  geom_bar(position = "stack") +
  theme_minimal() +
  labs(title = "Drug Distribution by Cholesterol Levels")

Understanding how different drugs are prescribed across various patient characteristics gives us insight into potential decision boundaries our trees might identify. The visualization shows clear patterns in drug prescriptions based on blood pressure and cholesterol levels. These distributions will help us interpret the splitting decisions our trees make during the classification process.

Data Preprocessing

drug_data$Sex <- as.factor(drug_data$Sex)
drug_data$BP <- as.factor(drug_data$BP)
drug_data$Cholesterol <- as.factor(drug_data$Cholesterol)
drug_data$Drug <- as.factor(drug_data$Drug)

set.seed(123)
train_index <- createDataPartition(drug_data$Drug, p = 0.8, list = FALSE)
train_data <- drug_data[train_index, ]
test_data <- drug_data[-train_index, ]

For proper decision tree construction, we’ve converted our categorical variables to factors. The data split maintains proportional representation of drug classes between training and test sets, ensuring our trees will learn from and be evaluated on representative samples.

First Decision Tree Model

tree1 <- rpart(Drug ~ ., 
               data = train_data,
               method = "class",
               control = rpart.control(minsplit = 5))

rpart.plot(tree1, extra = 1)

pred_tree1 <- predict(tree1, test_data, type = "class")
conf_matrix1 <- confusionMatrix(pred_tree1, test_data$Drug)
print(conf_matrix1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX drugY
##      drugA     4     0     0     0     0
##      drugB     0     3     0     0     0
##      drugC     0     0     3     0     0
##      drugX     0     0     0     9     0
##      drugY     0     0     0     1    18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.8619, 0.9993)
##     No Information Rate : 0.4737          
##     P-Value [Acc > NIR] : 2.015e-11       
##                                           
##                   Kappa : 0.9611          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity                1.0000      1.00000      1.00000       0.9000
## Specificity                1.0000      1.00000      1.00000       1.0000
## Pos Pred Value             1.0000      1.00000      1.00000       1.0000
## Neg Pred Value             1.0000      1.00000      1.00000       0.9655
## Prevalence                 0.1053      0.07895      0.07895       0.2632
## Detection Rate             0.1053      0.07895      0.07895       0.2368
## Detection Prevalence       0.1053      0.07895      0.07895       0.2368
## Balanced Accuracy          1.0000      1.00000      1.00000       0.9500
##                      Class: drugY
## Sensitivity                1.0000
## Specificity                0.9500
## Pos Pred Value             0.9474
## Neg Pred Value             1.0000
## Prevalence                 0.4737
## Detection Rate             0.4737
## Detection Prevalence       0.5000
## Balanced Accuracy          0.9750

Our first decision tree demonstrates impressive predictive power with 97.37% accuracy. The tree structure reveals that Na_to_K ratio serves as the primary splitting criterion, followed by blood pressure and age as secondary decision nodes. The model shows perfect prediction for drugs A, B, and C, with only a single misclassification between drugX and drugY. The high Kappa value of 0.9611 confirms the model’s strong performance isn’t due to chance.

Second Decision Tree Model

tree2 <- rpart(Drug ~ ., 
               data = train_data,
               method = "class",
               control = rpart.control(minsplit = 5))

tree2$control$minbucket <- 2
tree2$control$minsplit <- 4
tree2$control$maxdepth <- 5

rpart.plot(tree2, extra = 1)

pred_tree2 <- predict(tree2, test_data, type = "class")
conf_matrix2 <- confusionMatrix(pred_tree2, test_data$Drug)
print(conf_matrix2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX drugY
##      drugA     4     0     0     0     0
##      drugB     0     3     0     0     0
##      drugC     0     0     3     0     0
##      drugX     0     0     0     9     0
##      drugY     0     0     0     1    18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.8619, 0.9993)
##     No Information Rate : 0.4737          
##     P-Value [Acc > NIR] : 2.015e-11       
##                                           
##                   Kappa : 0.9611          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity                1.0000      1.00000      1.00000       0.9000
## Specificity                1.0000      1.00000      1.00000       1.0000
## Pos Pred Value             1.0000      1.00000      1.00000       1.0000
## Neg Pred Value             1.0000      1.00000      1.00000       0.9655
## Prevalence                 0.1053      0.07895      0.07895       0.2632
## Detection Rate             0.1053      0.07895      0.07895       0.2368
## Detection Prevalence       0.1053      0.07895      0.07895       0.2368
## Balanced Accuracy          1.0000      1.00000      1.00000       0.9500
##                      Class: drugY
## Sensitivity                1.0000
## Specificity                0.9500
## Pos Pred Value             0.9474
## Neg Pred Value             1.0000
## Prevalence                 0.4737
## Detection Rate             0.4737
## Detection Prevalence       0.5000
## Balanced Accuracy          0.9750

When attempting to create a different tree structure, we find the algorithm consistently returns to the same configuration as our first tree. This suggests the Na_to_K ratio is such a strong predictor that alternative splitting arrangements don’t improve classification performance. The stability of this structure across different parameterizations indicates we’ve found a robust classification pattern.

Random Forest Model

rf_model <- randomForest(Drug ~ ., 
                        data = train_data,
                        ntree = 500,
                        importance = TRUE)

pred_rf <- predict(rf_model, test_data)
conf_matrix_rf <- confusionMatrix(pred_rf, test_data$Drug)

importance(rf_model)
##                  drugA     drugB     drugC      drugX      drugY
## Age         25.7784655 27.968929 -2.387378  0.9064577 -0.1068566
## Sex         -0.2856645  1.242276 -2.935560 -0.2799703 -1.1826257
## BP          45.7892176 32.264481 32.953346 56.7800461  0.8282940
## Cholesterol -1.6517704 -1.587697 25.276727 22.5869738  0.9409756
## Na_to_K     33.4600218 24.874377 21.817845 56.0749104 85.2003892
##             MeanDecreaseAccuracy MeanDecreaseGini
## Age                    24.951425        14.492225
## Sex                    -1.557419         1.595065
## BP                     65.913196        27.435654
## Cholesterol            25.601394         5.864343
## Na_to_K                92.623196        58.051314
varImpPlot(rf_model)

print(conf_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX drugY
##      drugA     4     0     0     0     0
##      drugB     0     3     0     0     0
##      drugC     0     0     3     0     0
##      drugX     0     0     0     9     0
##      drugY     0     0     0     1    18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.8619, 0.9993)
##     No Information Rate : 0.4737          
##     P-Value [Acc > NIR] : 2.015e-11       
##                                           
##                   Kappa : 0.9611          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity                1.0000      1.00000      1.00000       0.9000
## Specificity                1.0000      1.00000      1.00000       1.0000
## Pos Pred Value             1.0000      1.00000      1.00000       1.0000
## Neg Pred Value             1.0000      1.00000      1.00000       0.9655
## Prevalence                 0.1053      0.07895      0.07895       0.2632
## Detection Rate             0.1053      0.07895      0.07895       0.2368
## Detection Prevalence       0.1053      0.07895      0.07895       0.2368
## Balanced Accuracy          1.0000      1.00000      1.00000       0.9500
##                      Class: drugY
## Sensitivity                1.0000
## Specificity                0.9500
## Pos Pred Value             0.9474
## Neg Pred Value             1.0000
## Prevalence                 0.4737
## Detection Rate             0.4737
## Detection Prevalence       0.5000
## Balanced Accuracy          0.9750

The Random Forest model confirms our findings from the individual decision trees. The variable importance plot reinforces Na_to_K ratio as the strongest predictor, with blood pressure also showing significant predictive power. Sex appears as the least influential feature. The model maintains the same 97.37% accuracy as our single trees, suggesting the relationships in our data are straightforward enough that ensemble methods don’t provide additional benefit.

Model Comparison

results <- data.frame(
  Model = c("Decision Tree 1", "Decision Tree 2", "Random Forest"),
  Accuracy = c(conf_matrix1$overall["Accuracy"],
               conf_matrix2$overall["Accuracy"],
               conf_matrix_rf$overall["Accuracy"]),
  Kappa = c(conf_matrix1$overall["Kappa"],
            conf_matrix2$overall["Kappa"],
            conf_matrix_rf$overall["Kappa"])
)

print(results)
##             Model  Accuracy     Kappa
## 1 Decision Tree 1 0.9736842 0.9611452
## 2 Decision Tree 2 0.9736842 0.9611452
## 3   Random Forest 0.9736842 0.9611452

The consistency in performance across all three models, with identical accuracy and Kappa values, suggests we’ve uncovered the fundamental patterns in our data. The persistence of just one misclassification between drugX and drugY across all models indicates this might represent a genuinely ambiguous case in our dataset. The fact that even our Random Forest couldn’t improve upon the single decision tree’s performance suggests that a simple, interpretable decision tree might be the most practical choice for this medical classification task.