drug_data <- read.csv('https://raw.githubusercontent.com/Kingtilon1/MachineLearning-BigData/refs/heads/main/DecisionTree/drug200.csv')
head(drug_data)
## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 F HIGH HIGH 25.355 drugY
## 2 47 M LOW HIGH 13.093 drugC
## 3 47 M LOW HIGH 10.114 drugC
## 4 28 F NORMAL HIGH 7.798 drugX
## 5 61 F LOW HIGH 18.043 drugY
## 6 22 F NORMAL HIGH 8.607 drugX
str(drug_data)
## 'data.frame': 200 obs. of 6 variables:
## $ Age : int 23 47 47 28 61 22 49 41 60 43 ...
## $ Sex : chr "F" "M" "M" "F" ...
## $ BP : chr "HIGH" "LOW" "LOW" "NORMAL" ...
## $ Cholesterol: chr "HIGH" "HIGH" "HIGH" "HIGH" ...
## $ Na_to_K : num 25.4 13.1 10.1 7.8 18 ...
## $ Drug : chr "drugY" "drugC" "drugC" "drugX" ...
skim(drug_data)
| Name | drug_data |
| Number of rows | 200 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Sex | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| BP | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
| Cholesterol | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Drug | 0 | 1 | 5 | 5 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 44.31 | 16.54 | 15.00 | 31.00 | 45.00 | 58.00 | 74.00 | ▆▆▇▆▆ |
| Na_to_K | 0 | 1 | 16.08 | 7.22 | 6.27 | 10.45 | 13.94 | 19.38 | 38.25 | ▇▆▂▂▁ |
The medical dataset contains records for 200 patients with their corresponding drug responses. A quick examination shows we have a complete dataset with no missing values, which is crucial for building reliable decision trees. The data includes patient age ranging from 15 to 74 years, blood pressure levels, cholesterol readings, and Na_to_K ratios. This clean, well-structured dataset provides an excellent foundation for our classification task.
table(drug_data$Drug)
##
## drugA drugB drugC drugX drugY
## 23 16 16 54 91
ggplot(drug_data, aes(x = Drug, fill = BP)) +
geom_bar(position = "stack") +
theme_minimal() +
labs(title = "Drug Distribution by Blood Pressure Levels")
ggplot(drug_data, aes(x = Drug, fill = Cholesterol)) +
geom_bar(position = "stack") +
theme_minimal() +
labs(title = "Drug Distribution by Cholesterol Levels")
Understanding how different drugs are prescribed across various patient characteristics gives us insight into potential decision boundaries our trees might identify. The visualization shows clear patterns in drug prescriptions based on blood pressure and cholesterol levels. These distributions will help us interpret the splitting decisions our trees make during the classification process.
drug_data$Sex <- as.factor(drug_data$Sex)
drug_data$BP <- as.factor(drug_data$BP)
drug_data$Cholesterol <- as.factor(drug_data$Cholesterol)
drug_data$Drug <- as.factor(drug_data$Drug)
set.seed(123)
train_index <- createDataPartition(drug_data$Drug, p = 0.8, list = FALSE)
train_data <- drug_data[train_index, ]
test_data <- drug_data[-train_index, ]
For proper decision tree construction, we’ve converted our categorical variables to factors. The data split maintains proportional representation of drug classes between training and test sets, ensuring our trees will learn from and be evaluated on representative samples.
tree1 <- rpart(Drug ~ .,
data = train_data,
method = "class",
control = rpart.control(minsplit = 5))
rpart.plot(tree1, extra = 1)
pred_tree1 <- predict(tree1, test_data, type = "class")
conf_matrix1 <- confusionMatrix(pred_tree1, test_data$Drug)
print(conf_matrix1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX drugY
## drugA 4 0 0 0 0
## drugB 0 3 0 0 0
## drugC 0 0 3 0 0
## drugX 0 0 0 9 0
## drugY 0 0 0 1 18
##
## Overall Statistics
##
## Accuracy : 0.9737
## 95% CI : (0.8619, 0.9993)
## No Information Rate : 0.4737
## P-Value [Acc > NIR] : 2.015e-11
##
## Kappa : 0.9611
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.0000 1.00000 1.00000 0.9000
## Specificity 1.0000 1.00000 1.00000 1.0000
## Pos Pred Value 1.0000 1.00000 1.00000 1.0000
## Neg Pred Value 1.0000 1.00000 1.00000 0.9655
## Prevalence 0.1053 0.07895 0.07895 0.2632
## Detection Rate 0.1053 0.07895 0.07895 0.2368
## Detection Prevalence 0.1053 0.07895 0.07895 0.2368
## Balanced Accuracy 1.0000 1.00000 1.00000 0.9500
## Class: drugY
## Sensitivity 1.0000
## Specificity 0.9500
## Pos Pred Value 0.9474
## Neg Pred Value 1.0000
## Prevalence 0.4737
## Detection Rate 0.4737
## Detection Prevalence 0.5000
## Balanced Accuracy 0.9750
Our first decision tree demonstrates impressive predictive power with 97.37% accuracy. The tree structure reveals that Na_to_K ratio serves as the primary splitting criterion, followed by blood pressure and age as secondary decision nodes. The model shows perfect prediction for drugs A, B, and C, with only a single misclassification between drugX and drugY. The high Kappa value of 0.9611 confirms the model’s strong performance isn’t due to chance.
tree2 <- rpart(Drug ~ .,
data = train_data,
method = "class",
control = rpart.control(minsplit = 5))
tree2$control$minbucket <- 2
tree2$control$minsplit <- 4
tree2$control$maxdepth <- 5
rpart.plot(tree2, extra = 1)
pred_tree2 <- predict(tree2, test_data, type = "class")
conf_matrix2 <- confusionMatrix(pred_tree2, test_data$Drug)
print(conf_matrix2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX drugY
## drugA 4 0 0 0 0
## drugB 0 3 0 0 0
## drugC 0 0 3 0 0
## drugX 0 0 0 9 0
## drugY 0 0 0 1 18
##
## Overall Statistics
##
## Accuracy : 0.9737
## 95% CI : (0.8619, 0.9993)
## No Information Rate : 0.4737
## P-Value [Acc > NIR] : 2.015e-11
##
## Kappa : 0.9611
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.0000 1.00000 1.00000 0.9000
## Specificity 1.0000 1.00000 1.00000 1.0000
## Pos Pred Value 1.0000 1.00000 1.00000 1.0000
## Neg Pred Value 1.0000 1.00000 1.00000 0.9655
## Prevalence 0.1053 0.07895 0.07895 0.2632
## Detection Rate 0.1053 0.07895 0.07895 0.2368
## Detection Prevalence 0.1053 0.07895 0.07895 0.2368
## Balanced Accuracy 1.0000 1.00000 1.00000 0.9500
## Class: drugY
## Sensitivity 1.0000
## Specificity 0.9500
## Pos Pred Value 0.9474
## Neg Pred Value 1.0000
## Prevalence 0.4737
## Detection Rate 0.4737
## Detection Prevalence 0.5000
## Balanced Accuracy 0.9750
When attempting to create a different tree structure, we find the algorithm consistently returns to the same configuration as our first tree. This suggests the Na_to_K ratio is such a strong predictor that alternative splitting arrangements don’t improve classification performance. The stability of this structure across different parameterizations indicates we’ve found a robust classification pattern.
rf_model <- randomForest(Drug ~ .,
data = train_data,
ntree = 500,
importance = TRUE)
pred_rf <- predict(rf_model, test_data)
conf_matrix_rf <- confusionMatrix(pred_rf, test_data$Drug)
importance(rf_model)
## drugA drugB drugC drugX drugY
## Age 25.7784655 27.968929 -2.387378 0.9064577 -0.1068566
## Sex -0.2856645 1.242276 -2.935560 -0.2799703 -1.1826257
## BP 45.7892176 32.264481 32.953346 56.7800461 0.8282940
## Cholesterol -1.6517704 -1.587697 25.276727 22.5869738 0.9409756
## Na_to_K 33.4600218 24.874377 21.817845 56.0749104 85.2003892
## MeanDecreaseAccuracy MeanDecreaseGini
## Age 24.951425 14.492225
## Sex -1.557419 1.595065
## BP 65.913196 27.435654
## Cholesterol 25.601394 5.864343
## Na_to_K 92.623196 58.051314
varImpPlot(rf_model)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX drugY
## drugA 4 0 0 0 0
## drugB 0 3 0 0 0
## drugC 0 0 3 0 0
## drugX 0 0 0 9 0
## drugY 0 0 0 1 18
##
## Overall Statistics
##
## Accuracy : 0.9737
## 95% CI : (0.8619, 0.9993)
## No Information Rate : 0.4737
## P-Value [Acc > NIR] : 2.015e-11
##
## Kappa : 0.9611
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.0000 1.00000 1.00000 0.9000
## Specificity 1.0000 1.00000 1.00000 1.0000
## Pos Pred Value 1.0000 1.00000 1.00000 1.0000
## Neg Pred Value 1.0000 1.00000 1.00000 0.9655
## Prevalence 0.1053 0.07895 0.07895 0.2632
## Detection Rate 0.1053 0.07895 0.07895 0.2368
## Detection Prevalence 0.1053 0.07895 0.07895 0.2368
## Balanced Accuracy 1.0000 1.00000 1.00000 0.9500
## Class: drugY
## Sensitivity 1.0000
## Specificity 0.9500
## Pos Pred Value 0.9474
## Neg Pred Value 1.0000
## Prevalence 0.4737
## Detection Rate 0.4737
## Detection Prevalence 0.5000
## Balanced Accuracy 0.9750
The Random Forest model confirms our findings from the individual decision trees. The variable importance plot reinforces Na_to_K ratio as the strongest predictor, with blood pressure also showing significant predictive power. Sex appears as the least influential feature. The model maintains the same 97.37% accuracy as our single trees, suggesting the relationships in our data are straightforward enough that ensemble methods don’t provide additional benefit.
results <- data.frame(
Model = c("Decision Tree 1", "Decision Tree 2", "Random Forest"),
Accuracy = c(conf_matrix1$overall["Accuracy"],
conf_matrix2$overall["Accuracy"],
conf_matrix_rf$overall["Accuracy"]),
Kappa = c(conf_matrix1$overall["Kappa"],
conf_matrix2$overall["Kappa"],
conf_matrix_rf$overall["Kappa"])
)
print(results)
## Model Accuracy Kappa
## 1 Decision Tree 1 0.9736842 0.9611452
## 2 Decision Tree 2 0.9736842 0.9611452
## 3 Random Forest 0.9736842 0.9611452
The consistency in performance across all three models, with identical accuracy and Kappa values, suggests we’ve uncovered the fundamental patterns in our data. The persistence of just one misclassification between drugX and drugY across all models indicates this might represent a genuinely ambiguous case in our dataset. The fact that even our Random Forest couldn’t improve upon the single decision tree’s performance suggests that a simple, interpretable decision tree might be the most practical choice for this medical classification task.