Comparison 4 Models Supervised for Drug Classification (Multiclass Dataset)

Muhammad Athanabil Andi Fawazdhia

2023-03-23

Goal of Study

This study aims to train the Decision Tree, K-Nearest Neighbors, Random Forest and Support Vector Machine in classifying the Drug data (Multiclass Dataset), finally the best model will be found based on the metrics score.

Import Library

# for manipulate data
library(tidyverse)
library(reshape2)

# for visualize data
library(ggplot2) 
library(dplyr)

# for train test split
library(caTools)

# for decision tree model
library(rpart)
library(rpart.plot)

# for Random Forest
library(randomForest)

# for knn model
library(class)

# for svm model
library(e1071)

# for confusion matrix
library(caret)

Import Dataset

data <- read.csv('C:/Users/ASUS/Downloads/Nabil/Projek/drug200.csv')
data <- data.frame(data)
head(data)
##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 DrugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 DrugY
## 6  22   F NORMAL        HIGH   8.607 drugX

Exploratory Data Analyst

Feature Information

Age : Age of the Patient

Sex : Gender of the Patient

BP : Blood Pressure Levels

Cholestrol : Cholestrol Levels.

Na_to_K : Sodium to potassium Ration in Blood

Drug : Drug Type

Dimension Data

dim(data)
## [1] 200   6

Data has 6 columns and 200 rows

Checking Data Types

str(data)
## 'data.frame':    200 obs. of  6 variables:
##  $ Age        : int  23 47 47 28 61 22 49 41 60 43 ...
##  $ Sex        : chr  "F" "M" "M" "F" ...
##  $ BP         : chr  "HIGH" "LOW" "LOW" "NORMAL" ...
##  $ Cholesterol: chr  "HIGH" "HIGH" "HIGH" "HIGH" ...
##  $ Na_to_K    : num  25.4 13.1 10.1 7.8 18 ...
##  $ Drug       : chr  "DrugY" "drugC" "drugC" "drugX" ...

dataframe has 4 character type, 1 numeric type and 1 integer type

Label Encoder

Data has several character variables that can be Independen variable. Since independen variable cant be process while it type is character, then we can convert to numeric.

data$Sex <- as.numeric(factor(data$Sex))
data$BP <- as.numeric(factor(data$BP))
data$Cholesterol <- as.numeric(factor(data$Cholesterol))
data$Drug <- as.numeric(factor(data$Drug))
head(data)
##   Age Sex BP Cholesterol Na_to_K Drug
## 1  23   1  1           1  25.355    5
## 2  47   2  2           1  13.093    3
## 3  47   2  2           1  10.114    3
## 4  28   1  3           1   7.798    4
## 5  61   1  2           1  18.043    5
## 6  22   1  3           1   8.607    4

Variables with character type succes converted to numeric type

Checking Missing Value

sum(is.na(data))
## [1] 0

Data has 0 missing value.

Checking Correlation with Spearman Method

Spearman correlation is used to find out the relationship between variables X and variable Y

# creating correlation matrix
corr_mat <- round(cor(data,method="spearman"),2)
 
# reduce the size of correlation matrix
melted_corr_mat <- melt(corr_mat)
 
# plotting the correlation heatmap
ggplot(data = melted_corr_mat, aes(x=Var1, y=Var2,
                                   fill=value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value),
          color = "white", size = 4)

Spearman correlation shows that the Na_to_K variable has a correlation value of 0.78, which means that this variable has a higher positive correlation and has a strong positive correlation with the Y variable (Drug), while the Sex variable has a correlation of -0.09 indicating that the variable has a higher negative correlation and has a strong negative correlation with Y (Drug).

Checking Outlier

boxplot(data,main="Data Outliers", xlab="Column Names", ylab="Values")

There are many outliers in Na_to_K.

Preprocessing

Train Test Split

set.seed(123)
split = sample.split(data$Drug, SplitRatio = 0.8)

training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)

cat('Len Data Before Train Test Split:', nrow(data))
## Len Data Before Train Test Split: 200
cat('Len Data Training (80%):',nrow(training_set))
## Len Data Training (80%): 160
cat('Len Data Testing (20%):',nrow(test_set))
## Len Data Testing (20%): 40

Modelling

Decision Tree

Decision tree is a flow-chart-like tree structure, where each internal node is denoted by rectangles and the leaf nodes are denoted by ovals.

#train decision tree
dct <- rpart(Drug~., 
             data = training_set, 
             method = 'class')

rpart.plot(dct)

#predict with decision tree
y_pred_dct = predict(dct, test_set, type = 'class')

Making Confusion Matrix Plot

cm <- confusionMatrix(factor(test_set$Drug), factor(y_pred_dct), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="blue") +
        labs(x = "Prediction",y = "Actual") +
        scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
        scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))

Find Score Evaluate Metrics Decision Tree

cm_dct <- table(test_set$Drug, y_pred_dct)
n_dct= sum(cm_dct) 
nc_dct = nrow(cm_dct) 
diag_dct = diag(cm_dct)  
rowsums_dct= apply(cm_dct, 1, sum) 
colsums_dct = apply(cm_dct, 2, sum) 
accuracy_dct = sum(diag_dct) / n_dct 
precision_dct = diag_dct / colsums_dct 
recall_dct = diag_dct / rowsums_dct 
f1_dct = 2 * precision_dct * recall_dct / (precision_dct + recall_dct) 
metrics_dct <- data.frame(accuracy_dct,precision_dct, recall_dct, f1_dct) 
metrics_dct
##   accuracy_dct precision_dct recall_dct f1_dct
## 1            1             1          1      1
## 2            1             1          1      1
## 3            1             1          1      1
## 4            1             1          1      1
## 5            1             1          1      1

Find Score Average Evaluation Metrics Decision Tree

macroPrecision_dct = mean(precision_dct)
macroRecall_dct = mean(recall_dct)
macroF1_dct = mean(f1_dct)
macro_metrics_dct <-  data.frame(accuracy_dct,macroPrecision_dct, 
                                 macroRecall_dct, macroF1_dct)
macro_metrics_dct
##   accuracy_dct macroPrecision_dct macroRecall_dct macroF1_dct
## 1            1                  1               1           1

The average accuracy is 100%

K-Nearest Neighbors

K-nearest neighbor algorithm is simple to implement and outcome of KNN is easy to justify. This algorithm is robust with small error ratio.

K-Nearest Neighbors Algorithm

  1. Determine k nearest neighbors and D set of training example.

  2. for each test example xi do Calculate d(xi, yi) based on distance measure.

  3. Select the k closest training examples yi to test example xi.

  4. Use majority voting to classify the test examples

  5. End for.

knn_ <- knn(train = training_set,
                      test = test_set,
                      cl = training_set$Drug,
                      k = 3)

Making Confusion Matrix Plot

cm <- confusionMatrix(factor(test_set$Drug), factor(knn_), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="blue") +
        labs(x = "Prediction",y = "Actual") +
        scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
        scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))

Find Score Evaluate Metrics K-Nearest Neighbors

cm_knn <- table(test_set$Drug, knn_)
n_knn= sum(cm_knn) 
nc_knn = nrow(cm_knn) 
diag_knn = diag(cm_knn)  
rowsums_knn= apply(cm_knn, 1, sum) 
colsums_knn = apply(cm_knn, 2, sum) 
accuracy_knn = sum(diag_knn) / n_knn 
precision_knn = diag_knn / colsums_knn 
recall_knn = diag_knn / rowsums_knn 
f1_knn = 2 * precision_knn * recall_knn / (precision_knn + recall_knn) 
metrics_knn <- data.frame(accuracy_knn,precision_knn, recall_knn, f1_knn) 
metrics_knn
##   accuracy_knn precision_knn recall_knn    f1_knn
## 1         0.85     1.0000000  1.0000000 1.0000000
## 2         0.85     0.7500000  1.0000000 0.8571429
## 3         0.85     0.0000000  0.0000000       NaN
## 4         0.85     0.8000000  0.7272727 0.7619048
## 5         0.85     0.9473684  1.0000000 0.9729730

Find Score Average Evaluation Metrics K-Nearest Neighbors

macroPrecision_knn = mean(precision_knn)
macroRecall_knn = mean(recall_knn)
macroF1_knn = mean(f1_knn)
macro_metrics_knn <-  data.frame(accuracy_knn,macroPrecision_knn, 
                                 macroRecall_knn, macroF1_knn)
macro_metrics_knn
##   accuracy_knn macroPrecision_knn macroRecall_knn macroF1_knn
## 1         0.85          0.6994737       0.7454545         NaN

The average accuracy is 85%, there is NaN value in macroF1_Knn, indicate any class that cant predict by KNN

Random Forest

Random Forest in R, Random forest developed by an aggregating tree and this can be used for classification and regression. One of the major advantages is its avoids overfitting.

RF = randomForest(factor(Drug)~.,data = training_set,method = 'class')
y_pred_RF = predict(RF, test_set, type = 'class')

Making Confusion Matrix Plot

cm <- confusionMatrix(factor(test_set$Drug), factor(y_pred_RF), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="blue") +
        labs(x = "Prediction",y = "Actual") +
        scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
        scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))

Find Score Evaluate Metrics Random Forest

cm_rf<- table(test_set$Drug, y_pred_RF)
n_rf= sum(cm_rf) 
nc_rf = nrow(cm_rf) 
diag_rf = diag(cm_rf)  
rowsums_rf= apply(cm_rf, 1, sum) 
colsums_rf = apply(cm_rf, 2, sum) 
accuracy_rf = sum(diag_rf) / n_rf 
precision_rf = diag_rf / colsums_rf 
recall_rf = diag_rf / rowsums_rf 
f1_rf = 2 * precision_rf * recall_rf / (precision_rf + recall_rf) 
metrics_rf <- data.frame(accuracy_rf,precision_rf, recall_rf, f1_rf) 
metrics_rf
##   accuracy_rf precision_rf recall_rf f1_rf
## 1           1            1         1     1
## 2           1            1         1     1
## 3           1            1         1     1
## 4           1            1         1     1
## 5           1            1         1     1

Find Score Average Evaluation Metrics Random Forest

macroPrecision_rf = mean(precision_rf)
macroRecall_rf = mean(recall_rf)
macroF1_rf = mean(f1_rf)
macro_metrics_rf <-  data.frame(accuracy_rf,macroPrecision_rf, 
                                 macroRecall_rf, macroF1_rf)
macro_metrics_rf
##   accuracy_rf macroPrecision_rf macroRecall_rf macroF1_rf
## 1           1                 1              1          1

The average accuracy is 100%

Support Vector Machine

SVM can classify features in a training set into categories that use either a linear or non-linear model. The linearity of the classifier is determined by the kernel function of the data set e.g. linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Therefore it is also possible to use non-linear classification in SVM using the kernel trick.

SVM = svm(Drug ~ .,data = training_set,type = 'C-classification',
                 kernel = 'radial')
y_pred_svm = predict(object = SVM, test_set, type='class')

Making Confusion Matrix Plot

cm <- confusionMatrix(factor(test_set$Drug), factor(y_pred_svm), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="blue") +
        labs(x = "Prediction",y = "Actual") +
        scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
        scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))

Find Score Evaluate Metrics Support Vector Machine

cm_<- table(test_set$Drug, y_pred_svm)
n_= sum(cm_) 
nc_ = nrow(cm_) 
diag_ = diag(cm_)  
rowsums_= apply(cm_, 1, sum) 
colsums_ = apply(cm_, 2, sum) 
accuracy_ = sum(diag_) / n_ 
precision_ = diag_ / colsums_ 
recall_ = diag_ / rowsums_ 
f1_ = 2 * precision_ * recall_ / (precision_ + recall_) 
metrics_ <- data.frame(accuracy_,precision_, recall_, f1_) 
metrics_
##   accuracy_ precision_   recall_       f1_
## 1      0.95  0.8333333 1.0000000 0.9090909
## 2      0.95  1.0000000 0.3333333 0.5000000
## 3      0.95  1.0000000 1.0000000 1.0000000
## 4      0.95  1.0000000 1.0000000 1.0000000
## 5      0.95  0.9473684 1.0000000 0.9729730

Find Score Average Evaluation Metrics Support Vector Machine

macroPrecision_ = mean(precision_)
macroRecall_ = mean(recall_)
macroF1_ = mean(f1_)
macro_metrics_ <-  data.frame(accuracy_,macroPrecision_, 
                                 macroRecall_, macroF1_)
macro_metrics_
##   accuracy_ macroPrecision_ macroRecall_  macroF1_
## 1      0.95       0.9561404    0.8666667 0.8764128

The average accuracy is 95%

Store Classification Results to Dataframe

test_set$Drug <- factor(test_set$Drug,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$Dct <- factor(y_pred_dct,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$KNN <- factor(knn_,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$RF <- factor(y_pred_RF,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$SVM <- factor(y_pred_svm,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
head(test_set,20)
##     Age Sex BP Cholesterol Na_to_K  Drug   Dct   KNN    RF   SVM
## 9    60   2  3           1  15.171 DrugY DrugY DrugY DrugY DrugY
## 10   43   2  2           2  19.368 DrugY DrugY DrugY DrugY DrugY
## 14   74   1  2           1  20.942 DrugY DrugY DrugY DrugY DrugY
## 17   69   2  2           2  11.455 DrugX DrugX DrugC DrugX DrugX
## 21   57   2  2           2  19.128 DrugY DrugY DrugY DrugY DrugY
## 26   28   1  1           2  18.809 DrugY DrugY DrugY DrugY DrugY
## 33   49   2  2           2  11.014 DrugX DrugX DrugX DrugX DrugX
## 34   65   1  1           2  31.876 DrugY DrugY DrugY DrugY DrugY
## 39   39   1  3           2   9.709 DrugX DrugX DrugX DrugX DrugX
## 41   73   1  3           1  19.221 DrugY DrugY DrugY DrugY DrugY
## 50   28   1  2           1  19.796 DrugY DrugY DrugY DrugY DrugY
## 63   67   2  2           2  20.693 DrugY DrugY DrugY DrugY DrugY
## 65   60   1  1           1  13.303 DrugB DrugB DrugB DrugB DrugY
## 66   68   1  3           2  27.050 DrugY DrugY DrugY DrugY DrugY
## 68   17   2  3           2  10.832 DrugX DrugX DrugX DrugX DrugX
## 77   36   1  1           1  11.198 DrugA DrugA DrugA DrugA DrugA
## 99   20   2  1           2  35.639 DrugY DrugY DrugY DrugY DrugY
## 104  56   2  3           1   8.966 DrugX DrugX DrugX DrugX DrugX
## 107  22   2  3           1  11.953 DrugX DrugX DrugX DrugX DrugX
## 109  72   2  1           2   9.677 DrugB DrugB DrugB DrugB DrugB

Conclusion

Based on the evaluation results of the matrix scores of the three models, it can be concluded in this study Decision Tree and Random Forest produces the best metrics score for classifying multiclass (5 class) dataset.