Goal of Study
This study aims to train the Decision Tree, K-Nearest Neighbors, Random Forest and Support Vector Machine in classifying the Drug data (Multiclass Dataset), finally the best model will be found based on the metrics score.
Import Library
# for manipulate data
library(tidyverse)
library(reshape2)
# for visualize data
library(ggplot2)
library(dplyr)
# for train test split
library(caTools)
# for decision tree model
library(rpart)
library(rpart.plot)
# for Random Forest
library(randomForest)
# for knn model
library(class)
# for svm model
library(e1071)
# for confusion matrix
library(caret)
Import Dataset
data <- read.csv('C:/Users/ASUS/Downloads/Nabil/Projek/drug200.csv')
data <- data.frame(data)
head(data)
## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 F HIGH HIGH 25.355 DrugY
## 2 47 M LOW HIGH 13.093 drugC
## 3 47 M LOW HIGH 10.114 drugC
## 4 28 F NORMAL HIGH 7.798 drugX
## 5 61 F LOW HIGH 18.043 DrugY
## 6 22 F NORMAL HIGH 8.607 drugX
Exploratory Data Analyst
Feature Information
Age : Age of the Patient
Sex : Gender of the Patient
BP : Blood Pressure Levels
Cholestrol : Cholestrol Levels.
Na_to_K : Sodium to potassium Ration in Blood
Drug : Drug Type
Dimension Data
dim(data)
## [1] 200 6
Data has 6 columns and 200 rows
Checking Data Types
str(data)
## 'data.frame': 200 obs. of 6 variables:
## $ Age : int 23 47 47 28 61 22 49 41 60 43 ...
## $ Sex : chr "F" "M" "M" "F" ...
## $ BP : chr "HIGH" "LOW" "LOW" "NORMAL" ...
## $ Cholesterol: chr "HIGH" "HIGH" "HIGH" "HIGH" ...
## $ Na_to_K : num 25.4 13.1 10.1 7.8 18 ...
## $ Drug : chr "DrugY" "drugC" "drugC" "drugX" ...
dataframe has 4 character type, 1 numeric type and 1 integer type
Label Encoder
Data has several character variables that can be Independen variable. Since independen variable cant be process while it type is character, then we can convert to numeric.
data$Sex <- as.numeric(factor(data$Sex))
data$BP <- as.numeric(factor(data$BP))
data$Cholesterol <- as.numeric(factor(data$Cholesterol))
data$Drug <- as.numeric(factor(data$Drug))
head(data)
## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 1 1 1 25.355 5
## 2 47 2 2 1 13.093 3
## 3 47 2 2 1 10.114 3
## 4 28 1 3 1 7.798 4
## 5 61 1 2 1 18.043 5
## 6 22 1 3 1 8.607 4
Variables with character type succes converted to numeric type
Checking Missing Value
sum(is.na(data))
## [1] 0
Data has 0 missing value.
Checking Correlation with Spearman Method
Spearman correlation is used to find out the relationship between variables X and variable Y
# creating correlation matrix
corr_mat <- round(cor(data,method="spearman"),2)
# reduce the size of correlation matrix
melted_corr_mat <- melt(corr_mat)
# plotting the correlation heatmap
ggplot(data = melted_corr_mat, aes(x=Var1, y=Var2,
fill=value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value),
color = "white", size = 4)
Spearman correlation shows that the Na_to_K variable has a correlation value of 0.78, which means that this variable has a higher positive correlation and has a strong positive correlation with the Y variable (Drug), while the Sex variable has a correlation of -0.09 indicating that the variable has a higher negative correlation and has a strong negative correlation with Y (Drug).
Checking Outlier
boxplot(data,main="Data Outliers", xlab="Column Names", ylab="Values")
There are many outliers in Na_to_K.
Preprocessing
Train Test Split
set.seed(123)
split = sample.split(data$Drug, SplitRatio = 0.8)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)
cat('Len Data Before Train Test Split:', nrow(data))
## Len Data Before Train Test Split: 200
cat('Len Data Training (80%):',nrow(training_set))
## Len Data Training (80%): 160
cat('Len Data Testing (20%):',nrow(test_set))
## Len Data Testing (20%): 40
Modelling
Decision Tree
Decision tree is a flow-chart-like tree structure, where each internal node is denoted by rectangles and the leaf nodes are denoted by ovals.
#train decision tree
dct <- rpart(Drug~.,
data = training_set,
method = 'class')
rpart.plot(dct)
#predict with decision tree
y_pred_dct = predict(dct, test_set, type = 'class')
Making Confusion Matrix Plot
cm <- confusionMatrix(factor(test_set$Drug), factor(y_pred_dct), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="blue") +
labs(x = "Prediction",y = "Actual") +
scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))
Find Score Evaluate Metrics Decision Tree
cm_dct <- table(test_set$Drug, y_pred_dct)
n_dct= sum(cm_dct)
nc_dct = nrow(cm_dct)
diag_dct = diag(cm_dct)
rowsums_dct= apply(cm_dct, 1, sum)
colsums_dct = apply(cm_dct, 2, sum)
accuracy_dct = sum(diag_dct) / n_dct
precision_dct = diag_dct / colsums_dct
recall_dct = diag_dct / rowsums_dct
f1_dct = 2 * precision_dct * recall_dct / (precision_dct + recall_dct)
metrics_dct <- data.frame(accuracy_dct,precision_dct, recall_dct, f1_dct)
metrics_dct
## accuracy_dct precision_dct recall_dct f1_dct
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
Find Score Average Evaluation Metrics Decision Tree
macroPrecision_dct = mean(precision_dct)
macroRecall_dct = mean(recall_dct)
macroF1_dct = mean(f1_dct)
macro_metrics_dct <- data.frame(accuracy_dct,macroPrecision_dct,
macroRecall_dct, macroF1_dct)
macro_metrics_dct
## accuracy_dct macroPrecision_dct macroRecall_dct macroF1_dct
## 1 1 1 1 1
The average accuracy is 100%
K-Nearest Neighbors
K-nearest neighbor algorithm is simple to implement and outcome of KNN is easy to justify. This algorithm is robust with small error ratio.
K-Nearest Neighbors Algorithm
Determine k nearest neighbors and D set of training example.
for each test example xi do Calculate d(xi, yi) based on distance measure.
Select the k closest training examples yi to test example xi.
Use majority voting to classify the test examples
End for.
knn_ <- knn(train = training_set,
test = test_set,
cl = training_set$Drug,
k = 3)
Making Confusion Matrix Plot
cm <- confusionMatrix(factor(test_set$Drug), factor(knn_), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="blue") +
labs(x = "Prediction",y = "Actual") +
scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))
Find Score Evaluate Metrics K-Nearest Neighbors
cm_knn <- table(test_set$Drug, knn_)
n_knn= sum(cm_knn)
nc_knn = nrow(cm_knn)
diag_knn = diag(cm_knn)
rowsums_knn= apply(cm_knn, 1, sum)
colsums_knn = apply(cm_knn, 2, sum)
accuracy_knn = sum(diag_knn) / n_knn
precision_knn = diag_knn / colsums_knn
recall_knn = diag_knn / rowsums_knn
f1_knn = 2 * precision_knn * recall_knn / (precision_knn + recall_knn)
metrics_knn <- data.frame(accuracy_knn,precision_knn, recall_knn, f1_knn)
metrics_knn
## accuracy_knn precision_knn recall_knn f1_knn
## 1 0.85 1.0000000 1.0000000 1.0000000
## 2 0.85 0.7500000 1.0000000 0.8571429
## 3 0.85 0.0000000 0.0000000 NaN
## 4 0.85 0.8000000 0.7272727 0.7619048
## 5 0.85 0.9473684 1.0000000 0.9729730
Find Score Average Evaluation Metrics K-Nearest Neighbors
macroPrecision_knn = mean(precision_knn)
macroRecall_knn = mean(recall_knn)
macroF1_knn = mean(f1_knn)
macro_metrics_knn <- data.frame(accuracy_knn,macroPrecision_knn,
macroRecall_knn, macroF1_knn)
macro_metrics_knn
## accuracy_knn macroPrecision_knn macroRecall_knn macroF1_knn
## 1 0.85 0.6994737 0.7454545 NaN
The average accuracy is 85%, there is NaN value in macroF1_Knn, indicate any class that cant predict by KNN
Random Forest
Random Forest in R, Random forest developed by an aggregating tree and this can be used for classification and regression. One of the major advantages is its avoids overfitting.
RF = randomForest(factor(Drug)~.,data = training_set,method = 'class')
y_pred_RF = predict(RF, test_set, type = 'class')
Making Confusion Matrix Plot
cm <- confusionMatrix(factor(test_set$Drug), factor(y_pred_RF), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="blue") +
labs(x = "Prediction",y = "Actual") +
scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))
Find Score Evaluate Metrics Random Forest
cm_rf<- table(test_set$Drug, y_pred_RF)
n_rf= sum(cm_rf)
nc_rf = nrow(cm_rf)
diag_rf = diag(cm_rf)
rowsums_rf= apply(cm_rf, 1, sum)
colsums_rf = apply(cm_rf, 2, sum)
accuracy_rf = sum(diag_rf) / n_rf
precision_rf = diag_rf / colsums_rf
recall_rf = diag_rf / rowsums_rf
f1_rf = 2 * precision_rf * recall_rf / (precision_rf + recall_rf)
metrics_rf <- data.frame(accuracy_rf,precision_rf, recall_rf, f1_rf)
metrics_rf
## accuracy_rf precision_rf recall_rf f1_rf
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
Find Score Average Evaluation Metrics Random Forest
macroPrecision_rf = mean(precision_rf)
macroRecall_rf = mean(recall_rf)
macroF1_rf = mean(f1_rf)
macro_metrics_rf <- data.frame(accuracy_rf,macroPrecision_rf,
macroRecall_rf, macroF1_rf)
macro_metrics_rf
## accuracy_rf macroPrecision_rf macroRecall_rf macroF1_rf
## 1 1 1 1 1
The average accuracy is 100%
Support Vector Machine
SVM can classify features in a training set into categories that use either a linear or non-linear model. The linearity of the classifier is determined by the kernel function of the data set e.g. linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Therefore it is also possible to use non-linear classification in SVM using the kernel trick.
SVM = svm(Drug ~ .,data = training_set,type = 'C-classification',
kernel = 'radial')
y_pred_svm = predict(object = SVM, test_set, type='class')
Making Confusion Matrix Plot
cm <- confusionMatrix(factor(test_set$Drug), factor(y_pred_svm), dnn = c("Prediction", "Actual"))
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Actual,Prediction, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="blue") +
labs(x = "Prediction",y = "Actual") +
scale_x_discrete(labels=c("Class 0","Class 1","Class 2","Class 3", "Class 4","Class 5")) +
scale_y_discrete(labels=c("Class 5","Class 4","Class 3","Class 2","Class 1","Class 0"))
Find Score Evaluate Metrics Support Vector Machine
cm_<- table(test_set$Drug, y_pred_svm)
n_= sum(cm_)
nc_ = nrow(cm_)
diag_ = diag(cm_)
rowsums_= apply(cm_, 1, sum)
colsums_ = apply(cm_, 2, sum)
accuracy_ = sum(diag_) / n_
precision_ = diag_ / colsums_
recall_ = diag_ / rowsums_
f1_ = 2 * precision_ * recall_ / (precision_ + recall_)
metrics_ <- data.frame(accuracy_,precision_, recall_, f1_)
metrics_
## accuracy_ precision_ recall_ f1_
## 1 0.95 0.8333333 1.0000000 0.9090909
## 2 0.95 1.0000000 0.3333333 0.5000000
## 3 0.95 1.0000000 1.0000000 1.0000000
## 4 0.95 1.0000000 1.0000000 1.0000000
## 5 0.95 0.9473684 1.0000000 0.9729730
Find Score Average Evaluation Metrics Support Vector Machine
macroPrecision_ = mean(precision_)
macroRecall_ = mean(recall_)
macroF1_ = mean(f1_)
macro_metrics_ <- data.frame(accuracy_,macroPrecision_,
macroRecall_, macroF1_)
macro_metrics_
## accuracy_ macroPrecision_ macroRecall_ macroF1_
## 1 0.95 0.9561404 0.8666667 0.8764128
The average accuracy is 95%
Store Classification Results to Dataframe
test_set$Drug <- factor(test_set$Drug,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$Dct <- factor(y_pred_dct,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$KNN <- factor(knn_,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$RF <- factor(y_pred_RF,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
test_set$SVM <- factor(y_pred_svm,levels=c(1,2,3,4,5),labels=c('DrugA','DrugB','DrugC','DrugX','DrugY'))
head(test_set,20)
## Age Sex BP Cholesterol Na_to_K Drug Dct KNN RF SVM
## 9 60 2 3 1 15.171 DrugY DrugY DrugY DrugY DrugY
## 10 43 2 2 2 19.368 DrugY DrugY DrugY DrugY DrugY
## 14 74 1 2 1 20.942 DrugY DrugY DrugY DrugY DrugY
## 17 69 2 2 2 11.455 DrugX DrugX DrugC DrugX DrugX
## 21 57 2 2 2 19.128 DrugY DrugY DrugY DrugY DrugY
## 26 28 1 1 2 18.809 DrugY DrugY DrugY DrugY DrugY
## 33 49 2 2 2 11.014 DrugX DrugX DrugX DrugX DrugX
## 34 65 1 1 2 31.876 DrugY DrugY DrugY DrugY DrugY
## 39 39 1 3 2 9.709 DrugX DrugX DrugX DrugX DrugX
## 41 73 1 3 1 19.221 DrugY DrugY DrugY DrugY DrugY
## 50 28 1 2 1 19.796 DrugY DrugY DrugY DrugY DrugY
## 63 67 2 2 2 20.693 DrugY DrugY DrugY DrugY DrugY
## 65 60 1 1 1 13.303 DrugB DrugB DrugB DrugB DrugY
## 66 68 1 3 2 27.050 DrugY DrugY DrugY DrugY DrugY
## 68 17 2 3 2 10.832 DrugX DrugX DrugX DrugX DrugX
## 77 36 1 1 1 11.198 DrugA DrugA DrugA DrugA DrugA
## 99 20 2 1 2 35.639 DrugY DrugY DrugY DrugY DrugY
## 104 56 2 3 1 8.966 DrugX DrugX DrugX DrugX DrugX
## 107 22 2 3 1 11.953 DrugX DrugX DrugX DrugX DrugX
## 109 72 2 1 2 9.677 DrugB DrugB DrugB DrugB DrugB
Conclusion
Based on the evaluation results of the matrix scores of the three models, it can be concluded in this study Decision Tree and Random Forest produces the best metrics score for classifying multiclass (5 class) dataset.
References
Drug Dataset : https://www.kaggle.com/datasets/prathamtripathi/drug-classification
Train Test Split (Holdout Method) :https://www.researchgate.net/profile/Paola-Galdi/publication/322179244_Data_Mining_Accuracy_and_Error_Measures_for_Classification_and_Prediction/links/5a6f5a6b458515015e615bb9/Data-Mining-Accuracy-and-Error-Measures-for-Classification-and-Prediction.pdf
Outlier Detections : https://eprints.whiterose.ac.uk/767/1/
Decision Tree : https://www.academia.edu/download/46836935/0a85e52e8e3c7d5f99000000.pdf
KNN : https://www.academia.edu/download/51726402/SUB156942.pdf
Random Forest : https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Support Vector Machine :https://medium.com/0xcode/svm-classification-algorithms-in-r-ced0ee73821
Confusion Matrix : https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9091880