Machine Learning For Drug Selection Based on Certain Conditions

Introduction

Hello everyone! in this RmD, I am trying a different aproach of making a report by using informal structure of words and guide the readers like telling a story. As I surfed Kaggle.com to find the best possible data for this RmD in which was made to learn how to create a modeling for prediction using Naive Bayes, Decision Tree, and Random Forest as a way for me to fulfill my assignment from Algoritma Data Science School, I stumble upon a data about a drug where it was given to several numbers of people with certain condition. From this data I would like to make a model where it would be able to suggest which type of drugs is the best for a person with certain type of conditions based on the data. Let us now begin with our data exploration.

Data Exploration

Data pre-processing and EDA

When we obtain a data, we certainly have to understand the contain of the data before we can process our data to obtain our desired outcome based on the business case we were given. In this case, the business case would be to make a program where it is able to distinguish type of drugs to be given to a person with a certain number of conditions explained in the data as variables.

I like to start my report by putting all of the libraries on the beginning of the data exploration.

Libraries

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(caret)

## Loading required package: lattice

library(e1071)
library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

After which I read my data and take a look at the data information.

Read data

drug <- read.csv("data/drug/drug200.csv")
glimpse(drug)

## Rows: 200
## Columns: 6
## $ Age         <int> 23, 47, 47, 28, 61, 22, 49, 41, 60, 43, 47, 34, 43, 74, 50~
## $ Sex         <chr> "F", "M", "M", "F", "F", "F", "F", "M", "M", "M", "F", "F"~
## $ BP          <chr> "HIGH", "LOW", "LOW", "NORMAL", "LOW", "NORMAL", "NORMAL",~
## $ Cholesterol <chr> "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "H~
## $ Na_to_K     <dbl> 25.355, 13.093, 10.114, 7.798, 18.043, 8.607, 16.275, 11.0~
## $ Drug        <chr> "DrugY", "drugC", "drugC", "drugX", "DrugY", "drugX", "Dru~

Our data contains 6 variables where 5 variables would be our possible predictors and 1 variables by the name of “Drug” will be our target as we want to make a machine learning model which can recommend specific type of drug to be consumed by a certain type of people. As the variable names is quite self explanatory, I will not explain them 1 by 1. However, here are the explanation of some variables:

Cholesterol: Cholesterol Levels
BP: Blood preasure levels
Na_to_K: Sodium to potassium Ration in Blood

If we observed the data type of each variables, we can certainly see that some data, which have character data type, would best be changed to factor data type as they are mostly repeating values.

Change data type

drug <- drug %>% 
  mutate_if(is.character, as.factor)

I would also recommend to always check for some missing values.

Check any missing values

colSums(is.na(drug))

##         Age         Sex          BP Cholesterol     Na_to_K        Drug 
##           0           0           0           0           0           0

It appears that we are safe to continue as our data does not contains missing values.

As naive bayes method as we know is “naive”, we would want to check the correlation of each of or variables. However, due to mostly our variables were categorical variables, we can only check 2 variables in out data.

Correlaion plot

ggcorr(drug, label = T)

## Warning in ggcorr(drug, label = T): data in column(s) 'Sex', 'BP',
## 'Cholesterol', 'Drug' are not numeric and were ignored

The graph above has shown us that the numerical variables in the dataset that we have has no correlation between each other, Which means that our data is safe for Naive Bayes model.

Cross validation

We would want to make a train and test data out of our pre-processed data. The train data will be used to create our model. while the test data will be the data which we will try to predict using the model which is made using train data.

Random Sampling

RNGkind(sample.kind = "Rounding")
set.seed(126)

index <- sample(nrow(drug), nrow(drug) * 0.7)
train_drug <- drug[index,] %>% 
  mutate_if(is.character, as.factor)
test_drug <- drug[-index,] %>% 
  mutate_if(is.character, as.factor)

Sometimes when we are doing cross validation where we split our data to train and test data, the variables which are categorical variables are not separating evenly in the train and test data. We would want to make sure the unique values of each of our categorical variables were present in each train and test data. We can check them by previewing each of the train and test unique categorical values.

Sex category

unique(train_drug$Sex)

## [1] M F
## Levels: F M

unique(test_drug$Sex)

## [1] F M
## Levels: F M

BP category

unique(train_drug$BP)

## [1] LOW    NORMAL HIGH  
## Levels: HIGH LOW NORMAL

unique(test_drug$BP)

## [1] HIGH   NORMAL LOW   
## Levels: HIGH LOW NORMAL

Cholesterol category

unique(train_drug$Cholesterol)

## [1] NORMAL HIGH  
## Levels: HIGH NORMAL

unique(test_drug$Cholesterol)

## [1] HIGH   NORMAL
## Levels: HIGH NORMAL

Drug category

unique(train_drug$Drug)

## [1] DrugY drugX drugA drugB drugC
## Levels: drugA drugB drugC drugX DrugY

unique(test_drug$Drug)

## [1] DrugY drugX drugC drugB drugA
## Levels: drugA drugB drugC drugX DrugY

It appears that each categorical unique values are evenly separated in our train and test data, we do not have to resample our train and test data. The next thing we want to check is the proportion of our train data.

prop.table(table(train_drug$Drug))

## 
##      drugA      drugB      drugC      drugX      DrugY 
## 0.13571429 0.07142857 0.10000000 0.26428571 0.42857143

It is important to have a balanced target as later will cause our model to unable to properly create a prediction for data which has a small of sample. Upsampling can be done to balance the data which has an imbalance proportion such as this data.

train_drug <- upSample(x = train_drug %>% select(-Drug),
                       y = train_drug$Drug,
                       yname = "Drug")

prop.table(table(train_drug$Drug))

## 
## drugA drugB drugC drugX DrugY 
##   0.2   0.2   0.2   0.2   0.2

Now that we have a balanced train data, we can finally make our machine learning model.

Modeling

Naive Bayes

Naive bayes is an algorithm classifier based on the Bayes’s theorem of probability. The theory assume that the predictor variables were all independent which explain why this method is called “naive”. we have to take notes that any multicollinearity in our data may result in a inaccurate modeling. Naive bayes model also prone to bias due to data scarcity.

In our case, due to the data that we have possess numerical variable, for sure that we will be finding any data scarcity. As a way to make sure of it, we can make a data table between the target and the predictor variables.

Proportion of numerical variables

table(train_drug$Drug, train_drug$Age)

##        
##         15 16 17 18 19 20 22 23 24 25 26 28 29 30 31 32 33 34 35 36 37 38 39 40
##   drugA  0  0  0  0  7  4  0  5  2  0  4  0  0  0  6  6  0  0  3  1  1  1  4  0
##   drugB  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   drugC  0  4  0  0  0  0  3  0  0  0  3  2  0  0  0  4  0  0  0  0  0  0  0  0
##   drugX  0  0  1  2  0  0  5  2  1  0  0  3  0  2  0  4  0  1  2  1  1  0  1  3
##   DrugY  2  2  0  1  1  1  1  0  0  1  1  3  1  0  0  0  1  1  0  2  2  2  3  1
##        
##         41 42 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 64 65 66
##   drugA  0  5  4  2  0  2  1  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   drugB  0  0  0  0  0  0  0  0  0  3  0  7  0  0  0  5  0  7 15  0  0  0  0  0
##   drugC  6  0  0  0  0 22  0  2  0  0  0  0  0  0  0  0  0  6  0  0  0  0  0  0
##   drugX  0  0  2  1  2  1  0  0  1  1  2  0  0  2  2  1  0  1  0  4  0  0  0  2
##   DrugY  2  2  2  1  0  2  0  1  2  2  0  0  1  0  0  2  2  0  0  2  2  2  1  1
##        
##         67 68 69 70 72 73 74
##   drugA  0  0  0  0  0  0  0
##   drugB  0 10  0  8  5  0  0
##   drugC  0  4  0  0  4  0  0
##   drugX  4  0  1  0  0  0  4
##   DrugY  1  1  1  0  1  2  1

table(train_drug$Drug, train_drug$Na_to_K)

##        
##         6.683 6.769 7.261 7.285 7.34 7.477 7.798 7.845 8.011 8.107 8.151 8.607
##   drugA     0     0     0     0    0     0     0     0     5     0     0     0
##   drugB     0     0     0     0    0     0     0     0     0     0     0     0
##   drugC     0     4     0     0    0     0     0     0     0     0     3     0
##   drugX     1     0     2     2    2     1     1     1     0     2     0     2
##   DrugY     0     0     0     0    0     0     0     0     0     0     0     0
##        
##         8.621 8.7 8.75 8.966 8.968 9.17 9.443 9.445 9.475 9.514 9.664 9.677
##   drugA     0   2    0     0     0    0     0     6     2     0     4     0
##   drugB     3   0    0     0     0    0     0     0     0     0     0     5
##   drugC     0   0    0     0     0    0     0     0     0     0     0     0
##   drugX     0   0    2     2     1    1     2     0     0     2     0     0
##   DrugY     0   0    0     0     0    0     0     0     0     0     0     0
##        
##         9.709 9.712 9.894 9.945 10.017 10.065 10.067 10.103 10.114 10.189
##   drugA     0     0     0     0      0      0      0      0      0      0
##   drugB     0     0     0     5      0      0      0      0      0     10
##   drugC     0     4     0     0      0      0      7      0      5      0
##   drugX     1     0     2     0      1      1      0      1      0      0
##   DrugY     0     0     0     0      0      0      0      0      0      0
##        
##         10.291 10.403 10.443 10.444 10.446 10.537 10.605 10.832 10.84 10.898
##   drugA      0      2      0      0      1      0      0      0     0      0
##   drugB      0      0      0      0      0      0      0      0     0      0
##   drugC      4      0      0      6      0      2      0      0     0      0
##   drugX      0      0      2      0      0      0      1      1     3      2
##   DrugY      0      0      0      0      0      0      0      0     0      0
##        
##         11.037 11.198 11.227 11.262 11.326 11.343 11.349 11.424 11.767 11.871
##   drugA      0      1      2      4      1      0      0      0      0      4
##   drugB      0      0      0      0      0      3      0      0      0      0
##   drugC      6      0      0      0      0      0      0      0      6      0
##   drugX      0      0      0      0      0      0      2      1      0      0
##   DrugY      0      0      0      0      0      0      0      0      0      0
##        
##         11.939 11.953 12.006 12.26 12.295 12.307 12.495 12.766 12.854 12.859
##   drugA      0      0      0     0      0      4      0      5      2      0
##   drugB      0      0      0     0      0      0      7      0      0      0
##   drugC      0      0      4     0      0      0      0      0      0      0
##   drugX      4      3      0     2      1      0      0      0      0      2
##   DrugY      0      0      0     0      0      0      0      0      0      0
##        
##         12.879 12.894 12.923 13.091 13.093 13.127 13.303 13.313 13.597 13.884
##   drugA      0      3      0      1      0      0      0      7      0      0
##   drugB      0      0      0      0      0      0      5      0      0      0
##   drugC      0      0      0      0      4      2      0      0      0      0
##   drugX      2      0      1      0      0      0      0      0      1      1
##   DrugY      0      0      0      0      0      0      0      0      0      0
##        
##         13.934 13.935 13.967 13.972 14.16 14.216 15.156 15.376 15.436 15.478
##   drugA      0      0      0      4     0      0      0      0      0      0
##   drugB      7      7      8      0     0      0      0      0      0      0
##   drugC      0      0      0      0     3      0      0      0      0      0
##   drugX      0      0      0      0     0      1      0      0      0      0
##   DrugY      0      0      0      0     0      0      1      1      1      1
##        
##         15.49 15.516 15.79 15.891 15.969 16.275 16.31 16.347 16.594 16.724
##   drugA     0      0     0      0      0      0     0      0      0      0
##   drugB     0      0     0      0      0      0     0      0      0      0
##   drugC     0      0     0      0      0      0     0      0      0      0
##   drugX     0      0     0      0      0      0     0      0      0      0
##   DrugY     1      1     1      1      1      1     1      1      1      1
##        
##         16.725 16.753 17.206 17.211 17.225 17.951 18.043 18.295 18.348 18.809
##   drugA      0      0      0      0      0      0      0      0      0      0
##   drugB      0      0      0      0      0      0      0      0      0      0
##   drugC      0      0      0      0      0      0      0      0      0      0
##   drugX      0      0      0      0      0      0      0      0      0      0
##   DrugY      1      1      1      1      1      1      1      2      1      1
##        
##         18.991 19.007 19.011 19.128 19.161 19.221 19.368 19.675 20.013 20.932
##   drugA      0      0      0      0      0      0      0      0      0      0
##   drugB      0      0      0      0      0      0      0      0      0      0
##   drugC      0      0      0      0      0      0      0      0      0      0
##   drugX      0      0      0      0      0      0      0      0      0      0
##   DrugY      1      1      1      1      1      1      1      1      1      1
##        
##         21.036 22.456 22.697 22.905 23.003 23.091 24.658 25.475 25.741 25.893
##   drugA      0      0      0      0      0      0      0      0      0      0
##   drugB      0      0      0      0      0      0      0      0      0      0
##   drugC      0      0      0      0      0      0      0      0      0      0
##   drugX      0      0      0      0      0      0      0      0      0      0
##   DrugY      1      1      1      1      1      1      1      1      1      1
##        
##         25.969 27.05 27.064 27.183 27.826 28.294 29.45 29.875 30.568 31.876
##   drugA      0     0      0      0      0      0     0      0      0      0
##   drugB      0     0      0      0      0      0     0      0      0      0
##   drugC      0     0      0      0      0      0     0      0      0      0
##   drugX      0     0      0      0      0      0     0      0      0      0
##   DrugY      1     1      1      1      1      1     1      1      1      1
##        
##         33.486 33.542 35.639 37.188 38.247
##   drugA      0      0      0      0      0
##   drugB      0      0      0      0      0
##   drugC      0      0      0      0      0
##   drugX      0      0      0      0      0
##   DrugY      1      1      1      1      1

Proportion of categorical variables

table(train_drug$Drug, train_drug$Sex)

##        
##          F  M
##   drugA 23 37
##   drugB 27 33
##   drugC 28 32
##   drugX 29 31
##   DrugY 30 30

table(train_drug$Drug, train_drug$BP)

##        
##         HIGH LOW NORMAL
##   drugA   60   0      0
##   drugB   60   0      0
##   drugC    0  60      0
##   drugX    0  16     44
##   DrugY   24  19     17

table(train_drug$Drug, train_drug$Cholesterol)

##        
##         HIGH NORMAL
##   drugA   33     27
##   drugB   27     33
##   drugC   60      0
##   drugX   25     35
##   DrugY   29     31

Result has shown us that there are in fact an indication of data scarcity in our data. To tackle this problem, we can apply Laplace smooting for our modeling to scale up the sample of our data to remove 0 values.

Modeling

naive_model <- naiveBayes(train_drug %>% select(-Drug), train_drug$Drug, laplace = 1)

naive_model

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = train_drug %>% select(-Drug), y = train_drug$Drug, 
##     laplace = 1)
## 
## A-priori probabilities:
## train_drug$Drug
## drugA drugB drugC drugX DrugY 
##   0.2   0.2   0.2   0.2   0.2 
## 
## Conditional probabilities:
##                Age
## train_drug$Drug     [,1]      [,2]
##           drugA 32.36667  9.513535
##           drugB 62.03333  6.574132
##           drugC 44.73333 15.242522
##           drugX 44.38333 17.261785
##           DrugY 44.63333 16.971528
## 
##                Sex
## train_drug$Drug         F         M
##           drugA 0.3870968 0.6129032
##           drugB 0.4516129 0.5483871
##           drugC 0.4677419 0.5322581
##           drugX 0.4838710 0.5161290
##           DrugY 0.5000000 0.5000000
## 
##                BP
## train_drug$Drug       HIGH        LOW     NORMAL
##           drugA 0.96825397 0.01587302 0.01587302
##           drugB 0.96825397 0.01587302 0.01587302
##           drugC 0.01587302 0.96825397 0.01587302
##           drugX 0.01587302 0.26984127 0.71428571
##           DrugY 0.39682540 0.31746032 0.28571429
## 
##                Cholesterol
## train_drug$Drug       HIGH     NORMAL
##           drugA 0.54838710 0.45161290
##           drugB 0.45161290 0.54838710
##           drugC 0.98387097 0.01612903
##           drugX 0.41935484 0.58064516
##           DrugY 0.48387097 0.51612903
## 
##                Na_to_K
## train_drug$Drug     [,1]     [,2]
##           drugA 11.33518 1.825444
##           drugB 12.01152 1.886540
##           drugC 10.70453 1.726071
##           drugX 10.25760 1.957759
##           DrugY 21.86553 6.183802

After we have our model ready, we can make a prediction using our test data as the data sample for our prediction and evaluate the performance of our model based on the prediction which our model has made.

Prediction and evaluation

naive_pred <- predict(naive_model, test_drug, type = "class")

confusionMatrix(naive_pred,
                test_drug$Drug,
                positive = "drugA")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX DrugY
##      drugA     4     0     0     0     0
##      drugB     0     6     0     0     0
##      drugC     0     0     2     0     2
##      drugX     0     0     0    16     0
##      DrugY     0     0     0     1    29
## 
## Overall Statistics
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8608, 0.9896)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 1.837e-13       
##                                           
##                   Kappa : 0.923           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity               1.00000          1.0      1.00000       0.9412
## Specificity               1.00000          1.0      0.96552       1.0000
## Pos Pred Value            1.00000          1.0      0.50000       1.0000
## Neg Pred Value            1.00000          1.0      1.00000       0.9773
## Prevalence                0.06667          0.1      0.03333       0.2833
## Detection Rate            0.06667          0.1      0.03333       0.2667
## Detection Prevalence      0.06667          0.1      0.06667       0.2667
## Balanced Accuracy         1.00000          1.0      0.98276       0.9706
##                      Class: DrugY
## Sensitivity                0.9355
## Specificity                0.9655
## Pos Pred Value             0.9667
## Neg Pred Value             0.9333
## Prevalence                 0.5167
## Detection Rate             0.4833
## Detection Prevalence       0.5000
## Balanced Accuracy          0.9505

The model which are build using Naive Bayes method seems promising as it has 95% accuracy, around 97% sensitivity, and around 89% precision level. However, keep in mind that the data sample that we have currently are still low in terms of the number of samples. As the number goes up when we run our model using continuous data, we might need to adjust our model to better suit the future data.

Decision tree

Decision tree, as one of the most popular and most widely used Machine Learning model, is a tree-based model that are quite simple but quite robust/powerful in terms of performance for prediction. The major benefit of using this method is that its ability to visualize an interpretable result as it produces a visualization in a form of a tree as well as its low computational load.

Modeling

tree_model <- ctree(Drug ~.,
                    train_drug)

tree_model

## 
## Model formula:
## Drug ~ Age + Sex + BP + Cholesterol + Na_to_K
## 
## Fitted party:
## [1] root
## |   [2] BP in HIGH
## |   |   [3] Age <= 49
## |   |   |   [4] Na_to_K <= 13.972: drugA (n = 60, err = 0.0%)
## |   |   |   [5] Na_to_K > 13.972: DrugY (n = 16, err = 0.0%)
## |   |   [6] Age > 49
## |   |   |   [7] Na_to_K <= 13.967: drugB (n = 60, err = 0.0%)
## |   |   |   [8] Na_to_K > 13.967: DrugY (n = 8, err = 0.0%)
## |   [9] BP in LOW, NORMAL
## |   |   [10] Na_to_K <= 14.216
## |   |   |   [11] BP in LOW
## |   |   |   |   [12] Cholesterol in HIGH: drugC (n = 60, err = 0.0%)
## |   |   |   |   [13] Cholesterol in NORMAL: drugX (n = 16, err = 0.0%)
## |   |   |   [14] BP in NORMAL: drugX (n = 44, err = 0.0%)
## |   |   [15] Na_to_K > 14.216: DrugY (n = 36, err = 0.0%)
## 
## Number of inner nodes:    7
## Number of terminal nodes: 8

For a better and easier understanding, we can create a visualization our of our decision tree model.

Plot model

plot(tree_model, type = "simple")

In the visualization above, we can see that the tree create a total of 7 split before finally making a decision which drug is the best based on the condition of a person. Moreover, it is such a simple model that even non-data analyst can understand. For example, by following the partern of “BP = LOW, NORMAL” and “Na_to_K > 14.216”, we will be recommended to use DrugY by our model.

Prediction and evaluation

tree_pred <- predict(tree_model,
                     test_drug,
                     type = "response")

confusionMatrix(tree_pred,
                test_drug$Drug,
                positive = "drugA")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX DrugY
##      drugA     3     0     0     0     0
##      drugB     1     5     0     0     0
##      drugC     0     0     2     0     0
##      drugX     0     0     0    16     0
##      DrugY     0     1     0     1    31
## 
## Overall Statistics
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8608, 0.9896)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 1.837e-13       
##                                           
##                   Kappa : 0.9201          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity               0.75000      0.83333      1.00000       0.9412
## Specificity               1.00000      0.98148      1.00000       1.0000
## Pos Pred Value            1.00000      0.83333      1.00000       1.0000
## Neg Pred Value            0.98246      0.98148      1.00000       0.9773
## Prevalence                0.06667      0.10000      0.03333       0.2833
## Detection Rate            0.05000      0.08333      0.03333       0.2667
## Detection Prevalence      0.05000      0.10000      0.03333       0.2667
## Balanced Accuracy         0.87500      0.90741      1.00000       0.9706
##                      Class: DrugY
## Sensitivity                1.0000
## Specificity                0.9310
## Pos Pred Value             0.9394
## Neg Pred Value             1.0000
## Prevalence                 0.5167
## Detection Rate             0.5167
## Detection Prevalence       0.5500
## Balanced Accuracy          0.9655

By using Decision Tree and our model, we have manage to achieve 95% accuracy, around 90% sensitivity, and round 95% precision. Although the default decision tree method has manage to create a well performing model, if we obtain an unsatisfactory preforming level of our model depending on the result and our subjective preference, we can in fact tune our model and try to find the best possible performance of our model.

I try to observe each and every variables to be compared with the target to see a little overview of the influence of our variables to the target before I tuned the model.

Observing variables vs target

drug %>% 
  select(BP, Drug) %>% 
  arrange(BP) %>% 
  unique() %>% 
  table()

##         Drug
## BP       drugA drugB drugC drugX DrugY
##   HIGH       1     1     0     0     1
##   LOW        0     0     1     1     1
##   NORMAL     0     0     0     1     1

drug %>% 
  select(Na_to_K, Drug) %>% 
  mutate(Na_to_K = ifelse(Na_to_K >= 16.08449, "HIGH", "LOW")) %>% 
  arrange(Na_to_K) %>% 
  unique() %>% 
  table()

##        Drug
## Na_to_K drugA drugB drugC drugX DrugY
##    HIGH     0     0     0     0     1
##    LOW      1     1     1     1     1

drug %>% 
  select(Age, Drug) %>% 
  mutate(Age = ifelse(Age >= 40, "OLD", "YOUNG")) %>% 
  arrange(Age) %>% 
  unique() %>% 
  table()

##        Drug
## Age     drugA drugB drugC drugX DrugY
##   OLD       1     1     1     1     1
##   YOUNG     1     0     1     1     1

drug %>% 
  select(Cholesterol, Drug) %>% 
  arrange(Cholesterol) %>% 
  unique() %>% 
  table()

##            Drug
## Cholesterol drugA drugB drugC drugX DrugY
##      HIGH       1     1     1     1     1
##      NORMAL     1     1     0     1     1

drug %>% 
  select(Sex, Drug) %>% 
  arrange(Sex) %>% 
  unique() %>% 
  table()

##    Drug
## Sex drugA drugB drugC drugX DrugY
##   F     1     1     1     1     1
##   M     1     1     1     1     1

By analyzing the matrices above we can say that gender has nothing to do with the medicines effectiveness and/or whether it is safe to be consumed by a person. As a result, we can see why there are no split decision for gender in our visualized model. However, Blood pressure and sodium to potassium Ration in Blood seems to have a big impact on which drug a person can/have to consume. With this knowledge, we can try to tune our decision tree model to be more accurate.

Tuning Decision Tree model

tree_model_tuned <- ctree(Drug ~.,
                    train_drug,
                    control = ctree_control(mincriterion=0.95,
                                             minsplit=0,
                                             minbucket=10))

plot(tree_model_tuned, type = "simple")

When we tuned our model, the sturctural of the decision tree will change depending on the number of the mincriterion, minsplit, and minbucket that we inserted inside the model. We can see that there is a difference in the prediction with the number 8 on top of its box, the prediction changed from DrugY to drugB when we create our decision tree model and tune it using mincriterion, minsplit, and minbucket value.

Prediction and evalutaion of the tuned model

tree_pred_tuned <- predict(tree_model_tuned,
                     test_drug,
                     type = "response")

confusionMatrix(tree_pred_tuned,
                test_drug$Drug,
                positive = "drugA")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX DrugY
##      drugA     3     0     0     0     0
##      drugB     1     6     0     0     3
##      drugC     0     0     2     0     0
##      drugX     0     0     0    16     0
##      DrugY     0     0     0     1    28
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8161, 0.9724)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 2.677e-11       
##                                           
##                   Kappa : 0.8725          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity               0.75000       1.0000      1.00000       0.9412
## Specificity               1.00000       0.9259      1.00000       1.0000
## Pos Pred Value            1.00000       0.6000      1.00000       1.0000
## Neg Pred Value            0.98246       1.0000      1.00000       0.9773
## Prevalence                0.06667       0.1000      0.03333       0.2833
## Detection Rate            0.05000       0.1000      0.03333       0.2667
## Detection Prevalence      0.05000       0.1667      0.03333       0.2667
## Balanced Accuracy         0.87500       0.9630      1.00000       0.9706
##                      Class: DrugY
## Sensitivity                0.9032
## Specificity                0.9655
## Pos Pred Value             0.9655
## Neg Pred Value             0.9032
## Prevalence                 0.5167
## Detection Rate             0.4667
## Detection Prevalence       0.4833
## Balanced Accuracy          0.9344

Although it is possible to tune our model in decision tree, the default model created by the decision tree still has the best result while the tuning makes the accuracy lower. Furthermore, the accuracy achieved by the default decision tree model has 3.33% higher accuracy compare to the model that we have slightly tuned. The tuning in this case makes the accuracy worse.

Random Forrest

Random forest is one of the Machine Learning model that consist of many decision trees, the “forest” generated by the random forest algorithm is trained through bagging or bootstrap sampling by creating may random sample from the whole data while allowing duplicated data. the algorithm establishes the outcomes based on the predictions of the decision trees by taking the average or mean of the output from various trees. The higher the number of decision trees are generated, the higher the precision of the outcome will be.

Modeling

RNGkind(sample.kind = "Rounding")
set.seed(126)

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
forest <- train(Drug ~ ., train_drug, method = "rf", tfControl = ctrl)
saveRDS(forest, "forest_model_LBB.RDS")

Since generating random forest model are heavy on the computational load for computers which results in time consuming process with especially big samples of data, we would want to generate only once and store the result using saveRDS() to later be called back when we want to use it without having to generate the model from the beginning.

Reading model

forest_model <- readRDS("forest_model_LBB.RDS")
forest_model

## Random Forest 
## 
## 300 samples
##   5 predictor
##   5 classes: 'drugA', 'drugB', 'drugC', 'drugX', 'DrugY' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 300, 300, 300, 300, 300, 300, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9988571  0.9985603
##   4     0.9988571  0.9985603
##   6     0.9988571  0.9985603
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The result of our modeling returns a rather surprising performance where the accuracy of our models generated form random data selection appears to have the same accuracy and a rather high accuracy level. This model has fall under my suspicion that this might be a sign of an overfitting. As the model generates the same accuracy, the program automatically choose the first best model obtained from generating random samples to create a model with less number of split as higher number of split does not have any further impact to the model performace

Final model

forest_model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), tfControl = ..1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##       drugA drugB drugC drugX DrugY class.error
## drugA    60     0     0     0     0           0
## drugB     0    60     0     0     0           0
## drugC     0     0    60     0     0           0
## drugX     0     0     0    60     0           0
## DrugY     0     0     0     0    60           0

If we look at the OOB (Out-Of-Bag) estimate, we have manage to achieve a model with 100% accuracy which raised my suspicion even more.

varImp(forest_model)

## rf variable importance
## 
##                   Overall
## Na_to_K            100.00
## BPLOW               63.28
## Age                 63.26
## BPNORMAL            52.14
## CholesterolNORMAL   18.43
## SexM                 0.00

Unfortunately, we cannot interpret why our model is created this way because of the many random sample that are generated during the computational process of making the random forest model. However, we can at least be able to see which variable has the most influence in our model. Apparently, Na_to_K have the highest influence compare to all of our variable. The number indicated the overall importance value of the variable generated from Gini importance calculation. As we see the first and the second most influential to our model, there is a quite big amount of gap in between them at around 37.72%. This explains why when we go back to our random forest result, we can see that with only 2 split of our tree sample, we can obtain accuracy as high as around 99%.

Prediction and evalutation

forest_pred <- predict(forest_model,
               test_drug,
               type = "raw")

confusionMatrix(forest_pred,
                test_drug$Drug,
                positive = "drugA")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX DrugY
##      drugA     4     0     0     0     0
##      drugB     0     6     0     0     0
##      drugC     0     0     2     0     0
##      drugX     0     0     0    17     0
##      DrugY     0     0     0     0    31
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9404, 1)
##     No Information Rate : 0.5167     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity               1.00000          1.0      1.00000       1.0000
## Specificity               1.00000          1.0      1.00000       1.0000
## Pos Pred Value            1.00000          1.0      1.00000       1.0000
## Neg Pred Value            1.00000          1.0      1.00000       1.0000
## Prevalence                0.06667          0.1      0.03333       0.2833
## Detection Rate            0.06667          0.1      0.03333       0.2833
## Detection Prevalence      0.06667          0.1      0.03333       0.2833
## Balanced Accuracy         1.00000          1.0      1.00000       1.0000
##                      Class: DrugY
## Sensitivity                1.0000
## Specificity                1.0000
## Pos Pred Value             1.0000
## Neg Pred Value             1.0000
## Prevalence                 0.5167
## Detection Rate             0.5167
## Detection Prevalence       0.5167
## Balanced Accuracy          1.0000

Although I have my suspicion regarding the model generated by Random Forest method at the beginning, the model manage to predict our test data at an accuracy rate of 100%. this indicates that we have created a perfect model where it can distinguish which type of medicine to use for certain type of person with certain type of condition. However, in the real world situation, to achieve such model, it is almost impossible due to many unknown factors that changed the result of which type of drug to use. In this case, it might be due to lab testing result where there are no other factors beside the variables stated in the data to determine which type of drug to use, or the other factors or variables which are not mentioned in the data was simply ignored.

Conclusion

All of our model managed to performed well on predicting which type of drug best to be consumed by a person with certain type of condition based on the variables contained in the data. The Naive Bayes and Decision Tree manage to performed well with the same percentage of accuracy. Although the Random Forest were the best model out of all, with the performance of all three are in a satisfactory level, I recommend Decision Tree to be used as the model for our prediction due to its being intrepretable, easy to understand, and can be adjusted based on our preference or needs.

Reference

https://www.kaggle.com/prathamtripathi/drug-classification?select=drug200.csv

Machine Learning For Drug Selection Based on Certain Conditions

by Mohammad Bagus Dwi Putra

06 September 2021

Introduction

Data Exploration

Data pre-processing and EDA

Libraries

Read data

Change data type

Check any missing values

Correlaion plot

Cross validation

Random Sampling

Sex category

BP category

Cholesterol category

Drug category

Modeling

Naive Bayes

Proportion of numerical variables

Proportion of categorical variables

Modeling

Prediction and evaluation

Decision tree

Modeling

Plot model

Prediction and evaluation

Observing variables vs target

Tuning Decision Tree model

Prediction and evalutaion of the tuned model

Random Forrest

Modeling

Reading model

Final model

Prediction and evalutation

Conclusion

Reference