library(tidyverse)
library(ggplot2)
library(caret)
library(rsample)
library(ROCR)
library(tm)
library(SnowballC)
library(partykit)
library(animation)
library(randomForest)

1 Intro

1.1 Choosing Target

In this project, I want to predict what kind of drugs need to be used by each patient so we will know which drugs need to be ready in stock overtime

1.2 Choosing Predictors

As for the predictors, there will be Age, Sex, Blood Pressure, Cholesterol Levels, and the Sodium - Potassium Levels of each patients since they are suit for being the predictor variables because each of the informations is owned by each patient

2 Import Data

drugs <- read.csv("drug200.csv", stringsAsFactors= T)

3 EDA

Take a peek into the data

drugs
##     Age Sex     BP Cholesterol Na_to_K  Drug
## 1    23   F   HIGH        HIGH  25.355 drugY
## 2    47   M    LOW        HIGH  13.093 drugC
## 3    47   M    LOW        HIGH  10.114 drugC
## 4    28   F NORMAL        HIGH   7.798 drugX
## 5    61   F    LOW        HIGH  18.043 drugY
## 6    22   F NORMAL        HIGH   8.607 drugX
## 7    49   F NORMAL        HIGH  16.275 drugY
## 8    41   M    LOW        HIGH  11.037 drugC
## 9    60   M NORMAL        HIGH  15.171 drugY
## 10   43   M    LOW      NORMAL  19.368 drugY
## 11   47   F    LOW        HIGH  11.767 drugC
## 12   34   F   HIGH      NORMAL  19.199 drugY
## 13   43   M    LOW        HIGH  15.376 drugY
## 14   74   F    LOW        HIGH  20.942 drugY
## 15   50   F NORMAL        HIGH  12.703 drugX
## 16   16   F   HIGH      NORMAL  15.516 drugY
## 17   69   M    LOW      NORMAL  11.455 drugX
## 18   43   M   HIGH        HIGH  13.972 drugA
## 19   23   M    LOW        HIGH   7.298 drugC
## 20   32   F   HIGH      NORMAL  25.974 drugY
## 21   57   M    LOW      NORMAL  19.128 drugY
## 22   63   M NORMAL        HIGH  25.917 drugY
## 23   47   M    LOW      NORMAL  30.568 drugY
## 24   48   F    LOW        HIGH  15.036 drugY
## 25   33   F    LOW        HIGH  33.486 drugY
## 26   28   F   HIGH      NORMAL  18.809 drugY
## 27   31   M   HIGH        HIGH  30.366 drugY
## 28   49   F NORMAL      NORMAL   9.381 drugX
## 29   39   F    LOW      NORMAL  22.697 drugY
## 30   45   M    LOW        HIGH  17.951 drugY
## 31   18   F NORMAL      NORMAL   8.750 drugX
## 32   74   M   HIGH        HIGH   9.567 drugB
## 33   49   M    LOW      NORMAL  11.014 drugX
## 34   65   F   HIGH      NORMAL  31.876 drugY
## 35   53   M NORMAL        HIGH  14.133 drugX
## 36   46   M NORMAL      NORMAL   7.285 drugX
## 37   32   M   HIGH      NORMAL   9.445 drugA
## 38   39   M    LOW      NORMAL  13.938 drugX
## 39   39   F NORMAL      NORMAL   9.709 drugX
## 40   15   M NORMAL        HIGH   9.084 drugX
## 41   73   F NORMAL        HIGH  19.221 drugY
## 42   58   F   HIGH      NORMAL  14.239 drugB
## 43   50   M NORMAL      NORMAL  15.790 drugY
## 44   23   M NORMAL        HIGH  12.260 drugX
## 45   50   F NORMAL      NORMAL  12.295 drugX
## 46   66   F NORMAL      NORMAL   8.107 drugX
## 47   37   F   HIGH        HIGH  13.091 drugA
## 48   68   M    LOW        HIGH  10.291 drugC
## 49   23   M NORMAL        HIGH  31.686 drugY
## 50   28   F    LOW        HIGH  19.796 drugY
## 51   58   F   HIGH        HIGH  19.416 drugY
## 52   67   M NORMAL      NORMAL  10.898 drugX
## 53   62   M    LOW      NORMAL  27.183 drugY
## 54   24   F   HIGH      NORMAL  18.457 drugY
## 55   68   F   HIGH      NORMAL  10.189 drugB
## 56   26   F    LOW        HIGH  14.160 drugC
## 57   65   M   HIGH      NORMAL  11.340 drugB
## 58   40   M   HIGH        HIGH  27.826 drugY
## 59   60   M NORMAL      NORMAL  10.091 drugX
## 60   34   M   HIGH        HIGH  18.703 drugY
## 61   38   F    LOW      NORMAL  29.875 drugY
## 62   24   M   HIGH      NORMAL   9.475 drugA
## 63   67   M    LOW      NORMAL  20.693 drugY
## 64   45   M    LOW      NORMAL   8.370 drugX
## 65   60   F   HIGH        HIGH  13.303 drugB
## 66   68   F NORMAL      NORMAL  27.050 drugY
## 67   29   M   HIGH        HIGH  12.856 drugA
## 68   17   M NORMAL      NORMAL  10.832 drugX
## 69   54   M NORMAL        HIGH  24.658 drugY
## 70   18   F   HIGH      NORMAL  24.276 drugY
## 71   70   M   HIGH        HIGH  13.967 drugB
## 72   28   F NORMAL        HIGH  19.675 drugY
## 73   24   F NORMAL        HIGH  10.605 drugX
## 74   41   F NORMAL      NORMAL  22.905 drugY
## 75   31   M   HIGH      NORMAL  17.069 drugY
## 76   26   M    LOW      NORMAL  20.909 drugY
## 77   36   F   HIGH        HIGH  11.198 drugA
## 78   26   F   HIGH      NORMAL  19.161 drugY
## 79   19   F   HIGH        HIGH  13.313 drugA
## 80   32   F    LOW      NORMAL  10.840 drugX
## 81   60   M   HIGH        HIGH  13.934 drugB
## 82   64   M NORMAL        HIGH   7.761 drugX
## 83   32   F    LOW        HIGH   9.712 drugC
## 84   38   F   HIGH      NORMAL  11.326 drugA
## 85   47   F    LOW        HIGH  10.067 drugC
## 86   59   M   HIGH        HIGH  13.935 drugB
## 87   51   F NORMAL        HIGH  13.597 drugX
## 88   69   M    LOW        HIGH  15.478 drugY
## 89   37   F   HIGH      NORMAL  23.091 drugY
## 90   50   F NORMAL      NORMAL  17.211 drugY
## 91   62   M NORMAL        HIGH  16.594 drugY
## 92   41   M   HIGH      NORMAL  15.156 drugY
## 93   29   F   HIGH        HIGH  29.450 drugY
## 94   42   F    LOW      NORMAL  29.271 drugY
## 95   56   M    LOW        HIGH  15.015 drugY
## 96   36   M    LOW      NORMAL  11.424 drugX
## 97   58   F    LOW        HIGH  38.247 drugY
## 98   56   F   HIGH        HIGH  25.395 drugY
## 99   20   M   HIGH      NORMAL  35.639 drugY
## 100  15   F   HIGH      NORMAL  16.725 drugY
## 101  31   M   HIGH      NORMAL  11.871 drugA
## 102  45   F   HIGH        HIGH  12.854 drugA
## 103  28   F    LOW        HIGH  13.127 drugC
## 104  56   M NORMAL        HIGH   8.966 drugX
## 105  22   M   HIGH      NORMAL  28.294 drugY
## 106  37   M    LOW      NORMAL   8.968 drugX
## 107  22   M NORMAL        HIGH  11.953 drugX
## 108  42   M    LOW        HIGH  20.013 drugY
## 109  72   M   HIGH      NORMAL   9.677 drugB
## 110  23   M NORMAL        HIGH  16.850 drugY
## 111  50   M   HIGH        HIGH   7.490 drugA
## 112  47   F NORMAL      NORMAL   6.683 drugX
## 113  35   M    LOW      NORMAL   9.170 drugX
## 114  65   F    LOW      NORMAL  13.769 drugX
## 115  20   F NORMAL      NORMAL   9.281 drugX
## 116  51   M   HIGH        HIGH  18.295 drugY
## 117  67   M NORMAL      NORMAL   9.514 drugX
## 118  40   F NORMAL        HIGH  10.103 drugX
## 119  32   F   HIGH      NORMAL  10.292 drugA
## 120  61   F   HIGH        HIGH  25.475 drugY
## 121  28   M NORMAL        HIGH  27.064 drugY
## 122  15   M   HIGH      NORMAL  17.206 drugY
## 123  34   M NORMAL        HIGH  22.456 drugY
## 124  36   F NORMAL        HIGH  16.753 drugY
## 125  53   F   HIGH      NORMAL  12.495 drugB
## 126  19   F   HIGH      NORMAL  25.969 drugY
## 127  66   M   HIGH        HIGH  16.347 drugY
## 128  35   M NORMAL      NORMAL   7.845 drugX
## 129  47   M    LOW      NORMAL  33.542 drugY
## 130  32   F NORMAL        HIGH   7.477 drugX
## 131  70   F NORMAL        HIGH  20.489 drugY
## 132  52   M    LOW      NORMAL  32.922 drugY
## 133  49   M    LOW      NORMAL  13.598 drugX
## 134  24   M NORMAL        HIGH  25.786 drugY
## 135  42   F   HIGH        HIGH  21.036 drugY
## 136  74   M    LOW      NORMAL  11.939 drugX
## 137  55   F   HIGH        HIGH  10.977 drugB
## 138  35   F   HIGH        HIGH  12.894 drugA
## 139  51   M   HIGH      NORMAL  11.343 drugB
## 140  69   F NORMAL        HIGH  10.065 drugX
## 141  49   M   HIGH      NORMAL   6.269 drugA
## 142  64   F    LOW      NORMAL  25.741 drugY
## 143  60   M   HIGH      NORMAL   8.621 drugB
## 144  74   M   HIGH      NORMAL  15.436 drugY
## 145  39   M   HIGH        HIGH   9.664 drugA
## 146  61   M NORMAL        HIGH   9.443 drugX
## 147  37   F    LOW      NORMAL  12.006 drugX
## 148  26   F   HIGH      NORMAL  12.307 drugA
## 149  61   F    LOW      NORMAL   7.340 drugX
## 150  22   M    LOW        HIGH   8.151 drugC
## 151  49   M   HIGH      NORMAL   8.700 drugA
## 152  68   M   HIGH        HIGH  11.009 drugB
## 153  55   M NORMAL      NORMAL   7.261 drugX
## 154  72   F    LOW      NORMAL  14.642 drugX
## 155  37   M    LOW      NORMAL  16.724 drugY
## 156  49   M    LOW        HIGH  10.537 drugC
## 157  31   M   HIGH      NORMAL  11.227 drugA
## 158  53   M    LOW        HIGH  22.963 drugY
## 159  59   F    LOW        HIGH  10.444 drugC
## 160  34   F    LOW      NORMAL  12.923 drugX
## 161  30   F NORMAL        HIGH  10.443 drugX
## 162  57   F   HIGH      NORMAL   9.945 drugB
## 163  43   M NORMAL      NORMAL  12.859 drugX
## 164  21   F   HIGH      NORMAL  28.632 drugY
## 165  16   M   HIGH      NORMAL  19.007 drugY
## 166  38   M    LOW        HIGH  18.295 drugY
## 167  58   F    LOW        HIGH  26.645 drugY
## 168  57   F NORMAL        HIGH  14.216 drugX
## 169  51   F    LOW      NORMAL  23.003 drugY
## 170  20   F   HIGH        HIGH  11.262 drugA
## 171  28   F NORMAL        HIGH  12.879 drugX
## 172  45   M    LOW      NORMAL  10.017 drugX
## 173  39   F NORMAL      NORMAL  17.225 drugY
## 174  41   F    LOW      NORMAL  18.739 drugY
## 175  42   M   HIGH      NORMAL  12.766 drugA
## 176  73   F   HIGH        HIGH  18.348 drugY
## 177  48   M   HIGH      NORMAL  10.446 drugA
## 178  25   M NORMAL        HIGH  19.011 drugY
## 179  39   M NORMAL        HIGH  15.969 drugY
## 180  67   F NORMAL        HIGH  15.891 drugY
## 181  22   F   HIGH      NORMAL  22.818 drugY
## 182  59   F NORMAL        HIGH  13.884 drugX
## 183  20   F    LOW      NORMAL  11.686 drugX
## 184  36   F   HIGH      NORMAL  15.490 drugY
## 185  18   F   HIGH        HIGH  37.188 drugY
## 186  57   F NORMAL      NORMAL  25.893 drugY
## 187  70   M   HIGH        HIGH   9.849 drugB
## 188  47   M   HIGH        HIGH  10.403 drugA
## 189  65   M   HIGH      NORMAL  34.997 drugY
## 190  64   M   HIGH      NORMAL  20.932 drugY
## 191  58   M   HIGH        HIGH  18.991 drugY
## 192  23   M   HIGH        HIGH   8.011 drugA
## 193  72   M    LOW        HIGH  16.310 drugY
## 194  72   M    LOW        HIGH   6.769 drugC
## 195  46   F   HIGH        HIGH  34.686 drugY
## 196  56   F    LOW        HIGH  11.567 drugC
## 197  16   M    LOW        HIGH  12.006 drugC
## 198  52   M NORMAL        HIGH   9.894 drugX
## 199  23   M NORMAL      NORMAL  14.020 drugX
## 200  40   F    LOW      NORMAL  11.349 drugX

what the data tells us:

  • Age: Age of patients
  • Sex: Sex of the patients
  • BP: Blood Pressure
  • Cholesterol: Cholesterol Levels
  • Na to K: Sodium - Potassium
  • Drug: Type of drugs that work with the patients

let’s see if there’s any missing values

anyNA(drugs)
## [1] FALSE

as there are no missing value, let’s now check on the data types

glimpse(drugs)
## Rows: 200
## Columns: 6
## $ Age         <int> 23, 47, 47, 28, 61, 22, 49, 41, 60, 43, 47, 34, 43, 74, 50…
## $ Sex         <fct> F, M, M, F, F, F, F, M, M, M, F, F, M, F, F, F, M, M, M, F…
## $ BP          <fct> HIGH, LOW, LOW, NORMAL, LOW, NORMAL, NORMAL, LOW, NORMAL, …
## $ Cholesterol <fct> HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, NORM…
## $ Na_to_K     <dbl> 25.355, 13.093, 10.114, 7.798, 18.043, 8.607, 16.275, 11.0…
## $ Drug        <fct> drugY, drugC, drugC, drugX, drugY, drugX, drugY, drugC, dr…

everything seems to be in ordered. now let’s take it to the next phase

4 Cross Validation

4.1 Data Split

let’s split the data into data train and data test

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

# your code here
index <- sample(nrow(drugs), nrow(drugs)*0.8)

drugs_train <- drugs[index,]
drugs_test <- drugs[-index,]

4.2 Check Data Proportion

now let’s check whether the data train is balanced

prop.table(table(drugs_train$Drug))
## 
##   drugA   drugB   drugC   drugX   drugY 
## 0.12500 0.08125 0.08125 0.25625 0.45625

since the data looks imbalanced, let’s balance it using downSampling. the downSampling method will balance the data by decreasing the proportion of the highest proportion data, and balancing the props between all variables. as for the upSampling, it will proportionate the data by increasing the props of undersample data/variable.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(444)

drugs_train_1 <- downSample(x=drugs_train |>  select(-Drug),
                         y=drugs_train$Drug, 
                         yname="Drug")

re-check the prop

prop.table(table(drugs_train_1$Drug))
## 
## drugA drugB drugC drugX drugY 
##   0.2   0.2   0.2   0.2   0.2

Now the data is balanced, Great! let’s move onto the next phase

5 Model Fitting

modeling

drugs_tree <- ctree(Drug~., drugs_train_1)

see the structure of the tree

drugs_tree
## 
## Model formula:
## Drug ~ Age + Sex + BP + Cholesterol + Na_to_K
## 
## Fitted party:
## [1] root
## |   [2] BP in HIGH
## |   |   [3] Age <= 50: drugA (n = 17, err = 23.5%)
## |   |   [4] Age > 50: drugB (n = 13, err = 0.0%)
## |   [5] BP in LOW, NORMAL
## |   |   [6] Na_to_K <= 14.16
## |   |   |   [7] Cholesterol in HIGH: drugC (n = 16, err = 18.8%)
## |   |   |   [8] Cholesterol in NORMAL: drugX (n = 10, err = 0.0%)
## |   |   [9] Na_to_K > 14.16: drugY (n = 9, err = 0.0%)
## 
## Number of inner nodes:    4
## Number of terminal nodes: 5

see the structure using visualization

plot(drugs_tree, type="simple")

6 Model Evaluation

pred_drugs <- predict(drugs_tree, drugs_test, type="response")

confusionMatrix(pred_drugs, drugs_test$Drug)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX drugY
##      drugA     3     0     0     0     6
##      drugB     0     3     0     0     4
##      drugC     0     0     3     7     0
##      drugX     0     0     0     4     0
##      drugY     0     0     0     2     8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.525           
##                  95% CI : (0.3613, 0.6849)
##     No Information Rate : 0.45            
##     P-Value [Acc > NIR] : 0.213           
##                                           
##                   Kappa : 0.4109          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity                1.0000       1.0000       1.0000       0.3077
## Specificity                0.8378       0.8919       0.8108       1.0000
## Pos Pred Value             0.3333       0.4286       0.3000       1.0000
## Neg Pred Value             1.0000       1.0000       1.0000       0.7500
## Prevalence                 0.0750       0.0750       0.0750       0.3250
## Detection Rate             0.0750       0.0750       0.0750       0.1000
## Detection Prevalence       0.2250       0.1750       0.2500       0.1000
## Balanced Accuracy          0.9189       0.9459       0.9054       0.6538
##                      Class: drugY
## Sensitivity                0.4444
## Specificity                0.9091
## Pos Pred Value             0.8000
## Neg Pred Value             0.6667
## Prevalence                 0.4500
## Detection Rate             0.2000
## Detection Prevalence       0.2500
## Balanced Accuracy          0.6768

Since we want to every prediction in each class is true, we will be looking at the accuracy metrics. we can see that this model is performing poorly with the accuracy of 0.525 or 52.5% (assuming that we want the model to perform at least at the threshold of 90%). so we need to try to make another model. Now let’s do it all over again making the model 2

7 Model 2

7.1 Proportionate the data

let’s balance the data using upSampling

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(444)

drugs_train_2 <- upSample(x=drugs_train |>  select(-Drug),
                         y=drugs_train$Drug, 
                         yname="Drug")

7.2 Check Data Proportion

check if the proportion is balanced

prop.table(table(drugs_train_2$Drug))
## 
## drugA drugB drugC drugX drugY 
##   0.2   0.2   0.2   0.2   0.2

Now the target is balanced. let’s move on..

7.3 Model Fitting

modeling

drugs_tree_2 <- ctree(Drug~., drugs_train_2)

the structure of model 2 tree

drugs_tree_2
## 
## Model formula:
## Drug ~ Age + Sex + BP + Cholesterol + Na_to_K
## 
## Fitted party:
## [1] root
## |   [2] BP in HIGH
## |   |   [3] Na_to_K <= 13.972
## |   |   |   [4] Age <= 50: drugA (n = 73, err = 0.0%)
## |   |   |   [5] Age > 50: drugB (n = 73, err = 0.0%)
## |   |   [6] Na_to_K > 13.972: drugY (n = 28, err = 0.0%)
## |   [7] BP in LOW, NORMAL
## |   |   [8] Na_to_K <= 14.16
## |   |   |   [9] Cholesterol in HIGH
## |   |   |   |   [10] BP in LOW: drugC (n = 73, err = 0.0%)
## |   |   |   |   [11] BP in NORMAL: drugX (n = 19, err = 0.0%)
## |   |   |   [12] Cholesterol in NORMAL: drugX (n = 54, err = 0.0%)
## |   |   [13] Na_to_K > 14.16: drugY (n = 45, err = 0.0%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7

see the structure using visualization

plot(drugs_tree_2, type="simple")

7.4 Model 2 Evaluation

pred_drugs_2 <- predict(drugs_tree_2, drugs_test, type="response")

confusionMatrix(pred_drugs_2, drugs_test$Drug)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction drugA drugB drugC drugX drugY
##      drugA     3     0     0     0     0
##      drugB     0     2     0     0     0
##      drugC     0     0     3     0     0
##      drugX     0     0     0    11     0
##      drugY     0     1     0     2    18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.925           
##                  95% CI : (0.7961, 0.9843)
##     No Information Rate : 0.45            
##     P-Value [Acc > NIR] : 0.0000000002588 
##                                           
##                   Kappa : 0.8863          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity                 1.000       0.6667        1.000       0.8462
## Specificity                 1.000       1.0000        1.000       1.0000
## Pos Pred Value              1.000       1.0000        1.000       1.0000
## Neg Pred Value              1.000       0.9737        1.000       0.9310
## Prevalence                  0.075       0.0750        0.075       0.3250
## Detection Rate              0.075       0.0500        0.075       0.2750
## Detection Prevalence        0.075       0.0500        0.075       0.2750
## Balanced Accuracy           1.000       0.8333        1.000       0.9231
##                      Class: drugY
## Sensitivity                1.0000
## Specificity                0.8636
## Pos Pred Value             0.8571
## Neg Pred Value             1.0000
## Prevalence                 0.4500
## Detection Rate             0.4500
## Detection Prevalence       0.5250
## Balanced Accuracy          0.9318

We can see that the model 2 is performing better than model 1, and actually is performing above the threshold of 90% with the accuracy of 0.925 or 92.5%. so we can actually use this model based on the given threshold for the value performance at the minimum of 90%.

8 Conclusion and Suggestion

As we can see, when we’re balancing out the data train using upSampling (model 2), the model is performing better than the model that balanced out using downSampling (model 1). In this case, we need to be careful when we want to set the data train when the initial data train is not balanced as it could impact the model we build.

As for the model, the model 1 clearly could not be used to predict which drugs will be needed for the patients as it is performing poorly at 52.5%. We could instead using the model 2 as it performing above the given threshold for the minimum value performance of 90%. So, in order to predict what kind of drugs which can be used by each and every patient, we can use the model 2 to our new data.