library(tidyverse)
library(ggplot2)
library(caret)
library(rsample)
library(ROCR)
library(tm)
library(SnowballC)
library(partykit)
library(animation)
library(randomForest)In this project, I want to predict what kind of drugs need to be used by each patient so we will know which drugs need to be ready in stock overtime
As for the predictors, there will be Age, Sex, Blood Pressure, Cholesterol Levels, and the Sodium - Potassium Levels of each patients since they are suit for being the predictor variables because each of the informations is owned by each patient
drugs <- read.csv("drug200.csv", stringsAsFactors= T)Take a peek into the data
drugs## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 F HIGH HIGH 25.355 drugY
## 2 47 M LOW HIGH 13.093 drugC
## 3 47 M LOW HIGH 10.114 drugC
## 4 28 F NORMAL HIGH 7.798 drugX
## 5 61 F LOW HIGH 18.043 drugY
## 6 22 F NORMAL HIGH 8.607 drugX
## 7 49 F NORMAL HIGH 16.275 drugY
## 8 41 M LOW HIGH 11.037 drugC
## 9 60 M NORMAL HIGH 15.171 drugY
## 10 43 M LOW NORMAL 19.368 drugY
## 11 47 F LOW HIGH 11.767 drugC
## 12 34 F HIGH NORMAL 19.199 drugY
## 13 43 M LOW HIGH 15.376 drugY
## 14 74 F LOW HIGH 20.942 drugY
## 15 50 F NORMAL HIGH 12.703 drugX
## 16 16 F HIGH NORMAL 15.516 drugY
## 17 69 M LOW NORMAL 11.455 drugX
## 18 43 M HIGH HIGH 13.972 drugA
## 19 23 M LOW HIGH 7.298 drugC
## 20 32 F HIGH NORMAL 25.974 drugY
## 21 57 M LOW NORMAL 19.128 drugY
## 22 63 M NORMAL HIGH 25.917 drugY
## 23 47 M LOW NORMAL 30.568 drugY
## 24 48 F LOW HIGH 15.036 drugY
## 25 33 F LOW HIGH 33.486 drugY
## 26 28 F HIGH NORMAL 18.809 drugY
## 27 31 M HIGH HIGH 30.366 drugY
## 28 49 F NORMAL NORMAL 9.381 drugX
## 29 39 F LOW NORMAL 22.697 drugY
## 30 45 M LOW HIGH 17.951 drugY
## 31 18 F NORMAL NORMAL 8.750 drugX
## 32 74 M HIGH HIGH 9.567 drugB
## 33 49 M LOW NORMAL 11.014 drugX
## 34 65 F HIGH NORMAL 31.876 drugY
## 35 53 M NORMAL HIGH 14.133 drugX
## 36 46 M NORMAL NORMAL 7.285 drugX
## 37 32 M HIGH NORMAL 9.445 drugA
## 38 39 M LOW NORMAL 13.938 drugX
## 39 39 F NORMAL NORMAL 9.709 drugX
## 40 15 M NORMAL HIGH 9.084 drugX
## 41 73 F NORMAL HIGH 19.221 drugY
## 42 58 F HIGH NORMAL 14.239 drugB
## 43 50 M NORMAL NORMAL 15.790 drugY
## 44 23 M NORMAL HIGH 12.260 drugX
## 45 50 F NORMAL NORMAL 12.295 drugX
## 46 66 F NORMAL NORMAL 8.107 drugX
## 47 37 F HIGH HIGH 13.091 drugA
## 48 68 M LOW HIGH 10.291 drugC
## 49 23 M NORMAL HIGH 31.686 drugY
## 50 28 F LOW HIGH 19.796 drugY
## 51 58 F HIGH HIGH 19.416 drugY
## 52 67 M NORMAL NORMAL 10.898 drugX
## 53 62 M LOW NORMAL 27.183 drugY
## 54 24 F HIGH NORMAL 18.457 drugY
## 55 68 F HIGH NORMAL 10.189 drugB
## 56 26 F LOW HIGH 14.160 drugC
## 57 65 M HIGH NORMAL 11.340 drugB
## 58 40 M HIGH HIGH 27.826 drugY
## 59 60 M NORMAL NORMAL 10.091 drugX
## 60 34 M HIGH HIGH 18.703 drugY
## 61 38 F LOW NORMAL 29.875 drugY
## 62 24 M HIGH NORMAL 9.475 drugA
## 63 67 M LOW NORMAL 20.693 drugY
## 64 45 M LOW NORMAL 8.370 drugX
## 65 60 F HIGH HIGH 13.303 drugB
## 66 68 F NORMAL NORMAL 27.050 drugY
## 67 29 M HIGH HIGH 12.856 drugA
## 68 17 M NORMAL NORMAL 10.832 drugX
## 69 54 M NORMAL HIGH 24.658 drugY
## 70 18 F HIGH NORMAL 24.276 drugY
## 71 70 M HIGH HIGH 13.967 drugB
## 72 28 F NORMAL HIGH 19.675 drugY
## 73 24 F NORMAL HIGH 10.605 drugX
## 74 41 F NORMAL NORMAL 22.905 drugY
## 75 31 M HIGH NORMAL 17.069 drugY
## 76 26 M LOW NORMAL 20.909 drugY
## 77 36 F HIGH HIGH 11.198 drugA
## 78 26 F HIGH NORMAL 19.161 drugY
## 79 19 F HIGH HIGH 13.313 drugA
## 80 32 F LOW NORMAL 10.840 drugX
## 81 60 M HIGH HIGH 13.934 drugB
## 82 64 M NORMAL HIGH 7.761 drugX
## 83 32 F LOW HIGH 9.712 drugC
## 84 38 F HIGH NORMAL 11.326 drugA
## 85 47 F LOW HIGH 10.067 drugC
## 86 59 M HIGH HIGH 13.935 drugB
## 87 51 F NORMAL HIGH 13.597 drugX
## 88 69 M LOW HIGH 15.478 drugY
## 89 37 F HIGH NORMAL 23.091 drugY
## 90 50 F NORMAL NORMAL 17.211 drugY
## 91 62 M NORMAL HIGH 16.594 drugY
## 92 41 M HIGH NORMAL 15.156 drugY
## 93 29 F HIGH HIGH 29.450 drugY
## 94 42 F LOW NORMAL 29.271 drugY
## 95 56 M LOW HIGH 15.015 drugY
## 96 36 M LOW NORMAL 11.424 drugX
## 97 58 F LOW HIGH 38.247 drugY
## 98 56 F HIGH HIGH 25.395 drugY
## 99 20 M HIGH NORMAL 35.639 drugY
## 100 15 F HIGH NORMAL 16.725 drugY
## 101 31 M HIGH NORMAL 11.871 drugA
## 102 45 F HIGH HIGH 12.854 drugA
## 103 28 F LOW HIGH 13.127 drugC
## 104 56 M NORMAL HIGH 8.966 drugX
## 105 22 M HIGH NORMAL 28.294 drugY
## 106 37 M LOW NORMAL 8.968 drugX
## 107 22 M NORMAL HIGH 11.953 drugX
## 108 42 M LOW HIGH 20.013 drugY
## 109 72 M HIGH NORMAL 9.677 drugB
## 110 23 M NORMAL HIGH 16.850 drugY
## 111 50 M HIGH HIGH 7.490 drugA
## 112 47 F NORMAL NORMAL 6.683 drugX
## 113 35 M LOW NORMAL 9.170 drugX
## 114 65 F LOW NORMAL 13.769 drugX
## 115 20 F NORMAL NORMAL 9.281 drugX
## 116 51 M HIGH HIGH 18.295 drugY
## 117 67 M NORMAL NORMAL 9.514 drugX
## 118 40 F NORMAL HIGH 10.103 drugX
## 119 32 F HIGH NORMAL 10.292 drugA
## 120 61 F HIGH HIGH 25.475 drugY
## 121 28 M NORMAL HIGH 27.064 drugY
## 122 15 M HIGH NORMAL 17.206 drugY
## 123 34 M NORMAL HIGH 22.456 drugY
## 124 36 F NORMAL HIGH 16.753 drugY
## 125 53 F HIGH NORMAL 12.495 drugB
## 126 19 F HIGH NORMAL 25.969 drugY
## 127 66 M HIGH HIGH 16.347 drugY
## 128 35 M NORMAL NORMAL 7.845 drugX
## 129 47 M LOW NORMAL 33.542 drugY
## 130 32 F NORMAL HIGH 7.477 drugX
## 131 70 F NORMAL HIGH 20.489 drugY
## 132 52 M LOW NORMAL 32.922 drugY
## 133 49 M LOW NORMAL 13.598 drugX
## 134 24 M NORMAL HIGH 25.786 drugY
## 135 42 F HIGH HIGH 21.036 drugY
## 136 74 M LOW NORMAL 11.939 drugX
## 137 55 F HIGH HIGH 10.977 drugB
## 138 35 F HIGH HIGH 12.894 drugA
## 139 51 M HIGH NORMAL 11.343 drugB
## 140 69 F NORMAL HIGH 10.065 drugX
## 141 49 M HIGH NORMAL 6.269 drugA
## 142 64 F LOW NORMAL 25.741 drugY
## 143 60 M HIGH NORMAL 8.621 drugB
## 144 74 M HIGH NORMAL 15.436 drugY
## 145 39 M HIGH HIGH 9.664 drugA
## 146 61 M NORMAL HIGH 9.443 drugX
## 147 37 F LOW NORMAL 12.006 drugX
## 148 26 F HIGH NORMAL 12.307 drugA
## 149 61 F LOW NORMAL 7.340 drugX
## 150 22 M LOW HIGH 8.151 drugC
## 151 49 M HIGH NORMAL 8.700 drugA
## 152 68 M HIGH HIGH 11.009 drugB
## 153 55 M NORMAL NORMAL 7.261 drugX
## 154 72 F LOW NORMAL 14.642 drugX
## 155 37 M LOW NORMAL 16.724 drugY
## 156 49 M LOW HIGH 10.537 drugC
## 157 31 M HIGH NORMAL 11.227 drugA
## 158 53 M LOW HIGH 22.963 drugY
## 159 59 F LOW HIGH 10.444 drugC
## 160 34 F LOW NORMAL 12.923 drugX
## 161 30 F NORMAL HIGH 10.443 drugX
## 162 57 F HIGH NORMAL 9.945 drugB
## 163 43 M NORMAL NORMAL 12.859 drugX
## 164 21 F HIGH NORMAL 28.632 drugY
## 165 16 M HIGH NORMAL 19.007 drugY
## 166 38 M LOW HIGH 18.295 drugY
## 167 58 F LOW HIGH 26.645 drugY
## 168 57 F NORMAL HIGH 14.216 drugX
## 169 51 F LOW NORMAL 23.003 drugY
## 170 20 F HIGH HIGH 11.262 drugA
## 171 28 F NORMAL HIGH 12.879 drugX
## 172 45 M LOW NORMAL 10.017 drugX
## 173 39 F NORMAL NORMAL 17.225 drugY
## 174 41 F LOW NORMAL 18.739 drugY
## 175 42 M HIGH NORMAL 12.766 drugA
## 176 73 F HIGH HIGH 18.348 drugY
## 177 48 M HIGH NORMAL 10.446 drugA
## 178 25 M NORMAL HIGH 19.011 drugY
## 179 39 M NORMAL HIGH 15.969 drugY
## 180 67 F NORMAL HIGH 15.891 drugY
## 181 22 F HIGH NORMAL 22.818 drugY
## 182 59 F NORMAL HIGH 13.884 drugX
## 183 20 F LOW NORMAL 11.686 drugX
## 184 36 F HIGH NORMAL 15.490 drugY
## 185 18 F HIGH HIGH 37.188 drugY
## 186 57 F NORMAL NORMAL 25.893 drugY
## 187 70 M HIGH HIGH 9.849 drugB
## 188 47 M HIGH HIGH 10.403 drugA
## 189 65 M HIGH NORMAL 34.997 drugY
## 190 64 M HIGH NORMAL 20.932 drugY
## 191 58 M HIGH HIGH 18.991 drugY
## 192 23 M HIGH HIGH 8.011 drugA
## 193 72 M LOW HIGH 16.310 drugY
## 194 72 M LOW HIGH 6.769 drugC
## 195 46 F HIGH HIGH 34.686 drugY
## 196 56 F LOW HIGH 11.567 drugC
## 197 16 M LOW HIGH 12.006 drugC
## 198 52 M NORMAL HIGH 9.894 drugX
## 199 23 M NORMAL NORMAL 14.020 drugX
## 200 40 F LOW NORMAL 11.349 drugX
what the data tells us:
let’s see if there’s any missing values
anyNA(drugs)## [1] FALSE
as there are no missing value, let’s now check on the data types
glimpse(drugs)## Rows: 200
## Columns: 6
## $ Age <int> 23, 47, 47, 28, 61, 22, 49, 41, 60, 43, 47, 34, 43, 74, 50…
## $ Sex <fct> F, M, M, F, F, F, F, M, M, M, F, F, M, F, F, F, M, M, M, F…
## $ BP <fct> HIGH, LOW, LOW, NORMAL, LOW, NORMAL, NORMAL, LOW, NORMAL, …
## $ Cholesterol <fct> HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, HIGH, NORM…
## $ Na_to_K <dbl> 25.355, 13.093, 10.114, 7.798, 18.043, 8.607, 16.275, 11.0…
## $ Drug <fct> drugY, drugC, drugC, drugX, drugY, drugX, drugY, drugC, dr…
everything seems to be in ordered. now let’s take it to the next phase
let’s split the data into data train and
data test
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here
index <- sample(nrow(drugs), nrow(drugs)*0.8)
drugs_train <- drugs[index,]
drugs_test <- drugs[-index,]now let’s check whether the data train is balanced
prop.table(table(drugs_train$Drug))##
## drugA drugB drugC drugX drugY
## 0.12500 0.08125 0.08125 0.25625 0.45625
since the data looks imbalanced, let’s balance it using downSampling. the downSampling method will balance the data by decreasing the proportion of the highest proportion data, and balancing the props between all variables. as for the upSampling, it will proportionate the data by increasing the props of undersample data/variable.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(444)
drugs_train_1 <- downSample(x=drugs_train |> select(-Drug),
y=drugs_train$Drug,
yname="Drug")re-check the prop
prop.table(table(drugs_train_1$Drug))##
## drugA drugB drugC drugX drugY
## 0.2 0.2 0.2 0.2 0.2
Now the data is balanced, Great! let’s move onto the next phase
modeling
drugs_tree <- ctree(Drug~., drugs_train_1)see the structure of the tree
drugs_tree##
## Model formula:
## Drug ~ Age + Sex + BP + Cholesterol + Na_to_K
##
## Fitted party:
## [1] root
## | [2] BP in HIGH
## | | [3] Age <= 50: drugA (n = 17, err = 23.5%)
## | | [4] Age > 50: drugB (n = 13, err = 0.0%)
## | [5] BP in LOW, NORMAL
## | | [6] Na_to_K <= 14.16
## | | | [7] Cholesterol in HIGH: drugC (n = 16, err = 18.8%)
## | | | [8] Cholesterol in NORMAL: drugX (n = 10, err = 0.0%)
## | | [9] Na_to_K > 14.16: drugY (n = 9, err = 0.0%)
##
## Number of inner nodes: 4
## Number of terminal nodes: 5
see the structure using visualization
plot(drugs_tree, type="simple")pred_drugs <- predict(drugs_tree, drugs_test, type="response")
confusionMatrix(pred_drugs, drugs_test$Drug)## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX drugY
## drugA 3 0 0 0 6
## drugB 0 3 0 0 4
## drugC 0 0 3 7 0
## drugX 0 0 0 4 0
## drugY 0 0 0 2 8
##
## Overall Statistics
##
## Accuracy : 0.525
## 95% CI : (0.3613, 0.6849)
## No Information Rate : 0.45
## P-Value [Acc > NIR] : 0.213
##
## Kappa : 0.4109
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.0000 1.0000 1.0000 0.3077
## Specificity 0.8378 0.8919 0.8108 1.0000
## Pos Pred Value 0.3333 0.4286 0.3000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 0.7500
## Prevalence 0.0750 0.0750 0.0750 0.3250
## Detection Rate 0.0750 0.0750 0.0750 0.1000
## Detection Prevalence 0.2250 0.1750 0.2500 0.1000
## Balanced Accuracy 0.9189 0.9459 0.9054 0.6538
## Class: drugY
## Sensitivity 0.4444
## Specificity 0.9091
## Pos Pred Value 0.8000
## Neg Pred Value 0.6667
## Prevalence 0.4500
## Detection Rate 0.2000
## Detection Prevalence 0.2500
## Balanced Accuracy 0.6768
Since we want to every prediction in each class is true, we will be looking at the accuracy metrics. we can see that this model is performing poorly with the accuracy of 0.525 or 52.5% (assuming that we want the model to perform at least at the threshold of 90%). so we need to try to make another model. Now let’s do it all over again making the model 2
let’s balance the data using upSampling
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(444)
drugs_train_2 <- upSample(x=drugs_train |> select(-Drug),
y=drugs_train$Drug,
yname="Drug")check if the proportion is balanced
prop.table(table(drugs_train_2$Drug))##
## drugA drugB drugC drugX drugY
## 0.2 0.2 0.2 0.2 0.2
Now the target is balanced. let’s move on..
modeling
drugs_tree_2 <- ctree(Drug~., drugs_train_2)the structure of model 2 tree
drugs_tree_2##
## Model formula:
## Drug ~ Age + Sex + BP + Cholesterol + Na_to_K
##
## Fitted party:
## [1] root
## | [2] BP in HIGH
## | | [3] Na_to_K <= 13.972
## | | | [4] Age <= 50: drugA (n = 73, err = 0.0%)
## | | | [5] Age > 50: drugB (n = 73, err = 0.0%)
## | | [6] Na_to_K > 13.972: drugY (n = 28, err = 0.0%)
## | [7] BP in LOW, NORMAL
## | | [8] Na_to_K <= 14.16
## | | | [9] Cholesterol in HIGH
## | | | | [10] BP in LOW: drugC (n = 73, err = 0.0%)
## | | | | [11] BP in NORMAL: drugX (n = 19, err = 0.0%)
## | | | [12] Cholesterol in NORMAL: drugX (n = 54, err = 0.0%)
## | | [13] Na_to_K > 14.16: drugY (n = 45, err = 0.0%)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
see the structure using visualization
plot(drugs_tree_2, type="simple")pred_drugs_2 <- predict(drugs_tree_2, drugs_test, type="response")
confusionMatrix(pred_drugs_2, drugs_test$Drug)## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX drugY
## drugA 3 0 0 0 0
## drugB 0 2 0 0 0
## drugC 0 0 3 0 0
## drugX 0 0 0 11 0
## drugY 0 1 0 2 18
##
## Overall Statistics
##
## Accuracy : 0.925
## 95% CI : (0.7961, 0.9843)
## No Information Rate : 0.45
## P-Value [Acc > NIR] : 0.0000000002588
##
## Kappa : 0.8863
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.000 0.6667 1.000 0.8462
## Specificity 1.000 1.0000 1.000 1.0000
## Pos Pred Value 1.000 1.0000 1.000 1.0000
## Neg Pred Value 1.000 0.9737 1.000 0.9310
## Prevalence 0.075 0.0750 0.075 0.3250
## Detection Rate 0.075 0.0500 0.075 0.2750
## Detection Prevalence 0.075 0.0500 0.075 0.2750
## Balanced Accuracy 1.000 0.8333 1.000 0.9231
## Class: drugY
## Sensitivity 1.0000
## Specificity 0.8636
## Pos Pred Value 0.8571
## Neg Pred Value 1.0000
## Prevalence 0.4500
## Detection Rate 0.4500
## Detection Prevalence 0.5250
## Balanced Accuracy 0.9318
We can see that the model 2 is performing better than model 1, and actually is performing above the threshold of 90% with the accuracy of 0.925 or 92.5%. so we can actually use this model based on the given threshold for the value performance at the minimum of 90%.
As we can see, when we’re balancing out the data train using upSampling (model 2), the model is performing better than the model that balanced out using downSampling (model 1). In this case, we need to be careful when we want to set the data train when the initial data train is not balanced as it could impact the model we build.
As for the model, the model 1 clearly could not be used to predict which drugs will be needed for the patients as it is performing poorly at 52.5%. We could instead using the model 2 as it performing above the given threshold for the minimum value performance of 90%. So, in order to predict what kind of drugs which can be used by each and every patient, we can use the model 2 to our new data.