Hello everyone! in this RmD, I am trying a different aproach of making a report by using informal structure of words and guide the readers like telling a story. As I surfed Kaggle.com to find the best possible data for this RmD in which was made to learn how to create a modeling for prediction using Naive Bayes, Decision Tree, and Random Forest as a way for me to fulfill my assignment from Algoritma Data Science School, I stumble upon a data about a drug where it was given to several numbers of people with certain condition. From this data I would like to make a model where it would be able to suggest which type of drugs is the best for a person with certain type of conditions based on the data. Let us now begin with our data exploration.
When we obtain a data, we certainly have to understand the contain of the data before we can process our data to obtain our desired outcome based on the business case we were given. In this case, the business case would be to make a program where it is able to distinguish type of drugs to be given to a person with a certain number of conditions explained in the data as variables.
I like to start my report by putting all of the libraries on the beginning of the data exploration.
library(GGally)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)## Loading required package: lattice
library(e1071)
library(partykit)## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(randomForest)## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
After which I read my data and take a look at the data information.
drug <- read.csv("data/drug/drug200.csv")
glimpse(drug)## Rows: 200
## Columns: 6
## $ Age <int> 23, 47, 47, 28, 61, 22, 49, 41, 60, 43, 47, 34, 43, 74, 50~
## $ Sex <chr> "F", "M", "M", "F", "F", "F", "F", "M", "M", "M", "F", "F"~
## $ BP <chr> "HIGH", "LOW", "LOW", "NORMAL", "LOW", "NORMAL", "NORMAL",~
## $ Cholesterol <chr> "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "H~
## $ Na_to_K <dbl> 25.355, 13.093, 10.114, 7.798, 18.043, 8.607, 16.275, 11.0~
## $ Drug <chr> "DrugY", "drugC", "drugC", "drugX", "DrugY", "drugX", "Dru~
Our data contains 6 variables where 5 variables would be our possible predictors and 1 variables by the name of “Drug” will be our target as we want to make a machine learning model which can recommend specific type of drug to be consumed by a certain type of people. As the variable names is quite self explanatory, I will not explain them 1 by 1. However, here are the explanation of some variables:
If we observed the data type of each variables, we can certainly see that some data, which have character data type, would best be changed to factor data type as they are mostly repeating values.
drug <- drug %>%
mutate_if(is.character, as.factor)I would also recommend to always check for some missing values.
colSums(is.na(drug))## Age Sex BP Cholesterol Na_to_K Drug
## 0 0 0 0 0 0
It appears that we are safe to continue as our data does not contains missing values.
As naive bayes method as we know is “naive”, we would want to check the correlation of each of or variables. However, due to mostly our variables were categorical variables, we can only check 2 variables in out data.
ggcorr(drug, label = T)## Warning in ggcorr(drug, label = T): data in column(s) 'Sex', 'BP',
## 'Cholesterol', 'Drug' are not numeric and were ignored
The graph above has shown us that the numerical variables in the dataset that we have has no correlation between each other, Which means that our data is safe for Naive Bayes model.
We would want to make a train and test data out of our pre-processed data. The train data will be used to create our model. while the test data will be the data which we will try to predict using the model which is made using train data.
RNGkind(sample.kind = "Rounding")
set.seed(126)
index <- sample(nrow(drug), nrow(drug) * 0.7)
train_drug <- drug[index,] %>%
mutate_if(is.character, as.factor)
test_drug <- drug[-index,] %>%
mutate_if(is.character, as.factor)Sometimes when we are doing cross validation where we split our data to train and test data, the variables which are categorical variables are not separating evenly in the train and test data. We would want to make sure the unique values of each of our categorical variables were present in each train and test data. We can check them by previewing each of the train and test unique categorical values.
unique(train_drug$Sex)## [1] M F
## Levels: F M
unique(test_drug$Sex)## [1] F M
## Levels: F M
unique(train_drug$BP)## [1] LOW NORMAL HIGH
## Levels: HIGH LOW NORMAL
unique(test_drug$BP)## [1] HIGH NORMAL LOW
## Levels: HIGH LOW NORMAL
unique(train_drug$Cholesterol)## [1] NORMAL HIGH
## Levels: HIGH NORMAL
unique(test_drug$Cholesterol)## [1] HIGH NORMAL
## Levels: HIGH NORMAL
unique(train_drug$Drug)## [1] DrugY drugX drugA drugB drugC
## Levels: drugA drugB drugC drugX DrugY
unique(test_drug$Drug)## [1] DrugY drugX drugC drugB drugA
## Levels: drugA drugB drugC drugX DrugY
It appears that each categorical unique values are evenly separated in our train and test data, we do not have to resample our train and test data. The next thing we want to check is the proportion of our train data.
prop.table(table(train_drug$Drug))##
## drugA drugB drugC drugX DrugY
## 0.13571429 0.07142857 0.10000000 0.26428571 0.42857143
It is important to have a balanced target as later will cause our model to unable to properly create a prediction for data which has a small of sample. Upsampling can be done to balance the data which has an imbalance proportion such as this data.
train_drug <- upSample(x = train_drug %>% select(-Drug),
y = train_drug$Drug,
yname = "Drug")prop.table(table(train_drug$Drug))##
## drugA drugB drugC drugX DrugY
## 0.2 0.2 0.2 0.2 0.2
Now that we have a balanced train data, we can finally make our machine learning model.
Naive bayes is an algorithm classifier based on the Bayes’s theorem of probability. The theory assume that the predictor variables were all independent which explain why this method is called “naive”. we have to take notes that any multicollinearity in our data may result in a inaccurate modeling. Naive bayes model also prone to bias due to data scarcity.
In our case, due to the data that we have possess numerical variable, for sure that we will be finding any data scarcity. As a way to make sure of it, we can make a data table between the target and the predictor variables.
table(train_drug$Drug, train_drug$Age)##
## 15 16 17 18 19 20 22 23 24 25 26 28 29 30 31 32 33 34 35 36 37 38 39 40
## drugA 0 0 0 0 7 4 0 5 2 0 4 0 0 0 6 6 0 0 3 1 1 1 4 0
## drugB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## drugC 0 4 0 0 0 0 3 0 0 0 3 2 0 0 0 4 0 0 0 0 0 0 0 0
## drugX 0 0 1 2 0 0 5 2 1 0 0 3 0 2 0 4 0 1 2 1 1 0 1 3
## DrugY 2 2 0 1 1 1 1 0 0 1 1 3 1 0 0 0 1 1 0 2 2 2 3 1
##
## 41 42 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 64 65 66
## drugA 0 5 4 2 0 2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 3 0 7 0 0 0 5 0 7 15 0 0 0 0 0
## drugC 6 0 0 0 0 22 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0
## drugX 0 0 2 1 2 1 0 0 1 1 2 0 0 2 2 1 0 1 0 4 0 0 0 2
## DrugY 2 2 2 1 0 2 0 1 2 2 0 0 1 0 0 2 2 0 0 2 2 2 1 1
##
## 67 68 69 70 72 73 74
## drugA 0 0 0 0 0 0 0
## drugB 0 10 0 8 5 0 0
## drugC 0 4 0 0 4 0 0
## drugX 4 0 1 0 0 0 4
## DrugY 1 1 1 0 1 2 1
table(train_drug$Drug, train_drug$Na_to_K)##
## 6.683 6.769 7.261 7.285 7.34 7.477 7.798 7.845 8.011 8.107 8.151 8.607
## drugA 0 0 0 0 0 0 0 0 5 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0 0 0
## drugC 0 4 0 0 0 0 0 0 0 0 3 0
## drugX 1 0 2 2 2 1 1 1 0 2 0 2
## DrugY 0 0 0 0 0 0 0 0 0 0 0 0
##
## 8.621 8.7 8.75 8.966 8.968 9.17 9.443 9.445 9.475 9.514 9.664 9.677
## drugA 0 2 0 0 0 0 0 6 2 0 4 0
## drugB 3 0 0 0 0 0 0 0 0 0 0 5
## drugC 0 0 0 0 0 0 0 0 0 0 0 0
## drugX 0 0 2 2 1 1 2 0 0 2 0 0
## DrugY 0 0 0 0 0 0 0 0 0 0 0 0
##
## 9.709 9.712 9.894 9.945 10.017 10.065 10.067 10.103 10.114 10.189
## drugA 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 5 0 0 0 0 0 10
## drugC 0 4 0 0 0 0 7 0 5 0
## drugX 1 0 2 0 1 1 0 1 0 0
## DrugY 0 0 0 0 0 0 0 0 0 0
##
## 10.291 10.403 10.443 10.444 10.446 10.537 10.605 10.832 10.84 10.898
## drugA 0 2 0 0 1 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0
## drugC 4 0 0 6 0 2 0 0 0 0
## drugX 0 0 2 0 0 0 1 1 3 2
## DrugY 0 0 0 0 0 0 0 0 0 0
##
## 11.037 11.198 11.227 11.262 11.326 11.343 11.349 11.424 11.767 11.871
## drugA 0 1 2 4 1 0 0 0 0 4
## drugB 0 0 0 0 0 3 0 0 0 0
## drugC 6 0 0 0 0 0 0 0 6 0
## drugX 0 0 0 0 0 0 2 1 0 0
## DrugY 0 0 0 0 0 0 0 0 0 0
##
## 11.939 11.953 12.006 12.26 12.295 12.307 12.495 12.766 12.854 12.859
## drugA 0 0 0 0 0 4 0 5 2 0
## drugB 0 0 0 0 0 0 7 0 0 0
## drugC 0 0 4 0 0 0 0 0 0 0
## drugX 4 3 0 2 1 0 0 0 0 2
## DrugY 0 0 0 0 0 0 0 0 0 0
##
## 12.879 12.894 12.923 13.091 13.093 13.127 13.303 13.313 13.597 13.884
## drugA 0 3 0 1 0 0 0 7 0 0
## drugB 0 0 0 0 0 0 5 0 0 0
## drugC 0 0 0 0 4 2 0 0 0 0
## drugX 2 0 1 0 0 0 0 0 1 1
## DrugY 0 0 0 0 0 0 0 0 0 0
##
## 13.934 13.935 13.967 13.972 14.16 14.216 15.156 15.376 15.436 15.478
## drugA 0 0 0 4 0 0 0 0 0 0
## drugB 7 7 8 0 0 0 0 0 0 0
## drugC 0 0 0 0 3 0 0 0 0 0
## drugX 0 0 0 0 0 1 0 0 0 0
## DrugY 0 0 0 0 0 0 1 1 1 1
##
## 15.49 15.516 15.79 15.891 15.969 16.275 16.31 16.347 16.594 16.724
## drugA 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0
## drugC 0 0 0 0 0 0 0 0 0 0
## drugX 0 0 0 0 0 0 0 0 0 0
## DrugY 1 1 1 1 1 1 1 1 1 1
##
## 16.725 16.753 17.206 17.211 17.225 17.951 18.043 18.295 18.348 18.809
## drugA 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0
## drugC 0 0 0 0 0 0 0 0 0 0
## drugX 0 0 0 0 0 0 0 0 0 0
## DrugY 1 1 1 1 1 1 1 2 1 1
##
## 18.991 19.007 19.011 19.128 19.161 19.221 19.368 19.675 20.013 20.932
## drugA 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0
## drugC 0 0 0 0 0 0 0 0 0 0
## drugX 0 0 0 0 0 0 0 0 0 0
## DrugY 1 1 1 1 1 1 1 1 1 1
##
## 21.036 22.456 22.697 22.905 23.003 23.091 24.658 25.475 25.741 25.893
## drugA 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0
## drugC 0 0 0 0 0 0 0 0 0 0
## drugX 0 0 0 0 0 0 0 0 0 0
## DrugY 1 1 1 1 1 1 1 1 1 1
##
## 25.969 27.05 27.064 27.183 27.826 28.294 29.45 29.875 30.568 31.876
## drugA 0 0 0 0 0 0 0 0 0 0
## drugB 0 0 0 0 0 0 0 0 0 0
## drugC 0 0 0 0 0 0 0 0 0 0
## drugX 0 0 0 0 0 0 0 0 0 0
## DrugY 1 1 1 1 1 1 1 1 1 1
##
## 33.486 33.542 35.639 37.188 38.247
## drugA 0 0 0 0 0
## drugB 0 0 0 0 0
## drugC 0 0 0 0 0
## drugX 0 0 0 0 0
## DrugY 1 1 1 1 1
table(train_drug$Drug, train_drug$Sex)##
## F M
## drugA 23 37
## drugB 27 33
## drugC 28 32
## drugX 29 31
## DrugY 30 30
table(train_drug$Drug, train_drug$BP)##
## HIGH LOW NORMAL
## drugA 60 0 0
## drugB 60 0 0
## drugC 0 60 0
## drugX 0 16 44
## DrugY 24 19 17
table(train_drug$Drug, train_drug$Cholesterol)##
## HIGH NORMAL
## drugA 33 27
## drugB 27 33
## drugC 60 0
## drugX 25 35
## DrugY 29 31
Result has shown us that there are in fact an indication of data scarcity in our data. To tackle this problem, we can apply Laplace smooting for our modeling to scale up the sample of our data to remove 0 values.
naive_model <- naiveBayes(train_drug %>% select(-Drug), train_drug$Drug, laplace = 1)
naive_model##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = train_drug %>% select(-Drug), y = train_drug$Drug,
## laplace = 1)
##
## A-priori probabilities:
## train_drug$Drug
## drugA drugB drugC drugX DrugY
## 0.2 0.2 0.2 0.2 0.2
##
## Conditional probabilities:
## Age
## train_drug$Drug [,1] [,2]
## drugA 32.36667 9.513535
## drugB 62.03333 6.574132
## drugC 44.73333 15.242522
## drugX 44.38333 17.261785
## DrugY 44.63333 16.971528
##
## Sex
## train_drug$Drug F M
## drugA 0.3870968 0.6129032
## drugB 0.4516129 0.5483871
## drugC 0.4677419 0.5322581
## drugX 0.4838710 0.5161290
## DrugY 0.5000000 0.5000000
##
## BP
## train_drug$Drug HIGH LOW NORMAL
## drugA 0.96825397 0.01587302 0.01587302
## drugB 0.96825397 0.01587302 0.01587302
## drugC 0.01587302 0.96825397 0.01587302
## drugX 0.01587302 0.26984127 0.71428571
## DrugY 0.39682540 0.31746032 0.28571429
##
## Cholesterol
## train_drug$Drug HIGH NORMAL
## drugA 0.54838710 0.45161290
## drugB 0.45161290 0.54838710
## drugC 0.98387097 0.01612903
## drugX 0.41935484 0.58064516
## DrugY 0.48387097 0.51612903
##
## Na_to_K
## train_drug$Drug [,1] [,2]
## drugA 11.33518 1.825444
## drugB 12.01152 1.886540
## drugC 10.70453 1.726071
## drugX 10.25760 1.957759
## DrugY 21.86553 6.183802
After we have our model ready, we can make a prediction using our test data as the data sample for our prediction and evaluate the performance of our model based on the prediction which our model has made.
naive_pred <- predict(naive_model, test_drug, type = "class")
confusionMatrix(naive_pred,
test_drug$Drug,
positive = "drugA")## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX DrugY
## drugA 4 0 0 0 0
## drugB 0 6 0 0 0
## drugC 0 0 2 0 2
## drugX 0 0 0 16 0
## DrugY 0 0 0 1 29
##
## Overall Statistics
##
## Accuracy : 0.95
## 95% CI : (0.8608, 0.9896)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 1.837e-13
##
## Kappa : 0.923
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.00000 1.0 1.00000 0.9412
## Specificity 1.00000 1.0 0.96552 1.0000
## Pos Pred Value 1.00000 1.0 0.50000 1.0000
## Neg Pred Value 1.00000 1.0 1.00000 0.9773
## Prevalence 0.06667 0.1 0.03333 0.2833
## Detection Rate 0.06667 0.1 0.03333 0.2667
## Detection Prevalence 0.06667 0.1 0.06667 0.2667
## Balanced Accuracy 1.00000 1.0 0.98276 0.9706
## Class: DrugY
## Sensitivity 0.9355
## Specificity 0.9655
## Pos Pred Value 0.9667
## Neg Pred Value 0.9333
## Prevalence 0.5167
## Detection Rate 0.4833
## Detection Prevalence 0.5000
## Balanced Accuracy 0.9505
The model which are build using Naive Bayes method seems promising as it has 95% accuracy, around 97% sensitivity, and around 89% precision level. However, keep in mind that the data sample that we have currently are still low in terms of the number of samples. As the number goes up when we run our model using continuous data, we might need to adjust our model to better suit the future data.
Decision tree, as one of the most popular and most widely used Machine Learning model, is a tree-based model that are quite simple but quite robust/powerful in terms of performance for prediction. The major benefit of using this method is that its ability to visualize an interpretable result as it produces a visualization in a form of a tree as well as its low computational load.
tree_model <- ctree(Drug ~.,
train_drug)
tree_model##
## Model formula:
## Drug ~ Age + Sex + BP + Cholesterol + Na_to_K
##
## Fitted party:
## [1] root
## | [2] BP in HIGH
## | | [3] Age <= 49
## | | | [4] Na_to_K <= 13.972: drugA (n = 60, err = 0.0%)
## | | | [5] Na_to_K > 13.972: DrugY (n = 16, err = 0.0%)
## | | [6] Age > 49
## | | | [7] Na_to_K <= 13.967: drugB (n = 60, err = 0.0%)
## | | | [8] Na_to_K > 13.967: DrugY (n = 8, err = 0.0%)
## | [9] BP in LOW, NORMAL
## | | [10] Na_to_K <= 14.216
## | | | [11] BP in LOW
## | | | | [12] Cholesterol in HIGH: drugC (n = 60, err = 0.0%)
## | | | | [13] Cholesterol in NORMAL: drugX (n = 16, err = 0.0%)
## | | | [14] BP in NORMAL: drugX (n = 44, err = 0.0%)
## | | [15] Na_to_K > 14.216: DrugY (n = 36, err = 0.0%)
##
## Number of inner nodes: 7
## Number of terminal nodes: 8
For a better and easier understanding, we can create a visualization our of our decision tree model.
plot(tree_model, type = "simple")In the visualization above, we can see that the tree create a total of 7 split before finally making a decision which drug is the best based on the condition of a person. Moreover, it is such a simple model that even non-data analyst can understand. For example, by following the partern of “BP = LOW, NORMAL” and “Na_to_K > 14.216”, we will be recommended to use DrugY by our model.
tree_pred <- predict(tree_model,
test_drug,
type = "response")
confusionMatrix(tree_pred,
test_drug$Drug,
positive = "drugA")## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX DrugY
## drugA 3 0 0 0 0
## drugB 1 5 0 0 0
## drugC 0 0 2 0 0
## drugX 0 0 0 16 0
## DrugY 0 1 0 1 31
##
## Overall Statistics
##
## Accuracy : 0.95
## 95% CI : (0.8608, 0.9896)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 1.837e-13
##
## Kappa : 0.9201
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 0.75000 0.83333 1.00000 0.9412
## Specificity 1.00000 0.98148 1.00000 1.0000
## Pos Pred Value 1.00000 0.83333 1.00000 1.0000
## Neg Pred Value 0.98246 0.98148 1.00000 0.9773
## Prevalence 0.06667 0.10000 0.03333 0.2833
## Detection Rate 0.05000 0.08333 0.03333 0.2667
## Detection Prevalence 0.05000 0.10000 0.03333 0.2667
## Balanced Accuracy 0.87500 0.90741 1.00000 0.9706
## Class: DrugY
## Sensitivity 1.0000
## Specificity 0.9310
## Pos Pred Value 0.9394
## Neg Pred Value 1.0000
## Prevalence 0.5167
## Detection Rate 0.5167
## Detection Prevalence 0.5500
## Balanced Accuracy 0.9655
By using Decision Tree and our model, we have manage to achieve 95% accuracy, around 90% sensitivity, and round 95% precision. Although the default decision tree method has manage to create a well performing model, if we obtain an unsatisfactory preforming level of our model depending on the result and our subjective preference, we can in fact tune our model and try to find the best possible performance of our model.
I try to observe each and every variables to be compared with the target to see a little overview of the influence of our variables to the target before I tuned the model.
drug %>%
select(BP, Drug) %>%
arrange(BP) %>%
unique() %>%
table()## Drug
## BP drugA drugB drugC drugX DrugY
## HIGH 1 1 0 0 1
## LOW 0 0 1 1 1
## NORMAL 0 0 0 1 1
drug %>%
select(Na_to_K, Drug) %>%
mutate(Na_to_K = ifelse(Na_to_K >= 16.08449, "HIGH", "LOW")) %>%
arrange(Na_to_K) %>%
unique() %>%
table()## Drug
## Na_to_K drugA drugB drugC drugX DrugY
## HIGH 0 0 0 0 1
## LOW 1 1 1 1 1
drug %>%
select(Age, Drug) %>%
mutate(Age = ifelse(Age >= 40, "OLD", "YOUNG")) %>%
arrange(Age) %>%
unique() %>%
table()## Drug
## Age drugA drugB drugC drugX DrugY
## OLD 1 1 1 1 1
## YOUNG 1 0 1 1 1
drug %>%
select(Cholesterol, Drug) %>%
arrange(Cholesterol) %>%
unique() %>%
table()## Drug
## Cholesterol drugA drugB drugC drugX DrugY
## HIGH 1 1 1 1 1
## NORMAL 1 1 0 1 1
drug %>%
select(Sex, Drug) %>%
arrange(Sex) %>%
unique() %>%
table()## Drug
## Sex drugA drugB drugC drugX DrugY
## F 1 1 1 1 1
## M 1 1 1 1 1
By analyzing the matrices above we can say that gender has nothing to do with the medicines effectiveness and/or whether it is safe to be consumed by a person. As a result, we can see why there are no split decision for gender in our visualized model. However, Blood pressure and sodium to potassium Ration in Blood seems to have a big impact on which drug a person can/have to consume. With this knowledge, we can try to tune our decision tree model to be more accurate.
tree_model_tuned <- ctree(Drug ~.,
train_drug,
control = ctree_control(mincriterion=0.95,
minsplit=0,
minbucket=10))
plot(tree_model_tuned, type = "simple")When we tuned our model, the sturctural of the decision tree will change depending on the number of the mincriterion, minsplit, and minbucket that we inserted inside the model. We can see that there is a difference in the prediction with the number 8 on top of its box, the prediction changed from DrugY to drugB when we create our decision tree model and tune it using mincriterion, minsplit, and minbucket value.
tree_pred_tuned <- predict(tree_model_tuned,
test_drug,
type = "response")
confusionMatrix(tree_pred_tuned,
test_drug$Drug,
positive = "drugA")## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX DrugY
## drugA 3 0 0 0 0
## drugB 1 6 0 0 3
## drugC 0 0 2 0 0
## drugX 0 0 0 16 0
## DrugY 0 0 0 1 28
##
## Overall Statistics
##
## Accuracy : 0.9167
## 95% CI : (0.8161, 0.9724)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 2.677e-11
##
## Kappa : 0.8725
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 0.75000 1.0000 1.00000 0.9412
## Specificity 1.00000 0.9259 1.00000 1.0000
## Pos Pred Value 1.00000 0.6000 1.00000 1.0000
## Neg Pred Value 0.98246 1.0000 1.00000 0.9773
## Prevalence 0.06667 0.1000 0.03333 0.2833
## Detection Rate 0.05000 0.1000 0.03333 0.2667
## Detection Prevalence 0.05000 0.1667 0.03333 0.2667
## Balanced Accuracy 0.87500 0.9630 1.00000 0.9706
## Class: DrugY
## Sensitivity 0.9032
## Specificity 0.9655
## Pos Pred Value 0.9655
## Neg Pred Value 0.9032
## Prevalence 0.5167
## Detection Rate 0.4667
## Detection Prevalence 0.4833
## Balanced Accuracy 0.9344
Although it is possible to tune our model in decision tree, the default model created by the decision tree still has the best result while the tuning makes the accuracy lower. Furthermore, the accuracy achieved by the default decision tree model has 3.33% higher accuracy compare to the model that we have slightly tuned. The tuning in this case makes the accuracy worse.
Random forest is one of the Machine Learning model that consist of many decision trees, the “forest” generated by the random forest algorithm is trained through bagging or bootstrap sampling by creating may random sample from the whole data while allowing duplicated data. the algorithm establishes the outcomes based on the predictions of the decision trees by taking the average or mean of the output from various trees. The higher the number of decision trees are generated, the higher the precision of the outcome will be.
RNGkind(sample.kind = "Rounding")
set.seed(126)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
forest <- train(Drug ~ ., train_drug, method = "rf", tfControl = ctrl)
saveRDS(forest, "forest_model_LBB.RDS")Since generating random forest model are heavy on the computational load for computers which results in time consuming process with especially big samples of data, we would want to generate only once and store the result using saveRDS() to later be called back when we want to use it without having to generate the model from the beginning.
forest_model <- readRDS("forest_model_LBB.RDS")
forest_model## Random Forest
##
## 300 samples
## 5 predictor
## 5 classes: 'drugA', 'drugB', 'drugC', 'drugX', 'DrugY'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 300, 300, 300, 300, 300, 300, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9988571 0.9985603
## 4 0.9988571 0.9985603
## 6 0.9988571 0.9985603
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The result of our modeling returns a rather surprising performance where the accuracy of our models generated form random data selection appears to have the same accuracy and a rather high accuracy level. This model has fall under my suspicion that this might be a sign of an overfitting. As the model generates the same accuracy, the program automatically choose the first best model obtained from generating random samples to create a model with less number of split as higher number of split does not have any further impact to the model performace
forest_model$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), tfControl = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## drugA drugB drugC drugX DrugY class.error
## drugA 60 0 0 0 0 0
## drugB 0 60 0 0 0 0
## drugC 0 0 60 0 0 0
## drugX 0 0 0 60 0 0
## DrugY 0 0 0 0 60 0
If we look at the OOB (Out-Of-Bag) estimate, we have manage to achieve a model with 100% accuracy which raised my suspicion even more.
varImp(forest_model)## rf variable importance
##
## Overall
## Na_to_K 100.00
## BPLOW 63.28
## Age 63.26
## BPNORMAL 52.14
## CholesterolNORMAL 18.43
## SexM 0.00
Unfortunately, we cannot interpret why our model is created this way because of the many random sample that are generated during the computational process of making the random forest model. However, we can at least be able to see which variable has the most influence in our model. Apparently, Na_to_K have the highest influence compare to all of our variable. The number indicated the overall importance value of the variable generated from Gini importance calculation. As we see the first and the second most influential to our model, there is a quite big amount of gap in between them at around 37.72%. This explains why when we go back to our random forest result, we can see that with only 2 split of our tree sample, we can obtain accuracy as high as around 99%.
forest_pred <- predict(forest_model,
test_drug,
type = "raw")
confusionMatrix(forest_pred,
test_drug$Drug,
positive = "drugA")## Confusion Matrix and Statistics
##
## Reference
## Prediction drugA drugB drugC drugX DrugY
## drugA 4 0 0 0 0
## drugB 0 6 0 0 0
## drugC 0 0 2 0 0
## drugX 0 0 0 17 0
## DrugY 0 0 0 0 31
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9404, 1)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: drugA Class: drugB Class: drugC Class: drugX
## Sensitivity 1.00000 1.0 1.00000 1.0000
## Specificity 1.00000 1.0 1.00000 1.0000
## Pos Pred Value 1.00000 1.0 1.00000 1.0000
## Neg Pred Value 1.00000 1.0 1.00000 1.0000
## Prevalence 0.06667 0.1 0.03333 0.2833
## Detection Rate 0.06667 0.1 0.03333 0.2833
## Detection Prevalence 0.06667 0.1 0.03333 0.2833
## Balanced Accuracy 1.00000 1.0 1.00000 1.0000
## Class: DrugY
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.5167
## Detection Rate 0.5167
## Detection Prevalence 0.5167
## Balanced Accuracy 1.0000
Although I have my suspicion regarding the model generated by Random Forest method at the beginning, the model manage to predict our test data at an accuracy rate of 100%. this indicates that we have created a perfect model where it can distinguish which type of medicine to use for certain type of person with certain type of condition. However, in the real world situation, to achieve such model, it is almost impossible due to many unknown factors that changed the result of which type of drug to use. In this case, it might be due to lab testing result where there are no other factors beside the variables stated in the data to determine which type of drug to use, or the other factors or variables which are not mentioned in the data was simply ignored.
All of our model managed to performed well on predicting which type of drug best to be consumed by a person with certain type of condition based on the variables contained in the data. The Naive Bayes and Decision Tree manage to performed well with the same percentage of accuracy. Although the Random Forest were the best model out of all, with the performance of all three are in a satisfactory level, I recommend Decision Tree to be used as the model for our prediction due to its being intrepretable, easy to understand, and can be adjusted based on our preference or needs.