On this occasion, I will try to make predictions on diabetic patients in a hospital who will be predicted to be sick or not based on the categories of several supporting variables. The algorithm that I will use is using the Naive Bayes method and Decision Tree which are included in supervised learning.
Load required library.
library(caret)
library(e1071)
library(randomForest)
library(ggplot2)
library(class)
library(tidyr)
library(dplyr)
library(partykit)Load data to perform regression model
diabetes <- read.csv(file="data_input/diabetes.csv",stringsAsFactors = F)after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() only.
head(diabetes)Descriptions:
pregnant: Number of times pregnantglucose: Plasma glucose concentration (glucose
tolerance test)pressure: Diastolic blood pressure (mm Hg)triceps: Triceps skin fold thickness (mm)insulin: 2-Hour serum insulin (mu U/ml)mass: Body mass index (weight in kg/(height in
m)^2)pedigree: Diabetes pedigree functionage: Age (years)diabetes: Test for DiabetesCheck the structure of the data copier. Are there any columns whose data types do not match?
str(diabetes)#> 'data.frame': 768 obs. of 9 variables:
#> $ pregnant: int 6 1 8 1 0 5 3 10 2 8 ...
#> $ glucose : int 148 85 183 89 137 116 78 115 197 125 ...
#> $ pressure: int 72 66 64 66 40 74 50 0 70 96 ...
#> $ triceps : int 35 29 0 23 35 0 32 0 45 0 ...
#> $ insulin : int 0 0 0 94 168 0 88 0 543 0 ...
#> $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
#> $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
#> $ age : int 50 31 32 21 33 30 26 29 53 54 ...
#> $ diabetes: chr "pos" "neg" "pos" "neg" ...
Target: diabetes (pos, neg)
Columns whose data types do not match:
diabetes -> factordiabetes <- diabetes %>%
mutate(diabetes = as.factor(diabetes))
glimpse(diabetes)#> Rows: 768
#> Columns: 9
#> $ pregnant <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
#> $ glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
#> $ pressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
#> $ triceps <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
#> $ insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
#> $ mass <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
#> $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
#> $ age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
#> $ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~
Check blank data in our dataset
colSums(is.na(diabetes))#> pregnant glucose pressure triceps insulin mass pedigree age
#> 0 0 0 0 0 0 0 0
#> diabetes
#> 0
from the results of our exploratory data, we get the result that there is no NA or balnk data.
Cross Validation is a stage where we will divide the data into 2 parts, namely data train and data test.
The purpose of cross validation is to find out how well the model predicts unseen data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(diabetes),
size = nrow(diabetes)*0.8)
# splitting
diab_train <- diabetes[index, ]
diab_test <- diabetes[-index, ]Before doing the modeling, we need to first see the proportion of the target variable that we have in the target column.
prop.table(table(diab_train$diabetes))#>
#> neg pos
#> 0.6563518 0.3436482
If it is seen from the proportion of the two classes, it is not so balanced, for that we need additional pre-processing to balance the proportion between the two classes of target variables.
Downsampling: reduces the observation of the majority class until it is equal to the minority class. Disadvantages: discarding information from the data held. Usually used when the data in the minority class is quite a lot.
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
diab_train <- downSample(x = diab_train %>%
select(-diabetes),
y = diab_train$diabetes,
yname = "diabetes")
prop.table(table(diab_train$diabetes))#>
#> neg pos
#> 0.5 0.5
after we are doing downsampling the proportions on the data train look already balanced and we can proceed to the modeling process.
The next step is to design a classification model using a different algorithm and compare the model accuracy of all the models that have been made, namely the Naive Bayes algorithm, Decision Tree.
Naive Bayes is a machine learning model that utilizes Bayes’ Theorem in classifying. The relationship between the predictor and the target variable is considered mutually dependent. It is said “Naive” because each predictor is assumed to be independent (not related to each other) and has the same weight (have the same importance or influence) in making predictions. This is to simplify calculations (the formula becomes simpler) and reduce the computational burden.
# Model with Naive Bayes
model_naive <- naiveBayes(formula = diabetes ~ ., data = diab_train)we will make predictions using type = "class" to return
the class label (default threshold 0.5).
# Prediction
preds_naive <- predict(model_naive, newdata = diab_test, type = "class")after we make predictions we will evaluate the model by means of
confusionMatrix we will see how it turns out
# Confusion Matrix
confusionMatrix(data = preds_naive, reference = diab_test$diabetes, positive = "pos" )#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction neg pos
#> neg 81 15
#> pos 16 42
#>
#> Accuracy : 0.7987
#> 95% CI : (0.7266, 0.8589)
#> No Information Rate : 0.6299
#> P-Value [Acc > NIR] : 0.000004461
#>
#> Kappa : 0.5698
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.7368
#> Specificity : 0.8351
#> Pos Pred Value : 0.7241
#> Neg Pred Value : 0.8438
#> Prevalence : 0.3701
#> Detection Rate : 0.2727
#> Detection Prevalence : 0.3766
#> Balanced Accuracy : 0.7859
#>
#> 'Positive' Class : pos
#>
The results of the confusionmatrix show that the Naive Bayes classification estimates 81 cases of patients not having diabetes correctly and 15 predictions incorrectly. Similarly, the model correctly predicted 42 cases of diabetic patients and 6 incorrectly predicted. From the output of the Naive Bayes classification model, it can be seen that the accuracy of the model is 79.87% and Recall is 73.68%.
Previously we have made predictions using test data. what if we make predictions using train data, will the results be better than testing using test data?
preds_naive <- predict(model_naive, newdata = diab_train, type = "class")after we make predictions we will evaluate the model by means of
confusionMatrix we will see how it turns out
confusionMatrix(data = preds_naive, reference = diab_train$diabetes, positive = "pos" )#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction neg pos
#> neg 165 70
#> pos 46 141
#>
#> Accuracy : 0.7251
#> 95% CI : (0.6799, 0.7672)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.4502
#>
#> Mcnemar's Test P-Value : 0.03272
#>
#> Sensitivity : 0.6682
#> Specificity : 0.7820
#> Pos Pred Value : 0.7540
#> Neg Pred Value : 0.7021
#> Prevalence : 0.5000
#> Detection Rate : 0.3341
#> Detection Prevalence : 0.4431
#> Balanced Accuracy : 0.7251
#>
#> 'Positive' Class : pos
#>
we will refer to the accuracy value and recall value from the results it can be seen that the accuracy of the model is 72.51% and Recall is 66.82% which by using test data we get accuracy of the model is 79.87% and Recall is 73.68%, With this we can assume that the model we have is not very suitable because there is an under fitting, which from the test results using train data and test data is not good.
Decision Tree is a fairly simple tree-based model with robust/powerful performance for prediction. Decision Tree produces a visualization in the form of a decision tree that can be interpreted easily.
Additional Decision Tree characters:
# Model with Naive Bayes
model_tree <- ctree(formula = diabetes ~ . , data = diab_train)
plot(model_tree,type = "simple")We can see the number of divisions / leaves (width) and many layers / levels (depth) it. where :
[1] is the Root Node or root [2], [3], [5], [8], and [11] are Internal Nodes or branches. These branches are indicated by an arrow pointing at them, and an arrow pointing from them. [4], [6], [7], [9], [10], [12] and [13] are Leaf Nodes or leaves. Leaves are indicated by arrows pointing at them, but no arrows pointing from them.
we will make predictions using type = "response" to
return the class label (default threshold 0.5).
# Prediction
preds_tree <- predict(model_tree, newdata = diab_test, type = "response")after we make predictions we will evaluate the model by means of
confusionMatrix we will see how it turns out
# Confusion Matrix
confusionMatrix(data = preds_tree, reference = diab_test$diabetes, positive = "pos" )#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction neg pos
#> neg 61 15
#> pos 36 42
#>
#> Accuracy : 0.6688
#> 95% CI : (0.5885, 0.7425)
#> No Information Rate : 0.6299
#> P-Value [Acc > NIR] : 0.179643
#>
#> Kappa : 0.3399
#>
#> Mcnemar's Test P-Value : 0.005101
#>
#> Sensitivity : 0.7368
#> Specificity : 0.6289
#> Pos Pred Value : 0.5385
#> Neg Pred Value : 0.8026
#> Prevalence : 0.3701
#> Detection Rate : 0.2727
#> Detection Prevalence : 0.5065
#> Balanced Accuracy : 0.6829
#>
#> 'Positive' Class : pos
#>
The results of the confusionmatrix show that the Naive Bayes classification estimates 61 cases of patients not having diabetes correctly and 15 predictions incorrectly. Similarly, the model correctly predicted 42 cases of diabetic patients and 36 incorrectly predicted. From the output of the Naive Bayes classification model, it can be seen that the accuracy of the model is 66.88% and Recall is 73.68%.
Previously we have made predictions using test data. what if we make predictions using train data, will the results be better than testing using test data?
preds_tree <- predict(model_tree, newdata = diab_train, type = "response")after we make predictions we will evaluate the model by means of
confusionMatrix we will see how it turns out
confusionMatrix(data = preds_tree, reference = diab_train$diabetes, positive = "pos" )#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction neg pos
#> neg 154 42
#> pos 57 169
#>
#> Accuracy : 0.7654
#> 95% CI : (0.722, 0.805)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.5308
#>
#> Mcnemar's Test P-Value : 0.1594
#>
#> Sensitivity : 0.8009
#> Specificity : 0.7299
#> Pos Pred Value : 0.7478
#> Neg Pred Value : 0.7857
#> Prevalence : 0.5000
#> Detection Rate : 0.4005
#> Detection Prevalence : 0.5355
#> Balanced Accuracy : 0.7654
#>
#> 'Positive' Class : pos
#>
we will refer to the accuracy value and recall value from the results it can be seen that the accuracy of the model is 76.54% and Recall is 80.09% which by using test data we get accuracy of the model is 66.88% and Recall is 73.68%, With this we can assume that the model we have is not very suitable because there is an over fitting, which from the test results using train data was good and test data is not good.
The disadvantage of the Decision Tree is that it is prone to overfitting, because it is capable of splitting data to very detail, even to conditions where 1 leaf node has only 1 observation. This makes the Decision Tree model only memorize train data patterns by creating complex rules, which should learn the data patterns. The result is that the model is less able to generalize the pattern of the test data, so its performance is much worse.
In getting the best performance for each model, especially in the Naive Bayes and Decision Tree models, performance can still be improved by changing and finding the most suitable cutoff value, in this analysis of course we want to minimize False Positive, it is necessary to find which cutoff value is able to provide the percentage of recall is high but does not change the level of accuracy too much.
Based on the order of accuracy level and recall value, the Naive Bayes classification model is the best model. With an accuracy rate of 79.87% and Recall 73.68%% while Decision Tree (Acc = 66.88% Recall = 73.68%). However, we can still change the cutoff value of each model and re-validate it.