Hepatitis C Diagnosis using Naive Bayes, Decision Tree, and Random Forest
Intro
In this analysis, we will use supervised machine learning to classify or diagnose hepatitis c infection on blood. The dataset can be downloaded here. We will use tree supervised machine learning models, Naive Bayes, Decision Tree, and Random Forest.
Naive Bayes model is supervised machine learning based on
Bayes Theorem of Probability
. This method assume that the
predictor and the target variables is dependen, and among the predictor
is independent. This means that all the predictors have save values to
predict the targe.
Decision tree model is simple tree-based model that have robust or powerfull performance to make a prediction. This model resulting decision tree that can be easily interpreted or used. Outlier data also can be handled by this model.
Random forest is classification model consisting of many decisions trees. Each decision trees have their own characteristic and build by Bagging (Bootstrap and Aggregation) concept.
Data Import
<- read.csv("HepatitisCdata.csv")
data ::paged_table(data) rmarkdown
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(data)
## Rows: 615
## Columns: 14
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ Category <chr> "0=Blood Donor", "0=Blood Donor", "0=Blood Donor", "0=Blood D…
## $ Age <int> 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 3…
## $ Sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "…
## $ ALB <dbl> 38.5, 38.5, 46.9, 43.2, 39.2, 41.6, 46.3, 42.2, 50.9, 42.4, 4…
## $ ALP <dbl> 52.5, 70.3, 74.7, 52.0, 74.1, 43.3, 41.3, 41.9, 65.5, 86.3, 5…
## $ ALT <dbl> 7.7, 18.0, 36.2, 30.6, 32.6, 18.5, 17.5, 35.8, 23.2, 20.3, 21…
## $ AST <dbl> 22.1, 24.7, 52.6, 22.6, 24.8, 19.7, 17.8, 31.1, 21.2, 20.0, 2…
## $ BIL <dbl> 7.5, 3.9, 6.1, 18.9, 9.6, 12.3, 8.5, 16.1, 6.9, 35.2, 17.2, 5…
## $ CHE <dbl> 6.93, 11.17, 8.84, 7.33, 9.15, 9.92, 7.01, 5.82, 8.69, 5.46, …
## $ CHOL <dbl> 3.23, 4.80, 5.20, 4.74, 4.32, 6.05, 4.79, 4.60, 4.10, 4.45, 3…
## $ CREA <dbl> 106, 74, 86, 80, 76, 111, 70, 109, 83, 81, 78, 79, 78, 65, 63…
## $ GGT <dbl> 12.1, 15.6, 33.2, 33.8, 29.9, 91.0, 16.9, 21.5, 13.7, 15.9, 2…
## $ PROT <dbl> 69.0, 76.5, 79.3, 75.7, 68.7, 74.0, 74.5, 67.1, 71.3, 69.9, 7…
The data contains laboratory values of blood donors and Hepatitis C
patients and demographic data such as Age
and
Sex
. Our target for classification is `Category, blood
donors vs hepatitis C patients, the hepatitis C levels (Hepatitis C,
Fibrosis, and Cirrhosis) will be joined as Hepatitis category. So, our
target class are Hepatitis for positive and Donor for Negative.
Data Wrangling
We will join fibrosis and cirrhosis as hepatitis C category, and make
hepatitis and donor as Diagnosis
variable. Change
Sex
data type to factor and delete X
,
Category
variables.
library(stringr)
<- data %>%
data mutate(Diagnosis = if_else(str_detect(Category, "Donor"), "Donor", "Hepatitis"),
Diagnosis = as.factor(Diagnosis),
Sex = as.factor(Sex)) %>%
select(-c(X, Category))
::paged_table(data) rmarkdown
Check missing values.
colSums(is.na(data))
## Age Sex ALB ALP ALT AST BIL CHE
## 0 0 1 18 1 0 0 0
## CHOL CREA GGT PROT Diagnosis
## 10 0 0 1 0
Fill missing values.
Fill the missing values with median because there are not to many missing values.
<- data %>%
data_clean mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))
colSums(is.na(data_clean))
## Age Sex ALB ALP ALT AST BIL CHE
## 0 0 0 0 0 0 0 0
## CHOL CREA GGT PROT Diagnosis
## 0 0 0 0 0
Now, our data is clean!
Exploratory Data Analysis
Check the target variable proportion
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.2.2
%>%
data_clean select(Diagnosis) %>%
inspect_cat()
%>%
data_clean select(Diagnosis) %>%
inspect_cat() %>%
show_plot()
We can see that our target proportion is unbalanced, the donor data have 87.80% proportion. We will balance it using upsampling method later.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(data_clean, label = T)
## Warning in ggcorr(data_clean, label = T): data in column(s) 'Sex', 'Diagnosis'
## are not numeric and were ignored
Based on the plot above, we can see that our predictors have low
correlation between each other. This may indicates that the data is good
using naive bayes because naive bayes assuming the predictors are
independent.
Cross-Validation
Cross validation is step when we split our data into training data and testing data. We use training data to train our model, and we use testing data to test if our model can classify correctly on new data or unseen data.
library(rsample)
## Warning: package 'rsample' was built under R version 4.2.2
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
<- sample(nrow(data_clean), nrow(data_clean)*0.8)
index_hepa <- data_clean[index_hepa,]
hepa_train <- data_clean[-index_hepa,]
hepa_test
#Check target proportion
prop.table(table(hepa_train$Diagnosis))
##
## Donor Hepatitis
## 0.8821138 0.1178862
Data Pre-Processing
# Upsampling
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
<- upSample(x = hepa_train %>% select(-Diagnosis),
hepa_train y = hepa_train$Diagnosis,
yname = "Diagnosis") #nama kolom target
prop.table(table(hepa_train$Diagnosis))
##
## Donor Hepatitis
## 0.5 0.5
Modelling
Naive Bayes
Model Training
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.2
##
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
##
## permutations
<- naiveBayes(Diagnosis~., hepa_train) model_nb
Model Evaluation
<- predict(model_nb, newdata = hepa_test, type = "class")
hepa_predict_nb
confusionMatrix(data = hepa_predict_nb, reference = hepa_test$Diagnosis, positive = "Hepatitis")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor Hepatitis
## Donor 104 3
## Hepatitis 2 14
##
## Accuracy : 0.9593
## 95% CI : (0.9077, 0.9867)
## No Information Rate : 0.8618
## P-Value [Acc > NIR] : 0.0003444
##
## Kappa : 0.825
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.8235
## Specificity : 0.9811
## Pos Pred Value : 0.8750
## Neg Pred Value : 0.9720
## Prevalence : 0.1382
## Detection Rate : 0.1138
## Detection Prevalence : 0.1301
## Balanced Accuracy : 0.9023
##
## 'Positive' Class : Hepatitis
##
Based on confusion matrix, naive bayes model classify the test data that there are 107 Donor class with 104 true predictions and 16 Hepatitis class with 14 true prediction. The model Accuracy is 95.93% with 82.35% Sensitifity and 98.11% Specifity.
Decision Tree
Model Training
library(partykit)
## Warning: package 'partykit' was built under R version 4.2.2
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.2.2
## Loading required package: mvtnorm
<- ctree(Diagnosis~., hepa_train)
model_dt model_dt
##
## Model formula:
## Diagnosis ~ Age + Sex + ALB + ALP + ALT + AST + BIL + CHE + CHOL +
## CREA + GGT + PROT
##
## Fitted party:
## [1] root
## | [2] AST <= 35.7
## | | [3] CREA <= 127
## | | | [4] ALP <= 36.7
## | | | | [5] ALT <= 7.4: Hepatitis (n = 13, err = 0.0%)
## | | | | [6] ALT > 7.4: Hepatitis (n = 16, err = 37.5%)
## | | | [7] ALP > 36.7
## | | | | [8] AST <= 33: Donor (n = 361, err = 0.0%)
## | | | | [9] AST > 33
## | | | | | [10] AST <= 34
## | | | | | | [11] ALP <= 56.3: Hepatitis (n = 12, err = 8.3%)
## | | | | | | [12] ALP > 56.3: Hepatitis (n = 14, err = 42.9%)
## | | | | | [13] AST > 34: Donor (n = 14, err = 0.0%)
## | | [14] CREA > 127: Hepatitis (n = 8, err = 0.0%)
## | [15] AST > 35.7
## | | [16] AST <= 52.6
## | | | [17] ALT <= 13.3: Hepatitis (n = 75, err = 2.7%)
## | | | [18] ALT > 13.3
## | | | | [19] PROT <= 76.2: Donor (n = 31, err = 0.0%)
## | | | | [20] PROT > 76.2
## | | | | | [21] CREA <= 81: Hepatitis (n = 12, err = 41.7%)
## | | | | | [22] CREA > 81: Hepatitis (n = 22, err = 4.5%)
## | | [23] AST > 52.6
## | | | [24] PROT <= 54.2: Hepatitis (n = 9, err = 33.3%)
## | | | [25] PROT > 54.2: Hepatitis (n = 281, err = 1.4%)
##
## Number of inner nodes: 12
## Number of terminal nodes: 13
Model Visualisation
#Model Visualisation
plot(model_dt, type = "simple")
Model Evaluation
<- predict(model_dt, newdata = hepa_test, type="response")
hepa_predict_dt
confusionMatrix(data = hepa_predict_dt, reference = hepa_test$Diagnosis, positive = "Hepatitis")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor Hepatitis
## Donor 97 1
## Hepatitis 9 16
##
## Accuracy : 0.9187
## 95% CI : (0.8556, 0.9603)
## No Information Rate : 0.8618
## P-Value [Acc > NIR] : 0.03807
##
## Kappa : 0.715
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.9412
## Specificity : 0.9151
## Pos Pred Value : 0.6400
## Neg Pred Value : 0.9898
## Prevalence : 0.1382
## Detection Rate : 0.1301
## Detection Prevalence : 0.2033
## Balanced Accuracy : 0.9281
##
## 'Positive' Class : Hepatitis
##
Based on confusion matrix, decision tree model classify the test data that there are 98 Donor class with 97 true predictions and 25 Hepatitis class with 16 true prediction. The model Accuracy is 91.87% with 94.12% Sensitifity and 91.51% Specifity.
Random Forest
Model Training
set.seed(417)
<- trainControl(method = "repeatedcv", number = 5, repeats = 3)
control
# pembuatan model random forest
<- train(form = Diagnosis~., data = hepa_train, method = "rf",
model_rf trainControl = control)
model_rf
## Random Forest
##
## 868 samples
## 12 predictor
## 2 classes: 'Donor', 'Hepatitis'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 868, 868, 868, 868, 868, 868, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9973928 0.9947677
## 7 0.9886703 0.9772482
## 12 0.9806837 0.9612491
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Based on model summary, there are 12 predictors to predict the classes. The model using 3 different mtry (2, 7, 12) and get the best accuracy on 2 mtry.
Out of Bag Error
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
$finalModel model_rf
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, trainControl = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## Donor Hepatitis class.error
## Donor 434 0 0
## Hepatitis 0 434 0
The random forest model create 500 trees with No. of variables tried at each split is 2. This model have 0% OOB estimate error rate. This means that the model accuracy is 100% on out of bag data.
varImp(model_rf) %>% plot()
The most important predictor in this model is
AST
.
Model Evaluation
<- predict(model_rf, newdata = hepa_test)
hepa_predict_rf
confusionMatrix(data = hepa_predict_rf, reference = hepa_test$Diagnosis, positive = "Hepatitis")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor Hepatitis
## Donor 105 1
## Hepatitis 1 16
##
## Accuracy : 0.9837
## 95% CI : (0.9425, 0.998)
## No Information Rate : 0.8618
## P-Value [Acc > NIR] : 2.422e-06
##
## Kappa : 0.9317
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9412
## Specificity : 0.9906
## Pos Pred Value : 0.9412
## Neg Pred Value : 0.9906
## Prevalence : 0.1382
## Detection Rate : 0.1301
## Detection Prevalence : 0.1382
## Balanced Accuracy : 0.9659
##
## 'Positive' Class : Hepatitis
##
Conclusion
Based on confusion matrix, the Random Forest model gave the best result on classifying hepatitis or donor classes. The model have highest accuracy 98.37% while also having sensitifity, specifity, and precision above 90%.