Hepatitis C Diagnosis using Naive Bayes, Decision Tree, and Random Forest

Intro

In this analysis, we will use supervised machine learning to classify or diagnose hepatitis c infection on blood. The dataset can be downloaded here. We will use tree supervised machine learning models, Naive Bayes, Decision Tree, and Random Forest.

Naive Bayes model is supervised machine learning based on Bayes Theorem of Probability. This method assume that the predictor and the target variables is dependen, and among the predictor is independent. This means that all the predictors have save values to predict the targe.

Decision tree model is simple tree-based model that have robust or powerfull performance to make a prediction. This model resulting decision tree that can be easily interpreted or used. Outlier data also can be handled by this model.

Random forest is classification model consisting of many decisions trees. Each decision trees have their own characteristic and build by Bagging (Bootstrap and Aggregation) concept.

Data Import

data <- read.csv("HepatitisCdata.csv")
rmarkdown::paged_table(data)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(data)
## Rows: 615
## Columns: 14
## $ X        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ Category <chr> "0=Blood Donor", "0=Blood Donor", "0=Blood Donor", "0=Blood D…
## $ Age      <int> 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 3…
## $ Sex      <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "…
## $ ALB      <dbl> 38.5, 38.5, 46.9, 43.2, 39.2, 41.6, 46.3, 42.2, 50.9, 42.4, 4…
## $ ALP      <dbl> 52.5, 70.3, 74.7, 52.0, 74.1, 43.3, 41.3, 41.9, 65.5, 86.3, 5…
## $ ALT      <dbl> 7.7, 18.0, 36.2, 30.6, 32.6, 18.5, 17.5, 35.8, 23.2, 20.3, 21…
## $ AST      <dbl> 22.1, 24.7, 52.6, 22.6, 24.8, 19.7, 17.8, 31.1, 21.2, 20.0, 2…
## $ BIL      <dbl> 7.5, 3.9, 6.1, 18.9, 9.6, 12.3, 8.5, 16.1, 6.9, 35.2, 17.2, 5…
## $ CHE      <dbl> 6.93, 11.17, 8.84, 7.33, 9.15, 9.92, 7.01, 5.82, 8.69, 5.46, …
## $ CHOL     <dbl> 3.23, 4.80, 5.20, 4.74, 4.32, 6.05, 4.79, 4.60, 4.10, 4.45, 3…
## $ CREA     <dbl> 106, 74, 86, 80, 76, 111, 70, 109, 83, 81, 78, 79, 78, 65, 63…
## $ GGT      <dbl> 12.1, 15.6, 33.2, 33.8, 29.9, 91.0, 16.9, 21.5, 13.7, 15.9, 2…
## $ PROT     <dbl> 69.0, 76.5, 79.3, 75.7, 68.7, 74.0, 74.5, 67.1, 71.3, 69.9, 7…

The data contains laboratory values of blood donors and Hepatitis C patients and demographic data such as Age and Sex. Our target for classification is `Category, blood donors vs hepatitis C patients, the hepatitis C levels (Hepatitis C, Fibrosis, and Cirrhosis) will be joined as Hepatitis category. So, our target class are Hepatitis for positive and Donor for Negative.

Data Wrangling

We will join fibrosis and cirrhosis as hepatitis C category, and make hepatitis and donor as Diagnosis variable. Change Sex data type to factor and delete X, Category variables.

library(stringr)
data <- data %>% 
  mutate(Diagnosis = if_else(str_detect(Category, "Donor"), "Donor", "Hepatitis"),
         Diagnosis = as.factor(Diagnosis),
         Sex = as.factor(Sex)) %>% 
  select(-c(X, Category))

rmarkdown::paged_table(data)

Check missing values.

colSums(is.na(data))
##       Age       Sex       ALB       ALP       ALT       AST       BIL       CHE 
##         0         0         1        18         1         0         0         0 
##      CHOL      CREA       GGT      PROT Diagnosis 
##        10         0         0         1         0

Fill missing values.

Fill the missing values with median because there are not to many missing values.

data_clean <- data %>% 
  mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))

colSums(is.na(data_clean))
##       Age       Sex       ALB       ALP       ALT       AST       BIL       CHE 
##         0         0         0         0         0         0         0         0 
##      CHOL      CREA       GGT      PROT Diagnosis 
##         0         0         0         0         0

Now, our data is clean!

Exploratory Data Analysis

Check the target variable proportion

library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.2.2
data_clean %>% 
  select(Diagnosis) %>%
  inspect_cat()
data_clean %>% 
  select(Diagnosis) %>%
  inspect_cat() %>%
  show_plot()

We can see that our target proportion is unbalanced, the donor data have 87.80% proportion. We will balance it using upsampling method later.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(data_clean, label = T)
## Warning in ggcorr(data_clean, label = T): data in column(s) 'Sex', 'Diagnosis'
## are not numeric and were ignored

Based on the plot above, we can see that our predictors have low correlation between each other. This may indicates that the data is good using naive bayes because naive bayes assuming the predictors are independent.

Cross-Validation

Cross validation is step when we split our data into training data and testing data. We use training data to train our model, and we use testing data to test if our model can classify correctly on new data or unseen data.

library(rsample)
## Warning: package 'rsample' was built under R version 4.2.2
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

index_hepa <- sample(nrow(data_clean), nrow(data_clean)*0.8)
hepa_train <- data_clean[index_hepa,]
hepa_test <- data_clean[-index_hepa,]

#Check target proportion
prop.table(table(hepa_train$Diagnosis))
## 
##     Donor Hepatitis 
## 0.8821138 0.1178862

Data Pre-Processing

# Upsampling
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
hepa_train <- upSample(x = hepa_train %>% select(-Diagnosis),
                       y = hepa_train$Diagnosis,
                       yname = "Diagnosis") #nama kolom target

prop.table(table(hepa_train$Diagnosis))
## 
##     Donor Hepatitis 
##       0.5       0.5

Modelling

Naive Bayes

Model Training

library(e1071)
## Warning: package 'e1071' was built under R version 4.2.2
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
## 
##     permutations
model_nb <- naiveBayes(Diagnosis~., hepa_train)

Model Evaluation

hepa_predict_nb <- predict(model_nb, newdata = hepa_test, type = "class")

confusionMatrix(data = hepa_predict_nb, reference = hepa_test$Diagnosis, positive = "Hepatitis")
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Donor Hepatitis
##   Donor       104         3
##   Hepatitis     2        14
##                                           
##                Accuracy : 0.9593          
##                  95% CI : (0.9077, 0.9867)
##     No Information Rate : 0.8618          
##     P-Value [Acc > NIR] : 0.0003444       
##                                           
##                   Kappa : 0.825           
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8235          
##             Specificity : 0.9811          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.9720          
##              Prevalence : 0.1382          
##          Detection Rate : 0.1138          
##    Detection Prevalence : 0.1301          
##       Balanced Accuracy : 0.9023          
##                                           
##        'Positive' Class : Hepatitis       
## 

Based on confusion matrix, naive bayes model classify the test data that there are 107 Donor class with 104 true predictions and 16 Hepatitis class with 14 true prediction. The model Accuracy is 95.93% with 82.35% Sensitifity and 98.11% Specifity.

Decision Tree

Model Training

library(partykit)
## Warning: package 'partykit' was built under R version 4.2.2
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.2.2
## Loading required package: mvtnorm
model_dt <- ctree(Diagnosis~., hepa_train)
model_dt
## 
## Model formula:
## Diagnosis ~ Age + Sex + ALB + ALP + ALT + AST + BIL + CHE + CHOL + 
##     CREA + GGT + PROT
## 
## Fitted party:
## [1] root
## |   [2] AST <= 35.7
## |   |   [3] CREA <= 127
## |   |   |   [4] ALP <= 36.7
## |   |   |   |   [5] ALT <= 7.4: Hepatitis (n = 13, err = 0.0%)
## |   |   |   |   [6] ALT > 7.4: Hepatitis (n = 16, err = 37.5%)
## |   |   |   [7] ALP > 36.7
## |   |   |   |   [8] AST <= 33: Donor (n = 361, err = 0.0%)
## |   |   |   |   [9] AST > 33
## |   |   |   |   |   [10] AST <= 34
## |   |   |   |   |   |   [11] ALP <= 56.3: Hepatitis (n = 12, err = 8.3%)
## |   |   |   |   |   |   [12] ALP > 56.3: Hepatitis (n = 14, err = 42.9%)
## |   |   |   |   |   [13] AST > 34: Donor (n = 14, err = 0.0%)
## |   |   [14] CREA > 127: Hepatitis (n = 8, err = 0.0%)
## |   [15] AST > 35.7
## |   |   [16] AST <= 52.6
## |   |   |   [17] ALT <= 13.3: Hepatitis (n = 75, err = 2.7%)
## |   |   |   [18] ALT > 13.3
## |   |   |   |   [19] PROT <= 76.2: Donor (n = 31, err = 0.0%)
## |   |   |   |   [20] PROT > 76.2
## |   |   |   |   |   [21] CREA <= 81: Hepatitis (n = 12, err = 41.7%)
## |   |   |   |   |   [22] CREA > 81: Hepatitis (n = 22, err = 4.5%)
## |   |   [23] AST > 52.6
## |   |   |   [24] PROT <= 54.2: Hepatitis (n = 9, err = 33.3%)
## |   |   |   [25] PROT > 54.2: Hepatitis (n = 281, err = 1.4%)
## 
## Number of inner nodes:    12
## Number of terminal nodes: 13

Model Visualisation

#Model Visualisation
plot(model_dt, type = "simple")

Model Evaluation

hepa_predict_dt <- predict(model_dt, newdata = hepa_test, type="response")

confusionMatrix(data = hepa_predict_dt, reference = hepa_test$Diagnosis, positive = "Hepatitis")
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Donor Hepatitis
##   Donor        97         1
##   Hepatitis     9        16
##                                           
##                Accuracy : 0.9187          
##                  95% CI : (0.8556, 0.9603)
##     No Information Rate : 0.8618          
##     P-Value [Acc > NIR] : 0.03807         
##                                           
##                   Kappa : 0.715           
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 0.9151          
##          Pos Pred Value : 0.6400          
##          Neg Pred Value : 0.9898          
##              Prevalence : 0.1382          
##          Detection Rate : 0.1301          
##    Detection Prevalence : 0.2033          
##       Balanced Accuracy : 0.9281          
##                                           
##        'Positive' Class : Hepatitis       
## 

Based on confusion matrix, decision tree model classify the test data that there are 98 Donor class with 97 true predictions and 25 Hepatitis class with 16 true prediction. The model Accuracy is 91.87% with 94.12% Sensitifity and 91.51% Specifity.

Random Forest

Model Training

set.seed(417)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

# pembuatan model random forest
model_rf <- train(form = Diagnosis~., data = hepa_train, method = "rf",
                   trainControl = control)
model_rf
## Random Forest 
## 
## 868 samples
##  12 predictor
##   2 classes: 'Donor', 'Hepatitis' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 868, 868, 868, 868, 868, 868, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9973928  0.9947677
##    7    0.9886703  0.9772482
##   12    0.9806837  0.9612491
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Based on model summary, there are 12 predictors to predict the classes. The model using 3 different mtry (2, 7, 12) and get the best accuracy on 2 mtry.

Out of Bag Error

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
model_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, trainControl = ..1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##           Donor Hepatitis class.error
## Donor       434         0           0
## Hepatitis     0       434           0

The random forest model create 500 trees with No. of variables tried at each split is 2. This model have 0% OOB estimate error rate. This means that the model accuracy is 100% on out of bag data.

varImp(model_rf) %>% plot()

The most important predictor in this model is AST.

Model Evaluation

hepa_predict_rf <- predict(model_rf, newdata = hepa_test)

confusionMatrix(data = hepa_predict_rf, reference = hepa_test$Diagnosis, positive = "Hepatitis")
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Donor Hepatitis
##   Donor       105         1
##   Hepatitis     1        16
##                                          
##                Accuracy : 0.9837         
##                  95% CI : (0.9425, 0.998)
##     No Information Rate : 0.8618         
##     P-Value [Acc > NIR] : 2.422e-06      
##                                          
##                   Kappa : 0.9317         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9412         
##             Specificity : 0.9906         
##          Pos Pred Value : 0.9412         
##          Neg Pred Value : 0.9906         
##              Prevalence : 0.1382         
##          Detection Rate : 0.1301         
##    Detection Prevalence : 0.1382         
##       Balanced Accuracy : 0.9659         
##                                          
##        'Positive' Class : Hepatitis      
## 

Conclusion

Based on confusion matrix, the Random Forest model gave the best result on classifying hepatitis or donor classes. The model have highest accuracy 98.37% while also having sensitifity, specifity, and precision above 90%.