Naive bayes classification is used on categorical data to study the correlation between two variables.
<h2> P(A|B)=P(B|A)P(A)/P(B) </h2>
Lets apply naive bayes classification on Naive bayes on “Head and neck cancer medication case study”.
** Data set Meta data PID:** Patient ID ENC_ID: coded encounter ID Seer Stage : Cancer stages 0 - In - situ, 1 - Localized,2 - Regional to direct extension, 3 - Regional to Lymph nodes, 4 - Both 2 & 3,5 - Regional, 7 - systematic disease, 8 - not applicable,9 - Unknown Medication desc : Description of medication Mediction summary : brief description of medication and usage Dose: dosage of medication Unit: unit of dosage Frequency: Frequency of use in medication summary Total dosage count: Total count of dosage
Lets look into the data first
df <- read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1",stringsAsFactors = F)
dim(df) #dimension: 662 X 9
## [1] 662 9
summary(df)
## PID ENC_ID seer_stage MEDICATION_DESC
## Min. : 22 Min. : 42 Min. :0.000 Length:662
## 1st Qu.: 3818 1st Qu.:15572 1st Qu.:1.000 Class :character
## Median : 7600 Median :33116 Median :2.000 Mode :character
## Mean : 6941 Mean :31301 Mean :3.396
## 3rd Qu.:10282 3rd Qu.:48372 3rd Qu.:7.000
## Max. :12340 Max. :57893 Max. :9.000
## MEDICATION_SUMMARY DOSE UNIT FREQUENCY
## Length:662 Length:662 Length:662 Length:662
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## TOTAL_DOSE_COUNT
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 3.000
## Mean : 6.337
## 3rd Qu.: 7.000
## Max. :110.000
str(df)
## 'data.frame': 662 obs. of 9 variables:
## $ PID : int 10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
## $ ENC_ID : int 46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
## $ seer_stage : int 1 1 4 1 9 1 1 1 9 1 ...
## $ MEDICATION_DESC : chr "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
## $ MEDICATION_SUMMARY: chr "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg Per PACU protocol" ...
## $ DOSE : chr "150" "5000" "1.5" "50" ...
## $ UNIT : chr "mg" "unit" "g" "microgram" ...
## $ FREQUENCY : chr "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
## $ TOTAL_DOSE_COUNT : int 5 3 11 2 1 2 2 6 15 1 ...
sapply(df,mode)
## PID ENC_ID seer_stage MEDICATION_DESC
## "numeric" "numeric" "numeric" "character"
## MEDICATION_SUMMARY DOSE UNIT FREQUENCY
## "character" "character" "character" "character"
## TOTAL_DOSE_COUNT
## "numeric"
head(df,4)
## PID ENC_ID seer_stage MEDICATION_DESC
## 1 10000 46836 1 ranitidine
## 2 10008 46886 1 heparin injection
## 3 10029 47034 4 ampicillin/sulbactam IVPB UH
## 4 10063 47240 1 fentaNYL injection UH
## MEDICATION_SUMMARY
## 1 (Zantac) 150 mg tablet oral two times a day
## 2 5,000 unit subcutaneous three times a day
## 3 (Unasyn) 15 g IV every 6 hours
## 4 25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg Per PACU protocol
## DOSE UNIT FREQUENCY TOTAL_DOSE_COUNT
## 1 150 mg two times a day 5
## 2 5000 unit three times a day 3
## 3 1.5 g every 6 hours 11
## 4 50 microgram every 5 minutes 2
head(df$MEDICATION_SUMMARY,4)
## [1] "(Zantac) 150 mg tablet oral two times a day"
## [2] "5,000 unit subcutaneous three times a day"
## [3] "(Unasyn) 15 g IV every 6 hours"
## [4] "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg Per PACU protocol"
head(df$MEDICATION_DESC,4)
## [1] "ranitidine" "heparin injection"
## [3] "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH"
head(df$seer_stage)
## [1] 1 1 4 1 9 1
The columns of interest for analysis is “seer_stage” and “medication_summary”. seer_stage column has catergorical number: 0,1,2,3,4,5,7,8,9*. So the first step would be to change the datatype to factor
df$seer_stage <- as.factor(df$seer_stage)
class(df$seer_stage)
## [1] "factor"
table(df$seer_stage) # use table to check the count in each category
##
## 0 1 2 3 4 5 7 8 9
## 21 265 53 90 46 18 87 14 68
round(prop.table(table(df$seer_stage))*100,2) # 3,7 and 9 have more cases
##
## 0 1 2 3 4 5 7 8 9
## 3.17 40.03 8.01 13.60 6.95 2.72 13.14 2.11 10.27
From the seer_stage data, a new column can be created with two levels - early and advanced. Lets name the column “stage”. Lets check the seer_stage column for values 5 to 7. The answer will be in “True” or “False”. The next step is to change the boolean value to “early” and “advanced”.
df$stage <- df$seer_stage %in% c(5:9)
df$stage <- factor(df$stage,levels = c(F,T),labels = c("early","advanced"))
head(df$stage)
## [1] early early early early advanced early
## Levels: early advanced
str(df$stage)
## Factor w/ 2 levels "early","advanced": 1 1 1 1 2 1 1 1 2 1 ...
table(df$stage)
##
## early advanced
## 475 187
prop.table(table(df$stage))*100
##
## early advanced
## 71.75227 28.24773
Lets focus on the other column of interest - Medication summary. As Medication summary contains sentences, the words should be tokenized for analysis. Document Term Matrix is used for tokenization of document. In order to do that, the data should be changed to Corpus.
library(caTools)
library(tm)
## Loading required package: NLP
library(gmodels)
library(quanteda)
## Package version: 2.1.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
## The following object is masked from 'package:utils':
##
## View
library(e1071)
Text mining package (tm) is used to cleanse the data.
df_corpus <- Corpus(VectorSource(df$MEDICATION_SUMMARY))
df_corpus[2]$content
## [1] "5,000 unit subcutaneous three times a day"
df_corpus <- tm::tm_map(df_corpus,tolower)
## Warning in tm_map.SimpleCorpus(df_corpus, tolower): transformation drops
## documents
df_corpus <- tm::tm_map(df_corpus,removeNumbers)
## Warning in tm_map.SimpleCorpus(df_corpus, removeNumbers): transformation drops
## documents
df_corpus <- tm::tm_map(df_corpus,removePunctuation)
## Warning in tm_map.SimpleCorpus(df_corpus, removePunctuation): transformation
## drops documents
df_corpus <- tm::tm_map(df_corpus,stripWhitespace)
## Warning in tm_map.SimpleCorpus(df_corpus, stripWhitespace): transformation drops
## documents
df_corpus <- tm::tm_map(df_corpus,removeWords,stopwords('en'))
## Warning in tm_map.SimpleCorpus(df_corpus, removeWords, stopwords("en")):
## transformation drops documents
df_corpus[2]$content
## [1] " unit subcutaneous three times day"
Create Document term matrix
df_dtm <- DocumentTermMatrix(df_corpus)
df_dtm
## <<DocumentTermMatrix (documents: 662, terms: 392)>>
## Non-/sparse entries: 4742/254762
## Sparsity : 98%
## Maximal term length: 20
## Weighting : term frequency (tf)
inspect(df_dtm)
## <<DocumentTermMatrix (documents: 662, terms: 392)>>
## Non-/sparse entries: 4742/254762
## Sparsity : 98%
## Maximal term length: 20
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs blood day every glucose hours oral pain prn tablet times
## 132 8 1 0 8 0 0 0 0 0 1
## 195 32 0 0 33 0 0 0 0 0 0
## 196 32 0 0 33 0 0 0 0 0 0
## 227 8 1 0 8 0 0 0 0 0 1
## 269 8 1 0 8 0 0 0 0 0 1
## 277 8 1 0 8 0 0 0 0 0 1
## 335 8 1 0 8 0 0 0 0 0 1
## 336 8 1 0 8 0 0 0 0 0 1
## 39 8 1 0 8 0 0 0 0 0 1
## 67 8 1 0 8 0 0 0 0 0 1
From the above result, we know that the Document term matrix has 662 documents/rows and 392 terms/columns. Sparsity percentage is 98%, so data is present in only (1-0.98 = 0.12) i.e. 12%. To reduce sparsity, we can use quanteda library and filter out the terms that are present in atleast five documents. note: quanteda has all functions required for working with Corpus and Document Term matrix.As I haven’t tried the other functions, I used tm package for data cleansing.
df_dfm <- quanteda::as.dfm(df_dtm) # change the dataset to frequency matrix to calculate frequency
df_dfm <- quanteda::dfm_trim(df_dfm,min_docfreq = 5) # extract data that is available in atleast 5 rows
head(df_dfm) # Sparsity reduced to 94.4% and features reduced to 116 from 392
## Document-feature matrix of: 6 documents, 116 features (95.1% sparse).
## features
## docs day oral tablet times two zantac subcutaneous three unit every
## 1 1 1 1 1 1 1 0 0 0 0
## 2 1 0 0 1 0 0 1 1 1 0
## 3 0 0 0 0 0 0 0 0 0 1
## 4 0 0 0 0 0 0 0 0 0 1
## 5 0 1 1 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 1
## [ reached max_nfeat ... 106 more features ]
df_dtm <- quanteda::as.DocumentTermMatrix(df_dfm) # change dfm to dtm
inspect(df_dtm)
## <<DocumentTermMatrix (documents: 662, terms: 116)>>
## Non-/sparse entries: 4276/72516
## Sparsity : 94%
## Maximal term length: 14
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs blood day every glucose hours oral pain prn tablet times
## 132 8 1 0 8 0 0 0 0 0 1
## 195 32 0 0 33 0 0 0 0 0 0
## 196 32 0 0 33 0 0 0 0 0 0
## 227 8 1 0 8 0 0 0 0 0 1
## 269 8 1 0 8 0 0 0 0 0 1
## 39 8 1 0 8 0 0 0 0 0 1
## 423 8 1 0 8 0 0 0 0 0 1
## 473 8 1 0 8 0 0 0 0 0 1
## 477 8 1 0 8 0 0 0 0 0 1
## 67 8 1 0 8 0 0 0 0 0 1
quanteda::textplot_wordcloud(df_dfm,min_size=0.5,max_size=4)
As naive bayes works on categorical data, the values of above Document term matrix should be changed to factor of “yes” or “no”. A function should be created to assign ‘yes’ to values equal to or greater than 1 and ‘no’ to values equal to 0.
count_fct <- function(value){
value <- factor(ifelse(value>0,1,0),levels = c(0,1),labels = c("no","yes"))
return(value)
}
df_dtm1 <- apply(df_dtm,MARGIN = 2,count_fct)
The data is split into 80-20%. sample that has 80% data is assigned to training set and the remaining is assigned to testing set.
set.seed(1234)
df_sample <- caTools::sample.split(df,SplitRatio = 0.8)
df_train <- df[df_sample, ]
df_test <- df[!df_sample, ]
round(prop.table(table(df_train$stage))*100,1)
##
## early advanced
## 71.7 28.3
round(prop.table(table(df_test$stage))*100,1)
##
## early advanced
## 72 28
dtm1_train <- df_dtm1[df_sample, ]
dtm1_test <- df_dtm1[!df_sample, ]
str(dtm1_train)
## chr [1:530, 1:116] "yes" "yes" "no" "no" "no" "yes" "no" "no" "no" "yes" ...
## - attr(*, "dimnames")=List of 2
## ..$ Docs : chr [1:530] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:116] "day" "oral" "tablet" "times" ...
str(dtm1_test)
## chr [1:132, 1:116] "no" "no" "yes" "yes" "no" "no" "no" "yes" "no" "yes" ...
## - attr(*, "dimnames")=List of 2
## ..$ Docs : chr [1:132] "5" "9" "15" "19" ...
## ..$ Terms: chr [1:116] "day" "oral" "tablet" "times" ...
table(dtm1_test)
## dtm1_test
## no yes
## 14463 849
table(dtm1_train)
## dtm1_train
## no yes
## 58053 3427
corpus_train <- df_corpus[df_sample]
corpus_test <- df_corpus[!df_sample]
inspect(corpus_train[1:4])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 4
##
## [1] zantac mg tablet oral two times day
## [2] unit subcutaneous three times day
## [3] unasyn g iv every hours
## [4] microgram iv every minutes prn severe pain maximum dose mcg per pacu protocol
Lets apply naive bayes classifier
classifier <- e1071::naiveBayes(dtm1_train,df_train$stage)
df_predict <- predict(classifier,df_test$stage)
table(df_predict) # it did not predict any advanced stage
## df_predict
## early advanced
## 132 0
table(df_test$stage)
##
## early advanced
## 95 37
Cross table to check the accuracy of the model
gmodels::CrossTable(df_predict,df_test$stage)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 132
##
##
## | df_test$stage
## df_predict | early | advanced | Row Total |
## -------------|-----------|-----------|-----------|
## early | 95 | 37 | 132 |
## | 0.720 | 0.280 | |
## -------------|-----------|-----------|-----------|
## Column Total | 95 | 37 | 132 |
## -------------|-----------|-----------|-----------|
##
##
**Accuracy = 95/132 = 72%
Lets change the laplace value and then check the data
classifier <- e1071::naiveBayes(dtm1_train,df_train$stage,laplace=2)
df_predict <- predict(classifier,df_test$stage)
table(df_predict) # it did not predict any advanced stage
## df_predict
## early advanced
## 132 0
table(df_test$stage)
##
## early advanced
## 95 37
A change in laplace value did not change the result.
Note: Try removing seer_stage 8 and 9 from dataset