R Markdown

Naive bayes classification is used on categorical data to study the correlation between two variables.

Naive bayes formula

   <h2>   P(A|B)=P(B|A)P(A)/P(B) </h2>

Lets apply naive bayes classification on Naive bayes on “Head and neck cancer medication case study”.

** Data set Meta data PID:** Patient ID ENC_ID: coded encounter ID Seer Stage : Cancer stages 0 - In - situ, 1 - Localized,2 - Regional to direct extension, 3 - Regional to Lymph nodes, 4 - Both 2 & 3,5 - Regional, 7 - systematic disease, 8 - not applicable,9 - Unknown Medication desc : Description of medication Mediction summary : brief description of medication and usage Dose: dosage of medication Unit: unit of dosage Frequency: Frequency of use in medication summary Total dosage count: Total count of dosage

Lets look into the data first

df <- read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1",stringsAsFactors = F)
dim(df)        #dimension: 662 X 9

## [1] 662   9

summary(df)

##       PID            ENC_ID        seer_stage    MEDICATION_DESC   
##  Min.   :   22   Min.   :   42   Min.   :0.000   Length:662        
##  1st Qu.: 3818   1st Qu.:15572   1st Qu.:1.000   Class :character  
##  Median : 7600   Median :33116   Median :2.000   Mode  :character  
##  Mean   : 6941   Mean   :31301   Mean   :3.396                     
##  3rd Qu.:10282   3rd Qu.:48372   3rd Qu.:7.000                     
##  Max.   :12340   Max.   :57893   Max.   :9.000                     
##  MEDICATION_SUMMARY     DOSE               UNIT            FREQUENCY        
##  Length:662         Length:662         Length:662         Length:662        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  TOTAL_DOSE_COUNT 
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  3.000  
##  Mean   :  6.337  
##  3rd Qu.:  7.000  
##  Max.   :110.000

str(df)

## 'data.frame':    662 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
##  $ seer_stage        : int  1 1 4 1 9 1 1 1 9 1 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 1 2 2 6 15 1 ...

sapply(df,mode)

##                PID             ENC_ID         seer_stage    MEDICATION_DESC 
##          "numeric"          "numeric"          "numeric"        "character" 
## MEDICATION_SUMMARY               DOSE               UNIT          FREQUENCY 
##        "character"        "character"        "character"        "character" 
##   TOTAL_DOSE_COUNT 
##          "numeric"

head(df,4)

##     PID ENC_ID seer_stage              MEDICATION_DESC
## 1 10000  46836          1                   ranitidine
## 2 10008  46886          1            heparin injection
## 3 10029  47034          4 ampicillin/sulbactam IVPB UH
## 4 10063  47240          1        fentaNYL injection UH
##                                                                              MEDICATION_SUMMARY
## 1                                                   (Zantac) 150 mg tablet oral two times a day
## 2                                                     5,000 unit subcutaneous three times a day
## 3                                                                (Unasyn) 15 g IV every 6 hours
## 4 25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol
##   DOSE      UNIT         FREQUENCY TOTAL_DOSE_COUNT
## 1  150        mg   two times a day                5
## 2 5000      unit three times a day                3
## 3  1.5         g     every 6 hours               11
## 4   50 microgram   every 5 minutes                2

head(df$MEDICATION_SUMMARY,4)

## [1] "(Zantac) 150 mg tablet oral two times a day"                                                  
## [2] "5,000 unit subcutaneous three times a day"                                                    
## [3] "(Unasyn) 15 g IV every 6 hours"                                                               
## [4] "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol"

head(df$MEDICATION_DESC,4)

## [1] "ranitidine"                   "heparin injection"           
## [3] "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH"

head(df$seer_stage)

## [1] 1 1 4 1 9 1

Data cleansing

The columns of interest for analysis is “seer_stage” and “medication_summary”. seer_stage column has catergorical number: 0,1,2,3,4,5,7,8,9*. So the first step would be to change the datatype to factor

df$seer_stage <- as.factor(df$seer_stage)
class(df$seer_stage)

## [1] "factor"

table(df$seer_stage)                        # use table to check the count in each category

## 
##   0   1   2   3   4   5   7   8   9 
##  21 265  53  90  46  18  87  14  68

round(prop.table(table(df$seer_stage))*100,2)   # 3,7 and 9 have more cases

## 
##     0     1     2     3     4     5     7     8     9 
##  3.17 40.03  8.01 13.60  6.95  2.72 13.14  2.11 10.27

Data preparation

From the seer_stage data, a new column can be created with two levels - early and advanced. Lets name the column “stage”. Lets check the seer_stage column for values 5 to 7. The answer will be in “True” or “False”. The next step is to change the boolean value to “early” and “advanced”.

df$stage <- df$seer_stage %in% c(5:9)
df$stage <- factor(df$stage,levels = c(F,T),labels = c("early","advanced"))
head(df$stage)

## [1] early    early    early    early    advanced early   
## Levels: early advanced

str(df$stage)

##  Factor w/ 2 levels "early","advanced": 1 1 1 1 2 1 1 1 2 1 ...

table(df$stage)

## 
##    early advanced 
##      475      187

prop.table(table(df$stage))*100

## 
##    early advanced 
## 71.75227 28.24773

Lets focus on the other column of interest - Medication summary. As Medication summary contains sentences, the words should be tokenized for analysis. Document Term Matrix is used for tokenization of document. In order to do that, the data should be changed to Corpus.

Loading the required libraries

library(caTools)
library(tm)

## Loading required package: NLP

library(gmodels)
library(quanteda)

## Package version: 2.1.2

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

## The following object is masked from 'package:utils':
## 
##     View

library(e1071)

Creation of Corpus and Text mining

Text mining package (tm) is used to cleanse the data.

df_corpus <- Corpus(VectorSource(df$MEDICATION_SUMMARY))
df_corpus[2]$content

## [1] "5,000 unit subcutaneous three times a day"

df_corpus <- tm::tm_map(df_corpus,tolower)

## Warning in tm_map.SimpleCorpus(df_corpus, tolower): transformation drops
## documents

df_corpus <- tm::tm_map(df_corpus,removeNumbers)

## Warning in tm_map.SimpleCorpus(df_corpus, removeNumbers): transformation drops
## documents

df_corpus <- tm::tm_map(df_corpus,removePunctuation)

## Warning in tm_map.SimpleCorpus(df_corpus, removePunctuation): transformation
## drops documents

df_corpus <- tm::tm_map(df_corpus,stripWhitespace)

## Warning in tm_map.SimpleCorpus(df_corpus, stripWhitespace): transformation drops
## documents

df_corpus <- tm::tm_map(df_corpus,removeWords,stopwords('en'))

## Warning in tm_map.SimpleCorpus(df_corpus, removeWords, stopwords("en")):
## transformation drops documents

df_corpus[2]$content

## [1] " unit subcutaneous three times  day"

Create Document term matrix

df_dtm <- DocumentTermMatrix(df_corpus)
df_dtm

## <<DocumentTermMatrix (documents: 662, terms: 392)>>
## Non-/sparse entries: 4742/254762
## Sparsity           : 98%
## Maximal term length: 20
## Weighting          : term frequency (tf)

inspect(df_dtm)

## <<DocumentTermMatrix (documents: 662, terms: 392)>>
## Non-/sparse entries: 4742/254762
## Sparsity           : 98%
## Maximal term length: 20
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  blood day every glucose hours oral pain prn tablet times
##   132     8   1     0       8     0    0    0   0      0     1
##   195    32   0     0      33     0    0    0   0      0     0
##   196    32   0     0      33     0    0    0   0      0     0
##   227     8   1     0       8     0    0    0   0      0     1
##   269     8   1     0       8     0    0    0   0      0     1
##   277     8   1     0       8     0    0    0   0      0     1
##   335     8   1     0       8     0    0    0   0      0     1
##   336     8   1     0       8     0    0    0   0      0     1
##   39      8   1     0       8     0    0    0   0      0     1
##   67      8   1     0       8     0    0    0   0      0     1

From the above result, we know that the Document term matrix has 662 documents/rows and 392 terms/columns. Sparsity percentage is 98%, so data is present in only (1-0.98 = 0.12) i.e. 12%. To reduce sparsity, we can use quanteda library and filter out the terms that are present in atleast five documents. note: quanteda has all functions required for working with Corpus and Document Term matrix.As I haven’t tried the other functions, I used tm package for data cleansing.

df_dfm <- quanteda::as.dfm(df_dtm)  # change the dataset to frequency matrix to calculate frequency
df_dfm <- quanteda::dfm_trim(df_dfm,min_docfreq = 5)     # extract data that is available in atleast 5 rows
head(df_dfm)                   # Sparsity reduced to 94.4% and features reduced to 116 from 392

## Document-feature matrix of: 6 documents, 116 features (95.1% sparse).
##     features
## docs day oral tablet times two zantac subcutaneous three unit every
##    1   1    1      1     1   1      1            0     0    0     0
##    2   1    0      0     1   0      0            1     1    1     0
##    3   0    0      0     0   0      0            0     0    0     1
##    4   0    0      0     0   0      0            0     0    0     1
##    5   0    1      1     0   0      0            0     0    0     0
##    6   0    0      0     0   0      0            0     0    0     1
## [ reached max_nfeat ... 106 more features ]

df_dtm <- quanteda::as.DocumentTermMatrix(df_dfm)       # change dfm to dtm
inspect(df_dtm)

## <<DocumentTermMatrix (documents: 662, terms: 116)>>
## Non-/sparse entries: 4276/72516
## Sparsity           : 94%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  blood day every glucose hours oral pain prn tablet times
##   132     8   1     0       8     0    0    0   0      0     1
##   195    32   0     0      33     0    0    0   0      0     0
##   196    32   0     0      33     0    0    0   0      0     0
##   227     8   1     0       8     0    0    0   0      0     1
##   269     8   1     0       8     0    0    0   0      0     1
##   39      8   1     0       8     0    0    0   0      0     1
##   423     8   1     0       8     0    0    0   0      0     1
##   473     8   1     0       8     0    0    0   0      0     1
##   477     8   1     0       8     0    0    0   0      0     1
##   67      8   1     0       8     0    0    0   0      0     1

quanteda::textplot_wordcloud(df_dfm,min_size=0.5,max_size=4)

As naive bayes works on categorical data, the values of above Document term matrix should be changed to factor of “yes” or “no”. A function should be created to assign ‘yes’ to values equal to or greater than 1 and ‘no’ to values equal to 0.

count_fct <- function(value){
  value <- factor(ifelse(value>0,1,0),levels = c(0,1),labels = c("no","yes"))
  return(value)
}

df_dtm1 <- apply(df_dtm,MARGIN = 2,count_fct)

Train and test data

The data is split into 80-20%. sample that has 80% data is assigned to training set and the remaining is assigned to testing set.

set.seed(1234)
df_sample <- caTools::sample.split(df,SplitRatio = 0.8)
df_train <- df[df_sample, ]
df_test <- df[!df_sample, ]
round(prop.table(table(df_train$stage))*100,1)

## 
##    early advanced 
##     71.7     28.3

round(prop.table(table(df_test$stage))*100,1)

## 
##    early advanced 
##       72       28

dtm1_train <- df_dtm1[df_sample, ]
dtm1_test <- df_dtm1[!df_sample, ]
str(dtm1_train)

##  chr [1:530, 1:116] "yes" "yes" "no" "no" "no" "yes" "no" "no" "no" "yes" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:530] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:116] "day" "oral" "tablet" "times" ...

str(dtm1_test)

##  chr [1:132, 1:116] "no" "no" "yes" "yes" "no" "no" "no" "yes" "no" "yes" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:132] "5" "9" "15" "19" ...
##   ..$ Terms: chr [1:116] "day" "oral" "tablet" "times" ...

table(dtm1_test)

## dtm1_test
##    no   yes 
## 14463   849

table(dtm1_train)

## dtm1_train
##    no   yes 
## 58053  3427

corpus_train <- df_corpus[df_sample]
corpus_test <- df_corpus[!df_sample]
inspect(corpus_train[1:4])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 4
## 
## [1] zantac mg tablet oral two times  day                                          
## [2]  unit subcutaneous three times  day                                           
## [3] unasyn g iv every hours                                                       
## [4]  microgram iv every minutes prn severe pain maximum dose mcg per pacu protocol

Final steps

Lets apply naive bayes classifier

classifier <- e1071::naiveBayes(dtm1_train,df_train$stage)
df_predict <- predict(classifier,df_test$stage)
table(df_predict)                         # it did not predict any advanced stage

## df_predict
##    early advanced 
##      132        0

table(df_test$stage)

## 
##    early advanced 
##       95       37

Cross table to check the accuracy of the model

gmodels::CrossTable(df_predict,df_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  132 
## 
##  
##              | df_test$stage 
##   df_predict |     early |  advanced | Row Total | 
## -------------|-----------|-----------|-----------|
##        early |        95 |        37 |       132 | 
##              |     0.720 |     0.280 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        95 |        37 |       132 | 
## -------------|-----------|-----------|-----------|
## 
##

**Accuracy = 95/132 = 72%

Lets change the laplace value and then check the data

classifier <- e1071::naiveBayes(dtm1_train,df_train$stage,laplace=2)
df_predict <- predict(classifier,df_test$stage)
table(df_predict)                         # it did not predict any advanced stage

## df_predict
##    early advanced 
##      132        0

table(df_test$stage)

## 
##    early advanced 
##       95       37

A change in laplace value did not change the result.

Note: Try removing seer_stage 8 and 9 from dataset

Naive Bayes analysis on cancer data

Thrishna Balakrishnan

4/5/2021