If you are a company which deals with a lot of health-care related data, one of challenges at hand is that you know that you are dealing with data which is complex and unstructured. Health-care data is collected from various sources such as medical prescriptions, diagnostics tests reports, insurance claims etc., which is then required to be digitized in order to able to make full use of it through the power of data science.

Ekincare,is a health-care analytics company, which primarily deals with customers health records and help them to manage and track their health. A huge amount of raw data is available in the from various diagnostic tests reports of the customers such as blood profile, X-ray scan, CT scan, MRI, body organ function tests etc.

Without reading what the diagnostic report says , it is difficult to know which diagnostic test it is and even more, it would take forever to read each and every description and tag that report with its corresponding diagnostic test procedure’s name. Either we can do it the hard way, where in descriptions of the diagnostic reports are manually read and tagged or the easy way, that is we could just let the machine do the job for us.

We use bag of words approach to extract features from text documents (description of tests results) which are then used for training machine learning algorithms such as GBM.

Machine learning Pipeline

Below is the highlevel overview of the machine learning pipeline for this study.

DATA PREPARATION

Brown Box Since we are going to use supervised learning methods, we randomly select some data and add labels manually. In this case we have five labels making this a multiclass classification problem.

Violet Boxes Once we have the lables added to the respecive descriptions, next step is to split the description in to meaningful words/terms. Measure the frequency of occurence of each tokenized term.The most importat terms which are rare to a sepicific document are upweighted and the most frequent terms which are common in all documents are downweighted. Top 25 terms with highest TF_IDF (Term frequency- Inverse Document Frequency) for each class label are selected. This is done to identify important terms which are important to identify the right class label.

Blue Boxes Next, a word vector from tokenized terms is built followed by a corpus. Pre-processing techniques such as removing numbers, punctuations, whitespaces, stopwords and making terms case insensitive are applied on the corpus.From corpus, a document term matrix is build, which is converted to count matrix, where are columns are terms and rows contain the frequency of thier occurence in a sentence/document.The count matrix is transformed into a binary instance matrix which indicates if a word in present or not in a document.

MODEL BUILDING AND EVALUTAION

GBM is trained on the above data and evaluated for its performance.

Data Preparation

library(readr)
notes <- read_csv("C:/Users/welcome/Downloads/train_notes (1).csv")

notes$title <- as.factor(notes$title) # convert title to factor

notes <- notes[,-c(1,4)] # remove firt column

Check for missing values

sum(is.na(notes$description))  # check for missing values
## [1] 910
missing_values <- which(is.na(notes$description))  # filter missing values

notes <- notes[-missing_values,]  # remove missing values from the original dataset
library(tidytext)
library(tidyverse)
# word count of each word in corresponding labels
words <- notes %>% unnest_tokens(word, description) %>% 
  count(title, word, sort = TRUE) %>% ungroup()


# total number of words for each label

total_words <- words %>% group_by(title) %>% summarize(total = sum(n)) 


# join words and total_words d.f's
notes_words <- left_join(words, total_words)

# visualize words distribution
ggplot(notes_words, aes(n/total, fill = title)) + 
  geom_histogram(show.legend =  FALSE) + 
  facet_wrap(~ title, ncol = 3, scales = "free") +
  ggtitle("                                   Term frequency distribution")

The tails of the above term frequency distributions are the fewer words which occur frequently and the peaks are the many words which occur rarely. Diagnostic imaging has higher distribution of words which occur rarely and Patient Related term frequency distribution has a longer tail indicating words which occur frequently.

freq_by_rank <- notes_words %>% 
  group_by(title) %>% # group by title
  mutate(rank = row_number(), term_frequency = n/total) # rank frequency of each word in document


words_tfidf <- notes_words %>% 
  bind_tf_idf(word, title,n)  # weights for words according to term frequency- inverse document frequency



words_tfidf <- words_tfidf %>% 
  select(-total) %>% # remove total column
  arrange(desc(tf_idf)) # arrange weights in descending order


words_tfidf %>% 
  arrange(desc(tf_idf)) %>%  # arrange weights in descending order
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(title) %>%  # group by title
  top_n(20) %>%  # top 20 words/terms
  ungroup %>% 
  ggplot(aes(word, tf_idf, fill = title)) +  # plot weights for top 25 words
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "td_idf") +
  facet_wrap(~ title, ncol = 2, scales = "free") + # get all lables
  coord_flip() + # horizontal bars
  ggtitle("                                                                                  Highest tf-idf words")

The Term Frequency- Inverse Document frequency upweights the rare words and downweights the words which occur frequently. These plots are opposite of the term frequency plots.

names_list <- words_tfidf %>% 
  arrange(desc(tf_idf)) %>% # arrange weights in descending order
  mutate(word = factor(word, levels = rev(unique(word)))) %>% # each words as a factor
  group_by(title) %>%  # group by titles
  top_n(20) %>%  # get top 25 words with highest weights
  ungroup %>% # ungrouo titles
  select(word)  # select words/Terms


d <- names_list %>% apply(., 2, function(x) nchar(x) >= 3) # get important terms

listnames <- names_list[which(d),] %>% filter(!word %in% c('0.30', '0.40', '0.00')) # remove from names list

#listnames <- as.list(listnames)

#lapply(listnames$word, function(x) cat(shQuote(x), "\n"))[0] %>% unlist()  # get the names of top 25 words for each class label

Terms with high TF-IDF weights

names_list <- c("urea", "nitrogen", "delta", "hpf", "epithelial", "bilirubin",
                "conjugated", "direct", "pain", "urobilinogen", "exercise",
                  "diet", "cells", "specimen", "pus", "adequacy", "cold", 
                "intraepithelial","ear","bladder", "percentage", "slide", 
                "provided","transformation", "covering", "sampling", "interpretability",
                "sampled", "composition", "information", "identification", "categorization" ,
                "wax", "lipid", "headache", "pains", "malignancy", "cough", "size", "squamous",
                "fat", "kidneys", "measures", "throat" ,"identified" , "cycles", "prep", 
                "cellular", "calculi", "months", "zone", "preserved", "cast", "evidence",
                "daily", "appears", "repeat", "ache", "spleen", "tab" , "pancreas" , "intake",
                "wbc", "colour", "gall", "echopattern", "echoanatomy", "water", "complaints",
                "weight", "weakness", "appear" , "thickness", "discharge" , "echotexture",
                "yellow", "prostate", "absent", "vitamin", "joint" , "brisk" , "round", 
                "fit" , "pathological", "leucocytes" , "walking", "occasional", "bun", "regular", 
                "colourless", "itching", "acne", "discomfort", "vision", "medically" , "urine" )

require(tm)  # load text mining package

sd <- VectorSource(notes$description) # words vector

corpus <- Corpus(sd)  # build corpus

corpus <- tm_map(corpus, removeNumbers)  # remove numbers

corpus <- tm_map(corpus, removePunctuation) # remove puntucations

corpus <- tm_map(corpus, stripWhitespace) # remove  white spaces

corpus <- tm_map(corpus, removeWords, c(stopwords('english'), "and", "are", "the",
                                        "both", "appears", "within", "appear",
                                        "others", "clear", "right", "seen", 
                                        "well")) # remove stopwords

corpus <- tm_map(corpus, content_transformer(tolower))  # change to lower case

tdm <- DocumentTermMatrix(corpus)  # build document term matrix which is 100 % sparse

tdm_dm <- as.data.frame(as.matrix(tdm)) # count matrix

tdm_df <- as.matrix((tdm_dm > 0) + 0) # binary instance matrix

tdm_df <- as.data.frame(tdm_df)  # convert to data frame

tdm_df <- cbind(tdm_df, notes$title) # append label column from original dataset

final <- tdm_df %>% select(names_list) %>% cbind(notes$title)  # select columns names with the keywords for all class labels

Model Building

s <- sample(1:nrow(final), nrow(final)*(0.70), replace = FALSE) # random sampling

train <- final[s,] # training set

test <- final[-s,] # testing set

#y.dep <- 97 # Dependent Variable

#x.indep <- c(1:96) # Independent variables

#gbm <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, 
               #ntrees=200, learn_rate=0.1, stopping_rounds = 5, seed = 1234)

Model Evaluation

Confusion Matrix

h2o.confusionMatrix(gbm, test.h2o)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##                     Body Fluid Analysis Cytology Test Diagnostic Imaging
## Body Fluid Analysis                1066             6                106
## Cytology Test                         6           409                 50
## Diagnostic Imaging                    0             0               1677
## Doctors Advice                        5             0                116
## Organ Function Test                   0             0                 16
## Patient Related                       1             0                 19
## Totals                             1078           415               1984
##                     Doctors Advice Organ Function Test Patient Related
## Body Fluid Analysis              0                   0               0
## Cytology Test                    0                   0               1
## Diagnostic Imaging               1                   0               2
## Doctors Advice                 212                   0               3
## Organ Function Test              0                 172               0
## Patient Related                  5                   0              19
## Totals                         218                 172              25
##                      Error          Rate
## Body Fluid Analysis 0.0951 = 112 / 1,178
## Cytology Test       0.1223 =    57 / 466
## Diagnostic Imaging  0.0018 =   3 / 1,680
## Doctors Advice      0.3690 =   124 / 336
## Organ Function Test 0.0851 =    16 / 188
## Patient Related     0.5682 =     25 / 44
## Totals              0.0866 = 337 / 3,892

Out of the five classes, Patient Related and Doctors Advice have a higher misclassification error compared to other classes. Diagnostic Imaging was perfectly classified with zero missclassification error. From the term frequency distribution plots, we had seen that Diagnostic imaging had more unique words and Pateint related had more frequently occuring words than unique words, possibly explaining the above results. The overall misclassification error is 8%, which is not bad. However, we can further improve the model accuracy by tuning hyperparameters of GBM.

Some key points about this study

Why GBM?

As this is a multiclass problem, there were two ways to go about it, one was to use one vs all approach, where a classification model is fitted for each class label and based on the resuting probabilities, vote for the class with the highest probablity for a particular case. The other approach was to use an algorthim which is efficient in handling multiclass and class imbalances. Since GBM is boosting technique, it takes care of both the above concerns.

Why 100% sparse Document term matrix?

The reason for going with a sparse matrix is to abke to include all the terms from the documents, so that we can select the terms with highest weights based on the TF-IDF. Adjusting the sparsity can result loosing of some of the important terms.

Why TF-IDF (Term frequency-Inverse Document frequency) over TF (Term Frequency)?

TF-IDF gives weights for all the terms in the documents. The words which are unique and rare are upweighted making them important words to correctly classify a label and downweights the words which are frequent such as stopwords making them less important words. TF on the other shows the frequency of terms appearing in the document. This can be a bit misleading.

Why choose only 20 words for each class label?

20 words for each class means we would have close to 100 variable(columns). We can say say that these words are unique and rare for thier respective classes and carry higher weights over other words which occur frequently. The dimesion of the data is neither too complex with a lot of features nor too simple with a fewer features.

How to improve the model further?

Tuning hyperparameters such as depth of trees, number of trees, learning rate etc.. can further improve the model performance.