If you are a company which deals with a lot of health-care related data, one of challenges at hand is that you know that you are dealing with data which is complex and unstructured. Health-care data is collected from various sources such as medical prescriptions, diagnostics tests reports, insurance claims etc., which is then required to be digitized in order to able to make full use of it through the power of data science.
Ekincare,is a health-care analytics company, which primarily deals with customers health records and help them to manage and track their health. A huge amount of raw data is available in the from various diagnostic tests reports of the customers such as blood profile, X-ray scan, CT scan, MRI, body organ function tests etc.
Without reading what the diagnostic report says , it is difficult to know which diagnostic test it is and even more, it would take forever to read each and every description and tag that report with its corresponding diagnostic test procedure’s name. Either we can do it the hard way, where in descriptions of the diagnostic reports are manually read and tagged or the easy way, that is we could just let the machine do the job for us.
We use bag of words approach to extract features from text documents (description of tests results) which are then used for training machine learning algorithms such as GBM.
Below is the highlevel overview of the machine learning pipeline for this study.
DATA PREPARATION
Brown Box Since we are going to use supervised learning methods, we randomly select some data and add labels manually. In this case we have five labels making this a multiclass classification problem.
Violet Boxes Once we have the lables added to the respecive descriptions, next step is to split the description in to meaningful words/terms. Measure the frequency of occurence of each tokenized term.The most importat terms which are rare to a sepicific document are upweighted and the most frequent terms which are common in all documents are downweighted. Top 25 terms with highest TF_IDF (Term frequency- Inverse Document Frequency) for each class label are selected. This is done to identify important terms which are important to identify the right class label.
Blue Boxes Next, a word vector from tokenized terms is built followed by a corpus. Pre-processing techniques such as removing numbers, punctuations, whitespaces, stopwords and making terms case insensitive are applied on the corpus.From corpus, a document term matrix is build, which is converted to count matrix, where are columns are terms and rows contain the frequency of thier occurence in a sentence/document.The count matrix is transformed into a binary instance matrix which indicates if a word in present or not in a document.
MODEL BUILDING AND EVALUTAION
GBM is trained on the above data and evaluated for its performance.
library(readr)
notes <- read_csv("C:/Users/welcome/Downloads/train_notes (1).csv")
notes$title <- as.factor(notes$title) # convert title to factor
notes <- notes[,-c(1,4)] # remove firt column
Check for missing values
sum(is.na(notes$description)) # check for missing values
## [1] 910
missing_values <- which(is.na(notes$description)) # filter missing values
notes <- notes[-missing_values,] # remove missing values from the original dataset
library(tidytext)
library(tidyverse)
# word count of each word in corresponding labels
words <- notes %>% unnest_tokens(word, description) %>%
count(title, word, sort = TRUE) %>% ungroup()
# total number of words for each label
total_words <- words %>% group_by(title) %>% summarize(total = sum(n))
# join words and total_words d.f's
notes_words <- left_join(words, total_words)
# visualize words distribution
ggplot(notes_words, aes(n/total, fill = title)) +
geom_histogram(show.legend = FALSE) +
facet_wrap(~ title, ncol = 3, scales = "free") +
ggtitle(" Term frequency distribution")
The tails of the above term frequency distributions are the fewer words which occur frequently and the peaks are the many words which occur rarely. Diagnostic imaging has higher distribution of words which occur rarely and Patient Related term frequency distribution has a longer tail indicating words which occur frequently.
freq_by_rank <- notes_words %>%
group_by(title) %>% # group by title
mutate(rank = row_number(), term_frequency = n/total) # rank frequency of each word in document
words_tfidf <- notes_words %>%
bind_tf_idf(word, title,n) # weights for words according to term frequency- inverse document frequency
words_tfidf <- words_tfidf %>%
select(-total) %>% # remove total column
arrange(desc(tf_idf)) # arrange weights in descending order
words_tfidf %>%
arrange(desc(tf_idf)) %>% # arrange weights in descending order
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(title) %>% # group by title
top_n(20) %>% # top 20 words/terms
ungroup %>%
ggplot(aes(word, tf_idf, fill = title)) + # plot weights for top 25 words
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "td_idf") +
facet_wrap(~ title, ncol = 2, scales = "free") + # get all lables
coord_flip() + # horizontal bars
ggtitle(" Highest tf-idf words")
The Term Frequency- Inverse Document frequency upweights the rare words and downweights the words which occur frequently. These plots are opposite of the term frequency plots.
names_list <- words_tfidf %>%
arrange(desc(tf_idf)) %>% # arrange weights in descending order
mutate(word = factor(word, levels = rev(unique(word)))) %>% # each words as a factor
group_by(title) %>% # group by titles
top_n(20) %>% # get top 25 words with highest weights
ungroup %>% # ungrouo titles
select(word) # select words/Terms
d <- names_list %>% apply(., 2, function(x) nchar(x) >= 3) # get important terms
listnames <- names_list[which(d),] %>% filter(!word %in% c('0.30', '0.40', '0.00')) # remove from names list
#listnames <- as.list(listnames)
#lapply(listnames$word, function(x) cat(shQuote(x), "\n"))[0] %>% unlist() # get the names of top 25 words for each class label
Terms with high TF-IDF weights
names_list <- c("urea", "nitrogen", "delta", "hpf", "epithelial", "bilirubin",
"conjugated", "direct", "pain", "urobilinogen", "exercise",
"diet", "cells", "specimen", "pus", "adequacy", "cold",
"intraepithelial","ear","bladder", "percentage", "slide",
"provided","transformation", "covering", "sampling", "interpretability",
"sampled", "composition", "information", "identification", "categorization" ,
"wax", "lipid", "headache", "pains", "malignancy", "cough", "size", "squamous",
"fat", "kidneys", "measures", "throat" ,"identified" , "cycles", "prep",
"cellular", "calculi", "months", "zone", "preserved", "cast", "evidence",
"daily", "appears", "repeat", "ache", "spleen", "tab" , "pancreas" , "intake",
"wbc", "colour", "gall", "echopattern", "echoanatomy", "water", "complaints",
"weight", "weakness", "appear" , "thickness", "discharge" , "echotexture",
"yellow", "prostate", "absent", "vitamin", "joint" , "brisk" , "round",
"fit" , "pathological", "leucocytes" , "walking", "occasional", "bun", "regular",
"colourless", "itching", "acne", "discomfort", "vision", "medically" , "urine" )
require(tm) # load text mining package
sd <- VectorSource(notes$description) # words vector
corpus <- Corpus(sd) # build corpus
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, removePunctuation) # remove puntucations
corpus <- tm_map(corpus, stripWhitespace) # remove white spaces
corpus <- tm_map(corpus, removeWords, c(stopwords('english'), "and", "are", "the",
"both", "appears", "within", "appear",
"others", "clear", "right", "seen",
"well")) # remove stopwords
corpus <- tm_map(corpus, content_transformer(tolower)) # change to lower case
tdm <- DocumentTermMatrix(corpus) # build document term matrix which is 100 % sparse
tdm_dm <- as.data.frame(as.matrix(tdm)) # count matrix
tdm_df <- as.matrix((tdm_dm > 0) + 0) # binary instance matrix
tdm_df <- as.data.frame(tdm_df) # convert to data frame
tdm_df <- cbind(tdm_df, notes$title) # append label column from original dataset
final <- tdm_df %>% select(names_list) %>% cbind(notes$title) # select columns names with the keywords for all class labels
s <- sample(1:nrow(final), nrow(final)*(0.70), replace = FALSE) # random sampling
train <- final[s,] # training set
test <- final[-s,] # testing set
#y.dep <- 97 # Dependent Variable
#x.indep <- c(1:96) # Independent variables
#gbm <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o,
#ntrees=200, learn_rate=0.1, stopping_rounds = 5, seed = 1234)
Confusion Matrix
h2o.confusionMatrix(gbm, test.h2o)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## Body Fluid Analysis Cytology Test Diagnostic Imaging
## Body Fluid Analysis 1066 6 106
## Cytology Test 6 409 50
## Diagnostic Imaging 0 0 1677
## Doctors Advice 5 0 116
## Organ Function Test 0 0 16
## Patient Related 1 0 19
## Totals 1078 415 1984
## Doctors Advice Organ Function Test Patient Related
## Body Fluid Analysis 0 0 0
## Cytology Test 0 0 1
## Diagnostic Imaging 1 0 2
## Doctors Advice 212 0 3
## Organ Function Test 0 172 0
## Patient Related 5 0 19
## Totals 218 172 25
## Error Rate
## Body Fluid Analysis 0.0951 = 112 / 1,178
## Cytology Test 0.1223 = 57 / 466
## Diagnostic Imaging 0.0018 = 3 / 1,680
## Doctors Advice 0.3690 = 124 / 336
## Organ Function Test 0.0851 = 16 / 188
## Patient Related 0.5682 = 25 / 44
## Totals 0.0866 = 337 / 3,892
Out of the five classes, Patient Related and Doctors Advice have a higher misclassification error compared to other classes. Diagnostic Imaging was perfectly classified with zero missclassification error. From the term frequency distribution plots, we had seen that Diagnostic imaging had more unique words and Pateint related had more frequently occuring words than unique words, possibly explaining the above results. The overall misclassification error is 8%, which is not bad. However, we can further improve the model accuracy by tuning hyperparameters of GBM.
As this is a multiclass problem, there were two ways to go about it, one was to use one vs all approach, where a classification model is fitted for each class label and based on the resuting probabilities, vote for the class with the highest probablity for a particular case. The other approach was to use an algorthim which is efficient in handling multiclass and class imbalances. Since GBM is boosting technique, it takes care of both the above concerns.
The reason for going with a sparse matrix is to abke to include all the terms from the documents, so that we can select the terms with highest weights based on the TF-IDF. Adjusting the sparsity can result loosing of some of the important terms.
TF-IDF gives weights for all the terms in the documents. The words which are unique and rare are upweighted making them important words to correctly classify a label and downweights the words which are frequent such as stopwords making them less important words. TF on the other shows the frequency of terms appearing in the document. This can be a bit misleading.
20 words for each class means we would have close to 100 variable(columns). We can say say that these words are unique and rare for thier respective classes and carry higher weights over other words which occur frequently. The dimesion of the data is neither too complex with a lot of features nor too simple with a fewer features.
Tuning hyperparameters such as depth of trees, number of trees, learning rate etc.. can further improve the model performance.