Data decription:

Number of obs = 12973

Number of columns = 2

column names = “title” and “description”

title has test names and description has corressponding description about the test outcome

Objective:

Using description predict the test name

Approach:

Data import:

The dataset, a csv file was imported into R environment. “title” which has 6 class labels is converted from character to factor. Removed missing values from the dataset.

Data preparation:

A. A document term matrix was constructed for each class label to identify key terms in the document that will help in classifying class labels

  1. Creat a word vector for description.

  2. Build a corpus using the word vector.

  3. Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.

  4. Build a document term matrix (dtm).

  5. Remove sparse words from the above dtm.

  6. The above step leads to a count frequency matrix showing the frequency of each word in respective columns.

  7. Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.

  8. Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.

A total of six document term matrices are constructed, one for each class. The sparsity for each tdm was adjusted accordingly to identify more keys words for each class label.

Key words from first dtm for class label “Diagnostic Imaging”

bladder, kidneys, liver, lesion, pancreas, spleen, urinary, echotexture, calculi, prostate, urinary, intrahepatic, fluid, gall,differentiation.

Key words from second dtm for class label “Doctors Advice”

diet, excercise, lipid, fat, profile, water, weight, regular, months, low, repeat

Key words from third dtm for class label “Cytology Test”

negative, smear, cervical, cells, sampled, slide, sampling, lesion, categorization, intraepithelial, malignancy, satisfactory , information , composition, identification, identified, area, adequacy , covering, patient , history

Key words from fourth dtm for class label “Body Fluid Analysis”

epithelial, pus, hpf, rbc, blood, wbcpus , nil, tcsqctransitionalsquamous, urine, absent, cells, cellshpf

Key words from fifth dtm for class label “Organ Function Test”

nitrogen, urea, mgdl, bilirubin, conjugated , delta , creatinine

Key words from sixth dtm for class label “Patient Related”

headache, throat, complaints, cold, pain, ear, cough, discharge

B. Document term matrix with all class labels included is contructed and the key words from respective tdm of each class are selected

C. A gradient boosting model is fit on the above data

library(readr)
notes <- read_csv("C:/Users/welcome/Downloads/train_notes (1).csv")

notes$title <- as.factor(notes$title) # convert title to factor

notes <- notes[,-1] # remove firt column

notes <- notes[complete.cases(notes),]  # remove missing values if any
require(tm)  # load text mining package

sd <- VectorSource(notes$description) # words vector

corpus <- Corpus(sd)  # build corpus

corpus <- tm_map(corpus, removeNumbers)  # remove numbers

corpus <- tm_map(corpus, removePunctuation) # remove puntucations

corpus <- tm_map(corpus, stripWhitespace) # remove  white spaces

corpus <- tm_map(corpus, removeWords, c(stopwords('english'), "and", "are", "the",
                                        "both", "appears", "within", "appear",
                                        "others", "clear", "right", "seen", 
                                        "well")) # remove stopwords

corpus <- tm_map(corpus, content_transformer(tolower))  # change to lower case

tdm <- DocumentTermMatrix(corpus)  # build document term matrix

tdm_dm <- as.data.frame(as.matrix(tdm)) # count matrix

tdm_df <- as.matrix((tdm_dm > 0) + 0) # binary instance matrix

tdm_df <- as.data.frame(tdm_df)

tdm_df <- cbind(tdm_df, notes$title) # append label column from original dataset


# List of key words for all class labels
namelist <- (c("bladder", "kidneys", "liver", "lesion", "pancreas", "spleen", "urinary", "echotexture", "calculi", "prostate", 
             "urinary", "intrahepatic", "fluid", "gall", "differentiation",
             "diet",  "lipid", "fat", "profile", "water", "weight", 
             "regular", "months", "low", "repeat", "negative", "smear", "cervical", "cells", 
             "sampled", "slide", "sampling", "lesion", "categorization", "intraepithelial", "malignancy", 
             "satisfactory" , "information" , "composition", "identification",
             "identified", "area", "adequacy" , "covering", "patient" , "history",
             "epithelial", "pus", "hpf", "rbc", "blood", "wbcpus" , "nil", "tcsqctransitionalsquamous", 
             "urine", "absent", "cells", "cellshpf", "nitrogen", "urea", "mgdl", "bilirubin", "conjugated" , 
             "delta" , "creatinine", "headache", "throat", "complaints", "cold", "pain", "ear", "cough", "discharge", "notes$title"))
library(tidyverse)


final <- tdm_df %>% select(namelist)  # select columns names with the keywords for all class labels


s <- sample(1:nrow(final), nrow(final)*(0.70), replace = FALSE) # random sampling

train <- final[s,] # training set

test <- final[-s,] # testing set
require(h2o)

localH2O <- h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 19 minutes 
##     H2O cluster version:        3.14.0.3 
##     H2O cluster version age:    4 months and 14 days !!! 
##     H2O cluster name:           H2O_started_from_R_Karthik_hst051 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.70 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.1 (2017-06-30)
h2o.init() # Initialize h20
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 19 minutes 
##     H2O cluster version:        3.14.0.3 
##     H2O cluster version age:    4 months and 14 days !!! 
##     H2O cluster name:           H2O_started_from_R_Karthik_hst051 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.70 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.1 (2017-06-30)
train.h2o <- as.h2o(train) # Train set converted to h2o dataframe
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(test) # Test set converted to h2o dataframe
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
y.dep <- 71 # Dependent Variable

x.indep <- c(1:70) # Independent variables

gbm <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, 
               ntrees=200, learn_rate=0.1, stopping_rounds = 5, seed = 1234)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |===============                                                  |  22%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |================================                                 |  48%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=================================================================| 100%

Model performance on train set

h2o.performance(gbm)
## H2OMultinomialMetrics: gbm
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.0817278
## RMSE: (Extract with `h2o.rmse`) 0.2858807
## Logloss: (Extract with `h2o.logloss`) 0.2572066
## Mean Per-Class Error: 0.2010196
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##                     Body Fluid Analysis Cytology Test Diagnostic Imaging
## Body Fluid Analysis                2468            24                253
## Cytology Test                        13           979                 59
## Diagnostic Imaging                    1             0               3935
## Doctors Advice                        7             8                274
## Organ Function Test                   2             0                 27
## Patient Related                       0             0                 54
## Totals                             2491          1011               4602
##                     Doctors Advice Organ Function Test Patient Related
## Body Fluid Analysis              0                   2               0
## Cytology Test                    0                   0               0
## Diagnostic Imaging               2                   0               1
## Doctors Advice                 469                   1               5
## Organ Function Test              0                 449               0
## Patient Related                  6                   0              42
## Totals                         477                 452              48
##                      Error          Rate
## Body Fluid Analysis 0.1016 = 279 / 2,747
## Cytology Test       0.0685 =  72 / 1,051
## Diagnostic Imaging  0.0010 =   4 / 3,939
## Doctors Advice      0.3861 =   295 / 764
## Organ Function Test 0.0607 =    29 / 478
## Patient Related     0.5882 =    60 / 102
## Totals              0.0814 = 739 / 9,081
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-6 Hit Ratios: 
##   k hit_ratio
## 1 1  0.918621
## 2 2  0.956172
## 3 3  0.984803
## 4 4  0.991301
## 5 5  0.997027
## 6 6  1.000000

Model performance on test set

h2o.performance(gbm, test.h2o)
## H2OMultinomialMetrics: gbm
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.08398827
## RMSE: (Extract with `h2o.rmse`) 0.2898073
## Logloss: (Extract with `h2o.logloss`) 0.2778274
## Mean Per-Class Error: 0.1784454
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##                     Body Fluid Analysis Cytology Test Diagnostic Imaging
## Body Fluid Analysis                1076            10                 92
## Cytology Test                         7           418                 37
## Diagnostic Imaging                    0             0               1653
## Doctors Advice                        4             2                124
## Organ Function Test                   2             0                 11
## Patient Related                       0             0                 20
## Totals                             1089           430               1937
##                     Doctors Advice Organ Function Test Patient Related
## Body Fluid Analysis              0                   1               0
## Cytology Test                    0                   0               2
## Diagnostic Imaging               1                   0               1
## Doctors Advice                 209                   0               6
## Organ Function Test              0                 179               0
## Patient Related                  4                   0              33
## Totals                         214                 180              42
##                      Error          Rate
## Body Fluid Analysis 0.0874 = 103 / 1,179
## Cytology Test       0.0991 =    46 / 464
## Diagnostic Imaging  0.0012 =   2 / 1,655
## Doctors Advice      0.3942 =   136 / 345
## Organ Function Test 0.0677 =    13 / 192
## Patient Related     0.4211 =     24 / 57
## Totals              0.0832 = 324 / 3,892
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-6 Hit Ratios: 
##   k hit_ratio
## 1 1  0.916752
## 2 2  0.956321
## 3 3  0.981501
## 4 4  0.991521
## 5 5  0.997174
## 6 6  1.000000