Number of obs = 12973
Number of columns = 2
column names = “title” and “description”
title has test names and description has corressponding description about the test outcome
Using description predict the test name
Data import:
The dataset, a csv file was imported into R environment. “title” which has 6 class labels is converted from character to factor. Removed missing values from the dataset.
Data preparation:
A. A document term matrix was constructed for each class label to identify key terms in the document that will help in classifying class labels
Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in respective columns.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.
A total of six document term matrices are constructed, one for each class. The sparsity for each tdm was adjusted accordingly to identify more keys words for each class label.
Key words from first dtm for class label “Diagnostic Imaging”
bladder, kidneys, liver, lesion, pancreas, spleen, urinary, echotexture, calculi, prostate, urinary, intrahepatic, fluid, gall,differentiation.
Key words from second dtm for class label “Doctors Advice”
diet, excercise, lipid, fat, profile, water, weight, regular, months, low, repeat
Key words from third dtm for class label “Cytology Test”
negative, smear, cervical, cells, sampled, slide, sampling, lesion, categorization, intraepithelial, malignancy, satisfactory , information , composition, identification, identified, area, adequacy , covering, patient , history
Key words from fourth dtm for class label “Body Fluid Analysis”
epithelial, pus, hpf, rbc, blood, wbcpus , nil, tcsqctransitionalsquamous, urine, absent, cells, cellshpf
Key words from fifth dtm for class label “Organ Function Test”
nitrogen, urea, mgdl, bilirubin, conjugated , delta , creatinine
Key words from sixth dtm for class label “Patient Related”
headache, throat, complaints, cold, pain, ear, cough, discharge
B. Document term matrix with all class labels included is contructed and the key words from respective tdm of each class are selected
C. A gradient boosting model is fit on the above data
library(readr)
notes <- read_csv("C:/Users/welcome/Downloads/train_notes (1).csv")
notes$title <- as.factor(notes$title) # convert title to factor
notes <- notes[,-1] # remove firt column
notes <- notes[complete.cases(notes),] # remove missing values if any
require(tm) # load text mining package
sd <- VectorSource(notes$description) # words vector
corpus <- Corpus(sd) # build corpus
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, removePunctuation) # remove puntucations
corpus <- tm_map(corpus, stripWhitespace) # remove white spaces
corpus <- tm_map(corpus, removeWords, c(stopwords('english'), "and", "are", "the",
"both", "appears", "within", "appear",
"others", "clear", "right", "seen",
"well")) # remove stopwords
corpus <- tm_map(corpus, content_transformer(tolower)) # change to lower case
tdm <- DocumentTermMatrix(corpus) # build document term matrix
tdm_dm <- as.data.frame(as.matrix(tdm)) # count matrix
tdm_df <- as.matrix((tdm_dm > 0) + 0) # binary instance matrix
tdm_df <- as.data.frame(tdm_df)
tdm_df <- cbind(tdm_df, notes$title) # append label column from original dataset
# List of key words for all class labels
namelist <- (c("bladder", "kidneys", "liver", "lesion", "pancreas", "spleen", "urinary", "echotexture", "calculi", "prostate",
"urinary", "intrahepatic", "fluid", "gall", "differentiation",
"diet", "lipid", "fat", "profile", "water", "weight",
"regular", "months", "low", "repeat", "negative", "smear", "cervical", "cells",
"sampled", "slide", "sampling", "lesion", "categorization", "intraepithelial", "malignancy",
"satisfactory" , "information" , "composition", "identification",
"identified", "area", "adequacy" , "covering", "patient" , "history",
"epithelial", "pus", "hpf", "rbc", "blood", "wbcpus" , "nil", "tcsqctransitionalsquamous",
"urine", "absent", "cells", "cellshpf", "nitrogen", "urea", "mgdl", "bilirubin", "conjugated" ,
"delta" , "creatinine", "headache", "throat", "complaints", "cold", "pain", "ear", "cough", "discharge", "notes$title"))
library(tidyverse)
final <- tdm_df %>% select(namelist) # select columns names with the keywords for all class labels
s <- sample(1:nrow(final), nrow(final)*(0.70), replace = FALSE) # random sampling
train <- final[s,] # training set
test <- final[-s,] # testing set
require(h2o)
localH2O <- h2o.init(nthreads = -1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 hours 19 minutes
## H2O cluster version: 3.14.0.3
## H2O cluster version age: 4 months and 14 days !!!
## H2O cluster name: H2O_started_from_R_Karthik_hst051
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.70 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.1 (2017-06-30)
h2o.init() # Initialize h20
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 hours 19 minutes
## H2O cluster version: 3.14.0.3
## H2O cluster version age: 4 months and 14 days !!!
## H2O cluster name: H2O_started_from_R_Karthik_hst051
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.70 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.1 (2017-06-30)
train.h2o <- as.h2o(train) # Train set converted to h2o dataframe
##
|
| | 0%
|
|=================================================================| 100%
test.h2o <- as.h2o(test) # Test set converted to h2o dataframe
##
|
| | 0%
|
|=================================================================| 100%
y.dep <- 71 # Dependent Variable
x.indep <- c(1:70) # Independent variables
gbm <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o,
ntrees=200, learn_rate=0.1, stopping_rounds = 5, seed = 1234)
##
|
| | 0%
|
|== | 3%
|
|===== | 7%
|
|======= | 10%
|
|========= | 14%
|
|============ | 18%
|
|=============== | 22%
|
|================= | 26%
|
|==================== | 30%
|
|====================== | 34%
|
|======================== | 37%
|
|========================== | 40%
|
|============================= | 44%
|
|================================ | 48%
|
|=================================== | 54%
|
|====================================== | 59%
|
|========================================== | 64%
|
|=============================================== | 72%
|
|====================================================== | 84%
|
|============================================================ | 93%
|
|=================================================================| 100%
Model performance on train set
h2o.performance(gbm)
## H2OMultinomialMetrics: gbm
## ** Reported on training data. **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.0817278
## RMSE: (Extract with `h2o.rmse`) 0.2858807
## Logloss: (Extract with `h2o.logloss`) 0.2572066
## Mean Per-Class Error: 0.2010196
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## Body Fluid Analysis Cytology Test Diagnostic Imaging
## Body Fluid Analysis 2468 24 253
## Cytology Test 13 979 59
## Diagnostic Imaging 1 0 3935
## Doctors Advice 7 8 274
## Organ Function Test 2 0 27
## Patient Related 0 0 54
## Totals 2491 1011 4602
## Doctors Advice Organ Function Test Patient Related
## Body Fluid Analysis 0 2 0
## Cytology Test 0 0 0
## Diagnostic Imaging 2 0 1
## Doctors Advice 469 1 5
## Organ Function Test 0 449 0
## Patient Related 6 0 42
## Totals 477 452 48
## Error Rate
## Body Fluid Analysis 0.1016 = 279 / 2,747
## Cytology Test 0.0685 = 72 / 1,051
## Diagnostic Imaging 0.0010 = 4 / 3,939
## Doctors Advice 0.3861 = 295 / 764
## Organ Function Test 0.0607 = 29 / 478
## Patient Related 0.5882 = 60 / 102
## Totals 0.0814 = 739 / 9,081
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-6 Hit Ratios:
## k hit_ratio
## 1 1 0.918621
## 2 2 0.956172
## 3 3 0.984803
## 4 4 0.991301
## 5 5 0.997027
## 6 6 1.000000
Model performance on test set
h2o.performance(gbm, test.h2o)
## H2OMultinomialMetrics: gbm
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.08398827
## RMSE: (Extract with `h2o.rmse`) 0.2898073
## Logloss: (Extract with `h2o.logloss`) 0.2778274
## Mean Per-Class Error: 0.1784454
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## Body Fluid Analysis Cytology Test Diagnostic Imaging
## Body Fluid Analysis 1076 10 92
## Cytology Test 7 418 37
## Diagnostic Imaging 0 0 1653
## Doctors Advice 4 2 124
## Organ Function Test 2 0 11
## Patient Related 0 0 20
## Totals 1089 430 1937
## Doctors Advice Organ Function Test Patient Related
## Body Fluid Analysis 0 1 0
## Cytology Test 0 0 2
## Diagnostic Imaging 1 0 1
## Doctors Advice 209 0 6
## Organ Function Test 0 179 0
## Patient Related 4 0 33
## Totals 214 180 42
## Error Rate
## Body Fluid Analysis 0.0874 = 103 / 1,179
## Cytology Test 0.0991 = 46 / 464
## Diagnostic Imaging 0.0012 = 2 / 1,655
## Doctors Advice 0.3942 = 136 / 345
## Organ Function Test 0.0677 = 13 / 192
## Patient Related 0.4211 = 24 / 57
## Totals 0.0832 = 324 / 3,892
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-6 Hit Ratios:
## k hit_ratio
## 1 1 0.916752
## 2 2 0.956321
## 3 3 0.981501
## 4 4 0.991521
## 5 5 0.997174
## 6 6 1.000000