Homework assignment for “The Analytics Edge” MITx

knitr::opts_chunk$set(echo = TRUE)
library(tm) # text analytics
library(caTools) # create training and test datasets
library(rpart) # Classificaiton and Regression Tree
library(rpart.plot)
library(ROCR) # receiver operator characteristic

The medical literature is enormous. Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.

The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this problem, we will see how text analytics can be used to automate the process of information retrieval.

The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.

data

trials <- read.csv("clinical_trial.csv", stringsAsFactors = FALSE)
str(trials)
## 'data.frame':    1860 obs. of  3 variables:
##  $ title   : chr  "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breas"| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for "| __truncated__ ...
##  $ abstract: chr  "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population bef"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with un"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a co"| __truncated__ ...
##  $ trial   : int  1 0 1 1 1 0 1 0 0 0 ...
cat("\n")
summary(trials)
##     title             abstract             trial       
##  Length:1860        Length:1860        Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :0.0000  
##                                        Mean   :0.4392  
##                                        3rd Qu.:1.0000  
##                                        Max.   :1.0000
cat("\nNumber of characters in longest abstract : ", max(nchar(trials$abstract)))
## 
## Number of characters in longest abstract :  3708
cat("\nNumber of search results with no abstract : ", sum(nchar(trials$abstract)==0))
## 
## Number of search results with no abstract :  112
cat("\nTitle of observation with shortest title : ", trials$title[which.min(nchar(trials$title))])
## 
## Title of observation with shortest title :  A decade of letrozole: FACE.

pre-process titles and abstracts

corpusTitle <- VCorpus(VectorSource(trials$title))
corpusAbstract <- VCorpus(VectorSource(trials$abstract))

corpusTitle = tm_map(corpusTitle, content_transformer(tolower)) # lower-case
corpusTitle = tm_map(corpusTitle, removePunctuation)            # remove punctuation
corpusTitle = tm_map(corpusTitle, removeWords, stopwords("english")) # remove stop words
corpusTitle = tm_map(corpusTitle, stemDocument)                 # reduce words to 'stems'

corpusAbstract = tm_map(corpusAbstract, content_transformer(tolower)) # lower-case
corpusAbstract = tm_map(corpusAbstract, removePunctuation)            # remove punctuation
corpusAbstract = tm_map(corpusAbstract, removeWords, stopwords("english")) # remove stop words
corpusAbstract = tm_map(corpusAbstract, stemDocument)                 # reduce words to 'stems'

bag of words

dtmTitle <- as.data.frame(as.matrix(removeSparseTerms(DocumentTermMatrix(corpusTitle), 0.95)))
dtmAbstract <- as.data.frame(as.matrix(removeSparseTerms(DocumentTermMatrix(corpusAbstract), 0.95)))
# create document term matrices. remove terms not present in at least 5% of documents

str(dtmTitle)
## 'data.frame':    1860 obs. of  31 variables:
##  $ adjuv          : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ advanc         : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ breast         : num  0 0 1 1 1 1 0 1 1 1 ...
##  $ cancer         : num  1 0 1 1 1 1 0 1 1 2 ...
##  $ chemotherapi   : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ clinic         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ combin         : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ compar         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cyclophosphamid: num  0 0 0 1 0 0 0 0 0 0 ...
##  $ docetaxel      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ doxorubicin    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ earli          : num  0 0 1 1 0 0 0 1 0 0 ...
##  $ effect         : num  0 0 0 0 1 0 0 1 0 1 ...
##  $ group          : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ iii            : num  0 0 1 0 0 0 0 0 0 1 ...
##  $ metastat       : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ patient        : num  0 0 0 0 1 0 1 0 1 1 ...
##  $ phase          : num  0 0 1 1 0 0 0 0 0 1 ...
##  $ plus           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ postmenopaus   : num  0 0 0 0 0 0 0 1 1 0 ...
##  $ random         : num  0 0 1 1 1 0 0 0 0 1 ...
##  $ randomis       : num  0 0 0 0 0 0 0 1 1 0 ...
##  $ respons        : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ result         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ studi          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ tamoxifen      : num  0 0 0 0 0 0 0 2 1 0 ...
##  $ therapi        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ treatment      : num  1 0 0 0 1 0 0 0 0 1 ...
##  $ trial          : num  0 0 1 1 1 0 0 1 1 1 ...
##  $ versus         : num  0 0 1 0 0 0 0 1 0 0 ...
##  $ women          : num  0 0 0 0 0 0 0 1 0 0 ...
str(dtmAbstract)
## 'data.frame':    1860 obs. of  335 variables:
##  $ 0001           : num  0 0 0 0 0 7 0 0 0 0 ...
##  $ 001            : num  0 0 0 1 0 0 0 1 0 0 ...
##  $ 005            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 100            : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ 500            : num  0 0 1 0 2 0 0 0 0 0 ...
##  $ 5fluorouracil  : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ accord         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ achiev         : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ activ          : num  0 2 0 1 0 0 1 0 0 0 ...
##  $ addit          : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ adjuv          : num  0 0 0 2 0 1 0 2 4 0 ...
##  $ administ       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ administr      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ advanc         : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ advers         : num  0 0 0 0 0 0 0 0 0 4 ...
##  $ age            : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ agent          : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ aim            : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ alon           : num  0 0 0 0 0 0 0 0 3 0 ...
##  $ also           : num  0 0 0 1 0 0 0 0 2 0 ...
##  $ although       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ among          : num  0 0 2 4 0 0 0 0 1 0 ...
##  $ analys         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ analysi        : num  0 0 0 2 0 0 0 1 0 0 ...
##  $ analyz         : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ andor          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anthracyclin   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appear         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ arm            : num  0 0 7 4 2 0 0 1 0 1 ...
##  $ aromatas       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ assess         : num  0 2 1 2 0 1 0 0 0 3 ...
##  $ assign         : num  0 0 2 1 0 0 0 1 1 1 ...
##  $ associ         : num  0 0 1 3 0 2 0 1 2 0 ...
##  $ avail          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ axillari       : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ background     : num  0 0 1 1 1 0 0 1 1 0 ...
##  $ base           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ baselin        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ benefit        : num  0 0 0 0 2 0 0 1 1 0 ...
##  $ better         : num  0 0 1 0 1 1 0 0 0 0 ...
##  $ bone           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ breast         : num  0 2 3 3 3 4 2 2 2 3 ...
##  $ can            : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ cancer         : num  0 0 2 3 3 3 2 2 3 0 ...
##  $ carcinoma      : num  0 5 0 0 0 0 0 0 0 2 ...
##  $ case           : num  0 11 0 0 1 0 0 0 0 0 ...
##  $ caus           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ cell           : num  0 3 0 0 0 1 0 0 0 0 ...
##  $ chang          : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ characterist   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ chemotherapi   : num  0 0 1 2 3 5 2 0 1 0 ...
##  $ clinic         : num  0 1 0 1 0 0 0 0 0 0 ...
##  $ cmf            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ combin         : num  0 0 0 0 6 0 2 0 1 0 ...
##  $ common         : num  0 0 0 0 0 0 0 0 0 3 ...
##  $ compar         : num  0 0 1 2 0 0 0 1 1 1 ...
##  $ comparison     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ complet        : num  0 0 3 0 1 1 1 2 0 0 ...
##  $ conclus        : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ conduct        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ confid         : num  0 0 1 1 0 0 0 0 0 0 ...
##  $ confirm        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ consid         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ consist        : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ continu        : num  0 0 2 0 1 0 0 2 0 0 ...
##  $ control        : num  0 1 0 0 0 0 0 0 1 0 ...
##  $ correl         : num  0 0 0 0 0 2 0 0 0 0 ...
##  $ cours          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cycl           : num  0 0 6 0 2 0 1 0 0 0 ...
##  $ cyclophosphamid: num  0 0 1 1 1 1 3 0 0 0 ...
##  $ daili          : num  0 0 0 0 0 0 1 0 1 2 ...
##  $ data           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ day            : num  0 0 1 0 0 0 3 0 0 0 ...
##  $ death          : num  0 0 0 2 0 0 0 0 1 0 ...
##  $ decreas        : num  0 0 1 0 0 0 0 0 1 0 ...
##  $ defin          : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ demonstr       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ design         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ detect         : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ determin       : num  0 0 0 0 1 1 1 0 0 0 ...
##  $ develop        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dfs            : num  0 0 0 3 0 0 0 0 0 0 ...
##  $ differ         : num  0 1 2 1 3 0 0 1 0 1 ...
##  $ diseas         : num  0 1 0 1 3 0 0 1 0 0 ...
##  $ diseasefre     : num  0 0 0 1 0 2 0 0 3 0 ...
##  $ distant        : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ docetaxel      : num  0 0 1 0 0 0 3 0 0 0 ...
##  $ dose           : num  0 0 0 0 0 0 4 0 0 0 ...
##  $ doubleblind    : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ doxorubicin    : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ drug           : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ due            : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ durat          : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ earli          : num  0 0 0 0 0 0 0 2 0 0 ...
##  $ effect         : num  0 0 2 0 0 1 0 2 0 1 ...
##  $ efficaci       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ eight          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ either         : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ elig           : num  0 0 0 1 0 0 0 0 0 0 ...
##   [list output truncated]
cat("\nMost frequent word stem across all abstracts :", colnames(dtmAbstract)[which.max(colSums(dtmAbstract))])
## 
## Most frequent word stem across all abstracts : patient

Building model

We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names.

colnames(dtmTitle) <-  paste0("T", colnames(dtmTitle))
colnames(dtmAbstract) <-  paste0("A", colnames(dtmAbstract))
# Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.

dtm <- cbind(dtmTitle, dtmAbstract)
# Using cbind(), combine dtmTitle and dtmAbstract into a single data frame called dtm:
dtm$trial <- trials$trial

paste("Number of columns in 'dtm' datta-frame", ncol(dtm))
## [1] "Number of columns in 'dtm' datta-frame 367"

Create training and test datasets

set.seed(144)

split <- sample.split(dtm$trial, SplitRatio = 0.7)
train <- subset(dtm, split == TRUE)
test <- subset(dtm, split == FALSE)

summary(train$trial)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4393  1.0000  1.0000
cat("\nBaseline accuracty of most common outcome of train$trial : ", 1-mean(train$trial))
## 
## Baseline accuracty of most common outcome of train$trial :  0.5606759

create Classification and Regression Tree (CART) model

trialCART <- rpart(trial ~ . ,
                   data = train, method = "class")
prp(trialCART)

training set predictions for the model

predictCARTtrain <- predict(trialCART)
cat("\nMaximum predicted probability for any result\nwhen applying model to training set :", max(predictCARTtrain[,2]))
## 
## Maximum predicted probability for any result
## when applying model to training set : 0.8718861
# probability of '1' is in second column of predictCarttrain

Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability when applying the model to the testing set compared to the training set.

model accuracy, sensitivity and specificity on training set

predictCARTtrainconfusion <- table(train$trial, predictCARTtrain[,2] >= 0.5)
predictCARTtrainconfusion
##    
##     FALSE TRUE
##   0   631   99
##   1   131  441
cat("\nAccuracy : ", sum(diag(predictCARTtrainconfusion))/nrow(train))
## 
## Accuracy :  0.8233487
cat("\nSensitivity : ", predictCARTtrainconfusion["1","TRUE"]/sum(predictCARTtrainconfusion["1",]))
## 
## Sensitivity :  0.770979
cat("\nSpecificity : ", predictCARTtrainconfusion["0","FALSE"]/sum(predictCARTtrainconfusion["0",]))
## 
## Specificity :  0.8643836

evaluating the model on the testing set

predTest =  predict(trialCART,
                    newdata = test)

predTestconfusion <- table(test$trial, predTest[,2] >= 0.5)
predTestconfusion
##    
##     FALSE TRUE
##   0   261   52
##   1    83  162
cat("\nAccuracy : ", sum(diag(predTestconfusion))/nrow(test))
## 
## Accuracy :  0.7580645
cat("\nSensitivity : ", predTestconfusion["1","TRUE"]/sum(predTestconfusion["1",]))
## 
## Sensitivity :  0.6612245
cat("\nSpecificity : ", predTestconfusion["0","FALSE"]/sum(predTestconfusion["0",]))
## 
## Specificity :  0.8338658
predictROCR <- prediction(predTest[,2], test$trial)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.8371063

decision-maker tradeoffs

The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:

  1. For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)

  2. Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study’s detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.

  3. Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.

A false negative might negatively affect the results of the literature review and analysis, while a false positive is a nuisance (one additional paper that needs to be manually checked). As a result, the cost of a false negative is much higher than the cost of a false positive, so much so that many studies actually use no machine learning (aka no Step 1) and have two people manually review each search result in Step 2. As always, we prefer a lower threshold in cases where false negatives are more costly than false positives, since we will make fewer negative predictions.