Background Information on the Dataset

The medical literature is enormous. Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.

The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this problem, we will see how text analytics can be used to automate the process of information retrieval.

The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.

R Exercises

Loading the Dataset

Load clinical_trial.csv into a data frame called trials (remembering to add the argument stringsAsFactors=FALSE), and investigate the data frame with summary() and str().

How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)

# Load the dataset
trials = read.csv("clinical_trial.csv", stringsAsFactors=FALSE)
# Outputs the longest abstract
max(nchar(trials$abstract))
## [1] 3708

3708 characters are in the longest abstract.

How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)

# Tabulates the amount of results with no abstracts
z = table(nchar(trials$abstract)==0)
kable(z)
Var1 Freq
FALSE 1748
TRUE 112

112 search results have no abstract.

Find the observation with the minimum number of characters in the title (the variable “title”) out of all of the observations in this dataset. What is the text of the title of this article? Include capitalization and punctuation in your response, but don’t include the quotes.

# Find the observation with the minimum number of characters
which.min(nchar(trials$title))
## [1] 1258
a = which.min(nchar(trials$title))
z = trials$title[a]
kable(z)
x
A decade of letrozole: FACE.

A decade of letrozole: FACE.

Preparing the Corpus

Because we have both title and abstract information for trials, we need to build two corpora instead of one. Name them corpusTitle and corpusAbstract.

Following the commands from lecture, perform the following tasks (you might need to load the “tm” package first if it isn’t already loaded). Make sure to perform them in this order.

  1. Convert the title variable to corpusTitle and the abstract variable to corpusAbstract.

  2. Convert corpusTitle and corpusAbstract to lowercase.

  3. Remove the punctuation in corpusTitle and corpusAbstract.

  4. Remove the English language stop words from corpusTitle and corpusAbstract.

  5. Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes).

  6. Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract.

  7. Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents).

  8. Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract).

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusTitle, removeWords, sw) and tm_map(corpusAbstract, removeWords, sw) instead of tm_map(corpusTitle, removeWords, stopwords(“english”)) and tm_map(corpusAbstract, removeWords, stopwords(“english”)).

How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?

sw = c(“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “his”, “himself”, “she”, “her”, “hers”, “herself”, “it”, “its”, “itself”, “they”, “them”, “their”, “theirs”, “themselves”, “what”, “which”, “who”, “whom”, “this”, “that”, “these”, “those”, “am”, “is”, “are”, “was”, “were”, “be”, “been”, “being”, “have”, “has”, “had”, “having”, “do”, “does”, “did”, “doing”, “would”, “should”, “could”, “ought”, “i’m”, “you’re”, “he’s”, “she’s”, “it’s”, “we’re”, “they’re”, “i’ve”, “you’ve”, “we’ve”, “they’ve”, “i’d”, “you’d”, “he’d”, “she’d”, “we’d”, “they’d”, “i’ll”, “you’ll”, “he’ll”, “she’ll”, “we’ll”, “they’ll”, “isn’t”, “aren’t”, “wasn’t”, “weren’t”, “hasn’t”, “haven’t”, “hadn’t”, “doesn’t”, “don’t”, “didn’t”, “won’t”, “wouldn’t”, “shan’t”, “shouldn’t”, “can’t”, “cannot”, “couldn’t”, “mustn’t”, “let’s”, “that’s”, “who’s”, “what’s”, “here’s”, “there’s”, “when’s”, “where’s”, “why’s”, “how’s”, “a”, “an”, “the”, “and”, “but”, “if”, “or”, “because”, “as”, “until”, “while”, “of”, “at”, “by”, “for”, “with”, “about”, “against”, “between”, “into”, “through”, “during”, “before”, “after”, “above”, “below”, “to”, “from”, “up”, “down”, “in”, “out”, “on”, “off”, “over”, “under”, “again”, “further”, “then”, “once”, “here”, “there”, “when”, “where”, “why”, “how”, “all”, “any”, “both”, “each”, “few”, “more”, “most”, “other”, “some”, “such”, “no”, “nor”, “not”, “only”, “own”, “same”, “so”, “than”, “too”, “very”)

# Creating the Corpus
library(tm)
# Convert the title variable to corpusTitle and the abstract variable to corpusAbstract
corpusTitle = VCorpus(VectorSource(trials$title))
# Convert corpusTitle and corpusAbstract to lowercase
corpusTitle = tm_map(corpusTitle, content_transformer(tolower))
# Remove the punctuation in corpusTitle and corpusAbstract
corpusTitle = tm_map(corpusTitle, removePunctuation)
# Remove the English language stop words from corpusTitle and corpusAbstract
corpusTitle = tm_map(corpusTitle, removeWords, stopwords("english"))
# Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes)
corpusTitle = tm_map(corpusTitle, stemDocument)
# Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents)
dtmTitle = DocumentTermMatrix(corpusTitle)
# Remove sparse terms
dtmTitle = removeSparseTerms(dtmTitle, 0.95)
dtmTitle = as.data.frame(as.matrix(dtmTitle))
colnames(dtmTitle) = make.names(colnames(dtmTitle))
ncol(dtmTitle)
## [1] 31
# Convert the title variable to corpusTitle and the abstract variable to corpusAbstract
corpusAbstract = VCorpus(VectorSource(trials$abstract))
# Convert corpusTitle and corpusAbstract to lowercase
corpusAbstract = tm_map(corpusAbstract, content_transformer(tolower))
# Remove the punctuation in corpusTitle and corpusAbstract
corpusAbstract = tm_map(corpusAbstract, removePunctuation)
# Remove the English language stop words from corpusTitle and corpusAbstract
corpusAbstract = tm_map(corpusAbstract, removeWords, stopwords("english"))
# Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes)
corpusAbstract = tm_map(corpusAbstract, stemDocument)
# Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents)
dtmAbstract = DocumentTermMatrix(corpusAbstract)
# Remove the sparse terms
dtmAbstract = removeSparseTerms(dtmAbstract, 0.95)
dtmAbstract = as.data.frame(as.matrix(dtmAbstract))
colnames(dtmAbstract) = make.names(colnames(dtmAbstract))
ncol(dtmAbstract)
## [1] 335

31 terms remain in dtmTitle and 335 terms remain in dtmAbstract.

What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts.

# OBtain the frequency of word stems
frequency <- colSums(dtmAbstract)
which.max(frequency)
## patient 
##     212

Patient is the most frequent word stem across all abstracts.

Building a model

We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names. To fix this issue, run the following commands:

colnames(dtmTitle) = paste0(“T”, colnames(dtmTitle))

colnames(dtmAbstract) = paste0(“A”, colnames(dtmAbstract))

# Input A into Title and A into Abstracts
colnames(dtmTitle) = paste0("T", colnames(dtmTitle))
colnames(dtmAbstract) = paste0("A", colnames(dtmAbstract))

How many columns are in this combined data frame?

# Combine the data frame
dtm = cbind(dtmTitle, dtmAbstract)
ncol(dtm)
## [1] 366
# Remove dependent variable
dtm$trial = trials$trial

367 columns are in the combined data frame.

Baseline Model

Now that we have prepared our data frame, it’s time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into data frames named “train” and “test”, putting 70% of the data in the training set.

What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)

# Split the dataset into training and testing sets
library(caTools)
set.seed(144)
spl = sample.split(dtm$trial, SplitRatio = 0.7)
train = subset(dtm, spl==TRUE)
test = subset(dtm, spl==FALSE)
# Tabulate the baseline model
table(dtm$trial)
## 
##    0    1 
## 1043  817
a = table(dtm$trial)
kable(a)
Var1 Freq
0 1043
1 817
a[1]/(sum(a))
##         0 
## 0.5607527

The accuracy of the baseline model is 0.5607527.

CART Model #1 - Tree

Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don’t add a minbucket or cp value). Remember to add the method=“class” argument, since this is a classification problem.

What is the name of the first variable the model split on?

# Implement the CART Model
library(rpart)
library(rpart.plot)
trialCART = rpart(trial  ~ ., data=train, method="class")
prp(trialCART)

The first variable is Tphase.

CART Model #1 - Training(MAX)

# Predicting on the raining set
predTrain <- predict(trialCART)
summary(predTrain)
##        0                1          
##  Min.   :0.1281   Min.   :0.05455  
##  1st Qu.:0.2177   1st Qu.:0.13636  
##  Median :0.7125   Median :0.28750  
##  Mean   :0.5607   Mean   :0.43932  
##  3rd Qu.:0.8636   3rd Qu.:0.78231  
##  Max.   :0.9455   Max.   :0.87189

The maximum predicted probability is 0.872

CART Model #1 - Training

# Obtaining the accuracies of training set
t = table(train$trial, predTrain[,2]>=0.5)
kable(t)
FALSE TRUE
0 631 99
1 131 441

(t[1] + t[4]) / nrow(train)
## [1] 0.8233487
t[4] / (t[4] + t[2])
## [1] 0.770979
t[1] / (t[1] + t[3])
## [1] 0.8643836

Training Set Accuracy = 0.8233487 Training Set Sensitivity = 0.770979 Training Set Specificity = 0.8643836

CART Model #1 - Test

# Predicting on the test set
testPredictCART = predict(trialCART, newdata=test, type="class")
# Tabulating the accuracy of the testpredic vs the testing set
a = table(test$trial, testPredictCART)
kable(a)
0 1
0 261 52
1 83 162
sum(diag(a))/(sum(a))
## [1] 0.7580645

Testing Set Accuracy = 0.7580645

ROCR testing set AUC

# Calculates the ROCR
library(ROCR)
predTest = predict(trialCART, newdata=test)[,2]
pred = prediction(predTest, test$trial)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.8371063

Testing Set AUC = 0.8371063

Decision-maker tradeoffs

The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:

  1. For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)

  2. Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study’s detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.

  3. Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.

This process is shown in the figure below.

What is the cost associated with the model in Step 1 making a false negative prediction?

By definition, a false negative is a paper that should have been included in Set A but was missed by the model. This means a study that should have been included in Step 3 was missed, affecting the results.

What is the cost associated with the model in Step 1 making a false positive prediction?

By definition, a false positive is a paper that should not have been included in Set A but that was actually included. However, because the manual review in Step 2 is assumed to be 100% effective, this extra paper will not make it into the more limited set of papers, and therefore this mistake will not affect the analysis in Step 3.

Given the costs associated with false positives and false negatives, which of the following is most accurate?

A false negative might negatively affect the results of the literature review and analysis, while a false positive is a nuisance (one additional paper that needs to be manually checked). As a result, the cost of a false negative is much higher than the cost of a false positive, so much so that many studies actually use no machine learning (aka no Step 1) and have two people manually review each search result in Step 2. As always, we prefer a lower threshold in cases where false negatives are more costly than false positives, since we will make fewer negative predictions.