Biomedical Text Mining: Cancer

Summary

The findassocs function reveals the term “histolog”" has an 0.85 association with “tumor”. While the term “poor” has a 0.92 association with “prognosi”, the term “rich” has a 1 association with several terms including “affect”, “metatas”, and “intraepitheli”.

Load libraries

library(wordcloud)
library(tm)
library(RISmed)
library(cluster)

Input of query term: Cancer

query <- "cancer"

The EUtilsSumary function gets the summary information on the results of a query for any database of the National Center for Biotechnology Information (NCBI). The query criteria dates were set to 2016 with a max of 100 articles. EUtilsGet downloads queries and passes it into query_level3

query_level2 <- EUtilsSummary(query, retmax=100, mindate=2016, maxdate=2016)
query_level3 <- EUtilsGet(query_level2)
class(query_level3)

## [1] "Medline"
## attr(,"package")
## [1] "RISmed"

AbstractText extracts abstracts from Medline (RISmed). data.frame creates a data frame with “Abstract” as the column.

pubmed_data <-data.frame("Abstract" = AbstractText(query_level3))

file.path stores the abstracts in a folder called “corpus”. write.table takes the newly formed data frame and makes it a table.

for(Abs in 1:9)
{
  doc1 <- data.frame(pubmed_data[Abs,])
  doc2 <- file.path("C:/Users/narce/OneDrive/Documents/GitHub/Biomedical/Cancer/corpus", paste0(Abs, ".txt"))
  write.table(doc1, file = doc2, sep = "", row.names = FALSE, col.names = FALSE, quote = FALSE,
              append = FALSE)
}

DirSource sets the directory and Corpus creates the corpus.

source <- DirSource("C:/Users/narce/OneDrive/Documents/GitHub/Biomedical/Cancer/corpus")
testdoc <- Corpus(source)

The tm_map function removes stop words and is pass thru to testdoc1.

testdoc1 <- tm_map(testdoc, removeWords, c("may", "are", "use", "can", "the", "then", "this", "is", "a", "well", stopwords("english")))

The TermDocumentMatrix fuction creates a Term Document Matrix that converts the text into tokens, removes stopwords, punctuation, white space, and stems words to their root origin.

testdoc2 <- TermDocumentMatrix(testdoc1, control = list(tokenize = scan_tokenizer, stopwords = TRUE, 
            removePunctuation = TRUE, stripWhitespace = TRUE, stemming = TRUE, removeNumbers = TRUE))

as.matrix converts the Term Document matrix into a conventional matrix. The sort and data.frame functions Calculate frequency of terms and plces them in dereasing order inside a new data frame called testdocd.

testdoc3 <- as.matrix(testdoc2)
testdoc4 <- sort(rowSums(testdoc3), decreasing = TRUE)
testdoc5 <- data.frame(word = names(testdoc4),freq=testdoc4)
head(testdoc5, 10)

##              word freq
## cell         cell   35
## tumor       tumor   27
## cancer     cancer   25
## express   express   21
## stem         stem   16
## ehsp         ehsp   10
## abt           abt    9
## factor     factor    9
## patient   patient    9
## prognosi prognosi    9

The findassocs function searches for associations between the term “tumor” with other terms in the document corpus. Here “histolog”" has an 0.85 association with “tumor”.

findAssocs(x=testdoc2, term="tumor", corlimit = 0.6)

## $tumor
##        histolog           immun          absenc            anti 
##            0.85            0.81            0.80            0.80 
##            head            neck          affect         aggress 
##            0.80            0.80            0.74            0.74 
##      background          better            ctla            cxcr 
##            0.74            0.74            0.74            0.74 
##         dendrit         densiti            good         granzym 
##            0.74            0.74            0.74            0.74 
##          higher immunohistochem   immunotherapi         infiltr 
##            0.74            0.74            0.74            0.74 
##   intraepitheli            less           lower        lympocyt 
##            0.74            0.74            0.74            0.74 
##       macrophag        metastas    microenviron            most 
##            0.74            0.74            0.74            0.74 
##        neoadjuv         overcom             pdl        prognost 
##            0.74            0.74            0.74            0.74 
##            rest            rich           scchn         stromal 
##            0.74            0.74            0.74            0.74 
##         respect         respons            high    chemotherapi 
##            0.70            0.70            0.67            0.65 
##        squamous 
##            0.65

While “poor” has a 0.92 association with “prognosi”.

findAssocs(x=testdoc2, term="poor", corlimit = 0.6)

## $poor
##        prognosi          method            serv         various 
##            0.92            0.87            0.87            0.87 
## clinicopatholog      differenti           stage         respect 
##            0.66            0.66            0.66            0.62 
##    chemotherapi          correl        histolog           immun 
##            0.61            0.61            0.61            0.61 
##           total 
##            0.61

And “rich” has a 1 association with several terms including “affect”, “metatas”, and “intraepitheli”.

findAssocs(x=testdoc2, term="rich", corlimit = 0.6)

## $rich
##          affect         aggress      background          better 
##            1.00            1.00            1.00            1.00 
##            ctla            cxcr         dendrit         densiti 
##            1.00            1.00            1.00            1.00 
##            good         granzym          higher immunohistochem 
##            1.00            1.00            1.00            1.00 
##   immunotherapi         infiltr   intraepitheli            less 
##            1.00            1.00            1.00            1.00 
##           lower        lympocyt       macrophag        metastas 
##            1.00            1.00            1.00            1.00 
##    microenviron            most        neoadjuv         overcom 
##            1.00            1.00            1.00            1.00 
##             pdl        prognost            rest           scchn 
##            1.00            1.00            1.00            1.00 
##         stromal           immun         respect    chemotherapi 
##            1.00            0.98            0.95            0.88 
##        histolog          system         respons           tumor 
##            0.88            0.80            0.78            0.74 
##          factor          absenc            anti         conclus 
##            0.71            0.66            0.66            0.66 
##         distant            head          induct          method 
##            0.66            0.66            0.66            0.66 
##            neck         pathway          result            serv 
##            0.66            0.66            0.66            0.66 
##         suffici             use         various          wherea 
##            0.66            0.66            0.66            0.66 
##        prognosi 
##            0.65

With the set.seed set to (1234) a wordcloud is constructed with the larger font representing words with more frequency in the corpus.

set.seed(1234)
wordcloud(words = testdoc5$word, freq = testdoc5$freq, min.freq = 1,
          max.words = 200, random.order = FALSE, rot.per = 0.2,
          colors=brewer.pal(8, "Dark2"))

The removeSparseTerms function creates a cluster of words using Hierarchical clustering and removes Sparse Terms.

testdoc5 <- removeSparseTerms(testdoc2, 0.70)

The as.matrix converts Term Document Matrix into normal matrix.

c1 <- as.matrix(testdoc5)

The dist function computes and returns the distance matrix by using the distances between the rows of the data matrix. The hclust function looks for the Hierarchical cluster analysis of dissimilarities.

c2 <- dist(c1)
c3 <- hclust(c2, method = "ward.D")

Creates a Dendogram

plot(c3, hang = -1, asp=-1)

Creates K-mean clustering by defining the number of clusters.

km1 <- kmeans(c2, 2)
clusplot(as.matrix(c2), km1$cluster, color = T, shade = T, labels =2, lines = 0)

Creates a Document Term matrix

doctest <- DocumentTermMatrix(testdoc1, control = list(tokenize=scan_tokenizer, stopwords = TRUE,
           removePunctuation = TRUE,
           stemming = TRUE,
           stripWhitespace = TRUE,
           removeNumbers = TRUE))

c1 <-as.matrix(testdoc5)
c2 <- dist(c1)
c3 <- hclust(c2, method = "ward.D")

Biomedical Text Mining: Cancer

Narcel Reedus - Data Analyst

November 12, 2017

Summary

The findassocs function reveals the term “histolog”" has an 0.85 association with “tumor”. While the term “poor” has a 0.92 association with “prognosi”, the term “rich” has a 1 association with several terms including “affect”, “metatas”, and “intraepitheli”.

Load libraries

Input of query term: Cancer

The EUtilsSumary function gets the summary information on the results of a query for any database of the National Center for Biotechnology Information (NCBI). The query criteria dates were set to 2016 with a max of 100 articles. EUtilsGet downloads queries and passes it into query_level3

AbstractText extracts abstracts from Medline (RISmed). data.frame creates a data frame with “Abstract” as the column.

file.path stores the abstracts in a folder called “corpus”. write.table takes the newly formed data frame and makes it a table.

DirSource sets the directory and Corpus creates the corpus.

The tm_map function removes stop words and is pass thru to testdoc1.

The TermDocumentMatrix fuction creates a Term Document Matrix that converts the text into tokens, removes stopwords, punctuation, white space, and stems words to their root origin.

as.matrix converts the Term Document matrix into a conventional matrix. The sort and data.frame functions Calculate frequency of terms and plces them in dereasing order inside a new data frame called testdocd.

The findassocs function searches for associations between the term “tumor” with other terms in the document corpus. Here “histolog”" has an 0.85 association with “tumor”.

While “poor” has a 0.92 association with “prognosi”.

And “rich” has a 1 association with several terms including “affect”, “metatas”, and “intraepitheli”.

With the set.seed set to (1234) a wordcloud is constructed with the larger font representing words with more frequency in the corpus.

The removeSparseTerms function creates a cluster of words using Hierarchical clustering and removes Sparse Terms.

The as.matrix converts Term Document Matrix into normal matrix.

The dist function computes and returns the distance matrix by using the distances between the rows of the data matrix. The hclust function looks for the Hierarchical cluster analysis of dissimilarities.

Creates a Dendogram

Creates K-mean clustering by defining the number of clusters.

Creates a Document Term matrix