The findassocs function reveals the term “histolog”" has an 0.85 association with “tumor”. While the term “poor” has a 0.92 association with “prognosi”, the term “rich” has a 1 association with several terms including “affect”, “metatas”, and “intraepitheli”.
Load libraries
library(wordcloud)
library(tm)
library(RISmed)
library(cluster)
The EUtilsSumary function gets the summary information on the results of a query for any database of the National Center for Biotechnology Information (NCBI). The query criteria dates were set to 2016 with a max of 100 articles. EUtilsGet downloads queries and passes it into query_level3
query_level2 <- EUtilsSummary(query, retmax=100, mindate=2016, maxdate=2016)
query_level3 <- EUtilsGet(query_level2)
class(query_level3)
## [1] "Medline"
## attr(,"package")
## [1] "RISmed"
AbstractText extracts abstracts from Medline (RISmed). data.frame creates a data frame with “Abstract” as the column.
pubmed_data <-data.frame("Abstract" = AbstractText(query_level3))
DirSource sets the directory and Corpus creates the corpus.
source <- DirSource("C:/Users/narce/OneDrive/Documents/GitHub/Biomedical/Cancer/corpus")
testdoc <- Corpus(source)
The tm_map function removes stop words and is pass thru to testdoc1.
testdoc1 <- tm_map(testdoc, removeWords, c("may", "are", "use", "can", "the", "then", "this", "is", "a", "well", stopwords("english")))
The TermDocumentMatrix fuction creates a Term Document Matrix that converts the text into tokens, removes stopwords, punctuation, white space, and stems words to their root origin.
testdoc2 <- TermDocumentMatrix(testdoc1, control = list(tokenize = scan_tokenizer, stopwords = TRUE,
removePunctuation = TRUE, stripWhitespace = TRUE, stemming = TRUE, removeNumbers = TRUE))
as.matrix converts the Term Document matrix into a conventional matrix. The sort and data.frame functions Calculate frequency of terms and plces them in dereasing order inside a new data frame called testdocd.
testdoc3 <- as.matrix(testdoc2)
testdoc4 <- sort(rowSums(testdoc3), decreasing = TRUE)
testdoc5 <- data.frame(word = names(testdoc4),freq=testdoc4)
head(testdoc5, 10)
## word freq
## cell cell 35
## tumor tumor 27
## cancer cancer 25
## express express 21
## stem stem 16
## ehsp ehsp 10
## abt abt 9
## factor factor 9
## patient patient 9
## prognosi prognosi 9
The findassocs function searches for associations between the term “tumor” with other terms in the document corpus. Here “histolog”" has an 0.85 association with “tumor”.
findAssocs(x=testdoc2, term="tumor", corlimit = 0.6)
## $tumor
## histolog immun absenc anti
## 0.85 0.81 0.80 0.80
## head neck affect aggress
## 0.80 0.80 0.74 0.74
## background better ctla cxcr
## 0.74 0.74 0.74 0.74
## dendrit densiti good granzym
## 0.74 0.74 0.74 0.74
## higher immunohistochem immunotherapi infiltr
## 0.74 0.74 0.74 0.74
## intraepitheli less lower lympocyt
## 0.74 0.74 0.74 0.74
## macrophag metastas microenviron most
## 0.74 0.74 0.74 0.74
## neoadjuv overcom pdl prognost
## 0.74 0.74 0.74 0.74
## rest rich scchn stromal
## 0.74 0.74 0.74 0.74
## respect respons high chemotherapi
## 0.70 0.70 0.67 0.65
## squamous
## 0.65
While “poor” has a 0.92 association with “prognosi”.
findAssocs(x=testdoc2, term="poor", corlimit = 0.6)
## $poor
## prognosi method serv various
## 0.92 0.87 0.87 0.87
## clinicopatholog differenti stage respect
## 0.66 0.66 0.66 0.62
## chemotherapi correl histolog immun
## 0.61 0.61 0.61 0.61
## total
## 0.61
With the set.seed set to (1234) a wordcloud is constructed with the larger font representing words with more frequency in the corpus.
set.seed(1234)
wordcloud(words = testdoc5$word, freq = testdoc5$freq, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.2,
colors=brewer.pal(8, "Dark2"))

The removeSparseTerms function creates a cluster of words using Hierarchical clustering and removes Sparse Terms.
testdoc5 <- removeSparseTerms(testdoc2, 0.70)
The as.matrix converts Term Document Matrix into normal matrix.
c1 <- as.matrix(testdoc5)
The dist function computes and returns the distance matrix by using the distances between the rows of the data matrix. The hclust function looks for the Hierarchical cluster analysis of dissimilarities.
c2 <- dist(c1)
c3 <- hclust(c2, method = "ward.D")
Creates a Dendogram
plot(c3, hang = -1, asp=-1)

Creates K-mean clustering by defining the number of clusters.
km1 <- kmeans(c2, 2)
clusplot(as.matrix(c2), km1$cluster, color = T, shade = T, labels =2, lines = 0)

Creates a Document Term matrix
doctest <- DocumentTermMatrix(testdoc1, control = list(tokenize=scan_tokenizer, stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE,
stripWhitespace = TRUE,
removeNumbers = TRUE))
c1 <-as.matrix(testdoc5)
c2 <- dist(c1)
c3 <- hclust(c2, method = "ward.D")