In this document the reader will find the code to build a search engine capable of determining the most relevant documents according to a specific query. This is similar to what google does when an user writes a search term and the engine returns websites that are relevant to those words or terms.
This can be applied to many situations, including academic researches where multiple documents are needed to collect sufficient information about a topic; the user can build a corpus (collection of documents) and see what papers are relevant to a specific part of the research. Also, this process can be part of a business project where the analyser wants to seek for specific emails that contain specific words.
Let’s say that after applying clustering methods the company discovered that vendors who mentioned the word “discount” on emails landed more sales; which seller is using this magic word?
Import the documents saved on a particular folder.
# install.packages("tm")
# install.packages("dplyr")
library(tm)
library(dplyr)
##Create corpus
my.corpus<- Corpus(DirSource("~/Google Drive/Data Analysis/Proyectos/varios/word mining/papers"))
print(my.corpus)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 13
# create names of documents
doc1<-content(my.corpus[[1]])
doc2<-content(my.corpus[[2]])
doc3<-content(my.corpus[[3]])
doc4<-content(my.corpus[[4]])
doc5<-content(my.corpus[[5]])
doc6<-content(my.corpus[[6]])
doc7<-content(my.corpus[[7]])
doc8<-content(my.corpus[[8]])
doc9<-content(my.corpus[[9]])
doc10<-content(my.corpus[[10]])
doc11<-content(my.corpus[[11]])
doc12<-content(my.corpus[[12]])
doc13<-content(my.corpus[[13]])
doc.list <- list(doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10, doc11, doc12, doc13)
N.docs <- length(doc.list)
names(doc.list) <- paste0("doc", c(1:N.docs))
#create the search query
query <- "Female Genital Mutilation socioeconomic model"
my.docs <- VectorSource(c(doc.list, query))
my.docs$Names <- c(names(doc.list), "query")
my.corpus <- Corpus(my.docs)
my.corpus
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 14
#remove punctuation
my.corpus <- tm_map(my.corpus, removePunctuation)
#remove numbers, uppercase, additional spaces
my.corpus <- tm_map(my.corpus, removeNumbers)
my.corpus <- tm_map(my.corpus, content_transformer(tolower))
my.corpus <- tm_map(my.corpus, stripWhitespace)
#create document matrix in a format that is efficient
term.doc.matrix.stm <- TermDocumentMatrix(my.corpus)
colnames(term.doc.matrix.stm) <- c(names(doc.list), "query")
inspect(term.doc.matrix.stm[0:14, ])
## <<TermDocumentMatrix (terms: 14, documents: 14)>>
## Non-/sparse entries: 45/151
## Sparsity : 77%
## Maximal term length: 13
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms doc1 doc11 doc12 doc2 doc3 doc4 doc5 doc6 doc7 doc8
## abandoned 1 0 0 0 0 0 0 0 0 0
## abandonment 1 1 1 0 2 1 0 0 0 0
## abdominal 1 0 0 0 0 0 0 4 0 0
## abnormalities 1 0 0 0 0 0 0 0 0 0
## about 6 5 5 1 16 5 0 1 3 2
## abscesses 3 0 0 0 0 0 0 2 0 0
## absence 1 0 0 0 0 0 1 5 1 0
## absolute 1 0 0 0 0 0 6 0 0 0
## abstract 2 1 1 1 1 1 0 3 8 0
## abusahlieh 3 0 0 0 0 0 0 0 0 1
This part is just to demonstrate that the format in which the document is saved is more efficient than a regular matrix.
#compare number of bytes of normal matrix against triple matrix format
term.doc.matrix <- as.matrix(term.doc.matrix.stm)
cat("Dense matrix representation costs", object.size(term.doc.matrix), "bytes.\n",
"Simple triplet matrix representation costs", object.size(term.doc.matrix.stm),
"bytes.")
## Dense matrix representation costs 1237552 bytes.
## Simple triplet matrix representation costs 706016 bytes.
This table shows the most frequent words on the corpus.
#show word ranking
most_frecuent_matrix<- addmargins(term.doc.matrix, margin = 2)
most_frecuent_matrix<- most_frecuent_matrix[order(most_frecuent_matrix[,15], decreasing = TRUE),]
most_frecuent_matrix_top50<- head(most_frecuent_matrix[,c(1,15)], n = 50)
print(most_frecuent_matrix_top50)
## Docs
## Terms doc1 Sum
## the 316 2967
## and 208 2180
## fgm 41 768
## female 46 629
## health 25 619
## for 38 598
## genital 14 594
## – 0 567
## women 42 556
## mutilation 16 438
## with 39 426
## that 32 386
## from 35 359
## were 3 344
## are 43 336
## this 16 318
## was 8 261
## circumcision 44 251
## study 1 249
## type 10 246
## complications 15 241
## have 27 241
## practice 31 233
## not 18 210
## more 14 195
## social 4 184
## studies 1 178
## who 18 177
## their 13 156
## consequences 4 153
## socioeconomic 3 142
## reproductive 0 137
## countries 6 136
## circumcised 10 131
## had 6 131
## girls 12 130
## also 6 128
## can 12 126
## among 15 125
## sexual 17 124
## which 11 121
## years 6 120
## been 10 119
## data 2 114
## african 7 111
## infections 12 111
## infection 6 109
## during 9 107
## prevalence 1 107
## education 1 104
A word cloud graph can be created to have a visual representation of the words with more appearances.
library(wordcloud)
wordTable<- head(most_frecuent_matrix[,15], n = 150)
wordcloud(words = names(wordTable),
freq = as.numeric(wordTable),
scale = c(2, 1),
min.freq = 1)
There is a lot of theory about this part, on this document it won’t be explained but the reader can always take the code and adapt it to any case.
#constructing the Vector Space Model
get.tf.idf.weights <- function(tf.vec) {
# Compute tfidf weights from term frequency vector
n.docs <- length(tf.vec)
doc.frequency <- length(tf.vec[tf.vec > 0])
weights <- rep(0, length(tf.vec))
weights[tf.vec > 0] <- (1 + log2(tf.vec[tf.vec > 0])) * log2(n.docs/doc.frequency)
return(weights)
}
tfidf.matrix <- t(apply(term.doc.matrix, 1,
FUN = function(row) {get.tf.idf.weights(row)}))
colnames(tfidf.matrix) <- colnames(term.doc.matrix)
tfidf.matrix[0:3, ]
##
## Terms doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9
## abandoned 3.807355 0 0.000000 0.000000 0 0.000000 0 0 0
## abandonment 1.222392 0 2.444785 1.222392 0 0.000000 0 0 0
## abdominal 2.807355 0 0.000000 0.000000 0 8.422065 0 0 0
##
## Terms doc10 doc11 doc12 doc13 query
## abandoned 0.000000 0.000000 0.000000 0 0
## abandonment 1.222392 1.222392 1.222392 0 0
## abdominal 0.000000 0.000000 0.000000 0 0
tfidf.matrix <- scale(tfidf.matrix, center = FALSE,
scale = sqrt(colSums(tfidf.matrix^2)))
tfidf.matrix[0:3, ]
##
## Terms doc1 doc2 doc3 doc4 doc5 doc6 doc7
## abandoned 0.026521074 0 0.00000000 0.00000000 0 0.00000000 0
## abandonment 0.008514877 0 0.01258962 0.01377428 0 0.00000000 0
## abdominal 0.019555326 0 0.00000000 0.00000000 0 0.06007484 0
##
## Terms doc8 doc9 doc10 doc11 doc12 doc13 query
## abandoned 0 0 0.000000 0.00000000 0.00000000 0 0
## abandonment 0 0 0.007443 0.01377428 0.01377428 0 0
## abdominal 0 0 0.000000 0.00000000 0.00000000 0 0
query.vector <- tfidf.matrix[, (N.docs + 1)]
tfidf.matrix <- tfidf.matrix[, 1:N.docs]
doc.scores <- t(query.vector) %*% tfidf.matrix
results.df <- data.frame(doc = names(doc.list), score = t(doc.scores),
text = unlist(doc.list))
results.df <- results.df[order(results.df$score, decreasing = TRUE), ]
The model will search on every document of the corpus the indicated query, the result will be the most relevant ones according to it.
The higher the score, the more relevant the document is.
#print scores of documents by the query
results<- results.df[,c(1,2)]
print(results, row.names = FALSE, right = FALSE, digits = 2)
## doc score
## doc5 0.0467
## doc10 0.0351
## doc3 0.0287
## doc9 0.0195
## doc6 0.0103
## doc4 0.0042
## doc11 0.0042
## doc12 0.0042
## doc13 0.0027
## doc2 0.0026
## doc7 0.0021
## doc1 0.0017
## doc8 0.0017
Doc5: Estimating the obstetric costs of Female Genital Mutilation in six African countries
Doc10: Forecasting High-Priority Infectious Disease Surveillance Regions: A Socioeconomic Model
Doc3: Exploration of pathways related to the decline in female circumcision in Egypt
For a more in depth explanation of the process and theory clic here