Data Retrieval Report

With the use of a Vector Space Model


In this document the reader will find the code to build a search engine capable of determining the most relevant documents according to a specific query. This is similar to what google does when an user writes a search term and the engine returns websites that are relevant to those words or terms.

This can be applied to many situations, including academic researches where multiple documents are needed to collect sufficient information about a topic; the user can build a corpus (collection of documents) and see what papers are relevant to a specific part of the research. Also, this process can be part of a business project where the analyser wants to seek for specific emails that contain specific words.

Let’s say that after applying clustering methods the company discovered that vendors who mentioned the word “discount” on emails landed more sales; which seller is using this magic word?

Preparation and Inspection of Corpus

Import the documents saved on a particular folder.

# install.packages("tm")
# install.packages("dplyr")
library(tm)
library(dplyr)

##Create corpus
my.corpus<- Corpus(DirSource("~/Google Drive/Data Analysis/Proyectos/varios/word mining/papers"))
print(my.corpus)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 13
# create names of documents
doc1<-content(my.corpus[[1]])
doc2<-content(my.corpus[[2]])
doc3<-content(my.corpus[[3]])
doc4<-content(my.corpus[[4]])
doc5<-content(my.corpus[[5]])
doc6<-content(my.corpus[[6]])
doc7<-content(my.corpus[[7]])
doc8<-content(my.corpus[[8]])
doc9<-content(my.corpus[[9]])
doc10<-content(my.corpus[[10]])
doc11<-content(my.corpus[[11]])
doc12<-content(my.corpus[[12]])
doc13<-content(my.corpus[[13]])

doc.list <- list(doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10, doc11, doc12, doc13)
N.docs <- length(doc.list)
names(doc.list) <- paste0("doc", c(1:N.docs))

#create the search query
query <- "Female Genital Mutilation socioeconomic model"

my.docs <- VectorSource(c(doc.list, query))
my.docs$Names <- c(names(doc.list), "query")

my.corpus <- Corpus(my.docs)
my.corpus
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 14
#remove punctuation
my.corpus <- tm_map(my.corpus, removePunctuation)

#remove numbers, uppercase, additional spaces
my.corpus <- tm_map(my.corpus, removeNumbers)
my.corpus <- tm_map(my.corpus, content_transformer(tolower))
my.corpus <- tm_map(my.corpus, stripWhitespace)

#create document matrix in a format that is efficient
term.doc.matrix.stm <- TermDocumentMatrix(my.corpus)
colnames(term.doc.matrix.stm) <- c(names(doc.list), "query")
inspect(term.doc.matrix.stm[0:14, ])
## <<TermDocumentMatrix (terms: 14, documents: 14)>>
## Non-/sparse entries: 45/151
## Sparsity           : 77%
## Maximal term length: 13
## Weighting          : term frequency (tf)
## Sample             :
##                Docs
## Terms           doc1 doc11 doc12 doc2 doc3 doc4 doc5 doc6 doc7 doc8
##   abandoned        1     0     0    0    0    0    0    0    0    0
##   abandonment      1     1     1    0    2    1    0    0    0    0
##   abdominal        1     0     0    0    0    0    0    4    0    0
##   abnormalities    1     0     0    0    0    0    0    0    0    0
##   about            6     5     5    1   16    5    0    1    3    2
##   abscesses        3     0     0    0    0    0    0    2    0    0
##   absence          1     0     0    0    0    0    1    5    1    0
##   absolute         1     0     0    0    0    0    6    0    0    0
##   abstract         2     1     1    1    1    1    0    3    8    0
##   abusahlieh       3     0     0    0    0    0    0    0    0    1


Verification of Efficiency of Model

This part is just to demonstrate that the format in which the document is saved is more efficient than a regular matrix.

#compare number of bytes of normal matrix against triple matrix format
term.doc.matrix <- as.matrix(term.doc.matrix.stm)

cat("Dense matrix representation costs", object.size(term.doc.matrix), "bytes.\n", 
    "Simple triplet matrix representation costs", object.size(term.doc.matrix.stm), 
    "bytes.")
## Dense matrix representation costs 1237552 bytes.
##  Simple triplet matrix representation costs 706016 bytes.


Words with most appearances

This table shows the most frequent words on the corpus.

#show word ranking
most_frecuent_matrix<- addmargins(term.doc.matrix, margin = 2) 
most_frecuent_matrix<- most_frecuent_matrix[order(most_frecuent_matrix[,15], decreasing = TRUE),] 
most_frecuent_matrix_top50<- head(most_frecuent_matrix[,c(1,15)], n = 50)
print(most_frecuent_matrix_top50)
##                Docs
## Terms           doc1  Sum
##   the            316 2967
##   and            208 2180
##   fgm             41  768
##   female          46  629
##   health          25  619
##   for             38  598
##   genital         14  594
##   –                0  567
##   women           42  556
##   mutilation      16  438
##   with            39  426
##   that            32  386
##   from            35  359
##   were             3  344
##   are             43  336
##   this            16  318
##   was              8  261
##   circumcision    44  251
##   study            1  249
##   type            10  246
##   complications   15  241
##   have            27  241
##   practice        31  233
##   not             18  210
##   more            14  195
##   social           4  184
##   studies          1  178
##   who             18  177
##   their           13  156
##   consequences     4  153
##   socioeconomic    3  142
##   reproductive     0  137
##   countries        6  136
##   circumcised     10  131
##   had              6  131
##   girls           12  130
##   also             6  128
##   can             12  126
##   among           15  125
##   sexual          17  124
##   which           11  121
##   years            6  120
##   been            10  119
##   data             2  114
##   african          7  111
##   infections      12  111
##   infection        6  109
##   during           9  107
##   prevalence       1  107
##   education        1  104


A word cloud graph can be created to have a visual representation of the words with more appearances.

library(wordcloud)
wordTable<- head(most_frecuent_matrix[,15], n = 150)

wordcloud(words = names(wordTable), 
          freq = as.numeric(wordTable), 
          scale = c(2, 1), 
          min.freq = 1)


Application of Vector Space Model

There is a lot of theory about this part, on this document it won’t be explained but the reader can always take the code and adapt it to any case.

#constructing the Vector Space Model
get.tf.idf.weights <- function(tf.vec) {
  # Compute tfidf weights from term frequency vector
  n.docs <- length(tf.vec)
  doc.frequency <- length(tf.vec[tf.vec > 0])
  weights <- rep(0, length(tf.vec))
  weights[tf.vec > 0] <- (1 + log2(tf.vec[tf.vec > 0])) * log2(n.docs/doc.frequency)
  return(weights)
}

tfidf.matrix <- t(apply(term.doc.matrix, 1,
                        FUN = function(row) {get.tf.idf.weights(row)}))
colnames(tfidf.matrix) <- colnames(term.doc.matrix)
tfidf.matrix[0:3, ]
##              
## Terms             doc1 doc2     doc3     doc4 doc5     doc6 doc7 doc8 doc9
##   abandoned   3.807355    0 0.000000 0.000000    0 0.000000    0    0    0
##   abandonment 1.222392    0 2.444785 1.222392    0 0.000000    0    0    0
##   abdominal   2.807355    0 0.000000 0.000000    0 8.422065    0    0    0
##              
## Terms            doc10    doc11    doc12 doc13 query
##   abandoned   0.000000 0.000000 0.000000     0     0
##   abandonment 1.222392 1.222392 1.222392     0     0
##   abdominal   0.000000 0.000000 0.000000     0     0
tfidf.matrix <- scale(tfidf.matrix, center = FALSE,
                      scale = sqrt(colSums(tfidf.matrix^2)))
tfidf.matrix[0:3, ]
##              
## Terms                doc1 doc2       doc3       doc4 doc5       doc6 doc7
##   abandoned   0.026521074    0 0.00000000 0.00000000    0 0.00000000    0
##   abandonment 0.008514877    0 0.01258962 0.01377428    0 0.00000000    0
##   abdominal   0.019555326    0 0.00000000 0.00000000    0 0.06007484    0
##              
## Terms         doc8 doc9    doc10      doc11      doc12 doc13 query
##   abandoned      0    0 0.000000 0.00000000 0.00000000     0     0
##   abandonment    0    0 0.007443 0.01377428 0.01377428     0     0
##   abdominal      0    0 0.000000 0.00000000 0.00000000     0     0
query.vector <- tfidf.matrix[, (N.docs + 1)]
tfidf.matrix <- tfidf.matrix[, 1:N.docs]

doc.scores <- t(query.vector) %*% tfidf.matrix

results.df <- data.frame(doc = names(doc.list), score = t(doc.scores),
                         text = unlist(doc.list))
results.df <- results.df[order(results.df$score, decreasing = TRUE), ]


Relevance Scoring

The model will search on every document of the corpus the indicated query, the result will be the most relevant ones according to it.

Query: “Female Genital Mutilation socioeconomic model”

The higher the score, the more relevant the document is.

#print scores of documents by the query
results<- results.df[,c(1,2)]
print(results, row.names = FALSE, right = FALSE, digits = 2)
##  doc   score 
##  doc5  0.0467
##  doc10 0.0351
##  doc3  0.0287
##  doc9  0.0195
##  doc6  0.0103
##  doc4  0.0042
##  doc11 0.0042
##  doc12 0.0042
##  doc13 0.0027
##  doc2  0.0026
##  doc7  0.0021
##  doc1  0.0017
##  doc8  0.0017


Top 3 Documents

Doc5: Estimating the obstetric costs of Female Genital Mutilation in six African countries

Doc10: Forecasting High-Priority Infectious Disease Surveillance Regions: A Socioeconomic Model

Doc3: Exploration of pathways related to the decline in female circumcision in Egypt



For a more in depth explanation of the process and theory clic here