Corpus
Doc1 : “So far in this course, we have studied association rule mining and we learned about frequent itemsets”
Doc2 : “During the past couple of weeks, we started learning about Text Mining”
Doc3 : “Data Mining and Text Mining are fun topics!”
Doc4 : “Holidays are fun! I can’t wait till my next holiday…”
Step.1
- Convert the documents to vector of strings where each element contains a document text.
docs <- c("So far in this course, we have studied association rule mining and we learned about frequent itemsets",
"During the past couple of weeks, we started learning about Text Mining",
"Data Mining and Text Mining are fun topics!" ,
"Holidays are fun! I can't wait till my next holiday...")
Step.2
- Constructing the corpus form the docs vector we created in step 1.
- Inspect the corpus we created and give it a quick look.
docCorpus <- Corpus(VectorSource(docs))
inspect(docCorpus)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 4
##
## [1] So far in this course, we have studied association rule mining and we learned about frequent itemsets
## [2] During the past couple of weeks, we started learning about Text Mining
## [3] Data Mining and Text Mining are fun topics!
## [4] Holidays are fun! I can't wait till my next holiday...
Step.3
- Visualize the corpus words using the word cloud graph
wordcloud(docCorpus, min.freq = 1,random.order = FALSE, colors = "red")
Step.3
- Apply preprocessing rules as the following
1. Case folding to lowercase
2. Remove stopwords.
3. Remove sunctuation.
4. Word stemming.
#case folding to lowercase
corpus.clean <- tm_map(docCorpus, tolower)
#remove stopwords
corpus.clean <- tm_map(corpus.clean, removeWords, stopwords())
#remove punctuation
corpus.clean <- tm_map(corpus.clean, removePunctuation)
#apply stemming to the corpus
corpus.clean <- tm_map(corpus.clean, stemDocument, language = "english")
Step.4
- Constructing the document term table using many methods
1. TF-Matrix as a default method
2. TF-IDF-Matrix.
3. Binary-Matrix.
#the default method is TF
tf.docMatrix <- DocumentTermMatrix(corpus.clean)
#use the tf-idf method to represent the corpus
tf.idf.docMatrix <- DocumentTermMatrix(corpus.clean,control = list(weighting= function(x) weightTfIdf(x)))
#using the binary method to represent the corpus
bin.docMatrix <- DocumentTermMatrix(corpus.clean,control = list(weighting= function(x) weightBin(x)))
Step.5
- Calculating the Cosine simalarity between each document in the corpus and the query “fun topic” sorted in descending order but first we will construct the query vector
#calculating simalarity based on tf-matrix and the output is ranked by top relevent document
q <- query("fun topics", colnames(as.matrix(tf.docMatrix)),stemming = TRUE , language = "english")
sort(apply(as.matrix(tf.docMatrix),1, cosine,as.vector(q)),decreasing = T)
## 3 4 1 2
## 0.50 0.25 0.00 0.00
#calculating simalarity based on tf-idf-matrix and the output is ranked by top relevent document
q <- query("fun topics", colnames(as.matrix(tf.idf.docMatrix)),stemming = TRUE , language = "english")
sort(apply(as.matrix(tf.idf.docMatrix),1, cosine,as.vector(q)),decreasing = T)
## 3 4 1 2
## 0.6488394 0.1313064 0.0000000 0.0000000
#calculating simalarity based on binary-matrix and the output is ranked by top relevent document
q <- query("fun topics", colnames(as.matrix(bin.docMatrix)),stemming = TRUE , language = "english")
sort(apply(as.matrix(bin.docMatrix),1, cosine,as.vector(q)),decreasing = T)
## 3 4 1 2
## 0.6324555 0.3162278 0.0000000 0.0000000
Conclusion
- As we noticed that the document 4 is the most relevent document to the query and then document 3 for all methods.