Text Processing Assignment

Corpus

      Doc1 : “So far in this course, we have studied association rule mining and we learned about frequent itemsets”
      Doc2 : “During the past couple of weeks, we started learning about Text Mining”
      Doc3 : “Data Mining and Text Mining are fun topics!”
      Doc4 : “Holidays are fun! I can’t wait till my next holiday…”

Step.1
- Convert the documents to vector of strings where each element contains a document text.

docs <- c("So far in this course, we have studied association rule mining and we learned about frequent itemsets",
          "During the past couple of weeks, we started learning about Text Mining",
          "Data Mining and Text Mining are fun topics!" ,
          "Holidays are fun! I can't wait till my next holiday...")

Step.2
- Constructing the corpus form the docs vector we created in step 1.
- Inspect the corpus we created and give it a quick look.

      docCorpus <- Corpus(VectorSource(docs))
      inspect(docCorpus)

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 4
## 
## [1] So far in this course, we have studied association rule mining and we learned about frequent itemsets
## [2] During the past couple of weeks, we started learning about Text Mining                               
## [3] Data Mining and Text Mining are fun topics!                                                          
## [4] Holidays are fun! I can't wait till my next holiday...

Step.3
- Visualize the corpus words using the word cloud graph

      wordcloud(docCorpus, min.freq = 1,random.order = FALSE, colors = "red")

Step.3
      - Apply preprocessing rules as the following
            1. Case folding to lowercase
            2. Remove stopwords.
            3. Remove sunctuation.
            4. Word stemming.

      #case folding to lowercase
      corpus.clean <- tm_map(docCorpus, tolower)

      #remove stopwords
      corpus.clean <- tm_map(corpus.clean, removeWords, stopwords())

      #remove punctuation 
      corpus.clean <- tm_map(corpus.clean, removePunctuation)
      
      #apply stemming to the corpus 
      corpus.clean <- tm_map(corpus.clean, stemDocument, language = "english")

Step.4
      - Constructing the document term table using many methods
            1. TF-Matrix as a default method
            2. TF-IDF-Matrix.
            3. Binary-Matrix.

      #the default method is TF 
      tf.docMatrix <- DocumentTermMatrix(corpus.clean)
      
      #use the tf-idf method to represent the corpus 
      tf.idf.docMatrix <- DocumentTermMatrix(corpus.clean,control = list(weighting= function(x) weightTfIdf(x)))
      
      #using the binary method to represent the corpus 
      bin.docMatrix <- DocumentTermMatrix(corpus.clean,control = list(weighting= function(x) weightBin(x)))

Step.5
- Calculating the Cosine simalarity between each document in the corpus and the query “fun topic” sorted in descending order but first we will construct the query vector

#calculating simalarity based on tf-matrix and the output is ranked by top relevent document
q <- query("fun topics", colnames(as.matrix(tf.docMatrix)),stemming = TRUE , language = "english")
sort(apply(as.matrix(tf.docMatrix),1, cosine,as.vector(q)),decreasing = T)

##    3    4    1    2 
## 0.50 0.25 0.00 0.00

#calculating simalarity based on tf-idf-matrix and the output is ranked by top relevent document 
q <- query("fun topics", colnames(as.matrix(tf.idf.docMatrix)),stemming = TRUE , language = "english")
sort(apply(as.matrix(tf.idf.docMatrix),1, cosine,as.vector(q)),decreasing = T)

##         3         4         1         2 
## 0.6488394 0.1313064 0.0000000 0.0000000

#calculating simalarity based on binary-matrix and the output is ranked by top relevent document 
q <- query("fun topics", colnames(as.matrix(bin.docMatrix)),stemming = TRUE , language = "english")
sort(apply(as.matrix(bin.docMatrix),1, cosine,as.vector(q)),decreasing = T)

##         3         4         1         2 
## 0.6324555 0.3162278 0.0000000 0.0000000

Conclusion
- As we noticed that the document 4 is the most relevent document to the query and then document 3 for all methods.

Text Processing Assignment

Mokhtar Ahmed