This is the continuation of the TF-IDF presentation.
We previously examined the TF-IDF algorithm and looked at a basic example in order to highlight the concept. This part of the presentation will now focus on a more practical application of tf-idf in r. I modified a really intuitive example I found cited in the power point.
We need to install the tm and dplyr libraries
library(tm)
## Warning: package 'tm' was built under R version 3.4.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.4.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lets create a corpus that consists of mini documents
documents<-c("Data science is fun", "machine learning is cool. I should document more of my work.", "this is a Document", "hello darkness my old friend", "I am good at saving data science notes as word documents", "the science data machine learning hello data")
We need to calculate the term frequency. The tm library allows us to form a matrix where each row belongs to the words in the corpus and each column is the document number. Remeber we made 5 mini documents total so there should be 6 columns.
Each entry in our matrix is the number of times a word appears in a document. For example, the word data appears once in document 1, once in document 5, and twice in document 6.
corpus <- Corpus( VectorSource(documents) )
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(corpus, control = control_list)
# print
( tf <- as.matrix(tdm) )
## Docs
## Terms 1 2 3 4 5 6
## data 1 0 0 0 1 2
## fun 1 0 0 0 0 0
## science 1 0 0 0 1 1
## cool 0 1 0 0 0 0
## document 0 1 1 0 0 0
## learning 0 1 0 0 0 1
## machine 0 1 0 0 0 1
## work 0 1 0 0 0 0
## darkness 0 0 0 1 0 0
## friend 0 0 0 1 0 0
## hello 0 0 0 1 0 1
## old 0 0 0 1 0 0
## documents 0 0 0 0 1 0
## good 0 0 0 0 1 0
## notes 0 0 0 0 1 0
## saving 0 0 0 0 1 0
## word 0 0 0 0 1 0
Why do words such as “is” not appear in our matrix? Also, were there not upper case letters? Notice that I set tolower=TRUE and stopwords=TRUE. This makes every word lower case and disregards common stop words such as “the” or “is.” removePunctuation=TRUE removes any commas or periods.
We need to compute the inverse document frequency. (Recall our example with the word dog) The use of !=0 is just to ensure we won’t divide by zero. Theoretically, if you do rowsum, there should be no zero, however with larger Corpus, it is best to play it safe.
( idf <- log( ncol(tf) / ( rowSums(tf != 0) ) ) )
## data fun science cool document learning machine
## 0.6931472 1.7917595 0.6931472 1.7917595 1.0986123 1.0986123 1.0986123
## work darkness friend hello old documents good
## 1.7917595 1.7917595 1.7917595 1.0986123 1.7917595 1.7917595 1.7917595
## notes saving word
## 1.7917595 1.7917595 1.7917595
We now have the computed inverse document frequency for each of the words in our term frequency matrix. We can diagonalize these results into a matrix. This is necessary to compute the full weight of each word and have the entire tf-idf metric since we need to compute a cross product to find our tf-idf.
( idf <- diag(idf) )
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.6931472 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [2,] 0.0000000 1.791759 0.0000000 0.000000 0.000000 0.000000 0.000000
## [3,] 0.0000000 0.000000 0.6931472 0.000000 0.000000 0.000000 0.000000
## [4,] 0.0000000 0.000000 0.0000000 1.791759 0.000000 0.000000 0.000000
## [5,] 0.0000000 0.000000 0.0000000 0.000000 1.098612 0.000000 0.000000
## [6,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 1.098612 0.000000
## [7,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 1.098612
## [8,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [9,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [10,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [11,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [12,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [13,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [14,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [15,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [16,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [17,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [2,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [3,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [4,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [5,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [6,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [7,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [8,] 1.791759 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [9,] 0.000000 1.791759 0.000000 0.000000 0.000000 0.000000 0.000000
## [10,] 0.000000 0.000000 1.791759 0.000000 0.000000 0.000000 0.000000
## [11,] 0.000000 0.000000 0.000000 1.098612 0.000000 0.000000 0.000000
## [12,] 0.000000 0.000000 0.000000 0.000000 1.791759 0.000000 0.000000
## [13,] 0.000000 0.000000 0.000000 0.000000 0.000000 1.791759 0.000000
## [14,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.791759
## [15,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [16,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [17,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [,15] [,16] [,17]
## [1,] 0.000000 0.000000 0.000000
## [2,] 0.000000 0.000000 0.000000
## [3,] 0.000000 0.000000 0.000000
## [4,] 0.000000 0.000000 0.000000
## [5,] 0.000000 0.000000 0.000000
## [6,] 0.000000 0.000000 0.000000
## [7,] 0.000000 0.000000 0.000000
## [8,] 0.000000 0.000000 0.000000
## [9,] 0.000000 0.000000 0.000000
## [10,] 0.000000 0.000000 0.000000
## [11,] 0.000000 0.000000 0.000000
## [12,] 0.000000 0.000000 0.000000
## [13,] 0.000000 0.000000 0.000000
## [14,] 0.000000 0.000000 0.000000
## [15,] 1.791759 0.000000 0.000000
## [16,] 0.000000 1.791759 0.000000
## [17,] 0.000000 0.000000 1.791759
compute the cross product
tf_idf <- crossprod(tf, idf)
colnames(tf_idf) <- rownames(tf)
tf_idf
##
## Docs data fun science cool document learning machine
## 1 0.6931472 1.791759 0.6931472 0.000000 0.000000 0.000000 0.000000
## 2 0.0000000 0.000000 0.0000000 1.791759 1.098612 1.098612 1.098612
## 3 0.0000000 0.000000 0.0000000 0.000000 1.098612 0.000000 0.000000
## 4 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## 5 0.6931472 0.000000 0.6931472 0.000000 0.000000 0.000000 0.000000
## 6 1.3862944 0.000000 0.6931472 0.000000 0.000000 1.098612 1.098612
##
## Docs work darkness friend hello old documents good
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 1.791759 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 4 0.000000 1.791759 1.791759 1.098612 1.791759 0.000000 0.000000
## 5 0.000000 0.000000 0.000000 0.000000 0.000000 1.791759 1.791759
## 6 0.000000 0.000000 0.000000 1.098612 0.000000 0.000000 0.000000
##
## Docs notes saving word
## 1 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000
## 3 0.000000 0.000000 0.000000
## 4 0.000000 0.000000 0.000000
## 5 1.791759 1.791759 1.791759
## 6 0.000000 0.000000 0.000000
We still need to normalize our results. In order to normalize the results, we take each document vector and divide by its norm (length). This is required in order to eliminate any bias that may arise because of document length. Lets say one document is longer than the other, then it has a better chance to have a certain word multiple times. This will give the word the appearance of being important even if it may not be within the corpus.
tf_idf / sqrt( rowSums( tf_idf^2 ) )
##
## Docs data fun science cool document learning machine
## 1 0.3393824 0.8772908 0.3393824 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.0000000 0.0000000 0.0000000 0.5654278 0.3466905 0.3466905 0.3466905
## 3 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
## 4 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 5 0.1680491 0.0000000 0.1680491 0.0000000 0.0000000 0.0000000 0.0000000
## 6 0.5648654 0.0000000 0.2824327 0.0000000 0.0000000 0.4476453 0.4476453
##
## Docs work darkness friend hello old documents good
## 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.5654278 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 4 0.0000000 0.5442545 0.5442545 0.3337081 0.5442545 0.0000000 0.0000000
## 5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.4344005 0.4344005
## 6 0.0000000 0.0000000 0.0000000 0.4476453 0.0000000 0.0000000 0.0000000
##
## Docs notes saving word
## 1 0.0000000 0.0000000 0.0000000
## 2 0.0000000 0.0000000 0.0000000
## 3 0.0000000 0.0000000 0.0000000
## 4 0.0000000 0.0000000 0.0000000
## 5 0.4344005 0.4344005 0.4344005
## 6 0.0000000 0.0000000 0.0000000
How does it compare to our original corpus?
documents
## [1] "Data science is fun"
## [2] "machine learning is cool. I should document more of my work."
## [3] "this is a Document"
## [4] "hello darkness my old friend"
## [5] "I am good at saving data science notes as word documents"
## [6] "the science data machine learning hello data"
The word data appears in 3 of our documents. The scores assigned to the word data are under .6 If we compare this to the word “fun”, which only appears once, we can see that “fun” has a much higher score. This is an example of the inverse relationship in tf-idf
With tf-idf, a user can perform K means clustering to gain even furthur insight. In reality, a corpus is usually a folder that exists as your working directory. It may contain as many documents as you need. From reading forums on stack overflow, it seems that txt file is the desired format for documents.