DATA 607 Data Science in Context Presentation

This is the continuation of the TF-IDF presentation.

We previously examined the TF-IDF algorithm and looked at a basic example in order to highlight the concept. This part of the presentation will now focus on a more practical application of tf-idf in r. I modified a really intuitive example I found cited in the power point.

We need to install the tm and dplyr libraries

library(tm)

## Warning: package 'tm' was built under R version 3.4.3

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.4.1

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Lets create a corpus that consists of mini documents

documents<-c("Data science is fun", "machine learning is cool. I should document more of my work.", "this is a Document", "hello darkness my old friend", "I am good at saving data science notes as word documents", "the science data machine learning hello data")

We need to calculate the term frequency. The tm library allows us to form a matrix where each row belongs to the words in the corpus and each column is the document number. Remeber we made 5 mini documents total so there should be 6 columns.

Each entry in our matrix is the number of times a word appears in a document. For example, the word data appears once in document 1, once in document 5, and twice in document 6.

corpus <- Corpus( VectorSource(documents) )
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(corpus, control = control_list)

# print
( tf <- as.matrix(tdm) )

##            Docs
## Terms       1 2 3 4 5 6
##   data      1 0 0 0 1 2
##   fun       1 0 0 0 0 0
##   science   1 0 0 0 1 1
##   cool      0 1 0 0 0 0
##   document  0 1 1 0 0 0
##   learning  0 1 0 0 0 1
##   machine   0 1 0 0 0 1
##   work      0 1 0 0 0 0
##   darkness  0 0 0 1 0 0
##   friend    0 0 0 1 0 0
##   hello     0 0 0 1 0 1
##   old       0 0 0 1 0 0
##   documents 0 0 0 0 1 0
##   good      0 0 0 0 1 0
##   notes     0 0 0 0 1 0
##   saving    0 0 0 0 1 0
##   word      0 0 0 0 1 0

Why do words such as “is” not appear in our matrix? Also, were there not upper case letters? Notice that I set tolower=TRUE and stopwords=TRUE. This makes every word lower case and disregards common stop words such as “the” or “is.” removePunctuation=TRUE removes any commas or periods.

We need to compute the inverse document frequency. (Recall our example with the word dog) The use of !=0 is just to ensure we won’t divide by zero. Theoretically, if you do rowsum, there should be no zero, however with larger Corpus, it is best to play it safe.

( idf <- log( ncol(tf) / ( rowSums(tf != 0) ) ) )

##      data       fun   science      cool  document  learning   machine 
## 0.6931472 1.7917595 0.6931472 1.7917595 1.0986123 1.0986123 1.0986123 
##      work  darkness    friend     hello       old documents      good 
## 1.7917595 1.7917595 1.7917595 1.0986123 1.7917595 1.7917595 1.7917595 
##     notes    saving      word 
## 1.7917595 1.7917595 1.7917595

We now have the computed inverse document frequency for each of the words in our term frequency matrix. We can diagonalize these results into a matrix. This is necessary to compute the full weight of each word and have the entire tf-idf metric since we need to compute a cross product to find our tf-idf.

( idf <- diag(idf) )

##            [,1]     [,2]      [,3]     [,4]     [,5]     [,6]     [,7]
##  [1,] 0.6931472 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
##  [2,] 0.0000000 1.791759 0.0000000 0.000000 0.000000 0.000000 0.000000
##  [3,] 0.0000000 0.000000 0.6931472 0.000000 0.000000 0.000000 0.000000
##  [4,] 0.0000000 0.000000 0.0000000 1.791759 0.000000 0.000000 0.000000
##  [5,] 0.0000000 0.000000 0.0000000 0.000000 1.098612 0.000000 0.000000
##  [6,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 1.098612 0.000000
##  [7,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 1.098612
##  [8,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
##  [9,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [10,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [11,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [12,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [13,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [14,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [15,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [16,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## [17,] 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
##           [,8]     [,9]    [,10]    [,11]    [,12]    [,13]    [,14]
##  [1,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [2,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [3,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [4,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [5,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [6,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [7,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [8,] 1.791759 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [9,] 0.000000 1.791759 0.000000 0.000000 0.000000 0.000000 0.000000
## [10,] 0.000000 0.000000 1.791759 0.000000 0.000000 0.000000 0.000000
## [11,] 0.000000 0.000000 0.000000 1.098612 0.000000 0.000000 0.000000
## [12,] 0.000000 0.000000 0.000000 0.000000 1.791759 0.000000 0.000000
## [13,] 0.000000 0.000000 0.000000 0.000000 0.000000 1.791759 0.000000
## [14,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.791759
## [15,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [16,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [17,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##          [,15]    [,16]    [,17]
##  [1,] 0.000000 0.000000 0.000000
##  [2,] 0.000000 0.000000 0.000000
##  [3,] 0.000000 0.000000 0.000000
##  [4,] 0.000000 0.000000 0.000000
##  [5,] 0.000000 0.000000 0.000000
##  [6,] 0.000000 0.000000 0.000000
##  [7,] 0.000000 0.000000 0.000000
##  [8,] 0.000000 0.000000 0.000000
##  [9,] 0.000000 0.000000 0.000000
## [10,] 0.000000 0.000000 0.000000
## [11,] 0.000000 0.000000 0.000000
## [12,] 0.000000 0.000000 0.000000
## [13,] 0.000000 0.000000 0.000000
## [14,] 0.000000 0.000000 0.000000
## [15,] 1.791759 0.000000 0.000000
## [16,] 0.000000 1.791759 0.000000
## [17,] 0.000000 0.000000 1.791759

compute the cross product

tf_idf <- crossprod(tf, idf)
colnames(tf_idf) <- rownames(tf)
tf_idf

##     
## Docs      data      fun   science     cool document learning  machine
##    1 0.6931472 1.791759 0.6931472 0.000000 0.000000 0.000000 0.000000
##    2 0.0000000 0.000000 0.0000000 1.791759 1.098612 1.098612 1.098612
##    3 0.0000000 0.000000 0.0000000 0.000000 1.098612 0.000000 0.000000
##    4 0.0000000 0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000
##    5 0.6931472 0.000000 0.6931472 0.000000 0.000000 0.000000 0.000000
##    6 1.3862944 0.000000 0.6931472 0.000000 0.000000 1.098612 1.098612
##     
## Docs     work darkness   friend    hello      old documents     good
##    1 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
##    2 1.791759 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
##    3 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
##    4 0.000000 1.791759 1.791759 1.098612 1.791759  0.000000 0.000000
##    5 0.000000 0.000000 0.000000 0.000000 0.000000  1.791759 1.791759
##    6 0.000000 0.000000 0.000000 1.098612 0.000000  0.000000 0.000000
##     
## Docs    notes   saving     word
##    1 0.000000 0.000000 0.000000
##    2 0.000000 0.000000 0.000000
##    3 0.000000 0.000000 0.000000
##    4 0.000000 0.000000 0.000000
##    5 1.791759 1.791759 1.791759
##    6 0.000000 0.000000 0.000000

We still need to normalize our results. In order to normalize the results, we take each document vector and divide by its norm (length). This is required in order to eliminate any bias that may arise because of document length. Lets say one document is longer than the other, then it has a better chance to have a certain word multiple times. This will give the word the appearance of being important even if it may not be within the corpus.

tf_idf / sqrt( rowSums( tf_idf^2 ) )

##     
## Docs      data       fun   science      cool  document  learning   machine
##    1 0.3393824 0.8772908 0.3393824 0.0000000 0.0000000 0.0000000 0.0000000
##    2 0.0000000 0.0000000 0.0000000 0.5654278 0.3466905 0.3466905 0.3466905
##    3 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
##    4 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##    5 0.1680491 0.0000000 0.1680491 0.0000000 0.0000000 0.0000000 0.0000000
##    6 0.5648654 0.0000000 0.2824327 0.0000000 0.0000000 0.4476453 0.4476453
##     
## Docs      work  darkness    friend     hello       old documents      good
##    1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##    2 0.5654278 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##    3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##    4 0.0000000 0.5442545 0.5442545 0.3337081 0.5442545 0.0000000 0.0000000
##    5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.4344005 0.4344005
##    6 0.0000000 0.0000000 0.0000000 0.4476453 0.0000000 0.0000000 0.0000000
##     
## Docs     notes    saving      word
##    1 0.0000000 0.0000000 0.0000000
##    2 0.0000000 0.0000000 0.0000000
##    3 0.0000000 0.0000000 0.0000000
##    4 0.0000000 0.0000000 0.0000000
##    5 0.4344005 0.4344005 0.4344005
##    6 0.0000000 0.0000000 0.0000000

How does it compare to our original corpus?

documents

## [1] "Data science is fun"                                         
## [2] "machine learning is cool. I should document more of my work."
## [3] "this is a Document"                                          
## [4] "hello darkness my old friend"                                
## [5] "I am good at saving data science notes as word documents"    
## [6] "the science data machine learning hello data"

The word data appears in 3 of our documents. The scores assigned to the word data are under .6 If we compare this to the word “fun”, which only appears once, we can see that “fun” has a much higher score. This is an example of the inverse relationship in tf-idf

With tf-idf, a user can perform K means clustering to gain even furthur insight. In reality, a corpus is usually a folder that exists as your working directory. It may contain as many documents as you need. From reading forums on stack overflow, it seems that txt file is the desired format for documents.

DATA 607 Data Science in Context Presentation

Vinicio Haro

3/13/2018