Problem setup

Text Mining ( Text Analytics) is a analytic tools used to learn from collections of text data, like social media, books, newspapers, emails, etc.

This paper explores text mining with R. Messages from Twitter, news and blogs are acquired and analyzed using R’s TM package. The final objective is to generate n-gram probability models for the purpose of text prediction.

Three data sets (news, blogs, twitter) were provided. Command line operations (e.g., cat en_US.twitter.txt | wc -l) were used to quickly explore basic dimensions of the data sets. The twitter data set contains 2,360,148 lines and 30,374,206 words. The news data set contains 10,10,242 lines and 34,372,720 words. The blogs data set contains 899,288 lines and 37,334,690.

Required Libraries
library(tm) # Framework for text mining.
## Warning: package 'NLP' was built under R version 3.2.3
library(qdap) # Quantitative discourse analysis of transcripts. library(qdapDictionaries)
## Warning: package 'qdapRegex' was built under R version 3.2.3
library(dplyr) # Data wrangling, pipe operator %>%().
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
## Warning: package 'ggplot2' was built under R version 3.2.3
library(scales) # Include commas in numbers.
## Warning: package 'scales' was built under R version 3.2.3
library(ggplot2)
library(wordcloud)

Getting Started: The Corpus

For illustration purposes, this report is limited to Twitter data. The basic methodology applied in the following section could be easily applied to blogs and news if necessary.

my_corpus <- file.path(".", "corpus", "txt")
length(dir(my_corpus))
## [1] 1
dir(my_corpus)
## [1] "twitter_trial.txt"

After loading the tm (Feinerer and Hornik, 2015) package into the R library, we are ready to load the files from the directory as the source of the files making up the corpus, using DirSource(). The source object is passed on to Corpus() which loads the documents. We save the resulting collection of documents in memory, stored in a variable called docs.

 docs <- Corpus(DirSource(my_corpus))

Transformation

We generally need to perform some pre-processing of the text data to prepare for the text analysis. Example transformations include converting the text to lower case, removing numbers and punctuation, removing stop words, stemming and identifying synonyms. The basic transforms are all available within tm.

getTransformations()
## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

We start with some manual special transforms we may want to do. For example, we might want to replace “/”, used sometimes to separate alternative words, with a space. This will avoid the two words being run into one string of characters through the transformations. We might also replace “@” and “|” with a space, for the same reason.

To create a custom transformation we make use of content transformer() to create a function to achieve the transformation, and then apply it to the corpus using tm map().

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

We use some already build functions

docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
length(stopwords("english"))
## [1] 174
stopwords("english")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)

Creating a Document Term Matrix

A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix() to create the matrix:

dtm <- DocumentTermMatrix(docs)
dtm
## <<DocumentTermMatrix (documents: 1, terms: 2732)>>
## Non-/sparse entries: 2732/0
## Sparsity           : 0%
## Maximal term length: 28
## Weighting          : term frequency (tf)
inspect(dtm[1,1:10])
## <<DocumentTermMatrix (documents: 1, terms: 10)>>
## Non-/sparse entries: 10/0
## Sparsity           : 0%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## 
##                    Terms
## Docs                aaiight abil abl abound absolut abus academi accent
##   twitter_trial.txt       1    1   2      1       3    1       2      1
##                    Terms
## Docs                accentu accept
##   twitter_trial.txt       1      2
class(dtm)
## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
dim(dtm)
## [1]    1 2732

Exploring Document term matrix

tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 2732
#------order --------
ord <- order(freq)
freq[head(ord)]
## aaiight    abil  abound    abus  accent accentu 
##       1       1       1       1       1       1
freq[tail(ord)]
##  get  one  day love just like 
##   52   52   53   55   62   67
Distribution of Term Frequencies
head(table(freq), 15)
## freq
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1761  407  164  106   63   36   35   26   19   21   15    6    5    9    3
tail(table(freq), 15)
## freq
## 32 34 35 36 37 38 44 47 49 50 52 53 55 62 67 
##  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1
Conversion to Matrix and Save to CSV

We can convert the document term matrix to a simple matrix for writing to a CSV file, for example, for loading the data into other software if we need to do so. To write to CSV we first convert the data structure into a simple matrix:

m <- as.matrix(dtm)
dim(m)
## [1]    1 2732
write.csv(m, file="dtm.csv")
Removing Sparse Terms

We are often not interested in infrequent terms in our documents. Such “sparse” terms can be removed from the document term matrix quite easily using removeSparseTerms():

dtms <- removeSparseTerms(dtm, 0.1)
freq <- colSums(as.matrix(dtms))
table(freq)
## freq
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1761  407  164  106   63   36   35   26   19   21   15    6    5    9    3 
##   16   17   18   19   20   21   23   24   26   28   30   31   32   34   35 
##    5    7    5    3    2    4    3    3    5    1    1    1    1    1    1 
##   36   37   38   44   47   49   50   52   53   55   62   67 
##    1    1    1    1    1    1    1    2    1    1    1    1

Identifying Frequent Items and Associations

#findFreqTerms(dtm, lowfreq= 2)
#findFreqTerms(dtm, lowfreq=1)
findAssocs(dtm, "data", corlimit=0.50)
## $data
## numeric(0)

Plotting Word Frequencies

We can generate the frequency count of all words in a corpus:

  • In decreasing Order
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 14)
##  like  just  love   day   get   one   can  know  dont thank  will  good 
##    67    62    55    53    52    52    50    49    47    44    38    37 
##  time   new 
##    36    35
  • In data frame format
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
##      word freq
## like like   67
## just just   62
## love love   55
## day   day   53
## get   get   52
## one   one   52

We can then plot the frequency of those words that occur at least 500 times in the corpus:

subset(wf, freq>10)                                                  %>%
  ggplot(aes(word, freq))                                              +
  geom_bar(stat="identity")                                            +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Word Cloud

We can generate a word cloud as an effective alternative to providing a quick visual overview of the frequency of words in a corpus.

set.seed(135)
wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))

Quantitative Analysis of Text

We can obtain simple summaries of a list of words, and to do so we will illustrate with the terms from our Term Document Matrix tdm. We first extract the shorter terms from each of our documents into one long word list. To do so we convert tdm into a matrix, extract the column names (the terms) and retain those shorter than 20 characters.

words <- dtm        %>%                                                 
as.matrix           %>%                                                  
colnames            %>%                                                  
(function(x) x[nchar(x) < 20])

We can then summarise the word list. Notice, in particular, the use of dist tab() from qdap to generate frequencies and percentages.

dist_tab(nchar(words))
##    interval freq cum.freq percent cum.percent
## 1         3  281      281   10.32       10.32
## 2         4  605      886   22.22       32.54
## 3         5  605     1491   22.22       54.76
## 4         6  477     1968   17.52       72.27
## 5         7  344     2312   12.63       84.91
## 6         8  179     2491    6.57       91.48
## 7         9  106     2597    3.89       95.37
## 8        10   58     2655    2.13       97.50
## 9        11   23     2678    0.84       98.35
## 10       12   15     2693    0.55       98.90
## 11       13   10     2703    0.37       99.27
## 12       14    6     2709    0.22       99.49
## 13       15    5     2714    0.18       99.67
## 14       16    2     2716    0.07       99.74
## 15       17    3     2719    0.11       99.85
## 16       18    3     2722    0.11       99.96
## 17       19    1     2723    0.04      100.00

Word Length Counts

A simple plot is then effective in showing the distribution of the word lengths. Here we create a single column data frame that is passed on to ggplot() to generate a histogram, with a vertical line to show the mean length of words.

data.frame(nletters=nchar(words))                                     %>%
  ggplot(aes(x=nletters))                                              +
  geom_histogram(binwidth=1)                                           +
  geom_vline(xintercept=mean(nchar(words)),
             colour="green", size=1, alpha=.5)                         +
  labs(x="Number of Letters", y="Number of Words")

Application Design

The basic methodology for the n-gram text prediction is as follows:

References: