Text Mining ( Text Analytics) is a analytic tools used to learn from collections of text data, like social media, books, newspapers, emails, etc.
This paper explores text mining with R. Messages from Twitter, news and blogs are acquired and analyzed using R’s TM package. The final objective is to generate n-gram probability models for the purpose of text prediction.
Three data sets (news, blogs, twitter) were provided. Command line operations (e.g., cat en_US.twitter.txt | wc -l) were used to quickly explore basic dimensions of the data sets. The twitter data set contains 2,360,148 lines and 30,374,206 words. The news data set contains 10,10,242 lines and 34,372,720 words. The blogs data set contains 899,288 lines and 37,334,690.
library(tm) # Framework for text mining.
## Warning: package 'NLP' was built under R version 3.2.3
library(qdap) # Quantitative discourse analysis of transcripts. library(qdapDictionaries)
## Warning: package 'qdapRegex' was built under R version 3.2.3
library(dplyr) # Data wrangling, pipe operator %>%().
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
## Warning: package 'ggplot2' was built under R version 3.2.3
library(scales) # Include commas in numbers.
## Warning: package 'scales' was built under R version 3.2.3
library(ggplot2)
library(wordcloud)
For illustration purposes, this report is limited to Twitter data. The basic methodology applied in the following section could be easily applied to blogs and news if necessary.
my_corpus <- file.path(".", "corpus", "txt")
length(dir(my_corpus))
## [1] 1
dir(my_corpus)
## [1] "twitter_trial.txt"
After loading the tm (Feinerer and Hornik, 2015) package into the R library, we are ready to load the files from the directory as the source of the files making up the corpus, using DirSource(). The source object is passed on to Corpus() which loads the documents. We save the resulting collection of documents in memory, stored in a variable called docs.
docs <- Corpus(DirSource(my_corpus))
We generally need to perform some pre-processing of the text data to prepare for the text analysis. Example transformations include converting the text to lower case, removing numbers and punctuation, removing stop words, stemming and identifying synonyms. The basic transforms are all available within tm.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
We start with some manual special transforms we may want to do. For example, we might want to replace “/”, used sometimes to separate alternative words, with a space. This will avoid the two words being run into one string of characters through the transformations. We might also replace “@” and “|” with a space, for the same reason.
To create a custom transformation we make use of content transformer() to create a function to achieve the transformation, and then apply it to the corpus using tm map().
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
We use some already build functions
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
length(stopwords("english"))
## [1] 174
stopwords("english")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix() to create the matrix:
dtm <- DocumentTermMatrix(docs)
dtm
## <<DocumentTermMatrix (documents: 1, terms: 2732)>>
## Non-/sparse entries: 2732/0
## Sparsity : 0%
## Maximal term length: 28
## Weighting : term frequency (tf)
inspect(dtm[1,1:10])
## <<DocumentTermMatrix (documents: 1, terms: 10)>>
## Non-/sparse entries: 10/0
## Sparsity : 0%
## Maximal term length: 7
## Weighting : term frequency (tf)
##
## Terms
## Docs aaiight abil abl abound absolut abus academi accent
## twitter_trial.txt 1 1 2 1 3 1 2 1
## Terms
## Docs accentu accept
## twitter_trial.txt 1 2
class(dtm)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
dim(dtm)
## [1] 1 2732
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 2732
#------order --------
ord <- order(freq)
freq[head(ord)]
## aaiight abil abound abus accent accentu
## 1 1 1 1 1 1
freq[tail(ord)]
## get one day love just like
## 52 52 53 55 62 67
head(table(freq), 15)
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1761 407 164 106 63 36 35 26 19 21 15 6 5 9 3
tail(table(freq), 15)
## freq
## 32 34 35 36 37 38 44 47 49 50 52 53 55 62 67
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
We can convert the document term matrix to a simple matrix for writing to a CSV file, for example, for loading the data into other software if we need to do so. To write to CSV we first convert the data structure into a simple matrix:
m <- as.matrix(dtm)
dim(m)
## [1] 1 2732
write.csv(m, file="dtm.csv")
We are often not interested in infrequent terms in our documents. Such “sparse” terms can be removed from the document term matrix quite easily using removeSparseTerms():
dtms <- removeSparseTerms(dtm, 0.1)
freq <- colSums(as.matrix(dtms))
table(freq)
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1761 407 164 106 63 36 35 26 19 21 15 6 5 9 3
## 16 17 18 19 20 21 23 24 26 28 30 31 32 34 35
## 5 7 5 3 2 4 3 3 5 1 1 1 1 1 1
## 36 37 38 44 47 49 50 52 53 55 62 67
## 1 1 1 1 1 1 1 2 1 1 1 1
#findFreqTerms(dtm, lowfreq= 2)
#findFreqTerms(dtm, lowfreq=1)
findAssocs(dtm, "data", corlimit=0.50)
## $data
## numeric(0)
We can generate the frequency count of all words in a corpus:
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 14)
## like just love day get one can know dont thank will good
## 67 62 55 53 52 52 50 49 47 44 38 37
## time new
## 36 35
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## like like 67
## just just 62
## love love 55
## day day 53
## get get 52
## one one 52
We can then plot the frequency of those words that occur at least 500 times in the corpus:
subset(wf, freq>10) %>%
ggplot(aes(word, freq)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
We can generate a word cloud as an effective alternative to providing a quick visual overview of the frequency of words in a corpus.
set.seed(135)
wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
We can obtain simple summaries of a list of words, and to do so we will illustrate with the terms from our Term Document Matrix tdm. We first extract the shorter terms from each of our documents into one long word list. To do so we convert tdm into a matrix, extract the column names (the terms) and retain those shorter than 20 characters.
words <- dtm %>%
as.matrix %>%
colnames %>%
(function(x) x[nchar(x) < 20])
We can then summarise the word list. Notice, in particular, the use of dist tab() from qdap to generate frequencies and percentages.
dist_tab(nchar(words))
## interval freq cum.freq percent cum.percent
## 1 3 281 281 10.32 10.32
## 2 4 605 886 22.22 32.54
## 3 5 605 1491 22.22 54.76
## 4 6 477 1968 17.52 72.27
## 5 7 344 2312 12.63 84.91
## 6 8 179 2491 6.57 91.48
## 7 9 106 2597 3.89 95.37
## 8 10 58 2655 2.13 97.50
## 9 11 23 2678 0.84 98.35
## 10 12 15 2693 0.55 98.90
## 11 13 10 2703 0.37 99.27
## 12 14 6 2709 0.22 99.49
## 13 15 5 2714 0.18 99.67
## 14 16 2 2716 0.07 99.74
## 15 17 3 2719 0.11 99.85
## 16 18 3 2722 0.11 99.96
## 17 19 1 2723 0.04 100.00
A simple plot is then effective in showing the distribution of the word lengths. Here we create a single column data frame that is passed on to ggplot() to generate a histogram, with a vertical line to show the mean length of words.
data.frame(nletters=nchar(words)) %>%
ggplot(aes(x=nletters)) +
geom_histogram(binwidth=1) +
geom_vline(xintercept=mean(nchar(words)),
colour="green", size=1, alpha=.5) +
labs(x="Number of Letters", y="Number of Words")
The basic methodology for the n-gram text prediction is as follows:
Match a n-word character string with the appropriate n+1 gram entry in the N-gram Frequency Table. For example, a two-word string should be matched with its corresponding entry in a tri-gram table.
If there is a match, propose high frequency words to the user. Continuing the previous example, a match should be the last word of the n-gram.