Text-mining requires considerable preparation of a list of words, eliminating common words (like ‘and’ or ‘the’), and then compiling frequencies of common words and sequences of words (like “Big Data”)
In this illustration, we take the text of Prof. Tom Davenport’s article on Competing on Analytics and create a graphic commonly known as a Word Cloud.
The raw text file for this assignment is on GitHub, called simply “Davenport.txt”.
The particular code uses the following packages, which are all invoked before the code that appears below:
SnowballC – to “stem” terms (group related roots)
RColorBrewer – to enhance graphs with color
cluster – perform some hierarchical clustering
library(tm)
library(ggplot2)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
The initial code chunk reads in the body of data using the ReadLines command, which simply reads character data as a list of text. Then the data is converted to a Corpus – a collection of text documents.
options(header=FALSE, stringsAsFactors = FALSE,FileEncoding="latin1")
text <- readLines("Davenport.txt")
corpus <- Corpus(VectorSource(text))
# for later random activities:
set.seed(1234) # for reproducible results
Next comes the phase of cleaning up the corpus. The following code chunk produces no output directly, but converts all text to lowercase, removes punctuation, and removes numerals.
#Clean-up
corpus <- tm_map(corpus, tolower) #make all text lower case
corpus <- tm_map(corpus, removePunctuation) #remove all punctuation
corpus <- tm_map(corpus, removeNumbers) # remove numbers
The next recommended step is to apply stopwords – specifying terms that should be considered non-informative in the mining stages to follow. These may be common English words (articles, pronounds, etc.) or words that appear often in the corpus but have special status in this particular corpus.
To see the default set of English stopwords, in the console type `stopwords(“en”). For this article about analytics, “big” is a word we may want to analyze, but it’s one of the standard stopwords. Hence, we remove it from mystopwords.
# apply standard English stopwords, and add "harrah" and "harvard"
special <- c("harrah","harvard")
myStopwords <- c(stopwords('english'),special)
# remove "big" from English stop words
myStopwords <- setdiff(myStopwords, c("big"))
# now remove stopwords from the corpus
cleanset <- tm_map(corpus, removeWords, myStopwords)
cleanset <- tm_map(cleanset, stripWhitespace) # purge extra white space
In any text, we often find plurals and other word forms that we’d prefer to treat as one word. “Stemming” refers to truncating words so that related terms are counted together. This is where package SnowballC does its work. After stemming, we create a term document matrix, omitting very short words. Lastly, we list frequent words – we may want to revise our stopwords list to remove repeated but unremarkable words.
#stemming to treat related terms alike
cleanset <-tm_map(cleanset,stemDocument)
#Build term document matrix
cleanset <- tm_map(cleanset, PlainTextDocument)
tdm <- TermDocumentMatrix(cleanset, control=list(minWordLength=3)) #overlook 2 letter words
# inspect frequent words
findFreqTerms(tdm, lowfreq=8)
## [1] "also" "analysis" "analysts" "analyt"
## [5] "analytical" "analytics" "best" "business"
## [9] "can" "capital" "companies" "company"
## [13] "competing" "competitor" "competitors" "customer"
## [17] "customers" "data" "every" "example"
## [21] "group" "know" "like" "must"
## [25] "new" "one" "organizations" "part"
## [29] "people" "process" "products" "quantitative"
## [33] "research" "supply" "tools" "use"
## [37] "way" "will" "years"
# NOTE: At this point, might choose to alter stop words list
A simple list is helpful, but gives little insight into how often each word occurs. Let’s make a graph:
#Bar plot
termFrequency <- rowSums(as.matrix(tdm))
termFrequency <- subset(termFrequency, termFrequency>=12)
barplot(termFrequency, las=2) # las makes axis labels perpendicular
Again, we might want to go back and alter the stopwords list. Once we are satisfied with the list, we’re ready to make a word cloud, using package wordcloud.
m<- as.matrix(tdm)
wordFreq <- sort(rowSums(m), decreasing=TRUE)
grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) ) # sets number of gray shades to use
# the command wordcloud is the main function:
wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, random.order=F, colors=grayLevels)
The word cloud above includes any word that occurs 3 times or more (min.freq=3); this creates an overcrowded curve that “spills over” the allocated space.
The next few commands show different ways to modify the size and appearance of a word cloud:
#Use same number of words, but re-scale the size
wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, scale = c(2, 0.2), random.order=F, colors=grayLevels)
#limit the cloud to the 100 most common words
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=100,scale = c(2, 0.2), random.order=F, colors=grayLevels)
# increase the minimum frequency to 10 occurrences, further re-scale the size.
wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=10, scale = c(3, 0.2), random.order=F, colors=grayLevels)
The package RColorBrewer adds a wide variety of color palettes. You can experiment with this endlessly!
# Add Color
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=100, scale = c(3, 0.2),random.order=F, colors=brewer.pal(6, "Dark2"))
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=100, scale = c(3, 0.2), random.order=F, colors=brewer.pal(9,"Reds"))
We can explore associations in various ways. One simple way is to use a correlational approach, to see how often other words occur next to words we specify. In the following code corlimit=k defines a lower bound.
myterms <- c("data", "analytics","business")
findAssocs(tdm,myterms, corlimit=0.3)
## $data
## numeric(0)
##
## $analytics
## competitor competitors
## 0.43 0.33
##
## $business
## january page review
## 0.43 0.43 0.39
Finally, one further valuable step in text mining is to explore for associations among words, that is finding clusters of words that often occur together (like “big” and “data”).
Searching for associations (a topic for next week, related to prior week) is computationally demanding, operates on the term document matrix – which is very large:
In this example, tdm has 1679 rows and 824 columns.
For efficient processing, we want to reduce the dimensions of the matrix by purging sparse cells:
lim <- 0.99 # retain terms that appear in at most (1-lim)% of lines of text (documents)
tdmss <- removeSparseTerms(tdm, lim)
# inspect(tdmss) to see this command produces more input that we probably want!
Just as we saw last week, hierarchical clustering can be helpful. We’ll apply the method and plot a simple dendogram. NOTE: Be careful with this technique– if the number of items grows too large, the dendogram quickly becomes unreadable.
d <- dist(t(tdmss), method="euclidian")
fit <- hclust(d=d, method="ward.D")
fit
##
## Call:
## hclust(d = d, method = "ward.D")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 824
plot(fit, hang=-1)