To be able to mine text properly, it’s essential to use appropriate tools. Below a list of packages which I used to process text based data is presented. To start, we need to install and then load their libraries in RStudio.
require(stringi)
require(knitr)
require(tm)
require(RWeka)
require(SnowballC)
require(ggplot2)
require(wordcloud)
require(cluster)
require(fpc)
The data for this capstone is from a corpus called HC Corpora http://www.corpora.heliohost.org/. The available corpora in this website is HERE. Some corpora may contain some foreign text. The dataset for this capstone is available HERE. This dataset contains four LOCALE files en_US, de_DE, ru_RU and fi_FI selected from three sources twitter, blogs and news. For this capstone, English based dataset has been selected for data analysis.
dataset.dir <- "./dataset/en_US"
dataset.file <- dir(dataset.dir)
dataset.path <- paste0(dataset.dir, "/", dataset.file)
dataset.size <- file.size(dataset.path)
blogs.data <- readLines(dataset.path[1], encoding = "latin1", skipNul=TRUE)
news.data <- readLines(dataset.path[2], encoding = "latin1", skipNul=TRUE)
twitter.data <- readLines(dataset.path[3], encoding = "latin1", skipNul=TRUE)
WordCounter <- function(x) {sum(sapply(gregexpr("\\S+", x), length))}
blogs.words <- WordCounter(blogs.data)
news.words <- WordCounter(news.data)
twitter.words <- WordCounter(twitter.data)
dataset.words <- c(blogs.words, news.words, twitter.words)
dataset.summary.blogs <- stri_stats_general(blogs.data)
dataset.summary.news <- stri_stats_general(news.data)
dataset.summary.twitter <- stri_stats_general(twitter.data)
dataset.summary <- rbind(dataset.summary.blogs,
dataset.summary.news,
dataset.summary.twitter)
dataset.summary <- data.frame(dataset.summary, words = dataset.words,
size_Mb = dataset.size/1024^2)
rownames(dataset.summary) <- dataset.file
kable(dataset.summary)
| Lines | LinesNEmpty | Chars | CharsNWhite | words | size_Mb | |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 899288 | 208361438 | 171925775 | 37334441 | 200.4242 |
| en_US.news.txt | 1010242 | 1010242 | 203791401 | 170428696 | 34372597 | 196.2775 |
| en_US.twitter.txt | 2360148 | 2360148 | 162385035 | 134370242 | 30373832 | 159.3641 |
The datasets for this project are fairly large, so to accelerate the process in this step, it’s suggested to work with a smaller subset of these data. A sample of 5000 lines has been selected for further analysis.
sample.news <- news.data[sample(1:length(news.data), 5000)]
sample.twitter <- twitter.data[sample(1:length(twitter.data), 5000)]
sample.blogs <- blogs.data[sample(1:length(blogs.data), 5000)]
sample.data <- c(sample.news,sample.twitter,sample.blogs)
First we should convert the sampled data to a corpur vector which is the base format for text analysis in tm package.
corpus <- VCorpus(VectorSource(sample.data))
To see the content of corpus, we can use inspect command in R terminal. Now we can go on to preprocess the text data. In this step we remove numbers, capitalization, common words, punctuation, and otherwise prepare the texts for analysis. This is a time consuming job, but at the end of this step, we have a high quality data for analysis.
R cannot read like a hunman and treat the same way with punctuation and other special characters. The following chunk remove punctuation from the text.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
Below code chunk will remove URLs completely. :alnum: matches any alphanumeric characters, incl. letters and numbers, and :punct: matches punctuation characters. See details by running “?regex” under R or googling for “regular expression”.
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
Below chunck code will remove anaything other than english letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
we want a word to appear exactly the same every time it appears. We therefore change everything to lowercase.
corpus <- tm_map(corpus, content_transformer(tolower))
stop words are common words in any languages that usually have no analytic value. So we will remove them from the dataset.
corpus <- tm_map(corpus, removeWords, stopwords(kind = "en"))
if you like to see some stop words in english, below chunk will help you:
length(stopwords("english"))
## [1] 174
stopwords("english")[1:20]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
The data contain words of offensive and profane meaning. So, a list of these bad words (profanity) collected and added to the dataset to filter and remove these words from dataset. These data will not use in prediction model.
corpus <- tm_map(corpus, removeWords, data.profanity)
Some words have a variety of possible endings in the original text. Stemming referes to removing common word endings (e.g., “ing”, “es”, “s”) and make them recognizable to the computer. For this, we used SnowballC package.
corpus <- tm_map(corpus, stemDocument)
The above preprocessing will leave a lot of white space as the result of all the left over spaces that were not removed along with the words that were deleted. These white spaces should be removed in this step.
corpus <- tm_map(corpus, stripWhitespace)
To finish the data processing, we need to tells R to treat the preprocessed documents as text documents.
corpus <- tm_map(corpus, PlainTextDocument)
To continue, we need to create a document-term matrix or term-document matrix which is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to termsWikipedia.
dtm <- DocumentTermMatrix(corpus)
dtm
To inspect, we can use: inspect(dtm), this will, however, fill up the terminal quickly. So you may prefer to view a subset: inspect(dtm[1:5, 1:20]) view first 5 docs & first 20 terms like dim(dtm) that display the number of documents & terms (in that order). Youll also need a transpose of this matrix:
tdm <- TermDocumentMatrix(corpus)
tdm
Below code cunck will organize terms by their frequency.
freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 26055
Here check out some of the most and least frequently occurring words.
ord <- order(freq)
head(freq[1:10])
## aaa âââ ââââââââââââââââ aaaaaaar
## 3 1 1 1
## aaah aaahahahah
## 2 1
freq[tail(ord)]
## get just like will one said
## 1206 1215 1262 1303 1351 1526
Check out the frequency of frequencies.
head(table(freq), 20)
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12
## 14583 3448 1651 973 652 465 377 304 259 226 169 159
## 13 14 15 16 17 18 19 20
## 171 124 117 101 90 86 81 68
tail(table(freq), 20)
## freq
## 632 635 648 659 686 689 704 723 763 781 824 960 1075 1124 1206
## 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## 1215 1262 1303 1351 1526
## 1 1 1 1 1
Another way to display the reult:
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
freq[1:20 ]
## said one will like just get time can year make day new
## 1526 1351 1303 1262 1215 1206 1124 1075 960 824 781 763
## work know now peopl say love also want
## 723 704 689 689 686 659 648 635
wf <- data.frame(word=names(freq), freq=freq)
wf[1:20, ]
## word freq
## said said 1526
## one one 1351
## will will 1303
## like like 1262
## just just 1215
## get get 1206
## time time 1124
## can can 1075
## year year 960
## make make 824
## day day 781
## new new 763
## work work 723
## know know 704
## now now 689
## peopl peopl 689
## say say 686
## love love 659
## also also 648
## want want 635
Here we plot words that appear at least 500 times
p <- ggplot(subset(wf, freq>500), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + coord_flip()
p
If you have a term in mind that you have found to be particularly meaningful to your analysis, then you may find it helpful to identify the words that most highly correlate with that term. If words always appear together, then correlation=1.0.
# specifying a correlation limit of 0.20
findAssocs(dtm, "year", corlimit=0.20)
## $year
## ago last
## 0.24 0.23
set.seed(2015)
wordcloud(names(freq), freq, min.freq=100, scale=c(5, 0.1),
colors=brewer.pal(6, "Dark2"))
To do this well, we should always first remove a lot of the uninteresting or infrequent words. We can remove these with the following code. This makes a matrix that is only 96.5% empty space, at maximum.
dtm.new <- removeSparseTerms(dtm, 0.965)
dtm.new
## <<DocumentTermMatrix (documents: 15000, terms: 28)>>
## Non-/sparse entries: 20961/399039
## Sparsity : 95%
## Maximal term length: 5
## Weighting : term frequency (tf)
Now we can calculate distance between words & then cluster them according to similarity.
d <- dist(t(dtm.new), method="euclidian")
fit <- hclust(d=d, method="ward")
fit
##
## Call:
## hclust(d = d, method = "ward")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 28
plot(fit, hang = -1)
groups <- cutree(fit, k=5)
rect.hclust(fit, k=5, border="red")
The k-means clustering method will attempt to cluster words into a specified number of groups (in this case 2), such that the sum of squared distances between individual words and one of the group centers. You can change the number of groups you seek by changing the number specified within the kmeans() command.
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.
Tokenization is used in computer science, where it plays a large part in the process of lexical analysis.
wordCloudPlot <- function(tdm, n) {
m <- as.matrix(tdm)
# calculate the frequency of words and sort it by frequency
word.freq <- sort(rowSums(m), decreasing = T)
df <- data.frame(word = names(word.freq),freq=word.freq)
# plot word cloud
wordcloud(df$word,df$freq, c(5,0.3), n,
random.order=FALSE, colors=brewer.pal(8, "Dark2"))
}