Synopsis

The report will summarize the text mining with R on a give text file.It starts with loading all the text files and do a basic statistical analysis. Further the data is loaded on corpus of tm package and performs the exploration of data. Data exploration include below data cleaning and summary of data: 1. Transformed to build a document-term matrix. 2. After that, frequent words and associations are found from the matrix. 3. A word cloud is used to present important words in documents. 4. In the end, words and tweets are clustered to find groups of words and also groups of tweets.

After having the basic count and occurance of words, ngram model is use to tokenized the words using RWeka package function.

At the end, the report have brief description of predictive algorithm for word prediction.

Data Processing

Load the required libraray

library(rJava)
library(SnowballC)
library(RWeka)
library(tm)
## Loading required package: NLP
library(openNLP)
suppressMessages(library(ggplot2))
suppressMessages(library(wordcloud))
library(stringi)

Data Loading and summary of files

Load all three text files

  1. en_US.news.txt
  2. en_US.blogs.txt
  3. en_US.twitter.txt
##         Size in byte Number of sentences
## News        20111392               77259
## Blog       260564320              899288
## Twitter    316037344             2360148

As the original files are too big to load, so random lines are selected from all 3 files to create a train data set. On this rbinom function is used to take 5% random lines from each text files. Using below function:

#sample 1% of the data from each data set then combine together
SampleData <- function(dataset, rate)
{
  return(dataset[as.logical(rbinom(length(dataset),1,rate))])
}

All sample files are first loaded to a corpus, which is a collection of text documents. After that, the corpus can be processed with functions provided in package tm

corpus <- Corpus(DirSource("C:\\Courseera\\Data_Science\\Capstone_Project\\final\\temp",encoding="UTF-8"), readerControl = list(language="en_US"))
summary(corpus)
##                   Length Class             Mode
## selectBlog.txt    2      PlainTextDocument list
## selectNews.txt    2      PlainTextDocument list
## selectTwitter.txt 2      PlainTextDocument list

Transforming Text

The corpus needs a couple of transformations, including changing letters to lower case, and removing punctuations, numbers and stop words. Hyperlinks are also removed in the example below.

#remove junk characters
onlyAlpha <- content_transformer(function(x) stri_replace_all_regex(x,"[^\\p{L}\\s[']]+",""))
corpus <- tm_map(corpus, onlyAlpha)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
#remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# convert to lower case
corpus <- tm_map(corpus,content_transformer(tolower))
# removing special characters
corpus <- tm_map(corpus, removePunctuation)

In the above code, tm_map() is an interface to apply transformations (mappings) to corpora. A list of available transformations can be obtained with getTransformations(), and the mostly used ones are as.PlainTextDocument(), removeNumbers(), removePunctuation() and removeWords(). A function onlyAlpha is defined above to remove junk characters, such as “•œö09d”, “081öœ”,“â–”. Function uses “stri_replace_all_regex” api in stringi package. The above pattern is specified as an regular expression, and detail about that can be found by running ?regex in R.

Building a Term-Document Matrix

A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document

corpus.tdm <- TermDocumentMatrix(corpus)
corpus.tdm
## <<TermDocumentMatrix (terms: 84037, documents: 3)>>
## Non-/sparse entries: 117060/135051
## Sparsity           : 54%
## Maximal term length: 165
## Weighting          : term frequency (tf)

As we can see from the above result, the term-document matrix is composed of 119675 terms and 3 documents. It is very sparse, with 55% of the entries being zero.

Frequent Terms

In the code above, findFreqTerms() finds frequent terms with frequency no less than 5000.

To show the top frequent words visually, we next make a barplot for them.

plot of chunk unnamed-chunk-7

Word Cloud

After building a term-document matrix, we can show the importance of words with a word cloud (also known as a tag cloud), which can be easily produced with package wordcloud. In the code below, we first convert the term-document matrix to a normal matrix, and then calculate word frequencies. After that, we set gray levels based on word frequency and use wordcloud() to make a plot for it.

plot of chunk unnamed-chunk-8

The above word cloud clearly shows again that “Just”, “Like”, “the” and “Love” are the top three words

Tokenization

Using the RWeka package for the single word tokenization, Bi-grams sets and Tri-grams sets for further Exploratory Analysis, keeping each in a separate list for now.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

tdm_t2 <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) 
tdm_t3 <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
tdm_t1 <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))

Distribution of Word Frequencies - Single Word, Two word and Tri-word combinations: Preparing the data in correct format by transforming the n-grams to dataframes and ordering by Frequency for charting.

plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11

Final Algorithm details

Final prediction algorithm will created on n-gram and Markov model.Algorthim steps are as below:

  1. First, the 2-gram, 3-gram, and 4-gram frequency tables calculated above will be converted to an n-gram frequency matrix(Markov model).

  2. Many n-grams are not seen in the training data. Using the probabilities of n-grams that are seen in the corpus. For unseen ngram different smoothing technices will be use such as “Maximum Likelihood Estimation”

  3. Finally, a function that examines a text string entered by a user, compares that string to the smoothed n-gram frequency matrix, and tries to match the string to the existing 4-grams. If there is a match, then the function looks for the maximum probability word that follows the 4-gram. If there is no match, then it will check the 3-grams, and then the 2-grams. At each step, the function looks for a match, and if there is a match, the function identifies the word with the highest probability of occurring next, using the smoothed frequency matrix.This is a backoff model approach.