The report will summarize the text mining with R on a give text file.It starts with loading all the text files and do a basic statistical analysis. Further the data is loaded on corpus of tm package and performs the exploration of data. Data exploration include below data cleaning and summary of data: 1. Transformed to build a document-term matrix. 2. After that, frequent words and associations are found from the matrix. 3. A word cloud is used to present important words in documents. 4. In the end, words and tweets are clustered to find groups of words and also groups of tweets.
After having the basic count and occurance of words, ngram model is use to tokenized the words using RWeka package function.
At the end, the report have brief description of predictive algorithm for word prediction.
Load the required libraray
library(rJava)
library(SnowballC)
library(RWeka)
library(tm)
## Loading required package: NLP
library(openNLP)
suppressMessages(library(ggplot2))
suppressMessages(library(wordcloud))
library(stringi)
Load all three text files
## Size in byte Number of sentences
## News 20111392 77259
## Blog 260564320 899288
## Twitter 316037344 2360148
As the original files are too big to load, so random lines are selected from all 3 files to create a train data set. On this rbinom function is used to take 5% random lines from each text files. Using below function:
#sample 1% of the data from each data set then combine together
SampleData <- function(dataset, rate)
{
return(dataset[as.logical(rbinom(length(dataset),1,rate))])
}
All sample files are first loaded to a corpus, which is a collection of text documents. After that, the corpus can be processed with functions provided in package tm
corpus <- Corpus(DirSource("C:\\Courseera\\Data_Science\\Capstone_Project\\final\\temp",encoding="UTF-8"), readerControl = list(language="en_US"))
summary(corpus)
## Length Class Mode
## selectBlog.txt 2 PlainTextDocument list
## selectNews.txt 2 PlainTextDocument list
## selectTwitter.txt 2 PlainTextDocument list
The corpus needs a couple of transformations, including changing letters to lower case, and removing punctuations, numbers and stop words. Hyperlinks are also removed in the example below.
#remove junk characters
onlyAlpha <- content_transformer(function(x) stri_replace_all_regex(x,"[^\\p{L}\\s[']]+",""))
corpus <- tm_map(corpus, onlyAlpha)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
#remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# convert to lower case
corpus <- tm_map(corpus,content_transformer(tolower))
# removing special characters
corpus <- tm_map(corpus, removePunctuation)
In the above code, tm_map() is an interface to apply transformations (mappings) to corpora. A list of available transformations can be obtained with getTransformations(), and the mostly used ones are as.PlainTextDocument(), removeNumbers(), removePunctuation() and removeWords(). A function onlyAlpha is defined above to remove junk characters, such as “ö09d”, “081ö”,“â”. Function uses “stri_replace_all_regex” api in stringi package. The above pattern is specified as an regular expression, and detail about that can be found by running ?regex in R.
A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document
corpus.tdm <- TermDocumentMatrix(corpus)
corpus.tdm
## <<TermDocumentMatrix (terms: 84037, documents: 3)>>
## Non-/sparse entries: 117060/135051
## Sparsity : 54%
## Maximal term length: 165
## Weighting : term frequency (tf)
As we can see from the above result, the term-document matrix is composed of 119675 terms and 3 documents. It is very sparse, with 55% of the entries being zero.
In the code above, findFreqTerms() finds frequent terms with frequency no less than 5000.
To show the top frequent words visually, we next make a barplot for them.
After building a term-document matrix, we can show the importance of words with a word cloud (also known as a tag cloud), which can be easily produced with package wordcloud. In the code below, we first convert the term-document matrix to a normal matrix, and then calculate word frequencies. After that, we set gray levels based on word frequency and use wordcloud() to make a plot for it.
The above word cloud clearly shows again that “Just”, “Like”, “the” and “Love” are the top three words
Using the RWeka package for the single word tokenization, Bi-grams sets and Tri-grams sets for further Exploratory Analysis, keeping each in a separate list for now.
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm_t2 <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tdm_t3 <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
tdm_t1 <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
Distribution of Word Frequencies - Single Word, Two word and Tri-word combinations: Preparing the data in correct format by transforming the n-grams to dataframes and ordering by Frequency for charting.
Final prediction algorithm will created on n-gram and Markov model.Algorthim steps are as below:
First, the 2-gram, 3-gram, and 4-gram frequency tables calculated above will be converted to an n-gram frequency matrix(Markov model).
Many n-grams are not seen in the training data. Using the probabilities of n-grams that are seen in the corpus. For unseen ngram different smoothing technices will be use such as “Maximum Likelihood Estimation”
Finally, a function that examines a text string entered by a user, compares that string to the smoothed n-gram frequency matrix, and tries to match the string to the existing 4-grams. If there is a match, then the function looks for the maximum probability word that follows the 4-gram. If there is no match, then it will check the 3-grams, and then the 2-grams. At each step, the function looks for a match, and if there is a match, the function identifies the word with the highest probability of occurring next, using the smoothed frequency matrix.This is a backoff model approach.