This is the milestone report for the Data Science Capstone project. This project involves developing an algorithm for predicting the next word in real-time based on user entry. The algorithm learns from a set of news articles, blog posts and twitter data. This report provides some exploratory analysis on the text data, and briefly explains the next steps of the project, in creating the prediction algorithm and the data product.
Firstly, we analyze the given datasets to explore the size, number of lines and number of words. A function is written to specifically return these statistics for the dataset.
As the first step, we load each dataset:
blog <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
stat_summary <- function(x){
num_lines <- length(x)
num_words <- sum(sapply(gregexpr("\\S+", x),length))
maxchar_line <- nchar(x[which.max(nchar(x))])
tot_stats <- cbind(num_lines, num_words, maxchar_line)
return(tot_stats)
}
Using the function, we get statistics summary for each dataset:
## Lines Words Max_Chars
## Blog 899288 37334131 40833
## Twitter 2360148 30373583 140
## News 77259 2643969 5760
Since, we are assessing the text documents together, we need to combine these three datasets into one. Since, the next steps would be time-consuming with the full dataset, a small sample dataset is derived which is about 0.1% of the main dataset.
The next step is to clean the dataset to remove punctuation, numbers, profanity, and so on. This example removes stop words such as “the”, “you”, “his”, etc. from the database, just to visualize the wordcloud to see what the most common word is other than these common words. These will be put back into the dataset for the next step of generating n-grams and prediction algorithm, as these are quite important in predicing the next word.
sample_corpora = gsub("[^\x20-\x7E]", "", sample_corpora) ## This removes the non-ASCII characters
review_source <- VectorSource(sample_corpora)
corpus <- Corpus(review_source)
bad_words <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt") ## Profane words
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removeWords, bad_words)
This cleaned up dataset will be used to visualize the most common words in the form of wordcloud. In order to do that, Document Term Matrix is generated, which can be used to provide frequency of words.
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
freq <- colSums(dtm2)
freq <- sort(freq, decreasing=TRUE)
f <- data.frame(freq)
f$word <- row.names(f)
pal2 = brewer.pal(8,"Dark2")
wordcloud(f$word, f$freq, scale=c(4,0.2), min.freq=50, random.order=FALSE, rot.per=0.15,colors=pal2)
The wordcloud shows the high frequency words in the sample of the text document (minus the stop words). Next, we retain the stop words back and generate N-grams, which are frequent set of words. RWeka package is used to set up tokenizer functions for 1-gram, 2-gram, 3-gram, and 4-gram words.
oneGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=1, max=1))}
biGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=2, max=2))}
triGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=3, max=3))}
quadGram <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=4, max=4))}
Then we use the Term Document Matrix function to obtain the frequency of words, and then produce sparse matrices of these sets.
tdm1 <- TermDocumentMatrix(corpus, control = list(tokenize=oneGram))
tdm2 <- TermDocumentMatrix(corpus, control = list(tokenize=biGram))
tdm3 <- TermDocumentMatrix(corpus, control = list(tokenize=triGram))
tdm4 <- TermDocumentMatrix(corpus, control = list(tokenize=quadGram))
gram1freq <- data.frame(word = tdm1$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm1$i, j = tdm1$j, x = tdm1$v)))
gram1freq <- arrange(gram1freq, desc(freq))
gram2freq <- data.frame(word = tdm2$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm2$i, j = tdm2$j, x = tdm2$v)))
gram2freq <- arrange(gram2freq, desc(freq))
gram3freq <- data.frame(word = tdm3$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm3$i, j = tdm3$j, x = tdm3$v)))
gram3freq <- arrange(gram3freq, desc(freq))
gram4freq <- data.frame(word = tdm4$dimnames$Terms, freq = rowSums(sparseMatrix(i = tdm4$i, j = tdm4$j, x = tdm4$v)))
gram4freq <- arrange(gram4freq, desc(freq))
From these tables, we can find the top 10 word chains in the sample dataset.
## 1-Gram 2-Gram 3-Gram 4-Gram
## 1 the of the thanks for the thanks for the follow
## 2 and in the one of the the rest of the
## 3 you to the a lot of for the first time
## 4 for for the i have to i was going to
## 5 that to be cant wait to i cant wait to
## 6 with on the the rest of of a trade mark
## 7 was it was going to be thanks for the rt
## 8 this at the looking forward to to be able to
## 9 are in a to be a as much as i
## 10 have for a im going to cant wait to see
We can plot the frequency plots and see the distribution of N-grams.
It can be observed that the frequency of word sets reduces for higher N-grams. Highest frequency word occurs about 3000 times in the 1-gram dataset, compared to the 4-gram dataset, where the highest frequency wordchain occurs only about 8 times. This is an expected result.
The next step of this project is to use these N-grams generated to develop a prediction algorithm, where based on the last set of words, the model should be able to predict the most probable word that could occur next. As we observed, the higher the wordchain is, the lower the frequency of the occurrence is in the dataset. I plan to explore higher N-grams in the full dataset to see the frequency of those wordchains. Backoff models such as Katz’ would be used to obtain conditional probability of a word occurrence based on the given set of wordchains.
We need to accommodate for combinations that occur outside of this text corpus. So, a smoothing technique would be used to account for such wordchains.
The Shiny product would be constructed in such a way that, the next predicted word would occur in real-time as the user types.
The main issue I have been experiencing is the size of the corpus. In order to construct the actual model, I would need to use most of the text dataset, which is huge for my machine! I would appreciate any suggestions that could result in improving this issue.