The aim of this project is build a shiny application which uses an algorithm to predict the word most likely to follow one or more words typed by any user. This problem is categorized as a task in Language Modeling in the field of Natural Language Processing (NLP). We are provided with text data from social media (twitter), news and blogs in three different languages, English, German and French. We use the English data for our project. Here is a basic size summary of these data sets which show that we are dealing with large data sets.
library(tm); library(koRpus); library(data.table); library(ggplot2); library(wordcloud); library(cowplot); library(gridExtra)
## fileName size_in_MB lineCount
## 1: en_US.blogs.txt 210.2 899288
## 2: en_US.news.txt 205.8 1010242
## 3: en_US.twitter.txt 167.1 2360148
Cleaning and Sampling. We first remove profanity words using regular expression since we do not want the prediction algorithm to return such words. Then we choose a random subset of the data as the training set. We use n.readLines{reader} command from reader package to read the data in chunks and then choose a subset of the data by a random selection by lines. For the purpose of this report we work with relatively small sample size of \(1\%\) of the cleaned data.
Tokenization. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. We tokenize the data to understand the distribution of these tokens such as words, punctuation, commas, full stops etc. We see that there are higher number of full stops and less commas in the twitter data as compared to the other two. The news data has the the most numbers. This shows the fundamental difference of the use of the language in the data sets, with the news data being more formal and reports more factual figures and numbers while blogs and twitter are more informal. Also, the distribution of word lengths show the presence of very long words in the twitter data as compared to the others, probably be due to web links addresses.
tokenized <-koRpus::tokenize(txt = train.file.path, format = "file", lang = "en")
tokenDesc <- slot(tokenized, name="desc")
tokensdf <-slot(tokenized, name="TT.res")
## fileName lineCount wordCount sentCount punctCount
## 1 train.en_US.blog.txt 8993 380472 23997 65396
## 2 train.en_US.news.txt 10102 353717 22089 71202
## 3 train.en_US.twitter.txt 23601 310737 37325 81145
Document Term Matrix
A Document Term Matrix (dtm) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents, where the rows correspond to documents in the collection and columns correspond to terms. Creating such a matrix from the corpus is an important step in text analysis. Using the tm package, we read the corpus and then convert the text to the lower case, remove numbers and punctuation before creating a dtm.
# Use tm package
docs <- Corpus(DirSource(trainData.path)) # Load the data
# Preprocessing
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument) # treat your preprocessed documents as text documents
dtm <- DocumentTermMatrix(docs)
inspect(dtm[, 100:106])
## <<DocumentTermMatrix (documents: 3, terms: 7)>>
## Non-/sparse entries: 9/12
## Sparsity : 57%
## Maximal term length: 8
## Weighting : term frequency (tf)
##
## Terms
## Docs ably abnormal abny aboard abode abodes abolish
## character(0) 0 3 0 5 3 1 0
## character(0) 2 0 0 7 0 0 1
## character(0) 0 0 1 5 0 0 0
Word Frequency The dtm allows us to analyse the data by the frequency of the words. We first look at the most frequently occurring words in all documents combined. A more attractive visualization is a word cloud with the 100 most frequently occurring words. “The” and “and” are the two most occurring words.
wordcloud(names(freq), freq, min.freq=500, max.words=100, colors=brewer.pal(8, "Dark2"), random.order = FALSE, rot.per=.15, scale=c(8,.9))
Coverage By coverage we mean the number of unique words needed to cover a certain percentage of total number of words (56519) in the data.
## coverage.percent cutoff.index uniqword.percent cutoff.freq
## 5 0.50 313 0.5537961 299
## 9 0.90 9807 17.3516870 6
## 10 0.95 21313 37.7094428 2
We can see from the above table and graphs that for \(50 \%\) coverage we need 313 unique words which have a minimum frequency of 299. For \(90 \%\) coverage we need 9807 unique words which have a minimum frequency of 6. So if we remove the terms occurring only 1 to 2 times in the data we can still have a good coverage and hence a robust model. This will also remove most foreign language and unknown words.
Model: We plan to use the ngrams language model to build our prediction model. A ngram is a sequence of \(n\) items from a given sequence of text. We will build document term matrices of 2-grams and 3-grams and compute the n-gram probabilities. The prediction model will be a model which outputs based on the highest probability.
Predicting on unseen ngrams: In order for our model to predict the ngrams which are not in the training data we will use interpolation or stupid backoff methods, where we assign non-zero probability to the unseen ngrams.
Evaluating the model: To evaluate the performance of our model intrinsically we will use a measure called perplexity or the average branching factor. The lower the perplexity, the better the model.
Further considerations: An important part of this project is to make the model fast and efficient so that it can run on small devices such as smartphones. So our next task would be to improve the model for better speed and efficiency.