Natural language processing (NLP) is a rapidly growing field of study with an expanding number of applications. One task in NLP is text prediction. In a text prediction task, a sentence, or series of words, is presented to a prediction model. The model’s task is to then present the word or words with the highest probability of following the initial series to the user. For this project, a corpus of 3 documents is evaluated and used to train a text prediction model. In this report, summary and exploratory corpus statistics are presented along with a high-level road map to model development.
The training corpus provided for this project is comprised of 3 text documents. The documents are from three different sources to vary sentence structure and capture a wide range of uses of the English language. The documents are taken one each from the following sources: blogs (899288 lines of text), news websites (1010242 lines of text), and twitter (2360148 lines of text).
Document | Number of Lines | Number of Unique Words |
---|---|---|
blogs | 899288 | 832520 |
news | 1010242 | 661750 |
2360148 | 1000223 |
We can see that the twitter document has the most lines and also the most unique words in the document. The line number is likely due to the 140 character cap for tweets. The high number of unique words could be due to misspellings and abbreviations which are used to fit messages into this 140 character cap. These words will likely be uncommon across the entire document or the entire corpus and will need to be dealt with before model training is started.
The plot above shows that 95% of the word instances in these three documents comes from less than 10% of the unique words that are observed in each of the three documents. This will have an important impact on the model development phase. Below, some of the most common words in the corpus are presented.
The plot above lends some key insights into the distribution of words in the corpus. Some words appear more frequently in some of the documents than they do in others. This information may be useful during model tuning to reweight an ngram’s probability.
A text prediction model will be built using the documents described above. The development process will consist of the following phases.
As the above data suggests, the entire dataset need not be used in order to capture most word uses in this corpus. Preprocessing of the text will include: * Stem of all words to reduce the number of forms each word appears in the corpus * Add part-of-speech tags to all words * Replace infrequenty words with an infrequent word tag (
Lists of n-grams can be built from the cleaned data. Unigrams up to four-grams will be created initially. The necessity of four-grams vs. trigrams will be evaluated in the model training process. Backoff logic will be evaluated as a means of handling previously unseen n-grams. The model will use smoothing to handle previously unseen words. Different smoothing methods will be evaluated during training. To train the model, Markov chains will be used to evaluate the probabilities of sentences. Probabilities will be stored as log Probability tables for the each of the different n-grams to avoid underflow.
To allow users to interact with the model, a simple text prediction application will be built on top of model using Shiny. Users will submit a word or series of words to server which will process the input and return the 3 most likely next words.