I was presented the task to create a text prediction algorithm similar to what is seen everywhere from smartphones to google search and beyond. The key to predicting the next word in a sentence is understanding the words preceding the one about to be predicted. This prediction method is known as a back off or Katz Back Off Method. The high level overview of the method is to find an instance where the previous words are observed somewhere in a text document, then returns the next word in the sentence as a recommendation. Here is an example:
I have the book Alice In Wonderland as a reference document and I want to find all the possible words that follow the phrase “Alice has never seen such a Large”. If we search for all the instances where that particular phrase appears, we will get a list of words that appear after the phrase. Such a list would look like this:
| Word | Frequency |
|---|---|
| Cat | 8 |
| Chess | 2 |
| Tree | 3 |
| Castle | 3 |
From the output list, the most likely recommended next word should be “Cat” because it is the most frequently seen.
Like I said this method is widly used for convenience in user interfaces. I will only be scratching the surface of the application, but please explore some of the links provided for a deeper understanding or more widespread use. This project is only a subset of a larger data analytics topic known as Natural Language Processing. The methods in the hyperlink below help explain how organizations can use text data as a quantifiable tool.
The reference text was scrubbed from news sources, twitter, and various blogs. These collectively were put together into one large text table.
The text files range from 167 to 210 Mb with well over 4 million cumulative lines of text. Twitter does have the fewest characters per document because of Twitter’s 140 character format, but it makes up for that with the largest number of documents in it’s dataset. Whereas the inverse is true for the Blogs, where the maximum document length is almost half a million words, but it has far fewer documents.
Because of their magnitude, a sample of only 500,000 documents will be taken. This will prove to be more than enough to satisfy an efficient and accurate predition algorithm.
| Source | Size..Mb. | Length | Max.Characters |
|---|---|---|---|
| 167.1053 | 2360148 | 26 | |
| US Blog | 210.1600 | 899288 | 483415 |
| News | 205.8119 | 1010242 | 123628 |
The three text files are combined then sampled to help with computation efficiency. A subset of 500 documents are taken out as a test set and set aside for later. This will make sure the prediction algorithm can be assessed for accuracy without influence from the training dataset.
Corpus: A collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Tokenization: The way structured text is broken into it’s most basic elements; punctuation, numbers, symbols, words, and spaces.
Ngram: a sequence of words only N words long
The Corpus is tokenized into 1 word, 2 word, 3 word, and 4 word Ngrams. This gives a basis for the back off method. We can add all the same Ngrams together for a frequency. The top 20 words are expressed in the bar graphs below The top unigrams, bigrams, trigrams, and quadgrams give a perspective of the common phrases or words used in these text documents. It should be noted that as the n-grams get larger, the frequency of repetition will go down.
From the test set the top 20 most frequent Ngrams are observed:
In the tokenized corpus it is aparent that some words will be very common, and others very rare. The method below explores the uniqueness of the corpus by finding how many words make up 50%, 75%, 90%, and 95% of the total number of words in the corpus.
| Words | Cumulative Percent |
|---|---|
| 142 | 50.04 |
| 1540 | 75.01 |
| 8187 | 90.01 |
| 20887 | 95.01 |
From the results above it shows how many words are needed to make up different quartiles of the unigram corpus. Only 142 different words are needed to make up 50% of the corpus dictionary. This makes sense because some words are much more common in the english language than others. Further exploration can be done to remove the stopwords such as “the”, “a”, or “for” to see how much more unique the corpus would be.
From the different Ngram dataframes, the frequency of each ngram was found so the most probable prediction could be selected later. The Ngram was broken into a base column and target column where the base is used as the prior words or phrase that will be searched to match an input string. Example:
After appropriate cleaning these are the backbone to quickly finding matching base Ngrams, then selecting the most probable target words.
A testing set of 500 randomly selected documents were selected from the original corpus. These were set aside to test the accuracy of the prediction algorithm. Each sentence has a randomly selected word the algorithm will try to predict. The input base works it’s way through the backoff part of the algorithm to output the top 5 most frequent matches. The accuracy of the algorithm is observed by how many times the first returned word matches the target word. The algorithm is also tested by finding if the target word is predicted in the top 5 most frequent predicted words.
The results from the test are as follows:
While this result is very good for a local machine, the web app it’s developed for lags slightly becuse of the size of the Ngram files. The average computational time was 0.934s which is a little slow for a typing app. Collectively the ngram files are 400mb so size reduction and usability was made at the cost of accuracy. All Ngrams of frequency less than 3 were removed from the data set. This reduced the results to the following:
The app is shown below where the user is able to input a string and click a input button to output the predicted word. A bar graph of predicted words is created with the table of the probability of predicting this word over other words.
Feel free to try out the app at the hyperlink provided: Link to app