Using a basic word counter which took each instance of a space as a separate word, the Twitter dataset contained 28,343,285 words, the news dataset contained 2,643,187 words, and the blogs dataset contained 37,205,236 words.
In terms of lines, the Twitter dataset contained 2,302,307 lines, news contained 77,258, and blogs contained 898,384.
Once profanity was filtered out, it was reduced to a smaller subsample via random sampling, leaving us with 50,000 lines.
Punctuations, numbers, and whitespace were removed from the dataset, and all words were converted to lowercase. Now let’s look at the top 10 most common terms, as shown in the table below.
Also bigrams
And trigrams
One approach could be to:
for each of the words, bigrams (combinations of two words), and trigrams (combinations of three words), create a table which contains the top N most common words/combinations. The value of N will be determined by using a trial and error approach, using a function to time how long the algorithm takes to run and deciding what seems acceptable.
for each row in the table, scan through the corpus and find the most common next word for that particular word/bigram/trigram. Add this as a new variable in the table.
when a word is typed into the interface, use our lookup table described above and search through to find the most likely next words and associated probabilities. Use a back-off model - sometimes use bigrams instead of trigrams, or unigrams instead of bigrams.