Daniel Meyer
2018-02-20
This slide deck accompanies a word prediction app (interactive website) that was developed during the Capstone Project of the coursera online Specialization Data Science by Johns Hopkins University. The task was to build a simple word prediction app that takes a user text input and presents the most probable next word following the input.
The project involved the following steps:
The algorithm to predict the next word works like this:
1. read in the user input, clean it, and determine the number of words in it (i)
2. check the respective n-gram lookup table where n = i + 1 an exact predictor match and reurn the outcome
3. if no exact match is found, remove the first word of the user input and match with the (n-1)-gram table (and so on)
4. if no exact match is found at all down to the 2-gram table, perform a fuzzy string match in the 2-gram table and return the outcome to the most similar predictor
The word predictor shows the cleaned input and the matching lookup table besides the actual prediction (the next word). Extensive testing showed that exact matches with i = 3 are already very, very unlikely with the pruned lookup tables. So I decided to only use n-gram lookup tables to the 4th degree. In most cases, the predictor matches will be found in the bigram lookup (2-gram).
Text cleaning is a crucial step in developing the text prediction model. First, I converted all characters to ASCII and lower case. Then, I replaced common internet slang, emojis, contractions, numbers and symbols with their written-out text representations. Eventually, I replaced all common names with <<NAME>>. The input observations were whole paragraphs or messages. It turned out to be quite important to split the observations into single sentences before building the actual n-grams. This way, common stop words next to each other but across sentences wouldn't be converted to n-grams.
I sampled 75k lines from each source (news, blogs, twitter) into one dataset and set aside 100 lines as a test dataset. Then I cleaned the texts in the test and the remaining training dataset and split the lines into single sentences. The full training dataset consisted of roughly 370k sentences. I sampled it again to get smaller training datasets with 200k, 100k, and 50k sentences. From each training dataset I built n-gram lookup tables with n = 2, 3, and 4 and only kept the most common outcome (last word) per predictor (other words in given order). From the sample with 50,000 sentences I built two versions of the lookup tables, one with keeping all unique predictors, and the other with dropping unique predictors that were only found once. From the other samples I only built the pruned versions.
From the test dataset I sampled 1k sentences and built a complete 4-gram table, assigning the last word to 'outcome' and the rest to 'predictor', as I did with the lookup tables. Then I fed each predictor into my word-prediction algorithm and run it with every lookup variant (050k_min1, 050k_min2, 100k_min2, 200k_min2, and full_min2). The results are plotted to the right. In the first graph you see that the algorithm could apply much more higher n-grams with the _min1 variant. That's because rare higher n-grams were kept in this variant (no pruning). The share of higher n-grams then rises with larger sample sizes, as more higher n-grams are kept in them.
In the second plot you see that overall, the algorithm has a success rate of about 10% (rough average of the total success rate of the different samples). Higher n-grams yield more accurate results and the total success rate (n-gram-share * accuracy) tend to be higher with a larger sampling size to build the lookup tables on. I chose to use the full_min2 variant of my lookup tables as they yielded the highest overall success rate of 12%. The success rate might seem to be quite disenchanting, but it is common to achieve a success rate of only 10% with this kind of tasks.