15 November, 2020

About this application

This app takes the the English dataset from SwiftKey, sourced from Twitter and various news and blogs, which contain 70,000 to over 2 million lines per, calculates a modified probability of words in the context of previous words, and use that to predict the next word following a user-input sentence.

The dataset is summarized in my Milestone Report, and there are easily over millions of words from each source. Although the more data the better, analyzing the entire dataset, while feasible, would be undesirable, as it would be time- and memory-consuming, and making an online application unwieldy. Therefore 1% of total number of lines is randomly sampled and this subset serves as the basis of the application.

Text mining and preprocessing

The preprocessing is fairly standard with a few caveats. The dataset is text-mined and cleaned of numbers, symbols, punctuation, URLs, hyphenations, and profanity (profanity list is referenced from Luis von Ahn’s group at Carnegie Mellon University).

Stemming, which is a process that reduces inflected or derived words to the word stem, or a common base form of the word. For example, the words accounts, accountable, accountability, accounting, and accountants would be reduced to account. The idea of course is to lower redundancies when calculating probability but this lead to a problem when dispalying prediction results: either in the stemmed words, which is may not be readable for humans, or creates a much longer list of potential words when I convert stemmed words back to their full forms, and this list would include many candidates that are grammatically incompatible to the sentence.

Therefore, I wrote the stemming in as an option to my function and compared the results with and without, and decided to go with the latter. While this may be a sacrifice to a more succinct dataset and accurate probability, the prediction results will be more user-friendly within acceptable probability margin.

Modified Knesner-Ney Smoothing

In my research for ways to train models or just calculate an appropriate probability and just reference it, multiple sources suggest a probability with modified Kneser-Ney smoothing (\(p_{kn}\)) by Chen and Goodman 1999, is the optimal approach:

\[ p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{max(c(w_{i-n+1}^{i}) - D, 0)}{\sum_{w_i} c(w_{i-n+1}^{i-1})} + {\lambda}{(w_{i-n+1}^{i-1})}p_{continuation}(w_i|w_{i-n+2}^{i-1}) \]

  • \(p_{KN}(w_i|w_{i-n+1}^{i-1})\): modified Kneser-Ney smoothed probability of word \(w_i\) with preceding word(s) \(w_{i-n+1}^{i-1}\) (eg. \(p_{KN}(see|Today\ I\ would\ like\ to)\)).

  • \(c(w_{i-n+1}^{i})\): \(count\) of \((see|Today\ I\ would\ like\ to)\) of this particular n-gram level.

  • \(D\): discounter factor; the essence of modified \(p_{KN}\) is that this D is one of 3 values depending on the above n-gram counts for 1, 2, or 3+. This shifts some probability from here towards novel words not seen in the training set, in the form of the interpolation weight \({\lambda}\).

  • \(c(w_{i-n+1}^{i-1})\): \(count\) of \((*|Today\ I\ would\ like\ to)\) where \(*\) is wild-card, mean any word that follows.

Modified Knesner-Ney Smoothing, continued

  • \({\lambda}(w_{i-n+1}^{i-1})\): the interpolation weight, composed of a normalized \(D\) by dividing \(D\) with \(c(w_{i-n+1}^{i-1})\) seen above, then multiply this normalized \(D\) with \(|\{w:c(w_{i-1}, w)\}|\), which is the count of word types \(w\) that can follow \(w_{i-1}\).
  • \(p_{continuation}(w_i|w_{i-n+2}^{i-1})\): the probability of the word \(w_i\) not seen in the training set (ie. a novel word, which would have had a probability of 0 if probability has not been shifted).

Admittedly, this was rather difficult for me to comprehend at first. These resources helped me a lot in calculating \(p_{KN}\).

Here is a screenshot of the working app