MyWord

Noelene Noone
23 August 2019

When the application is opened the first time the welcome screen displays for 30 seconds allowing the n-grams to load.

The app has a simple interface with a WordPredictor tab and an Instructions tab. On the WordPredictor tab the user types the incomplete sentence in the open text box and click on PREDICT.

Within a second the given incomplete sentence appears on the left, followed by the most likely next word.

MyWord - the simple predictor

  • The prediction algorithm is based on Katz Back-Off language model using 1, 2, 3, 4 and 5-grams

  • The bi-gram Maximum Likelihood estimate are calculated using the 1-gram and 2-gram counts, discounted for unobserved bigrams

  • The 1-gram's unobserved in the bi-gram prediction is allocated a portion of the discount depending on their 1-gram count

  • The same logic is used to build the 3-grams observed and unobserved, using the 2-grams as a source for the unobserved

  • Expanding to 4 grams and 5 grams probabilities effectivley includes the back-off logic in the final observations

Keeping it real

  • The algorithm learns from source text not pre-defined rules

  • Grammar rules are not included in the alogrithm but inferred

  • All words are included; profane, English and non-English.

Excluded to keep to size limitation and esnsure speed

  • Snowball's English stop words and Single occurrences

  • Non-alphabetical characters and Upper case

Accuracy

Accuracy was negatively impacted by:

  • Not including the stop words

  • Removing single occurrences

  • With Markov's assumption, excluding long distance dependencies

Accuracy measured

  • Actual next word matches predicted next word = 9.4%

  • Actual next word is within the top three = 17.3%

Beyond the prototype

  • Prototype to implement Katz Back-Off model in R

  • Not English specific, language based on the source text

  • Adapt model to learn from text users generated

Using a web server the interface can be expanded to:

  • user selects the number of displayed

  • user select the language of the predictions

  • user interactively pick from multiple predictions