Noelene Noone
23 August 2019
When the application is opened the first time the welcome screen displays for 30 seconds allowing the n-grams to load.
The app has a simple interface with a WordPredictor tab and an Instructions tab. On the WordPredictor tab the user types the incomplete sentence in the open text box and click on PREDICT.
Within a second the given incomplete sentence appears on the left, followed by the most likely next word.
The prediction algorithm is based on Katz Back-Off language model using 1, 2, 3, 4 and 5-grams
The bi-gram Maximum Likelihood estimate are calculated using the 1-gram and 2-gram counts, discounted for unobserved bigrams
The 1-gram's unobserved in the bi-gram prediction is allocated a portion of the discount depending on their 1-gram count
The same logic is used to build the 3-grams observed and unobserved, using the 2-grams as a source for the unobserved
Expanding to 4 grams and 5 grams probabilities effectivley includes the back-off logic in the final observations
The algorithm learns from source text not pre-defined rules
Grammar rules are not included in the alogrithm but inferred
All words are included; profane, English and non-English.
Excluded to keep to size limitation and esnsure speed
Snowball's English stop words and Single occurrences
Non-alphabetical characters and Upper case
Accuracy was negatively impacted by:
Not including the stop words
Removing single occurrences
With Markov's assumption, excluding long distance dependencies
Accuracy measured
Actual next word matches predicted next word = 9.4%
Actual next word is within the top three = 17.3%
Prototype to implement Katz Back-Off model in R
Not English specific, language based on the source text
Adapt model to learn from text users generated
Using a web server the interface can be expanded to:
user selects the number of displayed
user select the language of the predictions
user interactively pick from multiple predictions