A. Paul
June 2019
Many people all over the world use a smartphone to communicate with other people via social media, sms or e-mail. Typing can be painful, so a suggestion list with matching words would be helpful. The web app 'My_Next_Word' tries to support the user with a list of possible words and their probabilities to capture the next word.
The aim of the coursera project was to build an application that predicted the next word based on a clever algorithms and using a corpus of the company SwiftKey. Since the language is very diverse and only very few word phrases are unique, the app offers a word list with calculated appearance probabilities.
On the other hand, the message 'Your vocabulary is beyond our experience.' is displayed if no word can be predicted for the combination of words, e.g. because the corpus was only formed on the basis of 5 percent of the SwiftKey dataset.
For example, after the words “many thanks for” a text could continue with the words “your help…”, “helping…” or “the help…”, whereby even the word tag (“pronoun possessive”, “verb or gerund”, “noun”) of the next word cannot be predicted with certainty. Nevertheless, the app tries to do this using a hidden markov model.
The web app often provides a suggestion and is useful for more than 20% of cases. This is in the nature of things. The average response time is less than one second, sometimes more.
The web app 'My Next Word' can be accessed via the following URL https://apwi.shinyapps.io/my_next_word/. The user can specify a sentence in the input field 'Enter a sentence' and the app tries to automatically predict the next word with a probability. The last three words of the sentence are used. But if there are less than three words, the prediction is based on one or two words. As a rule, the next word cannot be predicted unambiguously because the corpus is too diverse. Therefore the words with the six highest probabilities are sorted in a plot. The number of suggestions can be changed with the parameter 'max. suggestions'. If no suitable word is found in the corpus, the message 'Pardon. Your vocabulary is beyond our experience.' will be displayed.
HMM's transition through a sequence of hidden tags and each tag produce a token. One can view the tag of a HMM in the same way as one views a hidden latent component which generates a point from a given categorial distributation. The hidden tags are not independent of one other and this needs for the estimation process. Each transition from one tag to another tag generates some type of data, which is a word in its most basic form.