The corpora was generated by combining and sampling three coporas collected from twitter, news and blogs written in English provided by the capstone project.
N-grams (2-grams, 3-grams and 4-grams) were generated from the combined copora with profanity check.
Frequency based ranking was generated based on the frequency of each feature in each n-gram. The top 100,000 ranked features were included in the final n-gram database.
User inputs are cleaned and trimmed down to 3-grams if necessary before searching the match.
The searching strategy is to search from high to low number grams until a hit is found. If no hits found after the search, it will return “Sorry, we need more data to process your request”.
Top predicted phrases are generated and provided and the predicted word is displayed.
User interface
WordSage uses a slick design with the input and output clear and eye-catching.
It reiterates users' input, returns the word with the highest frequency rank and provides the top predicted features as alternatives.
It provides a progress indicator when loading the database.
Ackownledgement
I would like to thank all instructors and the learning community for their kind guidance and support through the specialisation.