Chuk Yong
21 October 2018
NextWord is an app developed to predict the word following the phrase or sentence you typed. It has the following features:
The datasets are based on Swiftkey's collection of blogs, news and twitter. They are quite large, extensive and cover a wide range of subjects. The amount of data generated and can be very taxing to a normal laptop computer's processing and memory. One of the challenge is to break down the datasets into mangeable chunks. Extra considerations are taken to intelligently reduce the final search table size for deployment on Shinyapp.
Quanteda, a package for managing and analyzing textual data developed by Kenneth Benoit and other contributors was used exclusively for data exploration, cleaning and creating tokens, ngrams and DFMs. Many thanks to the many contributors for this easy and convenient package for natural language processing.
The search database consists of unigram, bigram and trigram data. There were some 70M rows in the data set. It was ranked and trimmed to 1M rows for final deployment.
In determining the algorithm to use for the predict text, one of the most common would be a probabilistic approach using Markov Chain with backoff. The other being a frequency approach. In our testing, we prefer the flow generated by the frequency approach. Neither are very “accurate”. This is because everyone has different writing style and mood.
Lastly, we build a data frame consisting of bigram and trigram as the search database. In consideration of speed and memory usage, a 1 million elements dataset was chosen out of 70 million.
The Shiny app created was with a single input interface to make it simple and intuitive. It allows user to enter a part of a sentence or phrase, hit the 'Enter' key and the predicted word or words will be shown in the box below. Up to 20 choices will be provided.
NextWord App can be found here: https://chukyong.shinyapps.io/SwiftkeyShinyApp/