NextWord App Presentation

Chuk Yong
21 October 2018

Data Science Capstone Word Prediction App Slide 1

NextWord is an app developed to predict the word following the phrase or sentence you typed. It has the following features:

  • Contain a million most common phrases in modern blogs, news and twiiter
  • An intuitive and simple user interface
  • Quick search time

Data Science Capstone Word Prediction App Slide 2

The datasets are based on Swiftkey's collection of blogs, news and twitter. They are quite large, extensive and cover a wide range of subjects. The amount of data generated and can be very taxing to a normal laptop computer's processing and memory. One of the challenge is to break down the datasets into mangeable chunks. Extra considerations are taken to intelligently reduce the final search table size for deployment on Shinyapp.

Data Science Capstone Word Prediction App Slide 3

Quanteda, a package for managing and analyzing textual data developed by Kenneth Benoit and other contributors was used exclusively for data exploration, cleaning and creating tokens, ngrams and DFMs. Many thanks to the many contributors for this easy and convenient package for natural language processing.

The search database consists of unigram, bigram and trigram data. There were some 70M rows in the data set. It was ranked and trimmed to 1M rows for final deployment.

Data Science Capstone Word Prediction App Slide 4

In determining the algorithm to use for the predict text, one of the most common would be a probabilistic approach using Markov Chain with backoff. The other being a frequency approach. In our testing, we prefer the flow generated by the frequency approach. Neither are very “accurate”. This is because everyone has different writing style and mood.

Lastly, we build a data frame consisting of bigram and trigram as the search database. In consideration of speed and memory usage, a 1 million elements dataset was chosen out of 70 million.

Data Science Capstone Word Prediction App Slide 5

The Shiny app created was with a single input interface to make it simple and intuitive. It allows user to enter a part of a sentence or phrase, hit the 'Enter' key and the predicted word or words will be shown in the box below. Up to 20 choices will be provided.

NextWord App can be found here: https://chukyong.shinyapps.io/SwiftkeyShinyApp/

NextWord App