Next Word Generator

February 16, 2021

The Project

This project is part of Coursera / John Hopkins University Data Science Specialization. The task was to develop a simple shiny app that could give a prediction of the next word after inputting a text.

Although it seems somewhat easy, the development of the prediction algorithm comprises reading and analyzing tons of texts, cleaning, the application fo Natural Language Processing and training that uses all the learning the student got throughout the courses.

See more about the specialization at https://www.coursera.org/specializations/jhu-data-science.

Training Data

The algorithm was developed using a huge amount of twitter, blogs and news posts from variable sources. A Sample of 1M sentenced was obtained, preprocessed and cleaned, with numbers, times, interjections and profanities tagged and removed. A few stopwords were also removed.

Here are some stats:

The algorithm

After cleanup the text data was tokenized and separated in groups of 2, 3, 4 5 and 6 ngrams, and each one´s frequency was calculated. Ngrams are combinations of multiple words. Knowing their frequencies one can predict the text ordering on that sample.
The final word for each ngram was identified and separated. The remaining text was then stemmed. The stemming process uses the root of the word to save space and improve efficiency.
The input text is cleaned-up the same way as the training data. Then it is also stemmed and compared to the Ngrams frequency tables.
If a match with the 6-gram table is obtained, the last word with the higher frequency is delivered. If there is no match, the stemmed input is then compared successively to 5, 4 3 and 2-gram tables.
If there is no match even at the bigram table, a random word from the 500 most common words in the sample training data was delivered.

The Next Word Generator

How to Use

Just input some random text and the proper text field and click Submit
To clean everything click Clear
The next word will appear in the blue ValueBox below
The main app function runs extremely fast: median of 6.9ms with a 10 word input (microbenchmark package)