NextWord Prediction Tool

Andrés Durán C.
2021-09-19

Coursera Data Science Specialization Capstone Project Johns Hopkins University

JHU

About this Project

The goal of this presentation is to show a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others.

A Shiny app, “NextWord Application” can be accessed at:

The source code files can be found on GitHub:

Predictive Model Used

There are several processes that need to be completed before the model can be built.

  • The raw data to buil the model is located at: Coursera-SwiftKey
  • In order to build the prediction algorithm, data was scraped from blogs, twitter and the news from en_US folder.
  • Sample 1% of Datasets, data cleaning stripping out of numbers and punctuation, changing all text to lowercase and removing the whitespace. Remove offensive words from Profanities
  • N-grams were created, these are a sequence of items collected from a corpus. The “N” refers to the number of words within the sequence. For this project, bigrams, trigrams and quadgrams were used.
  • The N-grams were sorted and the metadata saved as an .RData file.

The information is stored in the GitHub repository.

How the Shiny Server Works

To predict the NextWord of the user input sentence the server will do:

  1. For prediction of the NextWord, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
  2. If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).
  3. If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)
  4. If no Bigram is found, back off to the most common word with highest frequency 'the' is returned.

All the n-grams are loaded at first, Quadgram, Trigram & Bigram Data frame files.

How Application User Interface Works

The predictive NextWord App will take the writed words typed in the Text Field and after a moment, it will take it and process it in the server. As enunciated, just type a few words in the fiel and it will show the predicted word and the n-gram used.

Shiny App

More information in the “About” tab panel.