Capstone Final

Caroline Ines Lisevski
October 9, 2020

Predicting Text using Ngrams

Project Goals

The project was to develop an application that would predict the next word in an input phrase by the user. The project should have:

  • lightweight, to preserve device resources
  • based on principles of Natural Language Processing
  • intuitive user interface design

Data

Four data sets were available but this application uses the English dataset only. The data was taken from random news articles, blog posts, and twitter feeds.

For use in this project, the data was cleaned, removed punctuation, excessive whitespace, and other non-text elements. The portion of the data was then tokenized into ngram tables.

Algorithm

The algorithm will predict the probability of a word being the next one chosen by a user, given an input text, by comparing against a set of ngrams - quadgrams, trigrams, and bigrams.

If a match is not found by comparing a four word phrase to a set of quadgrams, the last three words of user input would then be used against a set of trigrams, and so on. In essence a “backing off” process will be done until an appropriate next word prediction is found.

The entire code for this project can be found in my github repository: https://github.com/caroline-lisevski/datasciencecoursera/tree/master

User Interface

The user enters a word or phrase in the text box, select how many word he wants to have the prediction, and the suggested next words will appear below it. Instructions are provided in the left sidebar to ensure a smooth user experience.

Please try it out at: https://caroline-lisevski.shinyapps.io/Predict_Words/