Capstone Project - NLP text prediction Shiny Application

Aled Evans
March 2017

Introduction

  • The Shiny application ('app') suggests the next word following text input from the user.
  • The Shiny App works across devices, but is optimised for use on a desktop web browser.
  • The Shiny App can be found on shinyapps.io
  • The code and documentation for the app, project and this R presentation can be found on GitHub.

Data Processing

  • The project data set is obtained from here.
  • The source files are sampled to give a 'corpus' that is processed more swiftly. A 1% sample is used to construct the corpus.
  • The corpus also undergoes text processing: all non-English characters are removed; numbers, punctuation, whitespace was also removed. All text is also changed to lowercase.
  • Profane words are also removed. The project used Carnegie Mellon University's resource: Offensive/Profane Word List.

Prediction Algorithm

  • Tokenization is used for finding the frequency of five types of n-gram: unigrams (single words), bigrams (two word phrases), trigrams (three words), quadgrams (four word) and quintgrams (five words).
  • N-grams indicate which words appear together in the text. (The higher the frequency of a certain n-gram, the more likely it is to be found in the corpus.)
  • The predictive algorithm uses the n-gram frequency to suggest/ predict the next word based on the users input. The model checks the phrase length and starts with the quintgram, then moves onto the quadgram and so on. The model is a version of a 'back-off' model.

Weblinks & References