Final Project - Natural Language Processing

Bernard NK
August 20, 2015

Description of the prediction algorithm

The project is done in association with SwiftKey, a company developing a smart prediction technology for easier mobile typing. To predict the next word, this R algorithm was implemented:

  • Get a corpus and identify appropriate tokens such as words, punctuation, and numbers.

  • Build a model with the corpus to understand the distribution and relationship between the words, tokens, and phrases.

  • The prediction algorithm is based on a predictor variable that is the n-gram frequency, to determine the next word that a user is most likely to type.

  • Match a n-gram character string with the appropriate n+1 gram entry in the n-gram frequency table.

Description of the Shiny application

How to use the predictive application:

  • Click on this link: https://bernardnk.shinyapps.io/FinalProject
  • Input on the left: Enter a phrase in the edit box and click “Predict!”
  • Output on the right: Observe the predicted next word, expected to follow the phrase you entered.
  • “NA”: If the next word cannot be predicted, then “NA” will be displayed in the output.

The dataset

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). It is composed of a large number of tweets, blogs and news publications. We used this corpus to identify appropriate tokens such as words, punctuation, and numbers. This dataset is used in the Shiny R application.

  • When comparing the highest frequency results using 4-grams, we did not find that 4-grams were helpful in finding the next word in a n-gram. Tri-grams were therefore used in our model.
  • A major tradeoff is the amount of data analyzed, corpus size vs analysis time.
  • Adding more lines from the text in the corpus did not always contribute to a better accuracy. The model was therefore based on qualitative n-gram criteria versus quantitative.

Applicability to other predictions

This application could be extended for other language processing predictions, including:

  • Determine a word in a speech-to-text application when a word or phrase was missed.
  • Determine whether a text is computer-generated by identifying the presence of high-probability next-word predictions.

References: