Final Project - Natural Language Processing

Bernard NK
August 20, 2015

Description of the prediction algorithm

The project is done in association with SwiftKey, a company developing a smart prediction technology for easier mobile typing. To predict the next word, this R algorithm was implemented:

Get a corpus and identify appropriate tokens such as words, punctuation, and numbers.
Build a model with the corpus to understand the distribution and relationship between the words, tokens, and phrases.
The prediction algorithm is based on a predictor variable that is the n-gram frequency, to determine the next word that a user is most likely to type.
Match a n-gram character string with the appropriate n+1 gram entry in the n-gram frequency table.

Description of the Shiny application

How to use the predictive application:

Click on this link: https://bernardnk.shinyapps.io/FinalProject
Input on the left: Enter a phrase in the edit box and click “Predict!”
Output on the right: Observe the predicted next word, expected to follow the phrase you entered.
“NA”: If the next word cannot be predicted, then “NA” will be displayed in the output.

The dataset

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). It is composed of a large number of tweets, blogs and news publications. We used this corpus to identify appropriate tokens such as words, punctuation, and numbers. This dataset is used in the Shiny R application.

When comparing the highest frequency results using 4-grams, we did not find that 4-grams were helpful in finding the next word in a n-gram. Tri-grams were therefore used in our model.
A major tradeoff is the amount of data analyzed, corpus size vs analysis time.
Adding more lines from the text in the corpus did not always contribute to a better accuracy. The model was therefore based on qualitative n-gram criteria versus quantitative.

Applicability to other predictions

This application could be extended for other language processing predictions, including:

Determine a word in a speech-to-text application when a word or phrase was missed.
Determine whether a text is computer-generated by identifying the presence of high-probability next-word predictions.

References:

HC Corpora (www.corpora.heliohost.org)
Johns Hopkins Data Science Capstone, https://www.coursera.org/course/dsscapstone
Generalized Linear Models, http://www.statmethods.net/advstats/glm.html