Predictive Text Model - DS Capstone

adrián álvarez del castillo
Jun 4, 2016

Background

The final product of the Data Science Capstone Project is an application that predict the next word in a sentence based on a natural language model. The information provided by SwiftKey cames from three different sources:

  • Blogs
  • News
  • Twitter

n-gram generation

The following process was applied to the information in order to generate n-grams:

This steps can be considered as a general methodology.

Predictive text model algorithm

Prediction Algorithm

Probabilty calculation

To take the diversity of histories into account, the modified Kneser-Ney smoothing for calculating the probability was adopted.

Shiny App

The application can be accessed from https://gaacs.shinyapps.io/Predictive_Text/

app

Basic usage of the application:

  1. Input phrase in the text field
  2. Press the Predict button
  3. The “Next word prediction” is displayed

Conclusion

This project is a first approach to the natural language processing subject. There are some many possibilities for improving, such as:

  • Parts of speech tagging to better understand the context of the user's words
  • More data sources such as books and Reuters Corpus
  • Parallel processing of data to increase volume and speed (hadoop implementation)
  • Feedback loop into the model