Predictive Text Model - DS Capstone

adrián álvarez del castillo
Jun 4, 2016

Background

The final product of the Data Science Capstone Project is an application that predict the next word in a sentence based on a natural language model. The information provided by SwiftKey cames from three different sources:

Blogs
News
Twitter

n-gram generation

The following process was applied to the information in order to generate n-grams:

This steps can be considered as a general methodology.

Predictive text model algorithm

Prediction Algorithm

Probabilty calculation

To take the diversity of histories into account, the modified Kneser-Ney smoothing for calculating the probability was adopted.

Shiny App

The application can be accessed from https://gaacs.shinyapps.io/Predictive_Text/

app

Basic usage of the application:

Input phrase in the text field
Press the Predict button
The “Next word prediction” is displayed

Conclusion

This project is a first approach to the natural language processing subject. There are some many possibilities for improving, such as:

Parts of speech tagging to better understand the context of the user's words
More data sources such as books and Reuters Corpus
Parallel processing of data to increase volume and speed (hadoop implementation)
Feedback loop into the model