wordPrediction

RDSN
January 24th, 2016

Johns Hopkins University
Coursera Data Science Specialization - Capstone Project

Context

This project is about creating a Shiny application designed to make text predictions. Building a smart keyboard makes it easier for people to type on their devices. One cornerstone of their smart keyboard is predictive text models.

Application

Data: we use the HC Corpora dataset.

Preprocessing Data: data acquisition, sampling, corpus creation, corpus transformations (removing punctuations, removing numbers, lower-case transformation, removing white spaces), tokenization and N-gram creation.

Loading Data: the N-grams dictionnaries (1-gram, 2-grams, 3-grams) are loaded.

Prediction of the Next Word: a Katz's back-off algorithm is performed on the input text.

Algorithm

The Katz's back-off model is used here to predict the Next Word.

This model estimates the probability of a word given its history in n-grams. This estimation is achieved by backing-off to models with smaller histories if no match is found.

The application considers 3-grams, 2-grams, and 1-grams.

How the application works

The user input a text in the input text box

The application checks for the last 3-grams to match. If not found, the last 2-grams. If not found, the last 1-gram.

The application gives you 5 possible Next Words, as buttons displayed below the input text box.

The most possible Next Word is colored in light blue, the others in dark blue.

Then you can choose to write by yourself the Next Word or to click one of the buttons to add the word to the sentenced typed.

Links

ShinyApp
https://rdsn.shinyapps.io/wordPrediction/

Coursera Data Science Capstone
https://www.coursera.org/learn/data-science-project