wordPrediction

RDSN
January 24th, 2016

Johns Hopkins University
Coursera Data Science Specialization - Capstone Project

Context

This project is about creating a Shiny application designed to make text predictions. Building a smart keyboard makes it easier for people to type on their devices. One cornerstone of their smart keyboard is predictive text models.

Application

  • Data: we use the HC Corpora dataset.
  • Preprocessing Data: data acquisition, sampling, corpus creation, corpus transformations (removing punctuations, removing numbers, lower-case transformation, removing white spaces), tokenization and N-gram creation.
  • Loading Data: the N-grams dictionnaries (1-gram, 2-grams, 3-grams) are loaded.
  • Prediction of the Next Word: a Katz's back-off algorithm is performed on the input text.

Algorithm

The Katz's back-off model is used here to predict the Next Word.

  • This model estimates the probability of a word given its history in n-grams. This estimation is achieved by backing-off to models with smaller histories if no match is found.
  • The application considers 3-grams, 2-grams, and 1-grams.

How the application works

  • The user input a text in the input text box
  • The application checks for the last 3-grams to match. If not found, the last 2-grams. If not found, the last 1-gram.
  • The application gives you 5 possible Next Words, as buttons displayed below the input text box.
  • The most possible Next Word is colored in light blue, the others in dark blue.
  • Then you can choose to write by yourself the Next Word or to click one of the buttons to add the word to the sentenced typed.

Links