Word Prediction

Andrea Alberto
05.08.2018

Coursera final Data Science project (SwiftKey)

Main Objective

The main goal of this work is to study an algorithm to predict the next word which the user would type, based on the previously typed words. The study of the algorithm is followed by its implementation in a shiny application for demonstration pourposes.

The training of the algorithm is based on a Corpus of text provided with the assignement, i.e. a repository of sentences. Based on the statistical association of words in the Corpus the algorithm is trained to evaluate the most probable next word.

Process

The steps performed to produce the algorithm are the following cleaning and normalization of Corpus: removing punctuation and special characters

sentence boundaries definition: the Corpus is devided in senteces to understand when a word can be correlated with the following one
stemming: extracting the stem of every word
ngrams table construction: a table of the Corpus ngrams has been build (ngrams till the fourth order) and a frequencency of appearance is associated to each ngram

Algorithm

The algorithm studied and implemented is based on the Katz Back-Off Model, which estimates the next word probability backing off through progressively shorter history models under certain conditions.

where
C(x) = number of times x appears in training
wi = ith word in the given context

How to use the application

In the app the user can insert a sentence in the “Input Sentence” area. When a space follows the last word inserted the app estimates a prediction for the next word and displays it in green after the original sentence.

Alternative predictions are displayed in the “Other Possible Predictions” sorted by decreasing probability.

Follow this link to access the application.

App Code and Accuracy

The shiny application code can be reviewed in github on this link , and is composed only of the two ui.R client and server.R files.

Accuracy measured over an out of sample test is of about 8%.