Capstone Presentation

Word Prediction

Veronika Rauch

Coursera Data Science Capstone Project

The following presentation will give an insight into the creation of a word prediction tool that has been built under the Coursera Data Science Course and with inspiration and support from SwiftKey. The obejctive of the tool is to correctly predict the next word, given an inputted phrase.

Objective and Background

Natural Language Processing

Natural langauge processing is concerned with the interaction between humans and computers, in other words the understanding of human/natural language by computers and the ability of processing of this information.

SwiftKey

Many of you might know SwiftKey from your mobile phones and tablets, where it is a leading input method for the keyboard. It uses methods of natural language processing and artifical intelligence to predict the next words the user intends to write and learns from previous input. So it is very much an inspiration for this project, as one of the most advanced technologies in this area.

Method

The data

The data used as the basis for the prediction model comes from a corpus named HC Corporora and can be found here and is made up from twitter, blog and news entries.

Cleaning the data

The text files containing several millions of lines of phrases were read into R and needed to first be cleaned and formatted. For the prupose of this the text processing package Quanteda was used. The data was cleaned, among other things, by removing numbers, punctuation, special characters and transforming all words to lower case.

Tokenization

Tokenization (the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens) was then used to split the text corpus into seperate words for furhter processing.

Model

Ngrams

The tokenized text was then used to construct ngrams - a contiguous sequence of n items from a given sequence of text or speech - which are used frequently in the fields of computational linguistics and probability. The bi-, tri- and quadgrams build the basis of a frequency matrix that allow to identify the most common sequence of words.

Methodology

The n-grams where then used as the input for a model based on "stupid back-off" following a method as described in this journal from Columbia University. It is a method easy to train on large data-sets as we have it here and with good results, although it is important to mention that there is certainly way to improve the approach.

The Prediction Tool

The tool itself is hosted on shinyapps.io and can be accessed here. Simply enter a phrase and view the suggestions for the next most likely words in your sentence presented by the model.

Half way through the construction process I also created a milestone report which details a little closer the first stages of this project. If you are interested you can find it here

width