Text Prediction Milestone Report

Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.

Predictive text is an input technology that facilitates typing on a mobile device by suggesting words the end user may wish to insert in a text field. Predictions are based on the context of other words in the message and the first letters typed. Because the end user simply taps on a word instead of typing it it out on a soft keyboard, predictive text can significantly speed up the input process.

Goal

The goal of this project is to build a predictive text application, which takes a phrase of one or more words as input and predicts the next word as output. For example, if the user types “I went to the”, the application should output the 3 most likely candidates for the next word. In this case, we could expect the output to be “gym”, “store”, “restaurant”.

Data

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The corpora are collected from publicly available sources by a web crawler. More information can be found on http://www.corpora.heliohost.org/aboutcorpus.html.

Data has been collected from various sources such as Twitter, Blogs and News, each having their respective data files.

Data Exploration

A brief summary of the data is presented below.

The Twitter document contains 2,360,148 lines, 30,373,583 words, and 162,384,825 characters.
The News document contains 1,010,242 lines, 34,372,530 words, and 203,223,159 characters.
The Blogs document contains 899,288 lines, 37,334,131 words, and 208,361,438 characters.
Length of the longest line in the Twitter document is 213.
Length of the longest line in the News document is 11384
Length of the longest line in the Blogs document is 40835.

Word Clouds

The best way to understand data better is to visualise it. Word clouds give greater prominence to words that appear more frequently in the source text. It is a very powerful instrument to get the first glance about frequency and variation of the words in the text. Below are word clouds for each dataset.

Blogs Word Cloud:

News Word Cloud:

Tweets Word Cloud:

Basic Data PreProcessing

Considering the size of the 3 data files, 5000 lines were randomly sampled from each of them. These 3 samples were then combined into one “training” file, containing 1500 lines.

Following this, a function was created to clean up the data set. The following actions were performed in the clean up process:

Remove punctuations, while preserving the intra word dashes.

All the content was changed to lower case.

Numbers were removed.

Whitespaces were stripped.

Tokenization

The data was tokenized into One Gram Tokens, Two Gram Tokens and Three Gram Tokens. They were then sorted based on frequency.

Top 20 One Gram Tokens:

Top 20 Two Gram Tokens:

Top 20 Three Gram Tokens:

More Exploratory Analysis

A function was created to analyze the minimum number of unique words needed to cover a certain percentage of all word occurences in the language. Following is the graph depecting the results:

Therefore, 3 unique words are needed to cover 10% of all words occurences in the language, 9 unique words to cover 20% of all word occurences in the language and so on. In general, onw can say that the minimum number of unique words needed keeps increasing by 2 to 3 times with an increase of 10% in the target percentages.

Next Steps

I plan to do modelling using the following 3 combination techniques:

Simple Linear Interpolation of Maximum Likelihood Estimators
Katz Backoff
Kneser-Ney Smoothing

Following this, prediction will be performed to check the accuracy of the model created.

Finally, A Shiny app will be created that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.