Predictive Text Application - Milestone Report

Introduction

We are developing an application that can predict a word based on previous ones. This is similar to the software available on mobile platforms such as SwiftKey. The end product will be a web application that takes an incomplete phrase from the user and predicts the next word. In order to build the application, we require an appropriate data collection. Here we use the English language sets from HC Corpora. This milestone report details our initial exploratory analysis of the data and our future goals in a concise and understandable manner.

Raw Data Summary

The HC Corpora English dataset includes three line-separated text files: Blogs, News and Twitter. Each file contains data from their respective sources from all over the Internet. Let’s have a look at the raw data statistics:

Table 1: Raw dataset summary

Dataset	Size (bytes)	Line Count	Word Count	Average Words/Line
Blogs	210160014	899288	38154238	42.4
News	205811889	1010242	35010782	34.7
Twitter	167105338	2360148	30218125	12.8

We can also visually see how the word count of each line varies in the datasets below.

Figure 1: Distribution of words per line of each individual dataset

Exploratory Data Analysis

Sampling

Due to the very large size of the datasets and limited hardware resources, we take a random 10% sample of each dataset (Blogs, News, Twitter). The sample datasets are then combined into one single corpus.

Cleaning

The corpus has profanity words that were removed using the pattern-for-python list. We also removed punctuations, numbers, whitespace, foreign characters and converted everything to lowercase. These tasks allowed us to have a clean tokenized corpus needed for our next step, n-grams.

N-Grams

N-gram is a contiguous sequence of n items from a given sequence of text or speech as explained on Wikipedia. For our application, we use unigrams, bigrams and trigrams (1, 2 and 3-grams). Our corpus is further split into three n-gram data structures where frequency of the n-grams are sorted. The n-grams are important for our modeling since the phrase the user inputs in our final application will be segmented and compared to our n-gram data structures to help predict the next word. N-gram frequency tables allow us to see the distribution of words and word pairs. The following are the most frequent n-grams in our sample corpus.

Figure 2: Top 15 n-grams by their frequency

While the total count of 1-gram (single words) is 29045630 in the sample corpus, most of these words are not unique. In fact, we can make a table to show how many unique words are needed to cover a certain percentage of all word instances in the sample corpus. The table below shows this information and how the ratios vary greatly between the percentages. We can use this information to make our n-gram data structures smaller and more efficient to be used in our final application while still maintaining reasonable accuracy.

Table 2: Unique words needed to cover all word instances in sample corpus

Percentage of Corpus Word Instances	Unique Word Count	Total Corpus Word Instances	Ratio
50	104	14522815	0.0000072
60	263	17427378	0.0000151
70	722	20331941	0.0000355
80	1992	23236504	0.0000857
90	6681	26141067	0.0002556
100	234522	29045630	0.0080743

Further Goals

With these completed n-gram data structures, we still need to build our prediction model using an appropriate algorithm. The final Shiny web application must be implemented which will take an incomplete phrase from the user and predict the next word. A presentation slide deck will also be completed.

Along the way, optimization must be completed and explored since the Shiny server has limited computing resources. The size of the n-gram data structures will need to be reduced and the prediction model should be efficient in speed. Stemming the raw data and different sample sizes will also be considered for coverage, speed and accuracy.