Introduction

We are developing an application that can predict a word based on previous ones. This is similar to the software available on mobile platforms such as SwiftKey. The end product will be a web application that takes an incomplete phrase from the user and predicts the next word. In order to build the application, we require an appropriate data collection. Here we use the English language sets from HC Corpora. This milestone report details our initial exploratory analysis of the data and our future goals in a concise and understandable manner.

Raw Data Summary

The HC Corpora English dataset includes three line-separated text files: Blogs, News and Twitter. Each file contains data from their respective sources from all over the Internet. Let’s have a look at the raw data statistics:

Table 1: Raw dataset summary

Dataset Size (bytes) Line Count Word Count Average Words/Line
Blogs 210160014 899288 38154238 42.4
News 205811889 1010242 35010782 34.7
Twitter 167105338 2360148 30218125 12.8

We can also visually see how the word count of each line varies in the datasets below.

Figure 1: Distribution of words per line of each individual dataset

Exploratory Data Analysis

Sampling

Due to the very large size of the datasets and limited hardware resources, we take a random 10% sample of each dataset (Blogs, News, Twitter). The sample datasets are then combined into one single corpus.

Cleaning

The corpus has profanity words that were removed using the pattern-for-python list. We also removed punctuations, numbers, whitespace, foreign characters and converted everything to lowercase. These tasks allowed us to have a clean tokenized corpus needed for our next step, n-grams.

N-Grams

N-gram is a contiguous sequence of n items from a given sequence of text or speech as explained on Wikipedia. For our application, we use unigrams, bigrams and trigrams (1, 2 and 3-grams). Our corpus is further split into three n-gram data structures where frequency of the n-grams are sorted. The n-grams are important for our modeling since the phrase the user inputs in our final application will be segmented and compared to our n-gram data structures to help predict the next word. N-gram frequency tables allow us to see the distribution of words and word pairs. The following are the most frequent n-grams in our sample corpus.

Figure 2: Top 15 n-grams by their frequency

While the total count of 1-gram (single words) is 29045630 in the sample corpus, most of these words are not unique. In fact, we can make a table to show how many unique words are needed to cover a certain percentage of all word instances in the sample corpus. The table below shows this information and how the ratios vary greatly between the percentages. We can use this information to make our n-gram data structures smaller and more efficient to be used in our final application while still maintaining reasonable accuracy.

Table 2: Unique words needed to cover all word instances in sample corpus

Percentage of Corpus Word Instances Unique Word Count Total Corpus Word Instances Ratio
50 104 14522815 0.0000072
60 263 17427378 0.0000151
70 722 20331941 0.0000355
80 1992 23236504 0.0000857
90 6681 26141067 0.0002556
100 234522 29045630 0.0080743

Further Goals

With these completed n-gram data structures, we still need to build our prediction model using an appropriate algorithm. The final Shiny web application must be implemented which will take an incomplete phrase from the user and predict the next word. A presentation slide deck will also be completed.

Along the way, optimization must be completed and explored since the Shiny server has limited computing resources. The size of the n-gram data structures will need to be reduced and the prediction model should be efficient in speed. Stemming the raw data and different sample sizes will also be considered for coverage, speed and accuracy.