Introduction

Next word prediction (NWP) is the task of suggesting the most probable word a user will type next. This is a part of natural language processing (NLP) which can be applied in many areas where text is inserted using a keyboard. An example of NWP application is the SwiftKey smart keyboard [1] which aims at improving the typing experience on small mobile phone. Beyond improving keystroke efficiency, recent literature points out other benefits of word prediction including the improvement of the quality and quantity of written work, enhancement in the development of written literacy skills, along with spelling assistance to people with various levels of spelling disorder [2,3].

The primary objective of the present project is to implement a next word prediction algorithm given a raw text dataset. The main deliveries are:

A milestone report presenting an exploratory analysis of the data along with an outlook of the application development plans (due by 27/07/2015). This document
A data product consisting of a Shiny web application highliting the prediction algoritm (due by 24/08/2015).

The first part of this milestone report focuses on the data that will be used throughout the project. It includes a brief description of the source files, the pre-processing steps and exploratory plots illustrating the main features of the different datasets.

The second part of this report is an outlook of my plans for creating the prediction algorithm and the Shiny application.

Part I. Data pre-processing and exploratory analysis

The data is originally from a corpus called HC Corpora www.corpora.heliohost.org. It was collected by a web crawler from three types of publicly available sources: newpapers, personal blogs and twitter. The source files we are using for this project have already been preprocessed and consist of raw text files with one document per line and no tag (available here. The dataset is available in different languages including german, english, finish and russian. I will be using the english data consisting in three files whose properties are summarized in Tab. 1.

**Table 1.** Dataset used throughout the project: filenames, sizes and number of document per file.
file	size	documents
E:/Capstone Data/raw/en_US.blogs.txt	200.4242	899288
E:/Capstone Data/raw/en_US.news.txt	196.2775	1010242
E:/Capstone Data/raw/en_US.twitter.txt	159.3641	2360148

a. pre-processing

An intuitive approach to predict the next word occuring in a sentence is to rely on a few preceding words, which is known as the Markov Assumption [4]. For example, considering the sentence " … a lot …“, we may intuitively assume”of" to have a relatively high probability to appear as the next word, this independently of the words preceding “a lot”. Setting aside our ability as humans to perform such predictions, the first step in this project is to anticipate the type of algorithm we will be using and which will dictate the data processing strategy.

At a practical level, a straightforward way to make a prediction for the above example would be to retrieve from a corpus all sequences of three words starting with “a lot …” and rank the results according to their number of occurence in the corpus (or count). This strategy can be applied considering either 1, 2 or N preceding words and is the basis of what is commonly called N-gram Language Models. A N-gram consists in the pair of a word sequence containing N words and its according count. After building N-gram Language Models from a corpus, one can perform word prediction by simple look-up of N-gram tables or by implementing more advanced techniques such as smoothing methods on top of the N-gram models (we will not get into the details at this stage of the project).

The objective of creating N-gram language models for our prediction algorithm allows us to define the pre-processing steps to be applied to our different source files. These different steps are listed and motivated below:

Read the raw text file (results in a list of documents)
Split each document into sentences
For each sentence, separate words from punctuation
Add start and end of sentence tags (we will use “<s>” and “</s>”)
Tokenize into N-grams
Create the language model by counting the occurence of each N-gram

The spliting of documents into sentences with the addition of special tokens marking the star/end of the sentences is performed by most current NLP toolkits such as MeTa, Kylm or SRILM. This adds an additional level of information and for example enables to attribute different probabilities to words depending either their occur at the beginning or end of a sentence.

I decided to keep punctuation marks and not perform stop word removal/stemming, in order to not impact the grammatical meaning. We will still have the option to filter the N-gram models and/or not predict punctuation marks if it appear usefull latter on in the project. I also keep upper/lowercase letters since, as for punctuation, it could impact the syntax, and some words with a first upper case have higher probability to occurs at the beginning of a sentence or after punctuation.Additionally I do not filter profane words, instead they will be filtered out during the prediction step.

In a first time, I will use N-grams models of order 1 to 3 (i.e. sequences of 1, 2 and 3 words with corresponding counts). Evaluation of the prediction algorithm will help latter to decide if higher order models could be usefull.

b. sub-sampling of the datasets

Considering the relatively large size of the source datasets, the present exploratory analysis considers a sub-sample consisting of the first 1000 documents of each file. In this way the preprocessing can be handled directly in R using the tm and RWeka packages for the tokenization steps. Also for the sake of clarity the code is not shown in the present report.

A detailled summary of the sub-sampled datasets is shown in Tab. 2. As one would expect, documents from news articles or blog posts contain in average more sentences and words compared to tweets. As a data sanity check, we compute the number of characters in the longest document for each file which we find equal to 140 for tweets (tweets are limited to 140 characters). Fig. 1 shows the number of words per sentence which is expected to follow lognormal distributions [5]. We observe a clear difference between the twitter dataset, which exhibits a mean of about 10 words/sentence and high skewness, and the two other datasets exhibiting means twice as large and distributions more spread out. This first observation suggests more syntactically complex sentences in the blog/news datasets compared to twitter data, that we might want to keep in mind for our prediction algorithm. Indeed it might be more difficult to make good prediction on long sentence if we have used twitter data to train our model. Inversely, building our model from blog/news sentences may not be optimal to perform prediction on short sentences from twitter.

**Table 2.** Statistics of sub-sampled datasets used for exploratory analysis: filenames, total number of documents, sentences, words and characters, along with the number of characters in the longuest sentence (max_characters)
file	documents	sentences	words	characters	max_characters
E:/Capstone Data/raw/en_US.blogs.txt	1000	2605	50360	230757	1912
E:/Capstone Data/raw/en_US.news.txt	1000	1857	41893	198003	982
E:/Capstone Data/raw/en_US.twitter.txt	1000	1556	16392	68538	140

Figure 1. Distribution of the number of words per sentence for the three sample datasets. Vertical dashed lines represent the mean of each distribution.

c. most frequent N-grams

Following the pre-processing and creation of the N-gram language models, we first look at the most frequent N-grams for each dataset taken separatly and for the three datasets merged togethers (referred to as “all” in the following).

Tabs. 2, 3 and 4 show the top 5 N-grams by descending order of their count in each dataset. We can already notice some differences between datasets. For example the word “I” is ranked in second position in tweets, fourth position in blog articles, and not present in the top 5 for news articles. The top 5 bigrams are in relatively good agreement, while for trigrams specific sequences emerge due to common word sequences like “Thanks for the RT” (RT = Re-Tweet) in the twitter dataset.

Fig. 2 compares N-gram frequencies with a higher level of granularity. It depicts the probability histograms for the top 10 N-grams, with N-grams ranked by decreasing probability in the merged dataset. From this we can confirm our previous observation that the word “I” shows very different frequencies between the twitter/blog and the news datasets. In the bigrams, sequences “in a” and “and the” exhibit a much lower probability in the tweeter dataset. This could be explained by considering that such bigrams are used as links in relatively long sentences. Finally some trigrams referring to the document author like “I had a” or “I want to” have much lower probabilities in the news dataset, as one could expect since news are made of more objective text. At this point it is important to note that the merged dataset (“all”) is biased in the current analysis since the three sample datasets have different total number of words (we sample a fixed number of documents). Consequently twitter dataset has a lower weight which explains that some of its frequent trigrams are not ranked in the top 10.

Beside validating the pre-processing steps by observing that most frequent N-grams correspond to common english word sequences, this analysis highlight the fact that there are noticeable differences between different sources of data. As a consequence, it may be of interest in the future steps of this project to investigate in more details the impact of the choice of the training data (using either a single source of data of mixing different sources).

**Table 2**: Top 5 most frequent unigrams in the different sample datasets.
blog	news	tweet	all
the	the	the	the
to	to	I	to
and	a	to	and
I	and	a	a
a	of	you	of

**Table 3**: Top 5 most frequent bigrams in the different sample datasets.
blog	news	tweet	all
of the	of the	in the	of the
in the	in the	of the	in the
to the	to the	for the	to the
to be	for the	to be	to be
on the	on the	on the	for the

**Table 4**: Top 5 most frequent trigrams in the different sample datasets.
blog	news	tweet	all
a lot of	a lot of	Thanks for the	a lot of
one of the	said in a	a lot of	one of the
I wanted to	a couple of	for the RT	a couple of
I had a	be able to	I had a	I had a
I had to	going to be	thanks for the	as much as

Figure 2. Probability histograms of the top 10 N-grams in the sample datasets. From top to bottom: unigrams, bigrams, trigrams. N-grams are ordered according to their probability in the merged dataset (“all”). Because the three sample datasets do not have the same length in term of words (we sample a fixed number of documents), frequencies of N-grams are compared by considering their probability (e.g. count divided by number of N-gram in each dataset).

d. n-gram frequency curves and vocabulary coverage

We further investigate the properties of our datasets by plotting the frequencies of N-grams as a function of their rank order (i.e. how many N-grams appear once in the dataset, how many appear twice, etc.). Such representation has two main interests:

this can be used as a data sanity check as we know that standard linguistic texts are expected to follow Zipf’s law for the frequency distribution. The latter predict the N-gram frequencies to decay with a power law of the rank order, with an exponent close to -1 [6]. By comparing our data to Zipf law we can thereby confirm the validity of the pre-processing and N-grams aggregation method.
the frequency curves can be used to optimize the coverage of the vocabulary. As we will see, many N-grams occur only once or a few times in the datasets. Such N-gram might not be very usefull to the prediction algorithm and we may choose to discard them for memory consumption / storage purposes. However, we may not want to blindly discard N-grams appearing let say below 5 times. Instead, a better approach is to set a threshold so that our language model cover 95% of the vocabulary. N-gram frequency curves can help us to make such choices.

Frequency curves for unigrams, bigrams and trigrams are shown in Fig. 3. We observe that unigrams follow a power law with an exponent between -1 and -2 which is in agreement with Zipf’s law. Bigrams and trigrams also appear to follow power laws with slightly lower exponents. Note that our actual sample dataset may be too small to expect a perfect agreement with Zipf’s law, the latter being usually demonstrated on relatively large corpora. It may be of interest to come back to this representation latter on in this project when we will process more data.

Figure 3. N-gram frequencies curves showing how many N-grams appear k times in the dataset, with k the rank order. Left part of the curves correspond to rare words appearing only once or a few times, while the right part corresponds to common words. From left to right: unigrams, bigrams, trigrams. For unigrams, black lines indicates power law decays with exponents -1 and -2, for comparison with Zipf’s law.

Let us now look at how we can use the above frequency curves to optimize the vocabulary coverage, i.e. how to choose a cutt-off value such that we can store all our N-grams while not loosing too much informations and prediction power. First, Tab. 5 shows the number of N-grams in each sample datasets. We see the number of N-grams grows very quickly with the order. Tab. 5 also displays the required memory size to store the three N-gram tables with their counts, for each dataset. Although the size is still moderate in the present case, about 10 Mb in total, it is important to remember that we loaded only 1000 documents of each set, over a total of 4 million, so less than 0.1% !. From this we can readily anticipate that we will some accross memory issues when trying to handle all the documents. An important feature however is that many N-grams appear only once or a few times in the datasets. For example Tab. 6 shows the percentage of unique N-grams which impressively is above 60% for unigrams and up to more than 95% for trigrams! [8] By discarding N-grams appearing only once we will strongly reduce the storage requirement, however with such high percentage of unique features we will also drastically decrease the vocabulary coverage. From these results it appears that out sample datasets may be too small to draw clear conclusion and optimize this aspect of the processing.

**Table 5.** Number of unique unigrams, bigrams and trigrams in each sample dataset. The last column is the size in Mb requiered to store the three N-gram tables with their count in a R dataframe.
file	unigrams	bigrams	trigrams	storage
E:/Capstone Data/raw/en_US.blogs.txt	8654	25178	28766	3.8
E:/Capstone Data/raw/en_US.news.txt	8213	21325	21925	3.4
E:/Capstone Data/raw/en_US.twitter.txt	3976	8267	7572	1.4

**Table 6.** Percentage of unigrams, bigrams and trigrams appearing only once.
file	unigrams	bigrams	trigrams
E:/Capstone Data/raw/en_US.blogs.txt	61.5	85.8	96.6
E:/Capstone Data/raw/en_US.news.txt	61.7	88.7	97.6
E:/Capstone Data/raw/en_US.twitter.txt	69.4	90.2	98.3

e. conclusion

From the present exploratory analysis we can draw the following conclusions:

The different datasets have distinct features, both in term of number of words per sentence, reflecting syntax complexity, and in term of vocabulary (differences observed for the most frequent N-grams). From this one can expect variations in the N-gram language model and prediction accuracy depending on which datasets are used for training and which are used for testing.
Although we have not investigated this aspect in details, we expect the twitter dataset to be more ‘dirty’ in the sence that it might contains more abbreviations and misspelled words, compared to blog or news articles. This can be for observed indirectly by for example noting the higher percentages of unique N-grams in Tab. 6.
It is important to keep in mind that, because the number of sentences per document, and words per sentence are different among datasets, a language model mixing the datasets on the basis of a fixed number of documents will be biased (i.e. more influenced by the news/blog datasets compared to tweets). If we come to build a model mixing the datasets, it may be usefull to use the present statistics to recompute the number of sampled documents, so that the merged dataset is balanced.
The frequency analysis of N-gram distribution appears globally in agreement with Zipf’s law, although the obseved exponent in Fig. 3 is closer to -2 while -1 would be expected for an english corpus. Besides, the number of unique N-grams observed in our sample datasets (Tab. 6) appear relatively high and at this point does not enable to draw clear conclusions regarding an eventual filtering of N-grams based on their counts to achieve a better tradeoff between memory usage and vocabulary coverage. I plan to do this analysis again when I will be able to process more data.
In this analysis we have fixed the number of input documents and thereby have not investigated the impact of the sample size. This might be an important issue for the following of the project in order to decide what is an acceptable number of documents to process to cover a sufficiently large vocabulary. This could be decided by for example plotting the number of unique N-grams as a function of the number of document processed (outcome is known as Heaps’ law).

Part II. Application development plans.

The exploratory analysis performed in part I gave me a good overview of the data and enabled to pinpoint some potentially critical aspect of the project linked to (i) the intrinsic differences between the datasets, (ii) the necessity to explore external tools to efficiently tokenize more data and (iii) potential memory / storage issues. Alongside with this exploratory analysis I spent time studying language models theory in order to get a better understanding of the architecture of the application we are to build and define a progressive implementation (starting from a simple predictor and going to more advanced models depending on the time available).

From my actual understanding of the project I anticipateThe the final application to be built ona three-tier architecture with (i) N-gram look-up tables with counts/probabilities loaded in memory or stored on disk depending on their size (ii) The prediction algorithm and (iii) the user interface.

Below I define and motivate the main upcoming steps:

Implementation of a prediction algorithm based on Maximum Likelihood estimates and simple Back-off strategy. This consists in predicting the most frequent N-gram whose N-1 first words match the input sentence. If the input sentence is not matched in the N-gram table, we back of to a lower and ultimatly predict the most frequent unigram. To implement the prediction algorithm I plan to use a “toy corpus” containing only a few sentences and which will allow to check the calculations.
Use a SQL database to store the N-gram tables and their count. This anticipates memory usage limits.
Develop a first version of the Shiny application to validate the whole process (from data processing to the user interface)
Setup a Validation procedure by splitting the input data into training/testing sets and estimating the prediction efficiency (a most common measure for word prediction accuracy is the so-called perplexity).
Investigate alternative ways to process the input data in order to increase the vocabulary coverage. Languages like Python or C++ are known to be more efficient for text processing. If the time allows it, I plan to implement the task with Apache Spark, which is a highly efficient distributed computing framework allowing programming with either R, Python, Scala or Julia. Using the SparkR module would actually be the most straightforward way to scale up my pre-processing R code to tokenize the whole dataset.
Implement a prediction algorithm based on Kneyser-Ney Smoothing.
Increase storage capacity by compressing N-gram tables, using either SQL trees or encoding.

The order of steps 5 to 7 may change depending on which tasks appear easier to implement and more valuable to improve the prediction.

References and notes

[1] http://swiftkey.com
[2] J. Klund and M. Novak, “If Word Prediction Can Help, Which Program Do You Choose?”, available at http://trace.wisc.edu
[3] A. Newell, J. Arnott, L. Booth, W. Beattie, B. Brophy and I. Ricketts. “Effect of the”pal" word prediction system on the quality and quantity of text generation.“, Augmentative and Alternative Communication, 8(4), 304–311, 1992
[4] J. R. Norris, Cambridge University Press, 1998
[5] P. Sobkowicz1, M. Thelwall2, K. Buckley, G. Paltoglou and A. Sobkowicz3, Lognormal distributions of user post lengths in Internet discussions - a consequence of the Weber-Fechner law? EPJ Data Science 2013, 2:2
[6] Zipf G. K. “Human Behaviour and the Principles of Least Effort”, Addison-Wesley, 1949.
[7] Ramon Ferrer i Cancho , Ricard V. Solé “Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited”, J. Quant. Linguistics, Vol. 8, Iss. 3, 2001
[8] These values are slightly higher compared to what one would expect from the litterature. Same apply for the power law observed previously which was closer to -2 while -1 is expected for Zipf law. These deviations maybe be due either to the small size of our sample datasets, and/or to the fact that we do not perform lowercasing or punctuation removal (which increases the number of unique tokens)

Milestone Report

Renaud Dufour

July 25, 2015