Introduction

Globally, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Lots of effort has been put into the developing of smart keyboard that makes it easier for people to type on their mobile devices. One major feature of a smart keyboard is to predict the next word typed in order to make the input more efficient. This can be archieved by developing a predictive text model. On the other hand, word prediction is an essential subtask of speech recognition, hand-writing recognition, augmentative communication for the disabled, and spelling error detection. An important in step in building a predictive model is looking at previous words, which can give an important insight about what the next ones are going to be.

This report focuses on the first step to build a predictive model, the explorative analysis of a text corpus of tweets, news feeds, and blog entries. The results of this analysis will assist in the construction of the final predictive model.

Downloading the data

The data for the Capstone Project were downloaded from the course website (original site: http://www.corpora.heliohost.org). The dataset contains text files of news & blog entires as well as tweets in four different languages, English, German, Russian and Finish. We opt to use the English version for our project. Three text documents from the Twitter, news and blogs were found with each line standing for a document.

First exploratory analysis

Raw statistics of the US corpus

As a first setp of our analysis we extract some basic but useful information. The table shows following statistics:

Table 1: Basic statistics of US corpora
Source Lines Characters Max.characters.per.lines Structure
1 Twitter 2,360,148 162,096,031 213 Small sentences
2 News 1,010,242 203,223,159 40835 Conatins paragraphs with multiple sentences per blog
3 Blogs 899,288 206,824,505 11384 Conatins paragraphs with multiple sentences

The center of the analysis is the corpus. Usually, statistical processing of natural language is based on corpora (singular corpus), which is a on-line collections of text and speech. We are using for this project the corpus data structure, provided by the text mining framework library, tm. The data of the corpus is fully loaded in to the memory.

Since the text documents are rather large it would be not feasible to load the whole content of the files in our dataset via the Corpus function. Instead, we create a sample of the data via random sampling. We draw samples of each text containing ca. 2 Mio characters. Since this is ca. 10% of the number of total characters in each text file, we should have a statistical significant sample size.

Cleaning the data and creating the corpus

Before we continue to build the corpus structure, we conduct a preliminary data cleaning. Especially the Twitter text body will likely contain many unwanted character sequences which are not desirable for our corpora. Therefore, we apply a custom function which utilizes regular expressions to filter out things like ellipsis, escapes, URLs, control characters, Twitter tags & user names, email addresses, repeated characters (more than 2 times), repeated words, non-ASCII characters and emoticons.

In the second step we apply the build in functions of the tm package in order to split the text into tokens, such as words or phrases split by space.

  • Case folding: Format and convert all of the text to lowercase. This will help to flatten all references to the same usage.

  • Removing numbers: We can remove all the punctuation from a corpus. This is a common procedure in order to avoid cases where the same word has different punctuation applied next to it, but is essentially the same word.

  • Removing numbers: One can remove all the numbers from a given corpus. Specific numbers in text are not comparable since here is no context to apply to decide whether a number is being used in the same fashion from one instance to another.

  • Removing stop words: The idea is to remove all the short English words that bear no additional meaning to an analysis. Stop words are words like “the” and “and”. If one is purely interested in analyzing the meaning of a text, they add no additional value.

  • Removing whitespaces: Removing whitespaces has little to do with standard text mining. This procedure is not essential for the analysis within the tm package. However, it provides a way to clean up ones intermediary results for better presentation.

  • Remove profanity words: Swear word have should be completely removed because swear words do not help the prediction. They are offensive and are not a part of the proper English language. Possible source of a list of profanity words: https://code.google.com/p/badwordslist/downloads/detail?name=badwords.txt.

  • Removing other words: Besides the profanity word, one may want to exclude further special words or phrases such as company or institutional names, brands, measurements (e.g. kg, g, km, etc.).

  • Word stemming: We can adjust the corpus to use only word stems. A word stem is the base or root of a word, regardless of the current inflection or usage. A word stem is the base or root of a word ignoring the current inflection or usage. For example, the words “wait”, “waiting”, “waits”, and “waited” all have the same stem:" wait“.

All of the above cleaning operation on the corpus do not have to occur. It depends on the intended use of the corpus. For the explorative analysis, we include the removal of stop words and the word stemming since we are interested in analyzing the differences in the three text bodies Twitter, news and blogs. Therefore it is more meaningful to normalize the text as much as possible. However, in the building of the prediction model, we will include stop words and not apply word stemming.

Analysis of the three documents used for the corpus

We adopt the following notation: types are distinct words in a corpus, i.e. the size of the vocabulary, and tokens are the total number of running words in a corpus.

In figure 1 we show the frequency of types across the corpus. It shows that all three texts (Twitter, news, and blogs) have a different distribution of at least the top 20 most frequent types. We will later have a look at the frequency of word combinations (n-grams).

My Figure

Figure 1: Top 20 word (type) frequency distribution across the corpus

Quantitative analysis

In this chapter, we conduct a quantitativ analysis focusing on the character distribution across the corpus. We are in particular interested in how the three document sources (Twitter, news, and blogs) are similar or perhaps different on a normalized level. Since we will use the corpus to construct a model for next word prediction, it is important, that the text source are diverse enough and do not show unwanted biases.

Firstly, in figure 2a we show the word (type) length distribution across the corpus. Although the maximum word length do strongly differ within the three text sources (see table 1) the general distribution is very similar and approximately normal. In figure 2b we are looking at the Word (type) length distribution across the corpus. We find that all the three text bodies show a similar distribution typical for an English text. Lastly, figure 2c shows distribution of letters positioned within words (types) across the corpus via a heat map.

My Figure

Figure 2a: Word (type) lenght distribution across the corpus

My Figure

Figure 2b: Alphabet frequency across the corpus

My Figure

Figure 2c: Distribution of letters positioned within words (types) across the corpus.

N-gram analysis of corpus

The above analysis as shown that all text sources, i.e. Twitter, news, and blogs have a different document structure and cover different topics, they do show on a normalized level close similarities on the token level. In order to attempt a predictive model for the word prediction one need to consider n-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. For our analysis we do only consider unigrams (1-gram, one token/type), bigrams (2-gram, two tokens/types), trigrams (3-gram, three tokens/types), and quadgrams (4-gram, four tokens/types).

We are using the NGramTokenizer() function of the RWeka package to do the hard work. We tokenized the corpus into unigrams, bigrams, trigrams, and quadgrams. By calculating the frequency of the n-gram occurrence within the corpus and the three different text bodies, we present the results in the following figures (4a - 4d). We find that also the frequency of types (see figure 1) is very similar across the text bodies (Twitter, news, and blogs), the n-gram distribution is not. Here, each different text source provides the corpus with different n-grams. This is important for the model creation since the corpus should be as representative and diverse as possible.

My Figure

Figure 3a: Unigram frequency across the corpus

My Figure

Figure 3b: Bigram frequency across the corpus

My Figure

Figure 3c: Trigram frequency across the corpus

My Figure

Figure 3d: Quadgram frequency across the corpus

Word coverage and minimal word representation

A major concern in building a predictive model is the potential size of an predictive matrix and the related performance and memory problems. It is therefor important to find a minimal representation of the corpus and using it as the basis of the model. A fruitful way is to look at the so called Zipf’s Law. It says that given some corpus the frequency of any word is inversely proportional to its rank in the frequency table. We show in figure 5b a plot confirming that the whole corpus does fulfill nicely Zip’s law considering the distribution of word (types).

My Figure

Figure 5: Confirming Zip’s law.

In fact, looking at figure 5b, we can estimate the minimal word cover for 50% (red lines) and 90% (green lines) of the corpus. It turns out that one need ca. 1% of the corpus vocabulary to cover 50% of the corpus and ca. 25% to cover 90% of the corpus. This fact we will utalize in building the model.

My Figure

Figure 5: Approximate word cover for 50% of the corpus sample (red lines) and word cover for 90% of the corpus sample (green lines)

Training the model - backoff and smoothing

On first approach of word prediction is the backoff method. Backoff n-gram modeling was introduced by Katz (1987). In the model an n-gram model is based on an (n-1)-gram model. If one has a non-zero trigram counts, one rely solely on the trigram counts and do not interpolate the bigram and unigram counts. One only ‘back off’ to a lower-order N-gram if one has no evidence for a higher-order N-gram. This method we did apply for the quiz 2 of the course project. The simplest back-off model will get the probability of every (n-1) terms and orders them. It than shows the first few words as a prediction. When no words were found, a (n-1)-gram model will be used until the unigram model is reached, which will show the most common words in the corpus. The key principle behind this model is the Markov process or Markov chain. The idea behind the Markov chain is that the next state (word) is always dependent on a finite history. Therefore, n-grams are a representation of Markov chains. For our further stuy, we will consider a n-gram model up to size 4.

Making the model better

Although the simple backoff algorithm is quite effective, we need to further refine the model. Here are some possible ways to do so:

  • Find a suatable model reduction
  • Add smoothing, i.e. ** Add-OneSmoothing ** Witten-Bell Discounting
  • Deleted interpolation
  • Combining Backoff with Discounting
  • Testing the performance of the model via ** Shannon test ** Comparing perplexity

Developing the Shiny app

Once we have constructed and chosen the best performing model, we will construct a Shiny app to predict the next word. The application should be easy to use, fast to load and should give a reasonable selection of the next possible word. The challenge will be to optimize the memory usage and the desired performance of the application. We think that performance will be a key factor.