Summary

This report aims to show and explain some main features of the data that will be used to create a next word prediction algorithm. The data were taken from a corpus called HC Corpora and contains English text from the social media. All information was anonymized.

First look at the data

As the first and most important part of processing the data, all sources were tokenized. Term “token” imply an English word containing only alphabetic symbols. It could be helpful to consider numbers, abbreviations and complex words, but this report focuses only on core features of the data, hereby we keep the research simple. The data were gathered from three sources: Twitter, blog posts and news articles. As all these sources differ in form and content, it’s natural to consider them in comparison:

Source Size_in_Mb Count_of_lines Count_of_tokens Mean_words_per_line Size_of_vocabulary
1 Twitter 159.40 2360148.00 29574899 12.53 302628
2 Blogs 200.40 899288.00 37177343 41.34 253018
3 News 196.30 1010242.00 33828027 33.49 212203

The sources look pretty similar in their basic summaries, except for the mean number of tokens per line and the size of vocabulary. Each line in the files we have is a single post, so twitter file has much fewer words per entry than news and blogs (that sounds reasonable). Interesting fact is that although blogs is a bigger (in term of tokens and entries) data set than twitter, their vocabulary is much smaller. This may be because of wider use of slang, typos and abbreviations by users of Twitter.

Statistics

In order to build a word prediction algorithm one needs to know probabilities of single words, bi- and trigrams that may occur in text. These probabilities may be estimated by frequencies of that n-grams in training data sets. Here are visualisation of frequencies of three sources in comparison. Results are shown on a logarithmic scale:

Single words (or unigrams)

Bigrams

Trigrams

The plots show that even on a log scale there is a huge gap between amount of words that occur in texts only once and all other words with larger frequencies. This is especially noticeable in the case of Twitter – another evidence of diverse vocabulary of Twitter users. But the longer ngram length is, the fewer differences are noticeable between plots for different sources, however the gap between the amount of unique ngrams and all others increases. This phenomenon is related to the fact that in simplified language model where occurrence of each word in a text is an independent event, the probability of ngram equals to the product of all words in this ngram.

Prediction algorithm and application

The main idea of the algorithm is based on a Markov’s chains concept, that is as follows: each sequence of words in a text has its own probability of occurrence. If these probabilities are estimated, one can predict next word in a phrase by picking up the most probable sequence that starts with last several words in that phrase. For example, if phrase starts with “I love” and the algorithm knows that the sequence “I love pizza” is more probable that the sequence “I love greens”, it will suggest the word “pizza” as its prediction.

Therefore, simple version of an algorithm estimates the probabilities of ngrams with their frequencies in the corpora and then uses this estimation to predict the next word in the phrase entered by a user. First step of this approach will be made beforehand and the final application will use precalculated dictionary of the most probable ngrams, that will accelerate its performance. The lightweight user interface will be implemented as a Shiny application.

From the business point of view this algorithm may be used in text editors and mobile phones to enhance user experience.

For the future

There is a lot of enhancement opportunities that may be used to make the application more accurate, smart and user friendly. Here are some of them: