Executive summary

The context of this report is the developement of an application able to predict (propose options for) the next word regarding the previous 3 words typed.

The purpose of this report is to prepare modeling by an exploratory analysis of the provided datas.

These are records of 3 differents sources : blogs, news and twitter.

It shows basic summaries about the three data sets provided. These are obtained after tokenizing sentences in the texts, in N-uple of the form word1 word2 word3 …. wordN called N-grams. These N-grams will be used for the prediction.

It also gives informations on processing times and memory sizes.

Finally, a projection on next steps is detailed.

Data summary

Data summary results

Data set Nb of documents Max number of words per document Total number of words Mean number of words per document
Blog 899388 40833 37334131 41.51
News 1010242 11384 34372530 34.02
Twitter 2360148 140 30373543 12.87

As we can see, the 3 datat sets have a similar number of words, although twitter data has a much larger number of lines (but a shorter length of lines).

Data sampling

For further calculations in exploratory analysis, we use smaller data sets for computation time concerns.

We will randomly sample 10000 records in each data set.

Calculation times

During computations above, we tracked the time consumptions.

Here are the results below.

Data set Loading Longest document Number of words Sample
Blog 24.063 72.357 21.573 4.571
News 33.722 44.426 19.429 6.553
Twitter 66.124 56.136 88.097 11.712

As we can see, computation times are higher for statistics computation (number of characters, number of words) as the number of lines is bigger. The other times are similar.

Data processing

We eliminate from the corpus the non significant “words”, as numbers, URLs, hashtags, special characters.

We keep in the corpus the word delimiters ( \r\n\t.,;:“()?!), to be used in tokenization, and the simple quotes (’’) for keeping words as i’m and don’t. We do not discard stopwords in this particular case, as they are important for predicting.

We can also discard words not contained in a dictionary (to exclude foreign languages pieces), and profanity words. The dictionary is a list in an external text file, with every variant of the words (eg: show, shows, shown, etc.). The profanity words is also a list in an external text file.

Natural language processing

The approach is to use Markov chains, called n-grams in this case.

We set up 3 corpuses (Blog, News, Twitter), each containing 10000 documents (samples of the original data).

The main packages used for this are tm, RWeka and slam.

Further tests may lead to use ngram package (first released in 2014), meant to be faster with tokenization.

Data plots and object sizes

As mentioned above, the following results were computed with sampled data (10000 records).

For each case (n / data set), the first thing we provide is the size of the n-grams table (in Bytes). As you will see, for a 10000 sampling size, the memory size of the n-grams sets vary from 1MB to 32MB. We will likely have to shorten these sets (discard n-grams with very low frequencies) for keeping used RAM under the 1GB threshold, as full data have nearly 100 times more records.

We plot then the most frequent n-grams for every dataset, for n = 1, 2, 3, 4, for we need a 4-grams model in order to predict the next word using the 3 (at most) previous words.

Finally, we plot a word cloud of the most frequent n-grams for every dataset, for n = 1, 2.

In these plots, we notice that for Twitter, the distribution of frequencies of n-grams is more balanced.

Most frequent n-grams : histograms

1-grams

dataset size
Blog 2203576
News 2150208
Twitter 1033888

2-grams

dataset size
Blog 14851272
News 13485000
Twitter 5003032

3-grams

dataset size
Blog 26851960
News 22077736
Twitter 7356320

4-grams

dataset size
Blog 31904864
News 24755848
Twitter 7856920

Most frequent n-grams : word clouds

1-grams - Blog

1-grams - News

1-grams - Twitter

2-grams - Blog

2-grams - News

2-grams - Twitter

Prediction algorithm foreview

Overview

In order to fit a model, we will need to consider the following points

  • merge the data from the 3 sources into a single corpus (the results of exploratory analysis beeing quite different, merging should improve the coverage of n-gram set)
  • split the corpus in several sets, at least one for training the model, the other to estimate his accuracy by testing on other data than these used for training the model
  • manage the zero probability words (options are : using an external dictionary / considering words under a frequency threshold as / one-add)
  • backoff model / smoothing (options are linear combination of n-grams predictors, etc.)
  • tradeoff between prediction accuracy and computation time (goal: couple of seconds)
  • RAM used by application less than 1 GByte to be able to run on R Shiny server with a free account
  • as an optimized model would be fitted, we will compute the final n-grams sets (n = 1, 2, 3, 4) and record them as text files for being used by the prediction application

Simple example of prediction

We use the 3-grams dictionary to compute simple prediction (no smoothing, no zero probability management, no backoff).

dataset size
All 55597272

We decompose the 3-grams into 2-gram / 1-gram pairs. The function looks in the first column for the 2-gram typed, and returns the records of the 1-gram column, ordered by frequency on the original 3-grams (before splittong into 2-gram / 1-gram pairs).

## [1] "Find the next word for input : i don't"
## [1] "First predictions : "
##     grams.2 grams.1 freq
##  1: i don't    know   72
##  2: i don't   think   52
##  3: i don't    have   32
##  4: i don't    want   24
##  5: i don't    like   21
##  6: i don't  really    9
##  7: i don't     see    9
##  8: i don't    feel    8
##  9: i don't     get    8
## 10: i don't    need    7
## [1] "Time for predicting : 1.228"

Even for this simple example, based on sampled data, the prediction takes more than 1 second.

That shows the need for optimization in the next steps.