Text mining - Exploratory analysis

Executive summary

The context of this report is the developement of an application able to predict (propose options for) the next word regarding the previous 3 words typed.

The purpose of this report is to prepare modeling by an exploratory analysis of the provided datas.

These are records of 3 differents sources : blogs, news and twitter.

It shows basic summaries about the three data sets provided. These are obtained after tokenizing sentences in the texts, in N-uple of the form word1 word2 word3 …. wordN called N-grams. These N-grams will be used for the prediction.

It also gives informations on processing times and memory sizes.

Finally, a projection on next steps is detailed.

Data summary

Data summary results

Data set	Nb of documents	Max number of words per document	Total number of words	Mean number of words per document
Blog	899388	40833	37334131	41.51
News	1010242	11384	34372530	34.02
Twitter	2360148	140	30373543	12.87

As we can see, the 3 datat sets have a similar number of words, although twitter data has a much larger number of lines (but a shorter length of lines).

Data sampling

For further calculations in exploratory analysis, we use smaller data sets for computation time concerns.

We will randomly sample 10000 records in each data set.

Calculation times

During computations above, we tracked the time consumptions.

Here are the results below.

Data set	Loading	Longest document	Number of words	Sample
Blog	24.063	72.357	21.573	4.571
News	33.722	44.426	19.429	6.553
Twitter	66.124	56.136	88.097	11.712

As we can see, computation times are higher for statistics computation (number of characters, number of words) as the number of lines is bigger. The other times are similar.

Data processing

We eliminate from the corpus the non significant “words”, as numbers, URLs, hashtags, special characters.

We keep in the corpus the word delimiters ( \r\n\t.,;:“()?!), to be used in tokenization, and the simple quotes (’’) for keeping words as i’m and don’t. We do not discard stopwords in this particular case, as they are important for predicting.

We can also discard words not contained in a dictionary (to exclude foreign languages pieces), and profanity words. The dictionary is a list in an external text file, with every variant of the words (eg: show, shows, shown, etc.). The profanity words is also a list in an external text file.

Natural language processing

The approach is to use Markov chains, called n-grams in this case.

We set up 3 corpuses (Blog, News, Twitter), each containing 10000 documents (samples of the original data).

The main packages used for this are tm, RWeka and slam.

Further tests may lead to use ngram package (first released in 2014), meant to be faster with tokenization.

Data plots and object sizes

As mentioned above, the following results were computed with sampled data (10000 records).

For each case (n / data set), the first thing we provide is the size of the n-grams table (in Bytes). As you will see, for a 10000 sampling size, the memory size of the n-grams sets vary from 1MB to 32MB. We will likely have to shorten these sets (discard n-grams with very low frequencies) for keeping used RAM under the 1GB threshold, as full data have nearly 100 times more records.

We plot then the most frequent n-grams for every dataset, for n = 1, 2, 3, 4, for we need a 4-grams model in order to predict the next word using the 3 (at most) previous words.

Finally, we plot a word cloud of the most frequent n-grams for every dataset, for n = 1, 2.

In these plots, we notice that for Twitter, the distribution of frequencies of n-grams is more balanced.

Most frequent n-grams : histograms

1-grams

dataset	size
Blog	2203576
News	2150208
Twitter	1033888

2-grams

dataset	size
Blog	14851272
News	13485000
Twitter	5003032

3-grams

dataset	size
Blog	26851960
News	22077736
Twitter	7356320

4-grams

dataset	size
Blog	31904864
News	24755848
Twitter	7856920

Most frequent n-grams : word clouds

1-grams - Blog

1-grams - News

1-grams - Twitter

2-grams - Blog

2-grams - News

2-grams - Twitter

Prediction algorithm foreview

Overview

In order to fit a model, we will need to consider the following points

merge the data from the 3 sources into a single corpus (the results of exploratory analysis beeing quite different, merging should improve the coverage of n-gram set)
split the corpus in several sets, at least one for training the model, the other to estimate his accuracy by testing on other data than these used for training the model
manage the zero probability words (options are : using an external dictionary / considering words under a frequency threshold as / one-add)
backoff model / smoothing (options are linear combination of n-grams predictors, etc.)
tradeoff between prediction accuracy and computation time (goal: couple of seconds)
RAM used by application less than 1 GByte to be able to run on R Shiny server with a free account
as an optimized model would be fitted, we will compute the final n-grams sets (n = 1, 2, 3, 4) and record them as text files for being used by the prediction application

Simple example of prediction

We use the 3-grams dictionary to compute simple prediction (no smoothing, no zero probability management, no backoff).

dataset	size
All	55597272

We decompose the 3-grams into 2-gram / 1-gram pairs. The function looks in the first column for the 2-gram typed, and returns the records of the 1-gram column, ordered by frequency on the original 3-grams (before splittong into 2-gram / 1-gram pairs).

## [1] "Find the next word for input : i don't"

## [1] "First predictions : "

##     grams.2 grams.1 freq
##  1: i don't    know   72
##  2: i don't   think   52
##  3: i don't    have   32
##  4: i don't    want   24
##  5: i don't    like   21
##  6: i don't  really    9
##  7: i don't     see    9
##  8: i don't    feel    8
##  9: i don't     get    8
## 10: i don't    need    7

## [1] "Time for predicting : 1.228"

Even for this simple example, based on sampled data, the prediction takes more than 1 second.

That shows the need for optimization in the next steps.