The context of this report is the developement of an application able to predict (propose options for) the next word regarding the previous 3 words typed.
The purpose of this report is to prepare modeling by an exploratory analysis of the provided datas.
These are records of 3 differents sources : blogs, news and twitter.
It shows basic summaries about the three data sets provided. These are obtained after tokenizing sentences in the texts, in N-uple of the form word1 word2 word3 …. wordN called N-grams. These N-grams will be used for the prediction.
It also gives informations on processing times and memory sizes.
Finally, a projection on next steps is detailed.
| Data set | Nb of documents | Max number of words per document | Total number of words | Mean number of words per document |
|---|---|---|---|---|
| Blog | 899388 | 40833 | 37334131 | 41.51 |
| News | 1010242 | 11384 | 34372530 | 34.02 |
| 2360148 | 140 | 30373543 | 12.87 |
As we can see, the 3 datat sets have a similar number of words, although twitter data has a much larger number of lines (but a shorter length of lines).
For further calculations in exploratory analysis, we use smaller data sets for computation time concerns.
We will randomly sample 10000 records in each data set.
During computations above, we tracked the time consumptions.
Here are the results below.
| Data set | Loading | Longest document | Number of words | Sample |
|---|---|---|---|---|
| Blog | 24.063 | 72.357 | 21.573 | 4.571 |
| News | 33.722 | 44.426 | 19.429 | 6.553 |
| 66.124 | 56.136 | 88.097 | 11.712 |
As we can see, computation times are higher for statistics computation (number of characters, number of words) as the number of lines is bigger. The other times are similar.
We eliminate from the corpus the non significant “words”, as numbers, URLs, hashtags, special characters.
We keep in the corpus the word delimiters ( \r\n\t.,;:“()?!), to be used in tokenization, and the simple quotes (’’) for keeping words as i’m and don’t. We do not discard stopwords in this particular case, as they are important for predicting.
We can also discard words not contained in a dictionary (to exclude foreign languages pieces), and profanity words. The dictionary is a list in an external text file, with every variant of the words (eg: show, shows, shown, etc.). The profanity words is also a list in an external text file.
The approach is to use Markov chains, called n-grams in this case.
We set up 3 corpuses (Blog, News, Twitter), each containing 10000 documents (samples of the original data).
The main packages used for this are tm, RWeka and slam.
Further tests may lead to use ngram package (first released in 2014), meant to be faster with tokenization.
As mentioned above, the following results were computed with sampled data (10000 records).
For each case (n / data set), the first thing we provide is the size of the n-grams table (in Bytes). As you will see, for a 10000 sampling size, the memory size of the n-grams sets vary from 1MB to 32MB. We will likely have to shorten these sets (discard n-grams with very low frequencies) for keeping used RAM under the 1GB threshold, as full data have nearly 100 times more records.
We plot then the most frequent n-grams for every dataset, for n = 1, 2, 3, 4, for we need a 4-grams model in order to predict the next word using the 3 (at most) previous words.
Finally, we plot a word cloud of the most frequent n-grams for every dataset, for n = 1, 2.
In these plots, we notice that for Twitter, the distribution of frequencies of n-grams is more balanced.
| dataset | size |
|---|---|
| Blog | 2203576 |
| News | 2150208 |
| 1033888 |
| dataset | size |
|---|---|
| Blog | 14851272 |
| News | 13485000 |
| 5003032 |
| dataset | size |
|---|---|
| Blog | 26851960 |
| News | 22077736 |
| 7356320 |
| dataset | size |
|---|---|
| Blog | 31904864 |
| News | 24755848 |
| 7856920 |
In order to fit a model, we will need to consider the following points
We use the 3-grams dictionary to compute simple prediction (no smoothing, no zero probability management, no backoff).
| dataset | size |
|---|---|
| All | 55597272 |
We decompose the 3-grams into 2-gram / 1-gram pairs. The function looks in the first column for the 2-gram typed, and returns the records of the 1-gram column, ordered by frequency on the original 3-grams (before splittong into 2-gram / 1-gram pairs).
## [1] "Find the next word for input : i don't"
## [1] "First predictions : "
## grams.2 grams.1 freq
## 1: i don't know 72
## 2: i don't think 52
## 3: i don't have 32
## 4: i don't want 24
## 5: i don't like 21
## 6: i don't really 9
## 7: i don't see 9
## 8: i don't feel 8
## 9: i don't get 8
## 10: i don't need 7
## [1] "Time for predicting : 1.228"
Even for this simple example, based on sampled data, the prediction takes more than 1 second.
That shows the need for optimization in the next steps.