Investigation of text corpora used for building of an n-gram predictive text model

Synopsis

This report explores 3 text corpora obtained from HC Corpora (www.corpora.Helios.org) and collected from different sources (Blogs, News, and twitter). The corpora will be used at a later stage to build a predictive text model. The model will be implemented as an online shiny application (i.e. available on the web) that suggests a list of next-to-be-typed words after a user enters a phrase. The scope of the investigation reported here was to do some basic per-processing and analysis on the corpora in order to get acquainted with the data, and to develop a strategy for building of the predictive model.

1. Data sampling

Data is comprised of 3 text documents en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt, with respective lengths of 899288, 1010242, and 2360148 lines. In order to avoid computational performance issues, subsequent analyses were based on a subset of the corpora. In specific, 20000 lines were randomly sampled from each corpus.

An analysis was performed investigating number of retained unique words as function of lines sampled from the News corpus. The result is shown in the plot below. As a reference point, the entire body of work by William Shakespeare has about 18000 unique words. The analysis below suggests that a small subset of the corpora would be sufficient to cover the most common words in English.

plot of chunk unnamed-chunk-2

2. Sentence Tokenization

In the first step of the analysis, text lines were segmented into individual sentences and some basic cleaning of the data was done (both before and after segmentation). The reason behind segmentation into sentences is to avoid pairing together last word(s) of a sentence and first word(s) of the following sentence when calculating frequency of occurrence of combination of words (also known as n-gram frequency count). The strategy for splitting by sentences is simple: it is based on sentence ending punctuation marks. This strategy ignores disambiguation problems. For example, a period may not always denote end of a sentence, as it is also used as a decimal point and in abbreviations. Since the impact of this issue on the predictive model is not clear at the moment, the issue was not explored at length. However, we did partially address some of the most obvious disambiguation cases as described below.

For data cleaning, we opted for a simple data cleaning strategy that basically strips out any non-alphanumeric characters from the text after it has been segmented to sentences (expect for the the apostrophe sign) and converts all remaining characters to lower case. Some cleaning was also done before segmentation; in cases where the cleaning was deemed to reduce ambiguity surrounding sentence ending punctuation. For example a period falling within digits was removed (e.g. 2.36 mapped to 236), and consecutive sentence ending punctuation were replaced by a single period (e.g. !?!?! mapped to .).

The following is a representative line from a Blog source taken before sentence tokenization:

## [1] "“I don't know. Maybe they're getting too much sun. I think I'm going to cut them way back.” I replied."

And here’s the same line after sentence tokenization and cleaning (includes conversion to lower case):

## [1] "i don't know"                          
## [2] "maybe they're getting too much sun"    
## [3] "i think i'm going to cut them way back"
## [4] " i replied"

3. Single-word Tokenization

Next, words were extracted and frequencies of occurrence of words were calculated for the 3 corpora. In addition to the data cleaning performed in the previous step, for this task, we removed stop words before generating the word frequency count. Stop words are common occurring words such as “a”, “the”, “is”, “at”, “i’m”, and others. Removing stop words allows us to get a better idea of the differences between each text corpus (otherwise the picture would be dominated by stop words). Note that stop words will not be removed when building the predictive model, as these have important predictive value.

Word statistics are displayed in two different formats below. First, we show a table with most frequent 20 words for each corpus.

##    Blog.word Blog.freq News.word News.freq twitter.word twitter.freq
## 1        one      2851      said      4938         just         1247
## 2       will      2615      will      2174         like         1052
## 3       just      2241       one      1699          get          964
## 4       like      2234      year      1527         love          918
## 5        can      2205       new      1456         good          864
## 6       time      2027       two      1306         will          841
## 7        get      1594      also      1216          day          789
## 8        now      1393       can      1145          now          747
## 9     people      1273      just      1101          can          732
## 10      know      1263     first      1080         know          677
## 11      also      1242      time      1079       thanks          676
## 12      back      1210      last      1047          one          665
## 13       new      1210     state      1018        great          648
## 14      even      1183      like       972         time          635
## 15       day      1164    people       932        today          607
## 16       see      1126       get       909          new          576
## 17      well      1117     years       889          see          552
## 18     first      1116      city       761         back          490
## 19     think      1110      back       740          got          482
## 20      much      1098     three       737        going          458

Next, we visualize the 150 most frequent words for each corpus in the form of a “word cloud” as shown below. Note that in such “word cloud”, the size of the printed word corresponds to its frequency of occurrence. There are some noticeable differences between twitter data and data from the other two corpora. For instance, twitter contains slang words such as lol, haha, luv, etc.

plot of chunk unnamed-chunk-7

Another interesting analysis explored here and shown in the figure below is the percentage of text covered as function of number of unique words (ranked by frequency of occurrence). The x-axis can be thought of as words in the dictionary and the y-axis as the language.

plot of chunk unnamed-chunk-8

The result is somewhat surprising as it shows that only 1058 most frequent words are needed to cover 50% of the sampled body of text. This is an important finding when considering reducing the size of the vocabulary in order to limit use of computational resources.

4. n-gram Tokenization

The previous exercise was extended to looking at the most common 2-grams and 3-grams. For this analysis, stop words were not removed. The tables below show the result for each corpus. Note that both Blog and twitter language seem to be dominated by the first person (i.e. usage of I and its derivatives).

##    Blog.word Blog.freq News.word News.freq twitter.word twitter.freq
## 1     of the      4147    of the      3703       in the          652
## 2     in the      3546    in the      3514      for the          620
## 3     to the      1949    to the      1724       of the          442
## 4     on the      1648    on the      1409       on the          419
## 5      to be      1523   for the      1381        to be          416
## 6    and the      1321    at the      1112       to the          378
## 7    for the      1305   and the      1023   thanks for          374
## 8     at the      1151      in a      1021       i love          314
## 9      it is      1075     to be       900       at the          310
## 10     and i      1071  with the       843    thank you          306
## 11     i was      1067  from the       710     going to          288
## 12    i have      1065    with a       669       have a          288
## 13    it was      1053      of a       654       i have          272
## 14  with the      1021   he said       644       if you          270
## 15      in a      1001      as a       627        for a          257
## 16      is a       986    by the       566      will be          236
## 17      i am       931   will be       562       to get          234
## 18  from the       891     for a       559         i am          232
## 19    that i       814    it was       554         is a          232
## 20    with a       791  that the       537      i think          228
##        Blog.word Blog.freq         News.word News.freq       twitter.word twitter.freq
## 1     one of the       327        one of the       300     thanks for the          201
## 2       a lot of       281          a lot of       223      thank you for           80
## 3     as well as       160  according to the       121         i love you           76
## 4     the end of       156       part of the       118 looking forward to           70
## 5        to be a       147        as well as       108      can't wait to           68
## 6     out of the       146       some of the       106        going to be           68
## 7    some of the       142        out of the       105     for the follow           64
## 8    a couple of       134        the end of       102           a lot of           62
## 9     be able to       134           to be a       101          i want to           56
## 10      it was a       134       going to be        99          i need to           52
## 11 the fact that       132      in the first        99         to see you           52
## 12     i want to       131          it was a        90            to be a           51
## 13      i have a       123 the united states        82          i have to           49
## 14   the rest of       120       the rest of        78           i have a           48
## 15   i have been       117         said in a        71        is going to           45
## 16   going to be       115       most of the        70       have a great           44
## 17     i have to       114    the first time        69           i wish i           44
## 18   part of the       110        be able to        66         you have a           42
## 19    there is a       110       of the year        66           to go to           41
## 20     this is a       110 the university of        64         one of the           38

5. Conclusion

Based on the data investigations done so far, the plan is to build a predictive text algorithm and accompanying shiny application based on frequency of occurrence of n-grams. The following is a rough sketch of the modeling strategy:

  1. Randomly sample part of the corpora (e.g. 20000 lines from each corpus)
  2. Calculate frequency of occurrence of n-grams for n=1,2,3 and 4 (combining together results from Blog, News, and twitter).
  3. Build a model that attempts to predict next word based on the n-gram that is largest in length (i.e. n=4).
  4. Supplement the prediction with the prediction from the next largest n-gram when the largest yields no results.
  5. Test the prediction accuracy out-of-sample and consider adding more of the corpora to the training sample if deemed necessary to improve accuracy.

The strategy outlined above reflects the current understanding of the problem and understanding of the provided training data. Note that we expect to build more complexity into the model as our thinking continues to evolve around Natural language processing, in general, and the problem at hand, in particular.

Last update: 2014-11-15 15:21:11