This report explores 3 text corpora obtained from HC Corpora (www.corpora.Helios.org) and collected from different sources (Blogs, News, and twitter). The corpora will be used at a later stage to build a predictive text model. The model will be implemented as an online shiny application (i.e. available on the web) that suggests a list of next-to-be-typed words after a user enters a phrase. The scope of the investigation reported here was to do some basic per-processing and analysis on the corpora in order to get acquainted with the data, and to develop a strategy for building of the predictive model.
Data is comprised of 3 text documents en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt, with respective lengths of 899288, 1010242, and 2360148 lines. In order to avoid computational performance issues, subsequent analyses were based on a subset of the corpora. In specific, 20000 lines were randomly sampled from each corpus.
An analysis was performed investigating number of retained unique words as function of lines sampled from the News corpus. The result is shown in the plot below. As a reference point, the entire body of work by William Shakespeare has about 18000 unique words. The analysis below suggests that a small subset of the corpora would be sufficient to cover the most common words in English.
In the first step of the analysis, text lines were segmented into individual sentences and some basic cleaning of the data was done (both before and after segmentation). The reason behind segmentation into sentences is to avoid pairing together last word(s) of a sentence and first word(s) of the following sentence when calculating frequency of occurrence of combination of words (also known as n-gram frequency count). The strategy for splitting by sentences is simple: it is based on sentence ending punctuation marks. This strategy ignores disambiguation problems. For example, a period may not always denote end of a sentence, as it is also used as a decimal point and in abbreviations. Since the impact of this issue on the predictive model is not clear at the moment, the issue was not explored at length. However, we did partially address some of the most obvious disambiguation cases as described below.
For data cleaning, we opted for a simple data cleaning strategy that basically strips out any non-alphanumeric characters from the text after it has been segmented to sentences (expect for the the apostrophe sign) and converts all remaining characters to lower case. Some cleaning was also done before segmentation; in cases where the cleaning was deemed to reduce ambiguity surrounding sentence ending punctuation. For example a period falling within digits was removed (e.g. 2.36 mapped to 236), and consecutive sentence ending punctuation were replaced by a single period (e.g. !?!?! mapped to .).
The following is a representative line from a Blog source taken before sentence tokenization:
## [1] "I don't know. Maybe they're getting too much sun. I think I'm going to cut them way back. I replied."
And here’s the same line after sentence tokenization and cleaning (includes conversion to lower case):
## [1] "i don't know"
## [2] "maybe they're getting too much sun"
## [3] "i think i'm going to cut them way back"
## [4] " i replied"
Next, words were extracted and frequencies of occurrence of words were calculated for the 3 corpora. In addition to the data cleaning performed in the previous step, for this task, we removed stop words before generating the word frequency count. Stop words are common occurring words such as “a”, “the”, “is”, “at”, “i’m”, and others. Removing stop words allows us to get a better idea of the differences between each text corpus (otherwise the picture would be dominated by stop words). Note that stop words will not be removed when building the predictive model, as these have important predictive value.
Word statistics are displayed in two different formats below. First, we show a table with most frequent 20 words for each corpus.
## Blog.word Blog.freq News.word News.freq twitter.word twitter.freq
## 1 one 2851 said 4938 just 1247
## 2 will 2615 will 2174 like 1052
## 3 just 2241 one 1699 get 964
## 4 like 2234 year 1527 love 918
## 5 can 2205 new 1456 good 864
## 6 time 2027 two 1306 will 841
## 7 get 1594 also 1216 day 789
## 8 now 1393 can 1145 now 747
## 9 people 1273 just 1101 can 732
## 10 know 1263 first 1080 know 677
## 11 also 1242 time 1079 thanks 676
## 12 back 1210 last 1047 one 665
## 13 new 1210 state 1018 great 648
## 14 even 1183 like 972 time 635
## 15 day 1164 people 932 today 607
## 16 see 1126 get 909 new 576
## 17 well 1117 years 889 see 552
## 18 first 1116 city 761 back 490
## 19 think 1110 back 740 got 482
## 20 much 1098 three 737 going 458
Next, we visualize the 150 most frequent words for each corpus in the form of a “word cloud” as shown below. Note that in such “word cloud”, the size of the printed word corresponds to its frequency of occurrence. There are some noticeable differences between twitter data and data from the other two corpora. For instance, twitter contains slang words such as lol, haha, luv, etc.
Another interesting analysis explored here and shown in the figure below is the percentage of text covered as function of number of unique words (ranked by frequency of occurrence). The x-axis can be thought of as words in the dictionary and the y-axis as the language.
The result is somewhat surprising as it shows that only 1058 most frequent words are needed to cover 50% of the sampled body of text. This is an important finding when considering reducing the size of the vocabulary in order to limit use of computational resources.
The previous exercise was extended to looking at the most common 2-grams and 3-grams. For this analysis, stop words were not removed. The tables below show the result for each corpus. Note that both Blog and twitter language seem to be dominated by the first person (i.e. usage of I and its derivatives).
## Blog.word Blog.freq News.word News.freq twitter.word twitter.freq
## 1 of the 4147 of the 3703 in the 652
## 2 in the 3546 in the 3514 for the 620
## 3 to the 1949 to the 1724 of the 442
## 4 on the 1648 on the 1409 on the 419
## 5 to be 1523 for the 1381 to be 416
## 6 and the 1321 at the 1112 to the 378
## 7 for the 1305 and the 1023 thanks for 374
## 8 at the 1151 in a 1021 i love 314
## 9 it is 1075 to be 900 at the 310
## 10 and i 1071 with the 843 thank you 306
## 11 i was 1067 from the 710 going to 288
## 12 i have 1065 with a 669 have a 288
## 13 it was 1053 of a 654 i have 272
## 14 with the 1021 he said 644 if you 270
## 15 in a 1001 as a 627 for a 257
## 16 is a 986 by the 566 will be 236
## 17 i am 931 will be 562 to get 234
## 18 from the 891 for a 559 i am 232
## 19 that i 814 it was 554 is a 232
## 20 with a 791 that the 537 i think 228
## Blog.word Blog.freq News.word News.freq twitter.word twitter.freq
## 1 one of the 327 one of the 300 thanks for the 201
## 2 a lot of 281 a lot of 223 thank you for 80
## 3 as well as 160 according to the 121 i love you 76
## 4 the end of 156 part of the 118 looking forward to 70
## 5 to be a 147 as well as 108 can't wait to 68
## 6 out of the 146 some of the 106 going to be 68
## 7 some of the 142 out of the 105 for the follow 64
## 8 a couple of 134 the end of 102 a lot of 62
## 9 be able to 134 to be a 101 i want to 56
## 10 it was a 134 going to be 99 i need to 52
## 11 the fact that 132 in the first 99 to see you 52
## 12 i want to 131 it was a 90 to be a 51
## 13 i have a 123 the united states 82 i have to 49
## 14 the rest of 120 the rest of 78 i have a 48
## 15 i have been 117 said in a 71 is going to 45
## 16 going to be 115 most of the 70 have a great 44
## 17 i have to 114 the first time 69 i wish i 44
## 18 part of the 110 be able to 66 you have a 42
## 19 there is a 110 of the year 66 to go to 41
## 20 this is a 110 the university of 64 one of the 38
Based on the data investigations done so far, the plan is to build a predictive text algorithm and accompanying shiny application based on frequency of occurrence of n-grams. The following is a rough sketch of the modeling strategy:
The strategy outlined above reflects the current understanding of the problem and understanding of the provided training data. Note that we expect to build more complexity into the model as our thinking continues to evolve around Natural language processing, in general, and the problem at hand, in particular.
Last update: 2014-11-15 15:21:11