The purpose of this document is to present the basic analysis we performed on three different text data sources.
Three text data sources were loaded:
Since this represents a huge mass of text which would be tedious to work with wholy, we sample the data to 10000 random lines of each data source and store everything in the same variable.
By splitting at each space character we can extract each individual word. 880688 words are counted. However we also need to separate punctuation signs from words. To do that we split words that have a punctuation sign or a parenthesis at the beginning or at the end of the word. Doing that yields 987758 words.
We then try to identify the most frequent words used and the most frequent series of 2, 3, 4, 5, or 6 words. To do so, we create a data frame with one entry for each of the 987758 words isolated. 5 additional fields are then added to hold respectively the 5 words following the first one. Words of the different field are finally concatenated to construct unigrams, bigrams, trigrams, four-grams, five-grams and six-grams.
| word | Count |
|---|---|
| . | 45227 |
| , | 40781 |
| the | 38697 |
| to | 23655 |
| and | 21276 |
| a | 20225 |
| of | 18435 |
| in | 13740 |
| I | 11969 |
| is | 8946 |
| gram_2 | Count |
|---|---|
| of the | 4031 |
| in the | 3494 |
| , and | 3260 |
| . The | 3161 |
| . I | 2878 |
| , but | 1941 |
| , the | 1913 |
| to the | 1885 |
| on the | 1666 |
| for the | 1622 |
| gram_3 | Count |
|---|---|
| one of the | 247 |
| , and the | 246 |
| a lot of | 246 |
| , but I | 217 |
| . It was | 213 |
| he said . | 205 |
| , and I | 188 |
| . It is | 182 |
| . This is | 179 |
| . I have | 169 |
| gram_4 | Count |
|---|---|
| " he said . | 134 |
| the rest of the | 69 |
| . In fact , | 64 |
| for the first time | 63 |
| " she said . | 57 |
| , as well as | 56 |
| the end of the | 53 |
| , according to the | 48 |
| at the end of | 44 |
| . It was a | 42 |
| gram_5 | Count |
|---|---|
| in the middle of the | 26 |
| at the end of the | 23 |
| for the first time since | 18 |
| " he said . “I | 16 |
| said in a statement . | 16 |
| for the first time in | 13 |
| . In fact , the | 12 |
| for the rest of the | 12 |
| . In other words , | 11 |
| . In fact , I | 11 |
| gram_6 | Count |
|---|---|
| . CLEVELAND , Ohio - - | 7 |
| , cake , cake , cake | 7 |
| cake , cake , cake , | 7 |
| . At the same time , | 6 |
| , on the other hand , | 6 |
| in the middle of the night | 6 |
| . On the other hand , | 5 |
| . For those of you who | 5 |
| the end of the day , | 5 |
| of Chicago , Chicago , Illinois | 5 |
Using this data frame, we can compute various statistics such as the frequency of appearence of each n-grams or the number of unique words required represent a certain percentage of the whole text data.