Introduction

The purpose of this document is to present the basic analysis we performed on three different text data sources.

Preliminary analysis

Three text data sources were loaded:

Since this represents a huge mass of text which would be tedious to work with wholy, we sample the data to 10000 random lines of each data source and store everything in the same variable.

By splitting at each space character we can extract each individual word. 880688 words are counted. However we also need to separate punctuation signs from words. To do that we split words that have a punctuation sign or a parenthesis at the beginning or at the end of the word. Doing that yields 987758 words.

Statistics

We then try to identify the most frequent words used and the most frequent series of 2, 3, 4, 5, or 6 words. To do so, we create a data frame with one entry for each of the 987758 words isolated. 5 additional fields are then added to hold respectively the 5 words following the first one. Words of the different field are finally concatenated to construct unigrams, bigrams, trigrams, four-grams, five-grams and six-grams.

10 Most Frequent Words
word Count
. 45227
, 40781
the 38697
to 23655
and 21276
a 20225
of 18435
in 13740
I 11969
is 8946
10 Most Frequent Bigrams
gram_2 Count
of the 4031
in the 3494
, and 3260
. The 3161
. I 2878
, but 1941
, the 1913
to the 1885
on the 1666
for the 1622
10 Most Frequent Trigrams
gram_3 Count
one of the 247
, and the 246
a lot of 246
, but I 217
. It was 213
he said . 205
, and I 188
. It is 182
. This is 179
. I have 169
10 Most Frequent Four-grams
gram_4 Count
" he said . 134
the rest of the 69
. In fact , 64
for the first time 63
" she said . 57
, as well as 56
the end of the 53
, according to the 48
at the end of 44
. It was a 42
10 Most Frequent Five-grams
gram_5 Count
in the middle of the 26
at the end of the 23
for the first time since 18
" he said . “I 16
said in a statement . 16
for the first time in 13
. In fact , the 12
for the rest of the 12
. In other words , 11
. In fact , I 11
10 Most Frequent Six-grams
gram_6 Count
. CLEVELAND , Ohio - - 7
, cake , cake , cake 7
cake , cake , cake , 7
. At the same time , 6
, on the other hand , 6
in the middle of the night 6
. On the other hand , 5
. For those of you who 5
the end of the day , 5
of Chicago , Chicago , Illinois 5

Using this data frame, we can compute various statistics such as the frequency of appearence of each n-grams or the number of unique words required represent a certain percentage of the whole text data.