Synopsis

This report consists in an exploratory analysis of text data from a corpora obtained from HC Corpora. The corpora of interest contains 3 text documents (“en_US.blogs.txt”, “en_US.blogs.txt” and “en_US.twitter.txt”) collected from three different sources (Twitter feed, news feed and a blog). The aim of this report is to provide a basic understanding of the distribution of words, of the relationship between the words and of the variation in the frequencies of words, word pairs and word triples in the corpora. As this report is intended for a non-technical audience, we don’t provide any code in this report, but any reader interested in reproducibility may find the entire R code to generate this report on GitHub.

1. Data processing

1.1 Summary of corpora

We begin our analysis by providing a count of the number of lines and of words in the three text files.

##      File Count_of_lines Count_of_words
## 1   Blogs         899288       37334131
## 2    News        1010242       34372530
## 3 Twitter        2360148       30373545

To allow for a better visualisation of these basic statistics, we provide a plot of the number of lines and of words in the three text files.

We may conclude that, as expected, the words/line ratio is the highest for “en_US.blogs.txt”, slightly lower for “en_US.news.txt” and the lowest for “en_US.twitter.txt”. More accurately, the mean of the line lengths for “en_US.blogs.txt” is 229.99, the mean of the line lengths for “en_US.news.txt” is 201.16 and the mean of the line lengths for “en_US.twitter.txt” is 68.68. We may note that, on average, the line length in the “en_US.twitter.txt” file is much shorter than in the two other files. This is due to the number of characters limitation fixed by Twitter.

1.2 Data cleaning and preprocessing

As the file sizes are pretty large, a sample of 10% of the files has been selected to avoid performance issues. This text corpora was then processed in 6 steps.

  • Step 1 : Replace the most frequent abbreviations with their equivalent without dot (for example “a.m.” is replaced by “am”).
  • Step 2 : Replace email adresses and URLs with their equivalent domain name (for example “www.amazon.com” is replaced by “amazon”).
  • Step 3 : Replace every sentence breaking ponctuation marks by a dot mark (for example the characters “:”, “?”, “!” and “|” are replaced by “.”).
  • Step 4 : Remove quotes, quotation, dashes and other non-sentence breaking ponctuation marks.
  • Step 5 : Split sentences according to the dot mark.
  • Step 6 : Remove special characters, convert all characters to lower case (except the “I”’s), strip whitespaces and remove duplicate sentences.

To illustrate this process, we apply these 6 steps to the sentence below.

## [1] "In the Buckeye district, Superintendent Dennis Honkala said: \"We're pretty frustrated and disappointed. We haven't passed anything in 16 years. I can't explain it. I don't understand.\""

We then obtain the processed sentences below.

## [1] "in the buckeye district superintendent dennis honkala said"
## [2] "we're pretty frustrated and disappointed"                  
## [3] "we haven't passed anything in 16 years"                    
## [4] "I can't explain it"                                        
## [5] "I don't understand"

2. N-grams tokenization

2.1 Unigrams tokenization

Now, we take a look at the distribution of unigrams (words) frequencies. To proceed, we use our cleaned corpora.

Next, we use a custom function to tokenize the unigrams in the corpora.

Then, we build the term document matrix of the unigrams which is a matrix with each corpus as column and each unigram as row that contains the frequencies of the unigrams.

We use this term document matrix to build a table with the 20 most frequent unigrams for each corpus.

##    Blogs-1gram Freq     News-1gram Freq     Twitter-1gram Freq   
## 1  "the"       "184571" "the"      "197119" "the"         "90734"
## 2  "and"       "110235" "to"       "90945"  "to"          "77702"
## 3  "to"        "106559" "and"      "89753"  "i"           "73277"
## 4  "a"         "89776"  "a"        "87467"  "a"           "59674"
## 5  "of"        "86722"  "of"       "77197"  "and"         "52519"
## 6  "i"         "80867"  "in"       "67780"  "you"         "49704"
## 7  "in"        "59112"  "for"      "35240"  "in"          "37705"
## 8  "that"      "45486"  "that"     "34641"  "for"         "36277"
## 9  "is"        "42804"  "is"       "28306"  "of"          "35469"
## 10 "it"        "39585"  "on"       "26851"  "is"          "35103"
## 11 "for"       "35763"  "with"     "25393"  "my"          "28712"
## 12 "you"       "29531"  "said"     "24970"  "it"          "27535"
## 13 "with"      "28412"  "was"      "22814"  "on"          "27479"
## 14 "on"        "27304"  "he"       "22802"  "that"        "22608"
## 15 "was"       "27116"  "it"       "21953"  "me"          "18605"
## 16 "my"        "26582"  "i"        "21589"  "at"          "18398"
## 17 "this"      "25729"  "at"       "21033"  "with"        "18374"
## 18 "as"        "22520"  "as"       "18970"  "be"          "18141"
## 19 "have"      "21829"  "his"      "15732"  "your"        "16954"
## 20 "be"        "20729"  "be"       "15386"  "have"        "16281"

For an easier visualisation of these frequencies we may also use a barplot to plot the 50 most frequent unigrams.

Another way of plotting these frequencies consists in using wordclouds.

2.2 Bigrams tokenization

Next, we take a look at the distribution of bigrams frequencies. We use again our entire cleaned corpora and our custom function to tokenize the bigrams in the corpora.

Then, we build the term document matrix of the bigrams which is a matrix with each corpus as column and each bigram as row that contains the frequencies of the bigrams.

We use this term document matrix to build a table with the 20 most frequent bigrams for each corpus.

##    Blogs-2gram Freq    News-2gram Freq    Twitter-2gram Freq  
## 1  "of the"    "18675" "of the"   "18512" "in the"      "7808"
## 2  "in the"    "15044" "in the"   "17611" "for the"     "5926"
## 3  "to the"    "8413"  "to the"   "8288"  "of the"      "5509"
## 4  "on the"    "7482"  "on the"   "7361"  "on the"      "4849"
## 5  "to be"     "6877"  "for the"  "6994"  "to be"       "4509"
## 6  "and the"   "5953"  "at the"   "5634"  "to the"      "4336"
## 7  "for the"   "5692"  "and the"  "5132"  "at the"      "3633"
## 8  "and i"     "4866"  "in a"     "5049"  "going to"    "3348"
## 9  "i was"     "4789"  "to be"    "4682"  "if you"      "3201"
## 10 "it is"     "4754"  "with the" "4340"  "i have"      "3174"
## 11 "i have"    "4725"  "from the" "3746"  "have a"      "3046"
## 12 "at the"    "4579"  "with a"   "3348"  "i love"      "3023"
## 13 "it was"    "4556"  "he said"  "3346"  "i am"        "2918"
## 14 "in a"      "4529"  "of a"     "3311"  "for a"       "2907"
## 15 "is a"      "4489"  "as a"     "3215"  "to get"      "2736"
## 16 "with the"  "4245"  "for a"    "3010"  "thanks for"  "2674"
## 17 "i am"      "4235"  "will be"  "2852"  "to see"      "2624"
## 18 "from the"  "3772"  "that the" "2796"  "is a"        "2602"
## 19 "that i"    "3577"  "is a"     "2781"  "will be"     "2566"
## 20 "with a"    "3449"  "by the"   "2767"  "and i"       "2508"

For an easier visualisation of these frequencies we may also use a barplot to plot the 50 most frequent bigrams.

Here again, we may also use wordclouds.

2.3 Trigrams tokenization

Next, we take a look at the distribution of trigrams frequencies. We use once again our cleaned corpora and our custom function to tokenize the trigrams in the corpora.

Then, we will build the term document matrix of the bigrams which is a matrix with each corpus as column and each trigram as row that contains the frequencies of the trigrams.

We use one more time this term document matrix to build a table with the 20 most frequent trigrams for each corpus.

##    Blogs-3gram     Freq   News-3gram          Freq   Twitter-3gram        Freq  
## 1  "one of the"    "1495" "one of the"        "1438" "thanks for the"     "1215"
## 2  "a lot of"      "1251" "a lot of"          "1198" "looking forward to" "816" 
## 3  "to be a"       "707"  "as well as"        "617"  "i want to"          "689" 
## 4  "as well as"    "685"  "part of the"       "598"  "can't wait to"      "684" 
## 5  "it was a"      "660"  "according to the"  "581"  "going to be"        "684" 
## 6  "some of the"   "658"  "out of the"        "564"  "thank you for"      "657" 
## 7  "out of the"    "650"  "to be a"           "548"  "a lot of"           "636" 
## 8  "the end of"    "626"  "some of the"       "534"  "i love you"         "600" 
## 9  "a couple of"   "605"  "the end of"        "528"  "i need to"          "583" 
## 10 "i want to"     "593"  "in the first"      "512"  "one of the"         "565" 
## 11 "be able to"    "590"  "going to be"       "487"  "i have a"           "564" 
## 12 "the fact that" "530"  "the united states" "454"  "to be a"            "546" 
## 13 "this is a"     "527"  "the first time"    "436"  "i have to"          "532" 
## 14 "it is a"       "508"  "be able to"        "408"  "i'm going to"       "511" 
## 15 "there is a"    "504"  "it was a"          "398"  "to see you"         "492" 
## 16 "i have a"      "502"  "said in a"         "351"  "to go to"           "450" 
## 17 "going to be"   "490"  "of the year"       "350"  "is going to"        "446" 
## 18 "i have to"     "488"  "end of the"        "349"  "i wish i"           "393" 
## 19 "the rest of"   "484"  "for the first"     "344"  "i think i"          "387" 
## 20 "i have been"   "482"  "most of the"       "341"  "you have a"         "377"

For an easier visualisation of these frequencies we may also use a barplot to plot the 50 most frequent trigrams.

Here again, we may use wordclouds.

3. Unique words needed to cover a percentage of all word instances

To compute how many words we need to cover a precise percentage of all word instances, we have to use some custom functions. In this section we use the three corpora as one big corpora.

Using a custom function that takes as input word frequencies and a percentage threshold of word instances and returns the number of words needed to cover this percentage, we may affirm that only 150 (resp. 7863) most frequent words are needed to cover 50% (resp. 90%) of all word instances.

We may also use another custom function that takes as input word frequencies and numbers of unique words and returns a vector of percentage of word instances covered by this number of words to plot the percentage of text covered vs. the number of unique words.

4. Zipf’s law

The Zipf’s law is an empirical law that states that given some corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table. This means that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. To see if our corpora fits this law well, we plot the frequency of each word vs. the frequency of frequency of words in log-log coordinates.

We may see that, as the Zipf’s law states, there are many words that occur infrequently and there are few words that appear very frequently.

5. Conclusion

Following this exploratory data analysis, we plan to build a predictive model based on the n-gram language model with the Markov assumption. The algorithms we will use are the Simple Good-Turing method and Kneser-Ney method with the Katz backoff.