Data Exploratory

============================================================

Preparation

The data is a corpus called HC Corpora, and we optained the data through the link provided by the Coursera course website. The goal is to use these corpus to make a language model, so that we can predict the next word according to the input. This kind of techniques has been widely applied to mobile devices, where a smart input method is extremely valuable for a size-limited device.

The data include text files in 4 languages: English, Rusian, French and Dutch. In each language, there are three files from blogs, news, and twitter individually. I will focus on the English texts because of the limitation of my language capibility. These files are too big (~9million lines in en_US.blogs.txt, ~10milions lines in en_US.news.txt, and ~24milion lines in en_US.twitter.txt), which will make the data exploratory slow using the complete sample. I will need to keep some data for parameter tunning and performance evaluation later. Therefore, I will first split the files into smaller chunks using splitrw.R.

> # Splitted the files randomly into 10 chunks each. 
> source(paste(dirScript,'splitraw.R',sep=''))

Basic Word Counts

To understand the contents of the text files, I did some basic countings for the three types of texts. Since the subsample is still too big, I only read the first 5000 lines in the first subsample. The processes are packed into basicCounting.R.

> source(list.files(path=dirScripts,pattern='basicCounting.R',full.names=TRUE))
> files <- list.files(path=dirUS, pattern = '.10sub10')
> for (i in seq(files)) {
+   if (i==1) {verbose=TRUE} else verbose = FALSE
+   statsData<-basicCounting(dataname=files[i],verbose=verbose,nlines=5000)
+   if (i==1) { 
+     tableStats <- statsData
+   } else {
+     tableStats <- rbind(tableStats, statsData)
+   }
+ }

## [1] "Loading text file:  en_US.blogs.10sub10"
## [1] "Counting lines and words (excluding numbers) before any filtering"
## [1] "Summary of words in each line:"
## [1] "Apply URL filter ..."
## [1] "Apply Email filter ..."
## [1] "Apply emoticons filter ..."
## [1] "More cleaning: remove escape (like \\t, \n, \\r), quote (\"), space before comma..."
## [1] "More cleaning: remove numbers ..."
## [1] "Counting urls, emails, and emoticons"
## [1] "Apply profanity words filter ... (This is slow...)"
## [1] "More cleaning: adds a space after comma"
## [1] "Replace abbreviations using replace_abbreviation in qdap, ...."
## [1] "Use bracketX to remove contents within brackets ...."
## [1] "Split sentences useing sent_detect... "
## [1] "Cleanning extra space or something again!"
## [1] "Counting sentences and words after filtering."
## [1] "number of sentences:  9617"
## [1] "Summary of number of words in each line (after filtering:"
## [1] "Save text to temporary file: ~/workspace//coursera_DS_cap//en_US.blogs.10sub10.RData"

I also repeated the same procedure to pre-process other subsample, so that I would be able to examine if the result is consistent. Here, I tried the first 20,000 lines of the second subsample.

#Summary of the basic countings
source(list.files(path=dirScripts,pattern='organizeStatTable.R',full.names=TRUE))
newtable <- organizeStatTable()
show(newtable$allTable)

                        blogs   news twitter blogs.sample2 news.sample2 twitter.sample2
totNumLines              5000   5000    5000         20000        20000           20000
totNumWordsRaw         202342 164737   61977        814633       661164          247973
totNumEmails                7      6       0            18           41               0
totNumWeb                  16     16      23            58           69              87
totNumEmoticons            73     25     415           257           71            1609
totNumDigitals           1247   1979     774          4968         7681            3077
totBadWords               127     53     178           512          149             753
totNumSentences          9617   6351    1395         37561        23252            3294
totNumWordsAF          158344 112690   12162        624578       410566           27723
avgNumWordsInLines      40.53  32.95   12.40         40.77        33.10           12.40
avgNumWordsInSentence  16.760 17.790   9.546        16.910       17.680           9.741
avgSentencesInLines    1.9234 1.2702  0.2790       1.87805      1.16260         0.16470
maxNumWordsInLines        467    298      31           615          361              33
maxNumWordsInSentence     106    140      81           143          242              92
ratio.1wordsSentence     0.02   0.03    0.05          0.03         0.03            0.06
ratio.lt4wordsSentence   0.08   0.07    0.19          0.07         0.07            0.18

Result for the first 5000 line sample
- Number of words (totNumWordsRaw) in twitter texts drops significantly after cleaning (totNumWordsAF). It become 1/10 of the other two sources, which means that the sentences of twitter are shorter: ~10 words in each sentence on the average.
- Overall, the profanity words (totBadWords) are rare, although we have used a long list of bad word dictionary (REF 1).
- There are more website, emoticons, profanity words in twitter texts compare to the other two sources – consistent with the idea that people use words more leisurely in twitter. Perhaps, it can be even better justified examing the statistics of mis-spelling.
- In the case of twitter text, the number of sentences are much less than that of lines. The sentence to line ratio (avgSentencesInLines) of blogs and news text are also unreasonably low. This is because many of the sentences are somehow ignored by sent_dect of qdap. Some are dropped intentionally, e.g. sentences in parentheses will be dropped by replace_abbreviation. But, the others are missed mystiously, which needs to be examined further. Given that there are already too many texts to digest, it doesn’t matter to miss some, so that I will not invest too much time here at this stage.
- The distribution of number of words in each sentence is wide, from single word to ~140 words. Most of the single word sentences are from wrong sentence detection. For example, multiple end-mark like “…” could be separated into more than one sentences with the second one with only single “.”. Since the n-gram models aren’t likely benefited by the short sentences, I will exclude them (n<4) for the modeling step.
Comparing to the 20,000 lines sample
- The number of data sets are 4 times larger than the 5000 lines sample. The results of the blogs and news are more or less scaled consistently. It means that we didn’t see obvious bias in the small 5000 line sample from these two sources.
- The twitter texts are less stable. For example, the number of words and sentences after cleanning (totNumWordsAF and totNumSentences) are way smaller than the extrapolated from the numbers of small 5000 lines sample.

Word Frequencies

So, I will convert the texts into corpus, and then DocumentTermMatrix using the tool of tm. I will be able to use DocumentTermMatrix to examine the phrase frequencies. There are still many options to choose to convert the text to the DTM. As a reminder, we now have cleaned text files, where the url, email, emoticons, and numbers are removed, and only one sentences in each line. The next possible steps are A. remove Punctuation, B. make all lower case, and C. remove stop words. The first two steps will be automatically performed if we use the word frequency matrix function (wfm) in qdap. So, here, I will first examine the difference caused by option C.

Note that the word frequency matrix (wfm) is similar to DTM, and are used in qdap. I feel that the interface of qdap is easier to use, so that I will do the word frequency count with wfm first. However, I didn’t figure out how to calculate n-gram frequency with qdap yet, so I will convert the result to tm later. The good news is qdap provide convenient tools and guide to convert the data frame to corpus used by tm.

In the plots below, I show the top 25 most frequent words in the three source files: with no stop words filter on the top, and with stop words filter on the bottom. Stop words means most commonly used words, so that no much information is provided. For example, I use the default stop word list of the tm package, with the top 3 stop words as “I”, “me”, and “my”.

> source(list.files(path=dirScripts,pattern='useWFM.R',full.names=TRUE))
> source(list.files(path=dirScripts,pattern='organizeWordFreqTable.R',full.names=TRUE))
> files <- list.files(path=dirWork, pattern = '.10sub10.RData')
> res1<-useWFM(dataname=files,verbose=TRUE,sentLenCut = 4, savefilename='wfm-sample1')

## [1] "Combine texts into data.frame for qdap..."
## [1] "Load data in en_US.blogs.10sub10.RData"
## [1] "Load data in en_US.news.10sub10.RData"
## [1] "Load data in en_US.twitter.10sub10.RData"
## [1] "further filtering: replace_symbol ==> remove symbols @, %, #, @, &, w/ ..."
## [1] "calculate word frequency using wfm ... "
## [1] "calculate word frequency using wfm removing stop words ... "
## [1] "find most frequent terms using freq_term ... "
## [1] "     This will automatically ignore the punctuations and case."
## [1] "find most frequent terms using freq_term (ignore the top 200 stop words ... "
## [1] "save result to  ~/workspace//coursera_DS_cap//wfm-sample1.RData  for later use..."

> files <- list.files(path=dirWork, pattern = '.2sub10.RData')
> res2<-useWFM(dataname=files,verbose=FALSE,sentLenCut = 4, savefilename='wfm-sample2')
> pltTable = organizeWordFreqTable(res1,res2)

#The list of top 25 most frequent words of the "blogs" in sample1. 
print(as.vector(res1$most25Freq$WORD[1:25]))

##  [1] "the"  "and"  "to"   "a"    "of"   "i"    "in"   "is"   "that" "num"  "it"   "for"  "with" "on"   "my"   "was"  "you"  "this" "but"  "as"   "have" "be"   "are" 
## [24] "we"   "at"

# with Stop words filter. 
print(as.vector(res1$most25FreqExcSW$WORD[1:25]))

##  [1] "num"    "one"    "will"   "just"   "can"    "time"   "like"   "get"    "people" "also"   "new"    "know"   "now"    "first"  "us"     "think"  "well"   "day"   
## [19] "back"   "little" "good"   "way"    "make"   "even"   "going"

Result
- Compare the left and right panels of the plots, I found that the pattern look quite similar. This reinsures that the small sample size is not a problem for the most frequent words. The small 5000 line sample already pick up the right most frequent words.
- Focusing on the top plot with no stop word filters, the words of twitter scatter a lot compare to those from the other two sources. This furthers strengthens that the language usage habbit of twitter are less consistent with the other two. Therefore, it may be necessary to discuss separately.
- Examine the difference between with/without stop word filters, we found that all the three sources follow the general trend more or less. Even with the large scattering of the twitter words, it still in general drop from left to right. However,
  the pattern of news in the bottom plot became flat quickly, which may also reveal the difference of language usage. By definition, the stop words are the most frequently used words. The difference between the news and blogs can be seen if these “most frequent” words are excluded.
- The most frequent word in the bottom plot is “num”. But, it is not the really word, but the replacement of all digital numbers.
- Finally, even if the stop words are excluded, the most frequent words are still very common. Good news is we perhaps don’t need very complicate words for the modeling. But, the concern is if we want to speak like Shakespheare, we will need to train the model deeply.

Phrases Frequencies

We use the n-gram tokenizer to count the phrases appeared in the texts. For a quick reference, n-gram is an approach to pick up word set in a sentence. n controls the number of adjacent words to be picked up. If the sentence is like, “To be or not to be.”, then 2-gram tokenizer will give us 7 tokens (or phrases): “ To”, “To be”, “be or”, “or not”, “not to”, “to be”, and “be ”. is the Sentence Boundary. Accordingly, the tokens of the 3-gram will be “ To”, “ To be”, “To be or”, “be or not”, “or not to”, “not to be”, “to be. ”, and “be. ” For more details, see the reference in wikipedia or the big book of Jurafsky & Martin (CH I.6).

Since we already see the difference of texts from the three source files in the word frequency counting, there is not need to compare their result again here. I could expect the difficulty to choose a right corpus for the language model later on, but let’s not worry about it now. For data exploratory stage of n-gram model, I will simply use the small sample of one single source, and I choose the 20,000 lines of subsample 2 of the blogs text.

Technically, I will convert the data frame used in qdap to corpus for the functions in tm. Then, I will use the n-gram tokenizer to make the DocumentTermMatrix for the 2 and 3-gram model.

files <- list.files(path=dirWork, pattern = '.blogs.2sub10.RData')
res1<-makeDTF(dataname=files,verbose=TRUE,sentLenCut = 4, savefilename='dtf-blogs-sample2')

#plotting 
source(list.files(path=dirScripts,pattern='organizeNGramPlot.R',full.names=TRUE))
resplt<-organizeNGramPlot('dtf-blogs-sample2')
ggplot(resplt$plt, aes(x=logindex,y=logFreq,color=ngram)) + geom_point() +
    labs(titles='Phrase Frequency Distribution',y='log(Freq)',x='log(Index)')

The plot above shows the distribution of frequency of bigram and trigram phrases. The axises are in log scale. The trigram frequency drop slower than the bigram curve, and both of them drop by a factor of ~100 with the most frequent 10,000 terms. The total number of phrases are ~289,000 for bigram, and ~52,500 for trigram. So, at least for the 20,000 lines subsample, we can reduce the size of the n-gram model to less than 3%, and still keep the meaningful information.

Below is the list of the top 25 most frequent phrases in the blogs text:

## [1] "Bigram"

##  [1] "of the"   "in the"   "to the"   "on the"   "to be"    "and the"  "for the"  "and i"    "i was"    "is a"     "it is"    "it was"   "i have"   "at the"   "in a"    
## [16] "with the" "that i"   "i am"     "it s"     "from the" "with a"   "i m"      "of a"     "for a"    "num num"

## [1] "Trigram:"

##  [1] "num  num"      "one of the"    "a lot of"      "i don t"       "out of the"    "some of the"   "the end of"    "it was a"      "to be a"       "as well as"   
## [11] "i have to"     "this is a"     "i have been"   "the fact that" "be able to"    "it is a"       "i didn t"      "a couple of"   "it s a"        "na na na"     
## [21] "i had to"      "it would be"   "most of the"   "part of the"   "a bit of"

Again, phrases with numbers are still dominant as the word frequency.
The most frequent phrases are dominanted by the combinations of stop words. This won’t be very helpful for the language model. I should try to either increase the word cuts, or use and extend stop words filter.

Some Thought For Language Model

It is time consuming to clean and tokenize large text files, even if I have used the 1/10 subsample. The most frequent terms will be less than 5% of the data. I should find a more efficient way to deal with the raw text files.
On the other hand, the most frequent data are dominanted by the “low-information” and/or stop words. Ideally, it would be more impressive to predict the rare word right, than predict very accurately for the common words. Therefore, I believe some kind of stop word filtering is necessary to increase the visibility of rare words and phrases.
There are intrinsic difference among the text from three sources. A smart choose of the correct corpus may benefit the accuracy of the language model.

Notes and References

Most of the source codes are kept hiden in this presentation for clearity, and can be found in my github.
REF 1: I used the dirty word list in the github of shutterstock.
REF 2: QDAP: Rinker, T. W. (2013). qdap: Quantitative Discourse Analysis Package. version 2.2.0. University at Buffalo. Buffalo, New York. http://github.com/trinker/qdap
REF 3: tm: http://tm.r-forge.r-project.org/ or http://cran.r-project.org/web/packages/tm/index.html, and the documents there in.