============================================================

Preparation

The data is a corpus called HC Corpora, and we optained the data through the link provided by the Coursera course website. The goal is to use these corpus to make a language model, so that we can predict the next word according to the input. This kind of techniques has been widely applied to mobile devices, where a smart input method is extremely valuable for a size-limited device.

The data include text files in 4 languages: English, Rusian, French and Dutch. In each language, there are three files from blogs, news, and twitter individually. I will focus on the English texts because of the limitation of my language capibility. These files are too big (~9million lines in en_US.blogs.txt, ~10milions lines in en_US.news.txt, and ~24milion lines in en_US.twitter.txt), which will make the data exploratory slow using the complete sample. I will need to keep some data for parameter tunning and performance evaluation later. Therefore, I will first split the files into smaller chunks using splitrw.R.

> # Splitted the files randomly into 10 chunks each. 
> source(paste(dirScript,'splitraw.R',sep='')) 

Basic Word Counts

To understand the contents of the text files, I did some basic countings for the three types of texts. Since the subsample is still too big, I only read the first 5000 lines in the first subsample. The processes are packed into basicCounting.R.

> source(list.files(path=dirScripts,pattern='basicCounting.R',full.names=TRUE))
> files <- list.files(path=dirUS, pattern = '.10sub10')
> for (i in seq(files)) {
+   if (i==1) {verbose=TRUE} else verbose = FALSE
+   statsData<-basicCounting(dataname=files[i],verbose=verbose,nlines=5000)
+   if (i==1) { 
+     tableStats <- statsData
+   } else {
+     tableStats <- rbind(tableStats, statsData)
+   }
+ }
## [1] "Loading text file:  en_US.blogs.10sub10"
## [1] "Counting lines and words (excluding numbers) before any filtering"
## [1] "Summary of words in each line:"
## [1] "Apply URL filter ..."
## [1] "Apply Email filter ..."
## [1] "Apply emoticons filter ..."
## [1] "More cleaning: remove escape (like \\t, \n, \\r), quote (\"), space before comma..."
## [1] "More cleaning: remove numbers ..."
## [1] "Counting urls, emails, and emoticons"
## [1] "Apply profanity words filter ... (This is slow...)"
## [1] "More cleaning: adds a space after comma"
## [1] "Replace abbreviations using replace_abbreviation in qdap, ...."
## [1] "Use bracketX to remove contents within brackets ...."
## [1] "Split sentences useing sent_detect... "
## [1] "Cleanning extra space or something again!"
## [1] "Counting sentences and words after filtering."
## [1] "number of sentences:  9617"
## [1] "Summary of number of words in each line (after filtering:"
## [1] "Save text to temporary file: ~/workspace//coursera_DS_cap//en_US.blogs.10sub10.RData"

I also repeated the same procedure to pre-process other subsample, so that I would be able to examine if the result is consistent. Here, I tried the first 20,000 lines of the second subsample.

#Summary of the basic countings
source(list.files(path=dirScripts,pattern='organizeStatTable.R',full.names=TRUE))
newtable <- organizeStatTable()
show(newtable$allTable)
                        blogs   news twitter blogs.sample2 news.sample2 twitter.sample2
totNumLines              5000   5000    5000         20000        20000           20000
totNumWordsRaw         202342 164737   61977        814633       661164          247973
totNumEmails                7      6       0            18           41               0
totNumWeb                  16     16      23            58           69              87
totNumEmoticons            73     25     415           257           71            1609
totNumDigitals           1247   1979     774          4968         7681            3077
totBadWords               127     53     178           512          149             753
totNumSentences          9617   6351    1395         37561        23252            3294
totNumWordsAF          158344 112690   12162        624578       410566           27723
avgNumWordsInLines      40.53  32.95   12.40         40.77        33.10           12.40
avgNumWordsInSentence  16.760 17.790   9.546        16.910       17.680           9.741
avgSentencesInLines    1.9234 1.2702  0.2790       1.87805      1.16260         0.16470
maxNumWordsInLines        467    298      31           615          361              33
maxNumWordsInSentence     106    140      81           143          242              92
ratio.1wordsSentence     0.02   0.03    0.05          0.03         0.03            0.06
ratio.lt4wordsSentence   0.08   0.07    0.19          0.07         0.07            0.18

Word Frequencies

So, I will convert the texts into corpus, and then DocumentTermMatrix using the tool of tm. I will be able to use DocumentTermMatrix to examine the phrase frequencies. There are still many options to choose to convert the text to the DTM. As a reminder, we now have cleaned text files, where the url, email, emoticons, and numbers are removed, and only one sentences in each line. The next possible steps are A. remove Punctuation, B. make all lower case, and C. remove stop words. The first two steps will be automatically performed if we use the word frequency matrix function (wfm) in qdap. So, here, I will first examine the difference caused by option C.

Note that the word frequency matrix (wfm) is similar to DTM, and are used in qdap. I feel that the interface of qdap is easier to use, so that I will do the word frequency count with wfm first. However, I didn’t figure out how to calculate n-gram frequency with qdap yet, so I will convert the result to tm later. The good news is qdap provide convenient tools and guide to convert the data frame to corpus used by tm.

In the plots below, I show the top 25 most frequent words in the three source files: with no stop words filter on the top, and with stop words filter on the bottom. Stop words means most commonly used words, so that no much information is provided. For example, I use the default stop word list of the tm package, with the top 3 stop words as “I”, “me”, and “my”.

> source(list.files(path=dirScripts,pattern='useWFM.R',full.names=TRUE))
> source(list.files(path=dirScripts,pattern='organizeWordFreqTable.R',full.names=TRUE))
> files <- list.files(path=dirWork, pattern = '.10sub10.RData')
> res1<-useWFM(dataname=files,verbose=TRUE,sentLenCut = 4, savefilename='wfm-sample1')
## [1] "Combine texts into data.frame for qdap..."
## [1] "Load data in en_US.blogs.10sub10.RData"
## [1] "Load data in en_US.news.10sub10.RData"
## [1] "Load data in en_US.twitter.10sub10.RData"
## [1] "further filtering: replace_symbol ==> remove symbols @, %, #, @, &, w/ ..."
## [1] "calculate word frequency using wfm ... "
## [1] "calculate word frequency using wfm removing stop words ... "
## [1] "find most frequent terms using freq_term ... "
## [1] "     This will automatically ignore the punctuations and case."
## [1] "find most frequent terms using freq_term (ignore the top 200 stop words ... "
## [1] "save result to  ~/workspace//coursera_DS_cap//wfm-sample1.RData  for later use..."
> files <- list.files(path=dirWork, pattern = '.2sub10.RData')
> res2<-useWFM(dataname=files,verbose=FALSE,sentLenCut = 4, savefilename='wfm-sample2')
> pltTable = organizeWordFreqTable(res1,res2)

#The list of top 25 most frequent words of the "blogs" in sample1. 
print(as.vector(res1$most25Freq$WORD[1:25]))
##  [1] "the"  "and"  "to"   "a"    "of"   "i"    "in"   "is"   "that" "num"  "it"   "for"  "with" "on"   "my"   "was"  "you"  "this" "but"  "as"   "have" "be"   "are" 
## [24] "we"   "at"
# with Stop words filter. 
print(as.vector(res1$most25FreqExcSW$WORD[1:25]))
##  [1] "num"    "one"    "will"   "just"   "can"    "time"   "like"   "get"    "people" "also"   "new"    "know"   "now"    "first"  "us"     "think"  "well"   "day"   
## [19] "back"   "little" "good"   "way"    "make"   "even"   "going"

Phrases Frequencies

We use the n-gram tokenizer to count the phrases appeared in the texts. For a quick reference, n-gram is an approach to pick up word set in a sentence. n controls the number of adjacent words to be picked up. If the sentence is like, “To be or not to be.”, then 2-gram tokenizer will give us 7 tokens (or phrases): “ To”, “To be”, “be or”, “or not”, “not to”, “to be”, and “be ”. is the Sentence Boundary. Accordingly, the tokens of the 3-gram will be “ To”, “ To be”, “To be or”, “be or not”, “or not to”, “not to be”, “to be. ”, and “be. ” For more details, see the reference in wikipedia or the big book of Jurafsky & Martin (CH I.6).

Since we already see the difference of texts from the three source files in the word frequency counting, there is not need to compare their result again here. I could expect the difficulty to choose a right corpus for the language model later on, but let’s not worry about it now. For data exploratory stage of n-gram model, I will simply use the small sample of one single source, and I choose the 20,000 lines of subsample 2 of the blogs text.

Technically, I will convert the data frame used in qdap to corpus for the functions in tm. Then, I will use the n-gram tokenizer to make the DocumentTermMatrix for the 2 and 3-gram model.

files <- list.files(path=dirWork, pattern = '.blogs.2sub10.RData')
res1<-makeDTF(dataname=files,verbose=TRUE,sentLenCut = 4, savefilename='dtf-blogs-sample2')
#plotting 
source(list.files(path=dirScripts,pattern='organizeNGramPlot.R',full.names=TRUE))
resplt<-organizeNGramPlot('dtf-blogs-sample2')
ggplot(resplt$plt, aes(x=logindex,y=logFreq,color=ngram)) + geom_point() +
    labs(titles='Phrase Frequency Distribution',y='log(Freq)',x='log(Index)')

The plot above shows the distribution of frequency of bigram and trigram phrases. The axises are in log scale. The trigram frequency drop slower than the bigram curve, and both of them drop by a factor of ~100 with the most frequent 10,000 terms. The total number of phrases are ~289,000 for bigram, and ~52,500 for trigram. So, at least for the 20,000 lines subsample, we can reduce the size of the n-gram model to less than 3%, and still keep the meaningful information.

Below is the list of the top 25 most frequent phrases in the blogs text:

## [1] "Bigram"
##  [1] "of the"   "in the"   "to the"   "on the"   "to be"    "and the"  "for the"  "and i"    "i was"    "is a"     "it is"    "it was"   "i have"   "at the"   "in a"    
## [16] "with the" "that i"   "i am"     "it s"     "from the" "with a"   "i m"      "of a"     "for a"    "num num"
## [1] "Trigram:"
##  [1] "num  num"      "one of the"    "a lot of"      "i don t"       "out of the"    "some of the"   "the end of"    "it was a"      "to be a"       "as well as"   
## [11] "i have to"     "this is a"     "i have been"   "the fact that" "be able to"    "it is a"       "i didn t"      "a couple of"   "it s a"        "na na na"     
## [21] "i had to"      "it would be"   "most of the"   "part of the"   "a bit of"

Some Thought For Language Model

Notes and References