============================================================
The data is a corpus called HC Corpora, and we optained the data through the link provided by the Coursera course website. The goal is to use these corpus to make a language model, so that we can predict the next word according to the input. This kind of techniques has been widely applied to mobile devices, where a smart input method is extremely valuable for a size-limited device.
The data include text files in 4 languages: English, Rusian, French and Dutch. In each language, there are three files from blogs, news, and twitter individually. I will focus on the English texts because of the limitation of my language capibility. These files are too big (~9million lines in en_US.blogs.txt, ~10milions lines in en_US.news.txt, and ~24milion lines in en_US.twitter.txt), which will make the data exploratory slow using the complete sample. I will need to keep some data for parameter tunning and performance evaluation later. Therefore, I will first split the files into smaller chunks using splitrw.R.
> # Splitted the files randomly into 10 chunks each.
> source(paste(dirScript,'splitraw.R',sep=''))
To understand the contents of the text files, I did some basic countings for the three types of texts. Since the subsample is still too big, I only read the first 5000 lines in the first subsample. The processes are packed into basicCounting.R.
> source(list.files(path=dirScripts,pattern='basicCounting.R',full.names=TRUE))
> files <- list.files(path=dirUS, pattern = '.10sub10')
> for (i in seq(files)) {
+ if (i==1) {verbose=TRUE} else verbose = FALSE
+ statsData<-basicCounting(dataname=files[i],verbose=verbose,nlines=5000)
+ if (i==1) {
+ tableStats <- statsData
+ } else {
+ tableStats <- rbind(tableStats, statsData)
+ }
+ }
## [1] "Loading text file: en_US.blogs.10sub10"
## [1] "Counting lines and words (excluding numbers) before any filtering"
## [1] "Summary of words in each line:"
## [1] "Apply URL filter ..."
## [1] "Apply Email filter ..."
## [1] "Apply emoticons filter ..."
## [1] "More cleaning: remove escape (like \\t, \n, \\r), quote (\"), space before comma..."
## [1] "More cleaning: remove numbers ..."
## [1] "Counting urls, emails, and emoticons"
## [1] "Apply profanity words filter ... (This is slow...)"
## [1] "More cleaning: adds a space after comma"
## [1] "Replace abbreviations using replace_abbreviation in qdap, ...."
## [1] "Use bracketX to remove contents within brackets ...."
## [1] "Split sentences useing sent_detect... "
## [1] "Cleanning extra space or something again!"
## [1] "Counting sentences and words after filtering."
## [1] "number of sentences: 9617"
## [1] "Summary of number of words in each line (after filtering:"
## [1] "Save text to temporary file: ~/workspace//coursera_DS_cap//en_US.blogs.10sub10.RData"
I also repeated the same procedure to pre-process other subsample, so that I would be able to examine if the result is consistent. Here, I tried the first 20,000 lines of the second subsample.
#Summary of the basic countings
source(list.files(path=dirScripts,pattern='organizeStatTable.R',full.names=TRUE))
newtable <- organizeStatTable()
show(newtable$allTable)
blogs news twitter blogs.sample2 news.sample2 twitter.sample2
totNumLines 5000 5000 5000 20000 20000 20000
totNumWordsRaw 202342 164737 61977 814633 661164 247973
totNumEmails 7 6 0 18 41 0
totNumWeb 16 16 23 58 69 87
totNumEmoticons 73 25 415 257 71 1609
totNumDigitals 1247 1979 774 4968 7681 3077
totBadWords 127 53 178 512 149 753
totNumSentences 9617 6351 1395 37561 23252 3294
totNumWordsAF 158344 112690 12162 624578 410566 27723
avgNumWordsInLines 40.53 32.95 12.40 40.77 33.10 12.40
avgNumWordsInSentence 16.760 17.790 9.546 16.910 17.680 9.741
avgSentencesInLines 1.9234 1.2702 0.2790 1.87805 1.16260 0.16470
maxNumWordsInLines 467 298 31 615 361 33
maxNumWordsInSentence 106 140 81 143 242 92
ratio.1wordsSentence 0.02 0.03 0.05 0.03 0.03 0.06
ratio.lt4wordsSentence 0.08 0.07 0.19 0.07 0.07 0.18
So, I will convert the texts into corpus, and then DocumentTermMatrix using the tool of tm. I will be able to use DocumentTermMatrix to examine the phrase frequencies. There are still many options to choose to convert the text to the DTM. As a reminder, we now have cleaned text files, where the url, email, emoticons, and numbers are removed, and only one sentences in each line. The next possible steps are A. remove Punctuation, B. make all lower case, and C. remove stop words. The first two steps will be automatically performed if we use the word frequency matrix function (wfm) in qdap. So, here, I will first examine the difference caused by option C.
Note that the word frequency matrix (wfm) is similar to DTM, and are used in qdap. I feel that the interface of qdap is easier to use, so that I will do the word frequency count with wfm first. However, I didn’t figure out how to calculate n-gram frequency with qdap yet, so I will convert the result to tm later. The good news is qdap provide convenient tools and guide to convert the data frame to corpus used by tm.
In the plots below, I show the top 25 most frequent words in the three source files: with no stop words filter on the top, and with stop words filter on the bottom. Stop words means most commonly used words, so that no much information is provided. For example, I use the default stop word list of the tm package, with the top 3 stop words as “I”, “me”, and “my”.
> source(list.files(path=dirScripts,pattern='useWFM.R',full.names=TRUE))
> source(list.files(path=dirScripts,pattern='organizeWordFreqTable.R',full.names=TRUE))
> files <- list.files(path=dirWork, pattern = '.10sub10.RData')
> res1<-useWFM(dataname=files,verbose=TRUE,sentLenCut = 4, savefilename='wfm-sample1')
## [1] "Combine texts into data.frame for qdap..."
## [1] "Load data in en_US.blogs.10sub10.RData"
## [1] "Load data in en_US.news.10sub10.RData"
## [1] "Load data in en_US.twitter.10sub10.RData"
## [1] "further filtering: replace_symbol ==> remove symbols @, %, #, @, &, w/ ..."
## [1] "calculate word frequency using wfm ... "
## [1] "calculate word frequency using wfm removing stop words ... "
## [1] "find most frequent terms using freq_term ... "
## [1] " This will automatically ignore the punctuations and case."
## [1] "find most frequent terms using freq_term (ignore the top 200 stop words ... "
## [1] "save result to ~/workspace//coursera_DS_cap//wfm-sample1.RData for later use..."
> files <- list.files(path=dirWork, pattern = '.2sub10.RData')
> res2<-useWFM(dataname=files,verbose=FALSE,sentLenCut = 4, savefilename='wfm-sample2')
> pltTable = organizeWordFreqTable(res1,res2)
#The list of top 25 most frequent words of the "blogs" in sample1.
print(as.vector(res1$most25Freq$WORD[1:25]))
## [1] "the" "and" "to" "a" "of" "i" "in" "is" "that" "num" "it" "for" "with" "on" "my" "was" "you" "this" "but" "as" "have" "be" "are"
## [24] "we" "at"
# with Stop words filter.
print(as.vector(res1$most25FreqExcSW$WORD[1:25]))
## [1] "num" "one" "will" "just" "can" "time" "like" "get" "people" "also" "new" "know" "now" "first" "us" "think" "well" "day"
## [19] "back" "little" "good" "way" "make" "even" "going"
We use the n-gram tokenizer to count the phrases appeared in the texts. For a quick reference, n-gram is an approach to pick up word set in a sentence. n controls the number of adjacent words to be picked up. If the sentence is like, “To be or not to be.”, then 2-gram tokenizer will give us 7 tokens (or phrases): “
Since we already see the difference of texts from the three source files in the word frequency counting, there is not need to compare their result again here. I could expect the difficulty to choose a right corpus for the language model later on, but let’s not worry about it now. For data exploratory stage of n-gram model, I will simply use the small sample of one single source, and I choose the 20,000 lines of subsample 2 of the blogs text.
Technically, I will convert the data frame used in qdap to corpus for the functions in tm. Then, I will use the n-gram tokenizer to make the DocumentTermMatrix for the 2 and 3-gram model.
files <- list.files(path=dirWork, pattern = '.blogs.2sub10.RData')
res1<-makeDTF(dataname=files,verbose=TRUE,sentLenCut = 4, savefilename='dtf-blogs-sample2')
#plotting
source(list.files(path=dirScripts,pattern='organizeNGramPlot.R',full.names=TRUE))
resplt<-organizeNGramPlot('dtf-blogs-sample2')
ggplot(resplt$plt, aes(x=logindex,y=logFreq,color=ngram)) + geom_point() +
labs(titles='Phrase Frequency Distribution',y='log(Freq)',x='log(Index)')
The plot above shows the distribution of frequency of bigram and trigram phrases. The axises are in log scale. The trigram frequency drop slower than the bigram curve, and both of them drop by a factor of ~100 with the most frequent 10,000 terms. The total number of phrases are ~289,000 for bigram, and ~52,500 for trigram. So, at least for the 20,000 lines subsample, we can reduce the size of the n-gram model to less than 3%, and still keep the meaningful information.
Below is the list of the top 25 most frequent phrases in the blogs text:
## [1] "Bigram"
## [1] "of the" "in the" "to the" "on the" "to be" "and the" "for the" "and i" "i was" "is a" "it is" "it was" "i have" "at the" "in a"
## [16] "with the" "that i" "i am" "it s" "from the" "with a" "i m" "of a" "for a" "num num"
## [1] "Trigram:"
## [1] "num num" "one of the" "a lot of" "i don t" "out of the" "some of the" "the end of" "it was a" "to be a" "as well as"
## [11] "i have to" "this is a" "i have been" "the fact that" "be able to" "it is a" "i didn t" "a couple of" "it s a" "na na na"
## [21] "i had to" "it would be" "most of the" "part of the" "a bit of"