We first download and unzip the data from the link given for the Capstone project. The data files are in separate folders for separate languages. We will use the english language files for this project (folder /final/en_US
). There are 3 files, one for blog data, one for news data and one for twitter data. The file sizes are as follows:
210160014 Sep 18 21:29 en_US.blogs.txt
205811889 Sep 18 21:29 en_US.news.txt
167105331 Sep 20 20:05 en_US.twitter.txt
Counting the number of lines of text in each file:
$ wc -l en_US.blogs.txt 899288 en_US.blogs.txt
$ wc -l en_US.news.txt 1010242 en_US.news.txt
$ wc -l en_US.twitter.txt
2360148 en_US.twitter.txt
Counting the number of words in each file:
$ wc -w en_US.twitter.txt
30341028 en_US.twitter.txt
$ wc -w en_US.news.txt
34309642 en_US.news.txt
$ wc -w en_US.blogs.txt
37272578 en_US.blogs.txt
Each line of text in these files is a single tweet, blog or news item and therefore corresponds to a single document in the corpus.
The longest line in the twitter file is as follows:
$ cat ./en_US.twitter.txt | awk ' { if ( length > x ) { x = length; y = $0 } }END{ print y }'
It's time for you to give me a little bit of lovin'(さぁちょっとはあなたの愛をちょうだい)Baby, hold me tight and do what I tell you!(ベイビー抱きしめて私が言うように!)
Lots of foreign language words (will have to be removed in the pre-processing steps).
The above tweet has this many lines, words and characters: 1 21 214
The longest line in the blogs file has 1 6630 40836
lines, words and characters.
The longest line in the news file has 1 1792 11385
lines, words and characters.
The first step is to read in the data from the files. We will use a binomial distribution and keep 80% of the data read for training and 10% for validation and 10% for testing purposes. The data was read in using readLines
function and was fed into a tm::VCorpus
object as a VectorSource
. Reading 80% of 10,000 lines from each file we get roughly 24,000 documents in the training corpus.
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 24030
The data was then pre-processed by doing the following:
Stripping extra whitespaces
Coverting the text to lower case
Remove “non-printable” characters and non-ASCII characters
Remove english stopwords (common words)
Remove punctuation and numbers
Remove profane or swear words based on a downloaded list
Stemming or lemmatization of the words was not done on the text.
Viewing contents of the first document of the “cleaned” corpus:
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 106
##
## st louis plant close die old age workers making cars since onset mass automotive production s
We will next tokenize the words in each document and create a Document Term Matrix. The tokenizer used was the MC_tokenizer
from the tm
package.
## <<DocumentTermMatrix (documents: 24030, terms: 40442)>>
## Non-/sparse entries: 435174/971386086
## Sparsity : 100%
## Maximal term length: 91
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs also can get just like new one people said time
## 1320 0 0 0 1 2 0 0 0 0 1
## 15384 2 0 0 4 3 0 1 1 2 0
## 17934 0 0 0 1 1 1 4 0 0 0
## 18822 1 0 0 0 0 0 1 0 0 0
## 20299 0 0 0 0 0 0 0 0 0 0
## 20300 0 0 0 0 0 0 0 0 0 0
## 21678 1 1 0 0 0 1 1 0 0 0
## 2886 0 2 1 1 2 1 0 0 0 1
## 4593 3 0 0 0 0 4 0 0 0 0
## 7608 1 0 0 0 0 0 1 0 1 0
And look at some of most frequently used terms (more than 1000 times)
## [1] "also" "can" "first" "get" "just" "last" "like"
## [8] "new" "now" "one" "people" "said" "time" "two"
## [15] "year" "years"
To get a better idea of what is in the data we will first create a wordcloud. This will show the top 50 most frequent words in the corpus.
The word “said” was by far the most frequently used word.
Next we will create a bar plot also of the most frequently found words (more than 1000)
We will do a density plot on the frequency distribution of words occuring more than a 1000 times.
The next steps are to build an n-gram model that will predict the next word based on 1, 2, or 3 input words. The n-gram model method is outlined in the following chapter 3 of the Jurafsky and Martin book. Youtube lecture of the coursera course on NLP is also available on Youtube. If a word is not found in the data it will use the “stupid backoff” method. The backoff terminates in the unigram, which has probability \[lamdba*[S(w)= count(w)/N]\] . “Brants et al. (2007) find that a value of 0.4 worked well for lambda.”
So far I have noticed that converting the document term matrix to as.matrix
is not a valid strategy as it cannot handle large corpus (more than 10000 documents). Instead it is better to convert the Document Term Matrix to a tidytext
format, load it into a data.table
and then do the aggregations. The data.table
structure is quite efficient and lookup of words is fast using indexed keys. This strategy is easily able to handle large corpuses although I have not been successful in loading all the documents in the data. I tried a million records and it did not finish in 14 hours on a 12GB RAM 1.7 Ghz laptop! I am also worried that creating such large data tables may reduce it’s efficicency and response in the actual Shiny app. Any suggestions on handling this are welcome.