Exploratory Analysis of the training dataset

Initial look at the data

We first download and unzip the data from the link given for the Capstone project. The data files are in separate folders for separate languages. We will use the english language files for this project (folder /final/en_US). There are 3 files, one for blog data, one for news data and one for twitter data. The file sizes are as follows:

210160014 Sep 18 21:29 en_US.blogs.txt 205811889 Sep 18 21:29 en_US.news.txt 167105331 Sep 20 20:05 en_US.twitter.txt

Counting the number of lines of text in each file:

$ wc -l en_US.blogs.txt 899288 en_US.blogs.txt

$ wc -l en_US.news.txt 1010242 en_US.news.txt

$ wc -l en_US.twitter.txt 2360148 en_US.twitter.txt

Counting the number of words in each file:

$ wc -w en_US.twitter.txt 30341028 en_US.twitter.txt

$ wc -w en_US.news.txt 34309642 en_US.news.txt

$ wc -w en_US.blogs.txt 37272578 en_US.blogs.txt

Each line of text in these files is a single tweet, blog or news item and therefore corresponds to a single document in the corpus.

The longest line in the twitter file is as follows:

$ cat ./en_US.twitter.txt | awk ' { if ( length > x ) { x = length; y = $0 } }END{ print y }' It's time for you to give me a little bit of lovin'(さぁちょっとはあなたの愛をちょうだい)Baby, hold me tight and do what I tell you!(ベイビー抱きしめて私が言うように!)

Lots of foreign language words (will have to be removed in the pre-processing steps).

The above tweet has this many lines, words and characters: 1 21 214

The longest line in the blogs file has 1 6630 40836 lines, words and characters.

The longest line in the news file has 1 1792 11385 lines, words and characters.

Reading in the data

The first step is to read in the data from the files. We will use a binomial distribution and keep 80% of the data read for training and 10% for validation and 10% for testing purposes. The data was read in using readLines function and was fed into a tm::VCorpus object as a VectorSource. Reading 80% of 10,000 lines from each file we get roughly 24,000 documents in the training corpus.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 24030

Pre-processing the data

The data was then pre-processed by doing the following:

  • Stripping extra whitespaces

  • Coverting the text to lower case

  • Remove “non-printable” characters and non-ASCII characters

  • Remove english stopwords (common words)

  • Remove punctuation and numbers

  • Remove profane or swear words based on a downloaded list

Stemming or lemmatization of the words was not done on the text.

Viewing contents of the first document of the “cleaned” corpus:

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 106
## 
##  st louis plant   close   die  old age workers   making cars  since  onset  mass automotive production   s

We will next tokenize the words in each document and create a Document Term Matrix. The tokenizer used was the MC_tokenizer from the tm package.

## <<DocumentTermMatrix (documents: 24030, terms: 40442)>>
## Non-/sparse entries: 435174/971386086
## Sparsity           : 100%
## Maximal term length: 91
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    also can get just like new one people said time
##   1320     0   0   0    1    2   0   0      0    0    1
##   15384    2   0   0    4    3   0   1      1    2    0
##   17934    0   0   0    1    1   1   4      0    0    0
##   18822    1   0   0    0    0   0   1      0    0    0
##   20299    0   0   0    0    0   0   0      0    0    0
##   20300    0   0   0    0    0   0   0      0    0    0
##   21678    1   1   0    0    0   1   1      0    0    0
##   2886     0   2   1    1    2   1   0      0    0    1
##   4593     3   0   0    0    0   4   0      0    0    0
##   7608     1   0   0    0    0   0   1      0    1    0

And look at some of most frequently used terms (more than 1000 times)

##  [1] "also"   "can"    "first"  "get"    "just"   "last"   "like"  
##  [8] "new"    "now"    "one"    "people" "said"   "time"   "two"   
## [15] "year"   "years"

Create some plots on the data

To get a better idea of what is in the data we will first create a wordcloud. This will show the top 50 most frequent words in the corpus.

The word “said” was by far the most frequently used word.

Next we will create a bar plot also of the most frequently found words (more than 1000)

We will do a density plot on the frequency distribution of words occuring more than a 1000 times.

Next steps

The next steps are to build an n-gram model that will predict the next word based on 1, 2, or 3 input words. The n-gram model method is outlined in the following chapter 3 of the Jurafsky and Martin book. Youtube lecture of the coursera course on NLP is also available on Youtube. If a word is not found in the data it will use the “stupid backoff” method. The backoff terminates in the unigram, which has probability \[lamdba*[S(w)= count(w)/N]\] . “Brants et al. (2007) find that a value of 0.4 worked well for lambda.”

Some findings

So far I have noticed that converting the document term matrix to as.matrix is not a valid strategy as it cannot handle large corpus (more than 10000 documents). Instead it is better to convert the Document Term Matrix to a tidytext format, load it into a data.table and then do the aggregations. The data.table structure is quite efficient and lookup of words is fast using indexed keys. This strategy is easily able to handle large corpuses although I have not been successful in loading all the documents in the data. I tried a million records and it did not finish in 14 hours on a 12GB RAM 1.7 Ghz laptop! I am also worried that creating such large data tables may reduce it’s efficicency and response in the actual Shiny app. Any suggestions on handling this are welcome.