Exploratory Document Text Analysis

Summary

This document has the purpose to show a brief exploratory text analysis on the documents given to work with the Data Science Capstone project. The project’s goal is to build a text predictive model. The documents can be downloaded from the Course’s page here. But they originally come from the HC Corpora site

Toolbox

Initially, it was planned to use R packages to perform this exploratory analysis but I found too slow to work with them. In addition of memory consumption issues (even with sampling). Instead, the Linux command line tools proved to be enough for this job along with the GNU parallel to process faster when needed and the The SRI Language Modeling Toolkit to get the ngrams counts. More information may be found at the given links. R will be use ocassionally as needed for example like plotting.

Exploratory Analysis

Lets see basic properties of these documents:

$ wc -cmlLw en_US.blogs.txt en_US.news.txt en_US.twitter.txt

lines	words	characters	bytes	longest line	file
899288	37334114	208623081	210160014	40833	en_US.blogs.txt
1010242	34365936	205243643	205811889	11384	en_US.news.txt
2360148	30359804	166816544	167105338	173	en_US.twitter.txt
4269678	102059854	580683268	583077241	40833	total

Even though the number of lines among the documents makes an importance difference, the number of words tell us they are closer in this attribute. Having more bytes than characters, tell us that we have characters that use more than a byte such as non-latin or english-foreign alpahabet characters; or meta characters like emoticons.

Reviewing first lines of each document:

$ head -3 en_US.blogs.txt en_US.news.txt en_US.twitter.txt

==> en_US.blogs.txt <==
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

==> en_US.news.txt <==
He wasn't home alone, apparently.
The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.

==> en_US.twitter.txt <==
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
they've decided its more fun if I don't.

By looking at the text, we immediately know we are going to face challenges to clean up the data like: Abbreviations (eg Mr., St.), english contractions (eg. wasn’t, you’ll), acronyms (eg. Btw) that may come with dots like B.t.w., identify sentences in a paragrah, etc.

Before cleaning the data, lets get the distinct words count to get an initial corpus, and sort them by count descent to see the most frequent words:

$ tr 'A-Z' 'a-z' < en_US.blogs.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r > en_US.blogs.corpus.txt
$ tr 'A-Z' 'a-z' < en_US.news.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r > en_US.news.corpus.txt
$ tr 'A-Z' 'a-z' < en_US.twitter.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r > en_US.twitter.corpus.txt
$ wc -l en_US.blogs.corpus.txt en_US.news.corpus.txt en_US.twitter.corpus.txt

lines	file
253042	en_US.blogs.corpus.txt
212227	en_US.news.corpus.txt
302652	en_US.twitter.corpus.txt
767921	total

Top words:

$ head -n 10 en_US.blogs.corpus.txt en_US.news.corpus.txt en_US.twitter.corpus.txt

==> en_US.blogs.corpus.txt <==
1860686 the
1094859 and
1069565 to
 906917 i
 904594 a
 876848 of
 598782 in
 485385 it
 484198 that
 432768 is

==> en_US.news.corpus.txt <==
1975163 the
 906198 to
 894899 a
 889612 and
 774525 of
 679242 in
 457198 s
 371864 that
 353967 for
 286646 it

==> en_US.twitter.corpus.txt <==
 937970 the
 918858 i
 788952 to
 617520 a
 601330 you
 438745 and
 385492 for
 383745 it
 380805 in
 359757 of

Almost similar, the main differences are because of the nature of the source. News does not have I (i) because in news that word usually does not appear while in a blog and twitter we usually give ours own opinion about a subject.

What about the last words:

$ tail -n 5 en_US.blogs.corpus.txt en_US.news.corpus.txt en_US.twitter.corpus.txt

==> en_US.blogs.corpus.txt <==
      1 aaaaaaaaaaaaaaaaaaaaaaaarrrrrrrrrrrrrrrrggggggggghhhhhhhhhh
      1 aaaaaaaaaaaaaaaaa
      1 aaaaaaaaaaaaaaa
      1 aaaaaaaaaaaaa
      1 aaaaaaaaa

==> en_US.news.corpus.txt <==
      1 aaaaaand
      1 aaaaaahhhh
      1 aaaaaahhh
      1 aaaaaahh
      1 aaaaa

==> en_US.twitter.corpus.txt <==
      1 aaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhh
      1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
      1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
      1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhh
      1 aaaaaaaaa

This is clearly a problem. Lets see how many low frequency words we have. Lets say below 100 count is a low frequency word as an initial estimation.

$ egrep '^[[:space:]]*([0-9]){1,2}[[:space:]]' en_US.blogs.corpus.txt | wc -l

236933

$ egrep '^[[:space:]]*([0-9]){1,2}[[:space:]]' en_US.news.corpus.txt | wc -l

195640

$ egrep '^[[:space:]]*([0-9]){1,2}[[:space:]]' en_US.twitter.corpus.txt | wc -l

290412

Most of our initial corpus have low frequency words, in particular from tweeter (more freedom to write). Certainly, these should not be included in our predictive text model.

Lets see the frequency of the most frequent words in the documents. Lets say above 10,000 ocurrences. This is needed to have a visibility on the words with higher counts.

bc <- read.table("/media/luisluevano/Data/DS/10CP/final/en_US/en_US.blogs.corpus.txt")
nc <- read.table("/media/luisluevano/Data/DS/10CP/final/en_US/en_US.news.corpus.txt")
tc <- read.table("/media/luisluevano/Data/DS/10CP/final/en_US/en_US.twitter.corpus.txt")
par(mfrow=c(1,3))
hist(bc[bc$V1>=10000,"V1"]/10000, main="Hist of frequent blog words", xlab="Word frequency in 10,000s")
hist(nc[nc$V1>=10000,"V1"]/10000, main="Hist of frequent news words", xlab="Word frequency in 10,000s")
hist(tc[tc$V1>=10000,"V1"]/10000, main="Hist of frequent twitter words", xlab="Word frequency in 10,000s")

We see that english language makes use of few words as the majority of the day to day expression. This means that there is a set of words that will need to have more accesible than other ones. This could be important for faster lookups in our predictive model.

Clean up

Lets see one of the potential issues we may have to clean up in the documents.

Acronyms found (with dots):

$ egrep -io '([a-z][\.]){2,}' en_US.blogs.txt | wc -l

14067

$ egrep -io '([a-z][\.]){2,}' en_US.news.txt | wc -l

72628

$ egrep -io '([a-z][\.]){2,}' en_US.twitter.txt | wc -l

11514

These can be replaced by the acronym without dots. This will be important to get the sentences on each line. Similar case is with the abbreviations.

Conclusions

The source documents are quite messy. In order to have a proper predictive model an initial clean will be made to:

Replace acronyms and abbrevations
Split into sentences
Remove profanity

Later, get only alphabetic characters and build up to 4 gram to create the backoff model. This is a command to get ngrams from one to three using the The SRI Language Modeling Toolkit

$ ngram-count -order 3 -no-sos -no-eos -sort -text en_US.twitter.txt -write1 en_US.twitter.1gram.txt -write2 en_US.twitter.2gram.txt -write3 en_US.twitter.3gram.txt

Perfomance on first model

Following the clean up, getting up to four grams an a backoff model, the performance on the first quiz for NLP was:

OK in 2nd guess !!
OK in 1st guess !!!
OK in 1st guess !!!
NOT OK. 1st guess with def
NOT OK. No good guess
OK in 1st guess !!!
OK in 1st guest !!!
NOT OK. 1st guess with fing
NOT OK. 7th guess with ba
NOT OK. 1st guess with insa

It is not perfect but works a bit.