This document has the purpose to show a brief exploratory text analysis on the documents given to work with the Data Science Capstone project. The project’s goal is to build a text predictive model. The documents can be downloaded from the Course’s page here. But they originally come from the HC Corpora site
Initially, it was planned to use R packages to perform this exploratory analysis but I found too slow to work with them. In addition of memory consumption issues (even with sampling). Instead, the Linux command line tools proved to be enough for this job along with the GNU parallel to process faster when needed and the The SRI Language Modeling Toolkit to get the ngrams counts. More information may be found at the given links. R will be use ocassionally as needed for example like plotting.
Lets see basic properties of these documents:
$ wc -cmlLw en_US.blogs.txt en_US.news.txt en_US.twitter.txt
| lines | words | characters | bytes | longest line | file |
|---|---|---|---|---|---|
| 899288 | 37334114 | 208623081 | 210160014 | 40833 | en_US.blogs.txt |
| 1010242 | 34365936 | 205243643 | 205811889 | 11384 | en_US.news.txt |
| 2360148 | 30359804 | 166816544 | 167105338 | 173 | en_US.twitter.txt |
| 4269678 | 102059854 | 580683268 | 583077241 | 40833 | total |
Even though the number of lines among the documents makes an importance difference, the number of words tell us they are closer in this attribute. Having more bytes than characters, tell us that we have characters that use more than a byte such as non-latin or english-foreign alpahabet characters; or meta characters like emoticons.
Reviewing first lines of each document:
$ head -3 en_US.blogs.txt en_US.news.txt en_US.twitter.txt
==> en_US.blogs.txt <==
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
==> en_US.news.txt <==
He wasn't home alone, apparently.
The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
==> en_US.twitter.txt <==
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
they've decided its more fun if I don't.
By looking at the text, we immediately know we are going to face challenges to clean up the data like: Abbreviations (eg Mr., St.), english contractions (eg. wasn’t, you’ll), acronyms (eg. Btw) that may come with dots like B.t.w., identify sentences in a paragrah, etc.
Before cleaning the data, lets get the distinct words count to get an initial corpus, and sort them by count descent to see the most frequent words:
$ tr 'A-Z' 'a-z' < en_US.blogs.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r > en_US.blogs.corpus.txt
$ tr 'A-Z' 'a-z' < en_US.news.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r > en_US.news.corpus.txt
$ tr 'A-Z' 'a-z' < en_US.twitter.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r > en_US.twitter.corpus.txt
$ wc -l en_US.blogs.corpus.txt en_US.news.corpus.txt en_US.twitter.corpus.txt
| lines | file |
|---|---|
| 253042 | en_US.blogs.corpus.txt |
| 212227 | en_US.news.corpus.txt |
| 302652 | en_US.twitter.corpus.txt |
| 767921 | total |
Top words:
$ head -n 10 en_US.blogs.corpus.txt en_US.news.corpus.txt en_US.twitter.corpus.txt
==> en_US.blogs.corpus.txt <==
1860686 the
1094859 and
1069565 to
906917 i
904594 a
876848 of
598782 in
485385 it
484198 that
432768 is
==> en_US.news.corpus.txt <==
1975163 the
906198 to
894899 a
889612 and
774525 of
679242 in
457198 s
371864 that
353967 for
286646 it
==> en_US.twitter.corpus.txt <==
937970 the
918858 i
788952 to
617520 a
601330 you
438745 and
385492 for
383745 it
380805 in
359757 of
Almost similar, the main differences are because of the nature of the source. News does not have I (i) because in news that word usually does not appear while in a blog and twitter we usually give ours own opinion about a subject.
What about the last words:
$ tail -n 5 en_US.blogs.corpus.txt en_US.news.corpus.txt en_US.twitter.corpus.txt
==> en_US.blogs.corpus.txt <==
1 aaaaaaaaaaaaaaaaaaaaaaaarrrrrrrrrrrrrrrrggggggggghhhhhhhhhh
1 aaaaaaaaaaaaaaaaa
1 aaaaaaaaaaaaaaa
1 aaaaaaaaaaaaa
1 aaaaaaaaa
==> en_US.news.corpus.txt <==
1 aaaaaand
1 aaaaaahhhh
1 aaaaaahhh
1 aaaaaahh
1 aaaaa
==> en_US.twitter.corpus.txt <==
1 aaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhh
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhh
1 aaaaaaaaa
This is clearly a problem. Lets see how many low frequency words we have. Lets say below 100 count is a low frequency word as an initial estimation.
$ egrep '^[[:space:]]*([0-9]){1,2}[[:space:]]' en_US.blogs.corpus.txt | wc -l
236933
$ egrep '^[[:space:]]*([0-9]){1,2}[[:space:]]' en_US.news.corpus.txt | wc -l
195640
$ egrep '^[[:space:]]*([0-9]){1,2}[[:space:]]' en_US.twitter.corpus.txt | wc -l
290412
Most of our initial corpus have low frequency words, in particular from tweeter (more freedom to write). Certainly, these should not be included in our predictive text model.
Lets see the frequency of the most frequent words in the documents. Lets say above 10,000 ocurrences. This is needed to have a visibility on the words with higher counts.
bc <- read.table("/media/luisluevano/Data/DS/10CP/final/en_US/en_US.blogs.corpus.txt")
nc <- read.table("/media/luisluevano/Data/DS/10CP/final/en_US/en_US.news.corpus.txt")
tc <- read.table("/media/luisluevano/Data/DS/10CP/final/en_US/en_US.twitter.corpus.txt")
par(mfrow=c(1,3))
hist(bc[bc$V1>=10000,"V1"]/10000, main="Hist of frequent blog words", xlab="Word frequency in 10,000s")
hist(nc[nc$V1>=10000,"V1"]/10000, main="Hist of frequent news words", xlab="Word frequency in 10,000s")
hist(tc[tc$V1>=10000,"V1"]/10000, main="Hist of frequent twitter words", xlab="Word frequency in 10,000s")
We see that english language makes use of few words as the majority of the day to day expression. This means that there is a set of words that will need to have more accesible than other ones. This could be important for faster lookups in our predictive model.
Lets see one of the potential issues we may have to clean up in the documents.
Acronyms found (with dots):
$ egrep -io '([a-z][\.]){2,}' en_US.blogs.txt | wc -l
14067
$ egrep -io '([a-z][\.]){2,}' en_US.news.txt | wc -l
72628
$ egrep -io '([a-z][\.]){2,}' en_US.twitter.txt | wc -l
11514
These can be replaced by the acronym without dots. This will be important to get the sentences on each line. Similar case is with the abbreviations.
The source documents are quite messy. In order to have a proper predictive model an initial clean will be made to:
Later, get only alphabetic characters and build up to 4 gram to create the backoff model. This is a command to get ngrams from one to three using the The SRI Language Modeling Toolkit
$ ngram-count -order 3 -no-sos -no-eos -sort -text en_US.twitter.txt -write1 en_US.twitter.1gram.txt -write2 en_US.twitter.2gram.txt -write3 en_US.twitter.3gram.txt
Following the clean up, getting up to four grams an a backoff model, the performance on the first quiz for NLP was:
It is not perfect but works a bit.