Reproducible Work

Just as in any science, reproducibility is important for many reasons

Reproducibility:

and most importantly for us:

To see more of the code used to create this report please see the GitHub repo

Downloading the Data

The data is from a corpus called HC Corpora. You can see the readme file for more information about the corpora
To properly allocate the data we will create directories for the original data seperate from what we make of it.
If you would like to see the details of this portion please visit the linked GitHub page


Quanitfying the Data

We can capture summary data by extracting baisc information from the files.
Using the R commond file.info(filePath)$size for file size, we find:

Using the wc utility with options -l,-w, and -m we capture the line, word and character counts respectively

wc -lwm data/final/en_US/en_US.twitter.txt
##  2360148 30373603 166816544 data/final/en_US/en_US.twitter.txt

Employing the same wc utility with the -lwm options on the rest of the files we get

##  1010242 34372530 205243643 data/final/en_US/en_US.news.txt
##   899288 37334147 208623081 data/final/en_US/en_US.blogs.txt

We have highlighted a basic summary of the of the three files.
Now we can arrage this data to obtain some insight into the differences that might identify these sources as well as some similarities which may be underlying facets of the language they are written in.

##                   Lines       Words   Characters Words per Line Char per Word
## twitter         2360148    30373603    166816544          12.87          5.49
## news            1010242    34372530    205243643          34.02          5.97
## blogs            899288    37334147    208623081          41.52          5.59
## twitter/news        2.3        0.88         0.81           0.38          0.92
## twitter/blogs       2.6        0.81          0.8           0.31          0.98
## news/blogs          1.1        0.92         0.98           0.82          1.07

The above table has been created to highlight the similarities and differences in the corpora so that we might begin to glean some information about the files, their contents, and even language


Analysis

Twitter and News Files
Where the twitter file has more than twice as many lines and over five times the amount of words as the news file, the news file contains more characters.
The blogs and news files have similar word, line and character counts.
With 92% similarity in word length

The twitter and news file contrasts makes sense since twitter is composed of tweets which are short, 140 character lines. This could be considered an indentifying trait of a twitter corpus.

Blog and News Files

While the blogs and news files did not have as large of a deviation from each other as from the the twitter file’s average of 12.87 words per line, the blogs file had 18% more words per line than the news files. This was the only of these factors these factors that differed by more than 10% between the news and blog files.

It makes sense that the blogs and news files have more in common with each other than the twitter file since both blogs and the news are made up of articles and not short phrases.
The largest variance between the two being words per line also makes sense. Blogs are written in more of a narrative voice, often creating run on sentences similar to every day speech. News articles are written to be concise and terminal in nature.


Conclusion on file traits

  • Twitter has 2.3x more lines than news and 2.6x more than blogs while still maintaining a base of at least 81% as many words
    • This is indicative of the nature of twitter
      • twitter is composed of tweets: 140 or less character lines
    • The News and Blogs, however, consist of articles
      • Articles follow a grammatic methodology to explain
  • All files have 5.49-5.97 characters per word on average
    • This makes sense as all the text files are in the same language: English.
      • Each word is being pulled from the same ultimate repository of the complete list of english words
        • One could even say that at such large volumes the law of large numbers comes into play with the average length of any set of words pulled from the same set, the collection of English words, converging on an average similar to the average of all the english words (weighted by usage or frequency of course)
  • Blogs have more words per line than news articles
    • Indicative of the nature and purpose of both blogs and new articles
      • Blogs tend to follow a narrative and write closer to common speach with run on, compound, complex and longer sentences
      • The purpose of news is to convey information in an easy to understand and straight forward manner, leading to simpler and shorter sentences

Further Analysis

Here we use a profanity list found at banned word list.com

badUrl <- c("http://www.bannedwordlist.com/lists/swearWords.txt")
download.file(badUrl, "./data/Rdata/badWords.txt")

Tokenization

Tokenization is the task of chopping up a body of text, or corpus, into pieces we call tokens.
For example: “We went to the mall and had a great time, then we went back home”
Could be tokenized into: |We| |went| |to| |the| |mall| |and| |had| |a| |great| |time| |then| |we| |went| |back| |home|
notice the comma was removed from the text

  • There is no one way to tokenize a corpus and in fact there are many methods all with benefits and weeknesses which can be employed at the discretion of the analyst. How a text is tokenized often depends on what the goal of the analysis is.
  • With a goal of creating a predictive text model, word frequency is one of the most important features of our tokenization process.
  • In the previous example, the token |went| would have an undisputed frequency value of 2. |we|, however could be given a value of 2 if we ignore capitalization or both |we| and |We| could be separately tokenized.
  • Since the tokens will be used in an ngram model the assumption of existing input text can be used to negate the prediction of capitalized First words.
    • Additionally, capitalized proper nouns should be considered rare enough in this case as do be ignored.

In any case, a robust model will be created which can be edited over time if a larger scope is taken into practice. Acronyms, slang and purposefully misspelled words as well as hashtags for the twitter file come to mind as immediate afterthoughts.

Using the Unix utilities tr, sort, uniq and cat along with the pipe command, |, we will take our downloaded raw data text files and tokenize them by words

Tokenization Steps:

  1. Use tr to change all capital letters to lowercase letters
    • tr 'A-Z' 'a-z' < filePath
  2. Create a new line \n for each word using tr with options s and c
    • tr -sc 'A-Za-z' '\n'
  3. Sort the words alphabetically so all similar and same words are next to each other
    • sort -n -r
  4. Use uniq to combine same words while keeping count of how many of each there are with option c and disregarding capitalization with option c
    • uniq -ci
  5. Sort again with sort using the option n to order by count and r to reverse order (so that highest frequency words are on top of the list)
    • sort -nr
  6. Finally, we write the new data into a text file using the cat utility
    • cat > destinationPath.txt

We can easily nest all of these functions with each utility and their associated options by using the pipe command |
The pipe command is as though you told the interpreter to “Make or Do This THEN Make or Do something with what you created or did”
Where this would be written as Do This to make That | Do something with That
Our final tokenization script will look like this:
tr 'A-Z' 'a-z' < filePath | tr -sc 'A-Za-z' '\n' | sort -n -r | uniq -ci | sort -nr | cat > destinationPath

Number of Tokens unique words
Now that we are working with smaller file sizes of 4.18MB, 2.77MB, and 3.33MB for the twitter, news and blogs unique word counts respectively. We can use the R function read.table(filePath) to create a data frame the tokens for further inspection.

  • Using nrow(data.frame) we find:
    • The twitter file has 302652 unique words
    • The news file has 212227 unique words
    • The blogs file has 253042 unique words

Most Common UniGrams

a unigram is a single word phrase, so in this case we are simply listing the highest frequency words from each corpus. Next we will be creating creating statistics from n-gram or multi-word phrases to help in building our prediction model.

Quickly looking over these exploratory bar charts we can see a few important things:

  • There needs to be filtering of certain letters from the token set
    • most likely from compound words like |wasn’t| -> |wasn| |t|
  • a vast majority of the corpora is composed of Stop-words such as the, and, to, a, of, etc
    • This is difficult because while these statistically skew our data, they are obviously the most commonly typed words and thus important for predictive text modeling
      • ngram modeling should significantly help in indentifying the distribution of these words
    • Being universal across all three files is a valuable insight into what some of the most predicted words will be
    • Also likely says something about the language english itself

Notes

  • Each Unix or bash command can be used in the R Console by typing the command in the system() function
    • system("Shell.Command")
    • example using wc utility to count lines of a file located at filePath
      • system("wc -l filePath") will return the number of lines along with the file path
  • Unix and many other shell commands and
  • To pass variables from R to bash use Sys.setenv(Var = "value")
    • Then you can use the variable Var in a bash code chunk
    • example using extending on previous:
      • In an R code chuck write Sys.setenv(path = "./data/final/en_US/en_US.twitter.txt")
      • Then use path var in bash code chunk system("wc -l path")
        • this is equivalent to writing wc -l ./data/final/en_US/en_US.twitter.txt in a bash code chunk
      • makes code and reports easier to read and understand, particularly in a knitr Rmd file
      • improves access to variables, outputs from different environments and a more flexible workflow for reusing code
  • Knitr can handle almost any language which can be called by command line
    • Examples not limited to: Python, Awk, Ruby, Haskell, Bash, Perl, Graphviz, TikZ, SAS, Scala and CoffeScript
    • To instruct knitr which language is in the code block use the knit_engines objects
      • i.e. {r use-bash, engine = ‘bash’}

Next Steps

TODO

  • Filter unwanted tokens
    • from bad word list
    • letters
    • consider what to do with stop words
  • Finish basic n-gram model
    • Make it able to handle unseen n-grams
    • be able to efficiently handle data and storage issues
  • Build a Predictive Model on previous data modeling
    • Evaluate the model for efficiency and accuracy
  • Explore new models and data to improve predictive model
  • Create Shinny App
    • Accepts n-gram and predicts the next word
  • Slide Deck
    • Something that you could pitch to an investor
  • Touch Ups
    • use Sys.setenv() to export variables, like file paths, to bash code chunks