Just as in any science, reproducibility is important for many reasons
Reproducibility:
and most importantly for us:
To see more of the code used to create this report please see the GitHub repo
The data is from a corpus called HC Corpora. You can see the readme file for more information about the corpora
To properly allocate the data we will create directories for the original data seperate from what we make of it.
If you would like to see the details of this portion please visit the linked GitHub page
We can capture summary data by extracting baisc information from the files.
Using the R commond file.info(filePath)$size for file size, we find:
Using the wc utility with options -l,-w, and -m we capture the line, word and character counts respectively
wc -lwm data/final/en_US/en_US.twitter.txt
## 2360148 30373603 166816544 data/final/en_US/en_US.twitter.txt
Employing the same wc utility with the -lwm options on the rest of the files we get
## 1010242 34372530 205243643 data/final/en_US/en_US.news.txt
## 899288 37334147 208623081 data/final/en_US/en_US.blogs.txt
We have highlighted a basic summary of the of the three files.
Now we can arrage this data to obtain some insight into the differences that might identify these sources as well as some similarities which may be underlying facets of the language they are written in.
## Lines Words Characters Words per Line Char per Word
## twitter 2360148 30373603 166816544 12.87 5.49
## news 1010242 34372530 205243643 34.02 5.97
## blogs 899288 37334147 208623081 41.52 5.59
## twitter/news 2.3 0.88 0.81 0.38 0.92
## twitter/blogs 2.6 0.81 0.8 0.31 0.98
## news/blogs 1.1 0.92 0.98 0.82 1.07
The above table has been created to highlight the similarities and differences in the corpora so that we might begin to glean some information about the files, their contents, and even language
Twitter and News Files
Where the twitter file has more than twice as many lines and over five times the amount of words as the news file, the news file contains more characters.
The blogs and news files have similar word, line and character counts.
With 92% similarity in word length
The twitter and news file contrasts makes sense since twitter is composed of tweets which are short, 140 character lines. This could be considered an indentifying trait of a twitter corpus.
Blog and News Files
While the blogs and news files did not have as large of a deviation from each other as from the the twitter file’s average of 12.87 words per line, the blogs file had 18% more words per line than the news files. This was the only of these factors these factors that differed by more than 10% between the news and blog files.
It makes sense that the blogs and news files have more in common with each other than the twitter file since both blogs and the news are made up of articles and not short phrases.
The largest variance between the two being words per line also makes sense. Blogs are written in more of a narrative voice, often creating run on sentences similar to every day speech. News articles are written to be concise and terminal in nature.
Here we use a profanity list found at banned word list.com
badUrl <- c("http://www.bannedwordlist.com/lists/swearWords.txt")
download.file(badUrl, "./data/Rdata/badWords.txt")
Tokenization is the task of chopping up a body of text, or corpus, into pieces we call tokens.
For example: “We went to the mall and had a great time, then we went back home”
Could be tokenized into: |We| |went| |to| |the| |mall| |and| |had| |a| |great| |time| |then| |we| |went| |back| |home|
notice the comma was removed from the text
In any case, a robust model will be created which can be edited over time if a larger scope is taken into practice. Acronyms, slang and purposefully misspelled words as well as hashtags for the twitter file come to mind as immediate afterthoughts.
Using the Unix utilities tr, sort, uniq and cat along with the pipe command, |, we will take our downloaded raw data text files and tokenize them by words
Tokenization Steps:
tr to change all capital letters to lowercase letters
tr 'A-Z' 'a-z' < filePath\n for each word using tr with options s and c
tr -sc 'A-Za-z' '\n'sort -n -runiq to combine same words while keeping count of how many of each there are with option c and disregarding capitalization with option c
uniq -cisort using the option n to order by count and r to reverse order (so that highest frequency words are on top of the list)
sort -nrcat utility
cat > destinationPath.txtWe can easily nest all of these functions with each utility and their associated options by using the pipe command |
The pipe command is as though you told the interpreter to “Make or Do This THEN Make or Do something with what you created or did”
Where this would be written as Do This to make That | Do something with That
Our final tokenization script will look like this:tr 'A-Z' 'a-z' < filePath | tr -sc 'A-Za-z' '\n' | sort -n -r | uniq -ci | sort -nr | cat > destinationPath
Number of Tokens unique words
Now that we are working with smaller file sizes of 4.18MB, 2.77MB, and 3.33MB for the twitter, news and blogs unique word counts respectively. We can use the R function read.table(filePath) to create a data frame the tokens for further inspection.
nrow(data.frame) we find:
Most Common UniGrams
a unigram is a single word phrase, so in this case we are simply listing the highest frequency words from each corpus. Next we will be creating creating statistics from n-gram or multi-word phrases to help in building our prediction model.
Quickly looking over these exploratory bar charts we can see a few important things:
system() function
system("Shell.Command")wc utility to count lines of a file located at filePath
system("wc -l filePath") will return the number of lines along with the file pathSys.setenv(Var = "value")
Var in a bash code chunkSys.setenv(path = "./data/final/en_US/en_US.twitter.txt")path var in bash code chunk system("wc -l path")
wc -l ./data/final/en_US/en_US.twitter.txt in a bash code chunkTODO
Sys.setenv() to export variables, like file paths, to bash code chunks