Milestone Report

This is the milestone report for the Coursera Data Science Captsone course. We’ll start by loading in all the needed libraries and globals.
Based on a thread in the discussion forum, I decided to employ Google’s bad word list for profanity filtering.

This report contains code that might be frightening to the non-technical reader. For technical readers, the code used to generate this report is contained in the appendix at the end.

Demonstrate that you’ve downloaded the data and have sucessfully loaded it in.

I manually downloaded the file and unzipped it. Summary statistics for full files (filesize, line count, word count) are shown below.
Beyond word counts and given that I don’t have an actual in-use data product, I’m not going to analyze the difference in word frequencies by sources because I really wouldn’t know what to infer from a finding like “twitter has more verbs.” The value of sources for the corpus could better be determined by looking at user satisfaction with the predictive typing application.

File	File.Size	Lines	Words
blogs	200.42 MB	899288	37570839
news	196.28 MB	1010242	34494539
twitter	159.36 MB	2360148	30451128

Create a basic report of summary statistics about the data sets.

As described above, I have shown the length, size, and word counts of the original files. From now on, I’ll be working with a 10% random sample from each file.
The code below creates the corpus (i.e., the collection of texts from the three data sources), and one to three word phrase counts (ngrams) from the corpus.

Given the task, we don’t really need any special characters that aren’t part of a word. To keep things simple, I don’t want to save the following tokens/characters:

end/start of word punctuation symbols
numbers and special characters that aren’t part of normal language
extra whitespace
emoji
profanity

I also wan’t to keep stop words as they are commonly typed and are valid suggestions.

First, we create the corpus.

Next, I create the term-frequency tables which I will show a portion as visualizations and tables.

One way to summarize terms in a meaningful way is to look at the top-n terms for each n-gram. Below is a barplot for the most frequently used terms (top ten), by n-gram.

It is somewhat difficult to see the counts for the 3-gram terms so I also include a table below.

phrase	count	ngram
the	477860	1
and	238663	1
for	113042	1
that	104479	1
you	88208	1
with	73475	1
was	66980	1
have	54983	1
this	53876	1
but	49962	1
in the	43608	2
of the	41140	2
for the	22197	2
to the	21846	2
on the	18982	2
to be	15623	2
at the	15290	2
and the	13911	2
in a	12679	2
is a	10643	2
a lot of	3602	3
thanks for the	3160	3
one of the	2865	3
going to be	2016	3
to be a	1967	3
looking forward to	1588	3
i love you	1531	3
as well as	1486	3
the fact that	1431	3
for the follow	1355	3

Report any interesting findings that you amassed so far.

One of the items I found surprising was around a task question that asked:
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Answering the problem simply (below), we realize for single words, we get to that 50% mark fairly quickly and it takes a little longer to get to the 90% mark.

## [1] "Unigram 50% mark reached at 313 words."

## [1] "Unigram 90% mark reached at 7674 words."

I would expect this distribution, as shown in the cumulative distribution below, to flatten out as we got into two and three word phrases but was surprised at how much quicker we get to the differing percentiles with single words.

Get feedback on your plans for creating a prediction algorithm and Shiny app.

Please note: this section is not written for the non data scientist that I described in the beginning of the report because I wouldn’t be looking to them for technical advice.

I do understand that there will be performance issues and there are tradeoffs in that:

larger dictionaries and longer n-grams tend to imply slower performance.
more robust algorithms tend to imply slower performance.
dynamic shiny apps that respond while you are typing tend to imply slower performance.

For the shiny app, I’m contemplating three options:

Autocomplete (i.e., dynamic updating as you type).
Two words suggested after a user types a partial phrase in an entry box and presses a button.
User has two button options (one word, two words) for suggestion.

If dynamic updating causes too much delay, I’m going to opt for one of the bottom two options.

With regards to NLP, my plans are to use the largest n-gram I can get away with without taking a huge performance hit. For the algorithm, I plan on using an NLP model that uses conditional probability like Stupid Backoff but can’t really comment until I see it in action. After reading a review of this capstone, I am obviously concerned with performance. It has also inspired me to look for some MOOCs on NLP to increase my knowledge.

Appendix

set.seed(1234)
library(dplyr)
library(tm)
library(RWeka)
library(ggplot2)
library(Matrix)
library(dplyr)
library(gridExtra)
library(pander)
library(stringi)
options(mc.cores=1) # NGramTokenizer generates an error in OS X without this option
profanity <- read.csv("google_twunter_lol.txt", header=FALSE, stringsAsFactors=FALSE)
profanity <- unlist(strsplit(profanity[,1], split=":1"))

# globals for file and path parts
ddir <- './final/en_US/'
prefix <- 'en_US.'
extension <- '.txt'
suffix <- '_reduced'
## create samples, report summary statistics for full files
count_things <- function(input) {
  input_file <- paste(ddir, prefix, input, extension, sep="")
  full_file <- readLines(input_file)
  word_stats <- stri_stats_latex(full_file)
  line_stats <- stri_stats_general(full_file)
  file_size <- round(file.info(input_file)$size / 1024^2, 2)
  stats <- data.frame("File" = input, "File Size" = paste(file_size, "MB"),
                      "Lines" = line_stats["Lines"], "Words" = word_stats["Words"])
}

create_sampled_file <- function(input, proportion = 0.10) {
  input_file <- paste(ddir, prefix, input, extension, sep="")
  output_file <- paste(ddir, prefix, input, suffix, extension, sep="")
  full_file <- readLines(input_file)
  file_length <- length(full_file)
  sample_size <- proportion * file_length
  full_file <- full_file[rbinom(n = sample_size, size = file_length, prob = 0.5)]
  write(full_file, file = output_file)
}

# I'll provide additional summaries (word counts, etc.) from the reduced data.
create_sampled_file('blogs')
create_sampled_file('news')
create_sampled_file('twitter')
counts <- (rbind(count_things('blogs'), count_things('news'), count_things('twitter')))
row.names(counts) <- NULL
pander(counts)
corpus <- Corpus(DirSource(ddir, pattern=suffix, encoding="UTF-8"),
                 readerControl = list(reader=readPlain,
                                      language="en",
                                      load=TRUE))
corpus <-  tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <-  tm_map(corpus, removePunctuation)
corpus <-  tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus,removeWords,profanity)
ngrammer <- function(lower, upper) {
  ngfun <- function(x) NGramTokenizer(x, Weka_control(min=lower, max=upper))
  ng <- TermDocumentMatrix(corpus, control = list(tokenize = ngfun))
  ng_df <- data.frame(phrase = ng$dimnames$Terms,
                      count = rowSums(sparseMatrix(i = ng$i, j=ng$j, x= ng$v)))
  ng_df <- arrange(ng_df, desc(count))
  ng_df <- mutate(ng_df, running=cumsum(count), cperc=running/max(running),
                  ngram=lower, record=1:nrow(ng_df))
  ng_df
}
gram1freq <- ngrammer(1,1)
gram2freq <- ngrammer(2,2)
gram3freq <- ngrammer(3,3)
all_grams <- rbind(gram1freq, gram2freq, gram3freq)
all_tops <- all_grams %>% group_by(ngram) %>% top_n(10, count)
all_tops$phrase <- factor(all_tops$phrase, levels=all_tops[order(all_tops$count, decreasing=TRUE),]$phrase)
ggplot(all_tops, aes(x=phrase, y=count/1000, order=phrase, fill=factor(ngram))) + geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle=90)) + xlab("term") + 
  ylab("count (in thousands)") + ggtitle("Top ten terms for 1, 2, and 3-grams") +
  facet_wrap(~ ngram, scales = "free_x") + scale_fill_discrete(name="n-gram")
pander(all_tops[,c(1:2,5)])
paste("Unigram 50% mark reached at", findInterval( max(gram1freq$running) * .5, gram1freq$running), "words.")
paste("Unigram 90% mark reached at", findInterval( max(gram1freq$running) * .9, gram1freq$running), "words.")
ggplot(all_grams, aes(x=record, y=cperc, color=factor(ngram))) + geom_line(size=2) + 
  xlab("ordered term number") + ylab("percent of total terms (non-unique)") +
  ggtitle("Frequency sorted dictionary distribution") +
  scale_color_discrete(name="n-gram")
##