This is the milestone report for the Coursera Data Science Captsone course. We’ll start by loading in all the needed libraries and globals.
Based on a thread in the discussion forum, I decided to employ Google’s bad word list for profanity filtering.

This report contains code that might be frightening to the non-technical reader. For technical readers, the code used to generate this report is contained in the appendix at the end.

Demonstrate that you’ve downloaded the data and have sucessfully loaded it in.

I manually downloaded the file and unzipped it. Summary statistics for full files (filesize, line count, word count) are shown below.
Beyond word counts and given that I don’t have an actual in-use data product, I’m not going to analyze the difference in word frequencies by sources because I really wouldn’t know what to infer from a finding like “twitter has more verbs.” The value of sources for the corpus could better be determined by looking at user satisfaction with the predictive typing application.

File File.Size Lines Words
blogs 200.42 MB 899288 37570839
news 196.28 MB 1010242 34494539
twitter 159.36 MB 2360148 30451128

Create a basic report of summary statistics about the data sets.

As described above, I have shown the length, size, and word counts of the original files. From now on, I’ll be working with a 10% random sample from each file.
The code below creates the corpus (i.e., the collection of texts from the three data sources), and one to three word phrase counts (ngrams) from the corpus.

Given the task, we don’t really need any special characters that aren’t part of a word. To keep things simple, I don’t want to save the following tokens/characters:

I also wan’t to keep stop words as they are commonly typed and are valid suggestions.

First, we create the corpus.

Next, I create the term-frequency tables which I will show a portion as visualizations and tables.

One way to summarize terms in a meaningful way is to look at the top-n terms for each n-gram. Below is a barplot for the most frequently used terms (top ten), by n-gram.

It is somewhat difficult to see the counts for the 3-gram terms so I also include a table below.

phrase count ngram
the 477860 1
and 238663 1
for 113042 1
that 104479 1
you 88208 1
with 73475 1
was 66980 1
have 54983 1
this 53876 1
but 49962 1
in the 43608 2
of the 41140 2
for the 22197 2
to the 21846 2
on the 18982 2
to be 15623 2
at the 15290 2
and the 13911 2
in a 12679 2
is a 10643 2
a lot of 3602 3
thanks for the 3160 3
one of the 2865 3
going to be 2016 3
to be a 1967 3
looking forward to 1588 3
i love you 1531 3
as well as 1486 3
the fact that 1431 3
for the follow 1355 3

Report any interesting findings that you amassed so far.

One of the items I found surprising was around a task question that asked:
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Answering the problem simply (below), we realize for single words, we get to that 50% mark fairly quickly and it takes a little longer to get to the 90% mark.

## [1] "Unigram 50% mark reached at 313 words."
## [1] "Unigram 90% mark reached at 7674 words."

I would expect this distribution, as shown in the cumulative distribution below, to flatten out as we got into two and three word phrases but was surprised at how much quicker we get to the differing percentiles with single words.

Get feedback on your plans for creating a prediction algorithm and Shiny app.

Please note: this section is not written for the non data scientist that I described in the beginning of the report because I wouldn’t be looking to them for technical advice.

I do understand that there will be performance issues and there are tradeoffs in that:

For the shiny app, I’m contemplating three options:

  1. Autocomplete (i.e., dynamic updating as you type).
  2. Two words suggested after a user types a partial phrase in an entry box and presses a button.
  3. User has two button options (one word, two words) for suggestion.

If dynamic updating causes too much delay, I’m going to opt for one of the bottom two options.

With regards to NLP, my plans are to use the largest n-gram I can get away with without taking a huge performance hit. For the algorithm, I plan on using an NLP model that uses conditional probability like Stupid Backoff but can’t really comment until I see it in action. After reading a review of this capstone, I am obviously concerned with performance. It has also inspired me to look for some MOOCs on NLP to increase my knowledge.

Appendix

set.seed(1234)
library(dplyr)
library(tm)
library(RWeka)
library(ggplot2)
library(Matrix)
library(dplyr)
library(gridExtra)
library(pander)
library(stringi)
options(mc.cores=1) # NGramTokenizer generates an error in OS X without this option
profanity <- read.csv("google_twunter_lol.txt", header=FALSE, stringsAsFactors=FALSE)
profanity <- unlist(strsplit(profanity[,1], split=":1"))

# globals for file and path parts
ddir <- './final/en_US/'
prefix <- 'en_US.'
extension <- '.txt'
suffix <- '_reduced'
## create samples, report summary statistics for full files
count_things <- function(input) {
  input_file <- paste(ddir, prefix, input, extension, sep="")
  full_file <- readLines(input_file)
  word_stats <- stri_stats_latex(full_file)
  line_stats <- stri_stats_general(full_file)
  file_size <- round(file.info(input_file)$size / 1024^2, 2)
  stats <- data.frame("File" = input, "File Size" = paste(file_size, "MB"),
                      "Lines" = line_stats["Lines"], "Words" = word_stats["Words"])
}

create_sampled_file <- function(input, proportion = 0.10) {
  input_file <- paste(ddir, prefix, input, extension, sep="")
  output_file <- paste(ddir, prefix, input, suffix, extension, sep="")
  full_file <- readLines(input_file)
  file_length <- length(full_file)
  sample_size <- proportion * file_length
  full_file <- full_file[rbinom(n = sample_size, size = file_length, prob = 0.5)]
  write(full_file, file = output_file)
}

# I'll provide additional summaries (word counts, etc.) from the reduced data.
create_sampled_file('blogs')
create_sampled_file('news')
create_sampled_file('twitter')
counts <- (rbind(count_things('blogs'), count_things('news'), count_things('twitter')))
row.names(counts) <- NULL
pander(counts)
corpus <- Corpus(DirSource(ddir, pattern=suffix, encoding="UTF-8"),
                 readerControl = list(reader=readPlain,
                                      language="en",
                                      load=TRUE))
corpus <-  tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <-  tm_map(corpus, removePunctuation)
corpus <-  tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus,removeWords,profanity)
ngrammer <- function(lower, upper) {
  ngfun <- function(x) NGramTokenizer(x, Weka_control(min=lower, max=upper))
  ng <- TermDocumentMatrix(corpus, control = list(tokenize = ngfun))
  ng_df <- data.frame(phrase = ng$dimnames$Terms,
                      count = rowSums(sparseMatrix(i = ng$i, j=ng$j, x= ng$v)))
  ng_df <- arrange(ng_df, desc(count))
  ng_df <- mutate(ng_df, running=cumsum(count), cperc=running/max(running),
                  ngram=lower, record=1:nrow(ng_df))
  ng_df
}
gram1freq <- ngrammer(1,1)
gram2freq <- ngrammer(2,2)
gram3freq <- ngrammer(3,3)
all_grams <- rbind(gram1freq, gram2freq, gram3freq)
all_tops <- all_grams %>% group_by(ngram) %>% top_n(10, count)
all_tops$phrase <- factor(all_tops$phrase, levels=all_tops[order(all_tops$count, decreasing=TRUE),]$phrase)
ggplot(all_tops, aes(x=phrase, y=count/1000, order=phrase, fill=factor(ngram))) + geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle=90)) + xlab("term") + 
  ylab("count (in thousands)") + ggtitle("Top ten terms for 1, 2, and 3-grams") +
  facet_wrap(~ ngram, scales = "free_x") + scale_fill_discrete(name="n-gram")
pander(all_tops[,c(1:2,5)])
paste("Unigram 50% mark reached at", findInterval( max(gram1freq$running) * .5, gram1freq$running), "words.")
paste("Unigram 90% mark reached at", findInterval( max(gram1freq$running) * .9, gram1freq$running), "words.")
ggplot(all_grams, aes(x=record, y=cperc, color=factor(ngram))) + geom_line(size=2) + 
  xlab("ordered term number") + ylab("percent of total terms (non-unique)") +
  ggtitle("Frequency sorted dictionary distribution") +
  scale_color_discrete(name="n-gram")
##