This is the milestone report for the Coursera Data Science Captsone course. We’ll start by loading in all the needed libraries and globals.
Based on a thread in the discussion forum, I decided to employ Google’s bad word list for profanity filtering.
This report contains code that might be frightening to the non-technical reader. For technical readers, the code used to generate this report is contained in the appendix at the end.
I manually downloaded the file and unzipped it. Summary statistics for full files (filesize, line count, word count) are shown below.
Beyond word counts and given that I don’t have an actual in-use data product, I’m not going to analyze the difference in word frequencies by sources because I really wouldn’t know what to infer from a finding like “twitter has more verbs.” The value of sources for the corpus could better be determined by looking at user satisfaction with the predictive typing application.
| File | File.Size | Lines | Words |
|---|---|---|---|
| blogs | 200.42 MB | 899288 | 37570839 |
| news | 196.28 MB | 1010242 | 34494539 |
| 159.36 MB | 2360148 | 30451128 |
As described above, I have shown the length, size, and word counts of the original files. From now on, I’ll be working with a 10% random sample from each file.
The code below creates the corpus (i.e., the collection of texts from the three data sources), and one to three word phrase counts (ngrams) from the corpus.
Given the task, we don’t really need any special characters that aren’t part of a word. To keep things simple, I don’t want to save the following tokens/characters:
I also wan’t to keep stop words as they are commonly typed and are valid suggestions.
First, we create the corpus.
Next, I create the term-frequency tables which I will show a portion as visualizations and tables.
One way to summarize terms in a meaningful way is to look at the top-n terms for each n-gram. Below is a barplot for the most frequently used terms (top ten), by n-gram.
It is somewhat difficult to see the counts for the 3-gram terms so I also include a table below.
| phrase | count | ngram |
|---|---|---|
| the | 477860 | 1 |
| and | 238663 | 1 |
| for | 113042 | 1 |
| that | 104479 | 1 |
| you | 88208 | 1 |
| with | 73475 | 1 |
| was | 66980 | 1 |
| have | 54983 | 1 |
| this | 53876 | 1 |
| but | 49962 | 1 |
| in the | 43608 | 2 |
| of the | 41140 | 2 |
| for the | 22197 | 2 |
| to the | 21846 | 2 |
| on the | 18982 | 2 |
| to be | 15623 | 2 |
| at the | 15290 | 2 |
| and the | 13911 | 2 |
| in a | 12679 | 2 |
| is a | 10643 | 2 |
| a lot of | 3602 | 3 |
| thanks for the | 3160 | 3 |
| one of the | 2865 | 3 |
| going to be | 2016 | 3 |
| to be a | 1967 | 3 |
| looking forward to | 1588 | 3 |
| i love you | 1531 | 3 |
| as well as | 1486 | 3 |
| the fact that | 1431 | 3 |
| for the follow | 1355 | 3 |
One of the items I found surprising was around a task question that asked:
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
Answering the problem simply (below), we realize for single words, we get to that 50% mark fairly quickly and it takes a little longer to get to the 90% mark.
## [1] "Unigram 50% mark reached at 313 words."
## [1] "Unigram 90% mark reached at 7674 words."
I would expect this distribution, as shown in the cumulative distribution below, to flatten out as we got into two and three word phrases but was surprised at how much quicker we get to the differing percentiles with single words.
Please note: this section is not written for the non data scientist that I described in the beginning of the report because I wouldn’t be looking to them for technical advice.
I do understand that there will be performance issues and there are tradeoffs in that:
For the shiny app, I’m contemplating three options:
If dynamic updating causes too much delay, I’m going to opt for one of the bottom two options.
With regards to NLP, my plans are to use the largest n-gram I can get away with without taking a huge performance hit. For the algorithm, I plan on using an NLP model that uses conditional probability like Stupid Backoff but can’t really comment until I see it in action. After reading a review of this capstone, I am obviously concerned with performance. It has also inspired me to look for some MOOCs on NLP to increase my knowledge.
set.seed(1234)
library(dplyr)
library(tm)
library(RWeka)
library(ggplot2)
library(Matrix)
library(dplyr)
library(gridExtra)
library(pander)
library(stringi)
options(mc.cores=1) # NGramTokenizer generates an error in OS X without this option
profanity <- read.csv("google_twunter_lol.txt", header=FALSE, stringsAsFactors=FALSE)
profanity <- unlist(strsplit(profanity[,1], split=":1"))
# globals for file and path parts
ddir <- './final/en_US/'
prefix <- 'en_US.'
extension <- '.txt'
suffix <- '_reduced'
## create samples, report summary statistics for full files
count_things <- function(input) {
input_file <- paste(ddir, prefix, input, extension, sep="")
full_file <- readLines(input_file)
word_stats <- stri_stats_latex(full_file)
line_stats <- stri_stats_general(full_file)
file_size <- round(file.info(input_file)$size / 1024^2, 2)
stats <- data.frame("File" = input, "File Size" = paste(file_size, "MB"),
"Lines" = line_stats["Lines"], "Words" = word_stats["Words"])
}
create_sampled_file <- function(input, proportion = 0.10) {
input_file <- paste(ddir, prefix, input, extension, sep="")
output_file <- paste(ddir, prefix, input, suffix, extension, sep="")
full_file <- readLines(input_file)
file_length <- length(full_file)
sample_size <- proportion * file_length
full_file <- full_file[rbinom(n = sample_size, size = file_length, prob = 0.5)]
write(full_file, file = output_file)
}
# I'll provide additional summaries (word counts, etc.) from the reduced data.
create_sampled_file('blogs')
create_sampled_file('news')
create_sampled_file('twitter')
counts <- (rbind(count_things('blogs'), count_things('news'), count_things('twitter')))
row.names(counts) <- NULL
pander(counts)
corpus <- Corpus(DirSource(ddir, pattern=suffix, encoding="UTF-8"),
readerControl = list(reader=readPlain,
language="en",
load=TRUE))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus,removeWords,profanity)
ngrammer <- function(lower, upper) {
ngfun <- function(x) NGramTokenizer(x, Weka_control(min=lower, max=upper))
ng <- TermDocumentMatrix(corpus, control = list(tokenize = ngfun))
ng_df <- data.frame(phrase = ng$dimnames$Terms,
count = rowSums(sparseMatrix(i = ng$i, j=ng$j, x= ng$v)))
ng_df <- arrange(ng_df, desc(count))
ng_df <- mutate(ng_df, running=cumsum(count), cperc=running/max(running),
ngram=lower, record=1:nrow(ng_df))
ng_df
}
gram1freq <- ngrammer(1,1)
gram2freq <- ngrammer(2,2)
gram3freq <- ngrammer(3,3)
all_grams <- rbind(gram1freq, gram2freq, gram3freq)
all_tops <- all_grams %>% group_by(ngram) %>% top_n(10, count)
all_tops$phrase <- factor(all_tops$phrase, levels=all_tops[order(all_tops$count, decreasing=TRUE),]$phrase)
ggplot(all_tops, aes(x=phrase, y=count/1000, order=phrase, fill=factor(ngram))) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle=90)) + xlab("term") +
ylab("count (in thousands)") + ggtitle("Top ten terms for 1, 2, and 3-grams") +
facet_wrap(~ ngram, scales = "free_x") + scale_fill_discrete(name="n-gram")
pander(all_tops[,c(1:2,5)])
paste("Unigram 50% mark reached at", findInterval( max(gram1freq$running) * .5, gram1freq$running), "words.")
paste("Unigram 90% mark reached at", findInterval( max(gram1freq$running) * .9, gram1freq$running), "words.")
ggplot(all_grams, aes(x=record, y=cperc, color=factor(ngram))) + geom_line(size=2) +
xlab("ordered term number") + ylab("percent of total terms (non-unique)") +
ggtitle("Frequency sorted dictionary distribution") +
scale_color_discrete(name="n-gram")
##