This analysis was originally prepared on March 25, 2015 for my Milestone Report in the Data Science Specialization Capstone on Coursera.
The capstone project will demonstrate this data scientist’s ability to process and analyze large volumes of unstructured text. As a final deliverable, the data scientist will develop an algorithm that predicts the next word in a provided text, similar to the predictive text functions found on today’s modern smart phones.
This report demonstrates the data scientist’s ability to successfully import the text data into R, provide basic summary statistics, and explain the planned steps for producing an algorithm for text prediction.
Three text files have been provided for machine learning.
- collection of Tweets
- collection of blog entries
- collection of news items
Each are loaded into R. Details for the R session and sourced functions are listed in the Appendix.
# files paths for each file are hidden
blog <- readLines(file.blogs, skipNul = TRUE)
twitter <- readLines(file.twitter, skipNul = TRUE)
news <- readLines(file.news, skipNul = TRUE)
A summary of the full files are provided prior to random sampling.
The text sources are put into a list and traversed to calculate length and word count.
# helper function to count number of words in a list element
f.word.count <- function(my.list) { sum(stringr::str_count(my.list, "\\S+")) }
# data frame to store counts
df <- data.frame(text.source = c("blog", "twitter", "news"), line.count = NA, word.count = NA)
# put corpura (they aren't class corpura yet) into a list
my.list <- list(blog = blog, twitter = twitter, news = news)
# get line count and word count for each Corpura
df$line.count <- sapply(my.list, length)
df$word.count <- sapply(my.list, f.word.count)
# plot prep
g.line.count <- ggplot(df, aes(x = factor(text.source), y = line.count/1e+06))
g.line.count <- g.line.count + geom_bar(stat = "identity") +
labs(y = "# of lines/million", x = "text source", title = "Count of lines per Corpus")
# g.line.count
g.word.count <- ggplot(df, aes(x = factor(text.source), y = word.count/1e+06))
g.word.count <- g.word.count + geom_bar(stat = "identity") +
labs(y = "# of words/million", x = "text source", title = "Count of words per Corpus")
These plots show the number of entries (lines) and number of words per corpus (text source). Each corpus has at least 800,000 lines of text (entries, tweets, items) and least 30 million words.
This section shows the steps taken to return the most frequent words found in each corpus. The blog, news, and twitter corpora are prepared and explored individually.
Analyzing each corpus in its entirety is not necessary when valid results can be obtained through random sampling. Thus, prior to exploring word frequencies, a random sample is taken from each corpus.
# create a data frame for samples
sample.df <- data.frame(text.source = c("blog", "twitter", "news"),
line.count = NA, word.count = NA)
# create a list of random variables
set.seed(324)
percent <- 0.05
randoms <- lapply(my.list, function (x) rbinom(x, 1, percent))
# create a new, empty list to store random selections
sample.list <- list(blog = NA, twitter = NA, news = NA)
# traverse each element of the original list, selecting ~x% of the sample, as
# determined in rbinom
for (i in 1:length(my.list)) {
sample.list[[i]] <- my.list[[i]][randoms[[i]] == 1]
}
# get counts of sample.list
sample.df$line.count <- sapply(sample.list, length)
sample.df$word.count <- sapply(sample.list, f.word.count)
Here are the count for the sample set. Each corpus represents about 5% of the total number of lines in its original corpus.
## text.source line.count word.count
## 1 blog 45238 1881800
## 2 twitter 117859 1516498
## 3 news 50515 1726275
At this stage in preliminary analysis, each text collection is converted to a single Corpus class and transformations are performed.
- For tweets only
- hash tags (the # sign and the accompanying word) and twitter handles (the @ sign and the accompanying word) are removed from the tweet corpus
- For all corpora
- text is converted to lower case
- URLs are removed
- curse words are removed
- numbers are removed
- high-frequency words are removed, such as “the”, “is”, “at”, etc. (collectively known as stop words)
- remaining punctuation is removed
These data cleansing steps are appropriate at this stage of preliminary analysis, but not all these steps will be used in the final preparation for use in natural language prediction. For example, stop words will be retained in the prediction algorithm, as the goal of the final deliverable is to mimic natural language as closely as possible.
### helper functions
removeURL <- function(x) gsub("http:[[:alnum:]]*", "", x)
removeHashTags <- function(x) gsub("#\\S+", "", x)
removeTwitterHandles <- function(x) gsub("@\\S+", "", x)
### create corpus classs
text.corpus <- tm::Corpus(VectorSource(sample.list))
rm(sample.list)
# remove twitter handles and hashtags
text.corpus["twitter"] <- tm::tm_map(text.corpus["twitter"],
content_transformer(removeHashTags))
text.corpus["twitter"] <- tm::tm_map(text.corpus["twitter"],
content_transformer(removeTwitterHandles))
# other transformations
text.corpus <- tm::tm_map(text.corpus, content_transformer(tolower))
text.corpus <- tm::tm_map(text.corpus, removeNumbers)
# cursewords file loaded locally
text.corpus <- tm::tm_map(text.corpus, removeWords, cursewords)
text.corpus <- tm::tm_map(text.corpus, content_transformer(removeURL))
text.corpus <- tm::tm_map(text.corpus, removePunctuation)
text.corpus <- tm::tm_map(text.corpus, removeWords, stopwords("english"))
Next, the Corpora are put into their own Term Document Matrix and ready for further analysis. Words smaller than three characters are omitted.
## single tokenizers
twitterTdm <- tm::TermDocumentMatrix(text.corpus["twitter"], control = list(wordLengths = c(3,Inf)))
blogTdm <- tm::TermDocumentMatrix(text.corpus["blog"], control = list(wordLengths = c(3,Inf)))
newsTdm <- tm::TermDocumentMatrix(text.corpus["news"], control = list(wordLengths = c(3,Inf)))
The corpora are now ready to be explored for distinct word counts and most frequent words. ## Distinct Words per Corpus
# put word count from term-document matrices into data frames
freq.news <- data.frame(word = newsTdm$dimnames$Terms, frequency = newsTdm$v)
freq.blog <- data.frame(word = blogTdm$dimnames$Terms, frequency = blogTdm$v)
freq.twitter <- data.frame(word = twitterTdm$dimnames$Terms, frequency = twitterTdm$v)
# reorder by descreasing number
freq.news <- plyr::arrange(freq.news, -frequency)
freq.blog <- plyr::arrange(freq.blog, -frequency)
freq.twitter <- plyr::arrange(freq.twitter, -frequency)
In the blog random sample (about 5%), there are 71,041 distinct words and 34,185 distinct words occurring two or more times.
In the news random sample (about 5%), there are 70,871 distinct words and 36,317 distinct words occurring two or more times.
In the twitter random sample (about 5%), there are 68,740 distinct words and 26,428 distinct words occurring two or more times.
n <- 25L # variable to set top n words
# isolate top n words by decreasing frequency
blog.top <- freq.blog[1:n, ]
news.top <- freq.news[1:n, ]
twitter.top <- freq.twitter[1:n, ]
# reorder levels so charts plot in order of frequency
blog.top$word <- reorder(blog.top$word, blog.top$frequency)
news.top$word <- reorder(news.top$word, news.top$frequency)
twitter.top$word <- reorder(twitter.top$word, twitter.top$frequency)
# plots
g.blog.top <- ggplot(blog.top, aes(x = word, y = frequency))
g.blog.top <- g.blog.top + geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent: Blog")
g.news.top <- ggplot(news.top, aes(x = word, y = frequency))
g.news.top <- g.news.top + geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent: News")
g.twitter.top <- ggplot(twitter.top, aes(x = word, y = frequency))
g.twitter.top <- g.twitter.top + geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent: Twitter")
These plots display the 25 most frequent terms in each corpus.
df.intersect <- data.frame(word = Reduce(intersect, list(blog.top$word, news.top$word, twitter.top$word)))
df.intersect <- plyr::arrange(df.intersect, word)
These 11, listed alphabetically, are found in all three top 25 lists.
## word
## 1 back
## 2 can
## 3 get
## 4 just
## 5 like
## 6 new
## 7 now
## 8 one
## 9 people
## 10 time
## 11 will
Moving forward, the project goal is to develop a natural language prediction algorithm and app. For example, if a user were to type, “I want to go to the …”, the app would suggest the three most likely words that would replace “…”.
While the word analysis performed in this document is helpful for initial exploration, the data analyst will need to construct a dictionary of bigrams, trigrams, and four-grams, collectively called n-grams. Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases. Here is an example of trigrams from the randomly sampled twitter corpus. Recall that stop words had been removed so the phrases may look choppy. In the final dictionary, stop phrases and words of any length will be maintained.
# tokenize into tri-grams
trigram.twitterTdm <- tm::TermDocumentMatrix(text.corpus["twitter"], control = list(tokenize = TrigramTokenizer))
# put into data frame
freq.trigram.twitter <- data.frame(word = trigram.twitterTdm$dimnames$Terms, frequency = trigram.twitterTdm$v)
# reorder by descending frequency
freq.trigram.twitter <- plyr::arrange(freq.trigram.twitter, -frequency)
## word frequency
## 1 happy mothers day 183
## 2 cant wait see 145
## 3 let us know 94
## 4 happy new year 91
## 5 ha ha ha 55
## 6 cinco de mayo 53
## 7 im pretty sure 53
## 8 dont even know 47
## 9 cant wait till 45
## 10 love love love 39
Each n-gram will be split, separating the last word from the previous words in the n-gram.
- bigrams will become unigram/unigram pairs
- trigrams will become bigram/unigram pairs
- four-grams will become trigram/unigram pairs
For each pair, the three most frequent occurrences will be stored in the dictionary. Here are the three most frequent trigrams for a bigram of “cant wait” from the randomly sampled twitter corpus. These three trigrams would be split into bigram/unigram pairs and stored in the twitter dictionary. Dictionaries will be built for tweets, blogs, and news items.
## word frequency
## 2 cant wait see 145
## 9 cant wait till 45
## 11 cant wait hear 36
After the dictionaries have been established, an app will be developed allowing the user to enter text. After entering the text, the user will declare the text as being meant for a tweet, a blog, or a news item. The app will suggest the three most likely words to come next in the text for the text type, based on these rules.
Suggest the three most frequent unigrams from the n-gram/unigram pair for either 1, 2, or 3 above.
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RWeka_0.4-24 plyr_1.8.1 gridExtra_0.9.1 ggplot2_1.0.1
## [5] stringr_0.6.2 tm_0.6-1 NLP_0.1-7
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5
## [4] formatR_1.0 gtable_0.1.2 htmltools_0.2.6
## [7] knitr_1.9 labeling_0.3 MASS_7.3-33
## [10] munsell_0.4.2 parallel_3.1.1 proto_0.3-10
## [13] Rcpp_0.11.5 reshape2_1.4.1 rJava_0.9-6
## [16] rmarkdown_0.7 RWekajars_3.7.12-1 scales_0.2.4
## [19] slam_0.1-32 tools_3.1.1
Some of these functions may not have been used.
#### [ngramTokenizer] functions
# BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
# TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# FourgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
#' Ngrams tokenizer
#' @param n integer
#' @return n-gram tokenizer function
ngram_tokenizer <- function(n = 1L, skip_word_none = TRUE) {
stopifnot(is.numeric(n), is.finite(n), n > 0)
options <- stringi::stri_opts_brkiter(type="word", skip_word_none = skip_word_none)
function(x) {
stopifnot(is.character(x))
# Split into word tokens
tokens <- unlist(stringi::stri_split_boundaries(x, opts_brkiter=options))
len <- length(tokens)
if(all(is.na(tokens)) || len < n) {
# If we didn't detect any words or number of tokens is less than n return empty vector
character(0)
} else {
sapply(
1:max(1, len - n + 1),
function(i) stringi::stri_join(tokens[i:min(len, i + n - 1)], collapse = " ")
)
}
}
}
#### ngram_tokenizer example
x <- ngram_tokenizer(4)(sample.list$blog)