Below, I will create a corpus from some source of Twitter updates (i.e. “tweets”). I will then clean, tokenize, and perform some very basic ngram exploratory analysis upon this corpus. Ultimately, within about a month’s time, I will use the following insights to help develop a predictive model for text suggestion/selection from some user input in a Shiny application.
Please note: I may later exercise my option to integrate blog posts and news posts into this corpus, for more versatile -and potentially more accurate- predictive applications. However, at this milestone, I will only use the source text originally requested of us.
The source dataset is HUGE. Therefore, I’m just going to sample 10,000 elements within it. Please note: I’m purposefully surpressing a file encoding error that rarely occurs, because they’re effectively useless. Sample() continues working until its reached the requested sample size.
## Load necessary libraries. Note: I'm surpressing the startup messages, because some of them
## are large and distracting!
library(tm)
library(NLP)
library(openNLP)
library(qdap)
library(RWeka)
library(dplyr)
library(ggplot2)
library(gridExtra)
## Open connection to file. Only using the Twitter text for now, per the assignment
## instructions. I may expand my corpus later.
tw_con <- file("data/SwiftKey/en_US/en_US.twitter.txt", "r")
## Sample 10,000 records from the original and then close the connection.
## I had to suppress warnings, because the warnings of the occassional encoding
## error were, in effect, useless. The sample function continues until it
## gets what it needs.
options(warn=-1)
unclean_tweets <- sample(readLines(tw_con), 10000)
options(warn=0)
close(tw_con)
Here, we just double-check the sample size and preview its contents. We also save it, just in case we need it later. Not sure we will, but hey, this ain’t the final product!
## Preview and save unclean tweets, because we might be able to use it later.
print(paste(length(unclean_tweets), "tweets"))
## [1] "10000 tweets"
print(head(unclean_tweets))
## [1] "Really enjoying Washington Ballet's Rock & Roll. At Harman Hall through this weekend."
## [2] "\"You can set yourself up to be sick, or you can choose to stay well.\" - Wayne Dyer - Have a great day from your friends at Hieber's Pharmacy"
## [3] "What's up? This is your biggest fan, Antoine. You are one of my favorite pornstars. So sexy. I hope we meet soon..Muah ;-)"
## [4] "Poor squirrel never made it across the street alive #"
## [5] "See you on the other side Uncle Zay - if I'm fortunate enough. I miss you already."
## [6] "Sleepy, headache, backache, perhaps from an alcohol- and fun-filled time with , proving that \"grown-ups\" can, in fact, party"
write.table(unclean_tweets, file = "data/unclean_tweets.txt", row.names = F)
Thus far, our tweets are “dirty.” There are tons of useless capitalization inconsistencies, whitespaces, numbers, and punctuations. Let’s address these issues via the very nifty “TM” (text mining) package.
## Detect sentences before converting to a corpus using TM's VCorpus. Then, clean sample:
## lowercase all, remove whitespaces, remove numbers, remove special characters,
## remove profanity (list used: https://gist.github.com/jamiew/1112488)
unclean_tweets <- sent_detect(unclean_tweets, language = "en", model = NULL)
corpus_tw <- VCorpus(VectorSource(unclean_tweets))
corpus_tw <- tm_map(corpus_tw, tolower)
corpus_tw <- tm_map(corpus_tw, stripWhitespace)
corpus_tw <- tm_map(corpus_tw, removeNumbers)
corpus_tw <- tm_map(corpus_tw, removePunctuation)
corpus_tw <- tm_map(corpus_tw, removeWords, as.vector(readLines("profanity.txt")))
As before, let’s preview and save this. Unfortunately, this time, since a corpus is a set of ‘documents’, we’ll first need to convert this into a data frame.
## Convert to DF, preview, and save for potential later use, then close the connection.
clean_corp <- data.frame(text = unlist(corpus_tw), stringsAsFactors = F)
row.names(clean_corp) <- NULL
print(head(clean_corp))
## text
## 1 really enjoying washington ballets rock roll
## 2 at harman hall through this weekend
## 3 you can set yourself up to be sick or you can choose to stay well
## 4 wayne dyer have a great day from your friends at hiebers pharmacy whats up
## 5 this is your biggest fan antoine
## 6 you are one of my favorite pornstars
write.table(clean_corp, file = "data/corpus_tweets.txt", row.names = F)
Here, we’re going to use the RWeka package to neatly extract our n-grams (single, bi-grams, and tri-grams). Then, we’ll use Plyr and DPlyr to count and arrange frequencies, thereby presenting their summary statistics. And finally, we’ll plot the top words per n-gram frequency.
## Tokenize using RWeka.
singles <- NGramTokenizer(clean_corp, Weka_control(min = 1, max = 1))
bigrams <- NGramTokenizer(clean_corp, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
trigrams <- NGramTokenizer(clean_corp, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
## Get frequency counts and prep for plotting.
s_freq <- data.frame(table(singles))
s_freq <- arrange(s_freq, desc(Freq))
names(s_freq)[1] <- "text"
bi_freq <- data.frame(table(bigrams))
bi_freq <- arrange(bi_freq, desc(Freq))
names(bi_freq)[1] <- "text"
tri_freq <- data.frame(table(trigrams))
tri_freq <- arrange(tri_freq, desc(Freq))
names(tri_freq)[1] <- "text"
## Print top freqs and summaries.
print(head(s_freq))
## text Freq
## 1 the 811
## 2 to 677
## 3 i 667
## 4 you 490
## 5 a 484
## 6 and 395
print(head(bi_freq))
## text Freq
## 1 in the 85
## 2 for the 69
## 3 on the 46
## 4 to be 43
## 5 thanks for 37
## 6 will be 36
print(head(tri_freq))
## text Freq
## 1 thanks for the 22
## 2 i need to 12
## 3 i want to 11
## 4 is going to 11
## 5 looking forward to 11
## 6 for the follow 9
print(summary(s_freq))
## text Freq
## a : 1 Min. : 1.000
## ? : 1 1st Qu.: 1.000
## ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>: 1 Median : 1.000
## ?<U+0093>nobody : 1 Mean : 4.589
## ?<U+0094> : 1 3rd Qu.: 2.000
## ?<U+0080><U+0093> : 1 Max. :811.000
## (Other) :5703
print(summary(bi_freq))
## text Freq
## ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> and: 1 Min. : 1.0
## ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> the: 1 1st Qu.: 1.0
## a all : 1 Median : 1.0
## a and : 1 Mean : 1.3
## a animal : 1 3rd Qu.: 1.0
## a anymore : 1 Max. :85.0
## (Other) :20138
print(summary(tri_freq))
## text Freq
## ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> and you : 1 Min. : 1.000
## ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> the government: 1 1st Qu.: 1.000
## a all the : 1 Median : 1.000
## a and the : 1 Mean : 1.032
## a animal u : 1 3rd Qu.: 1.000
## a anymore kobe : 1 Max. :22.000
## (Other) :25374
Below, we finally plot the top 10 per single, bi-gram, and tri-gram word sets in our corpus. Consequently, we can already see some interesting patterns (e.g. many words are reused moving up the n-grams).
## Let's plot these bad boys, starting with single words.
s_freq <- head(s_freq, 10)
s_freq$text <- factor(s_freq$text, levels = s_freq$text)
plot_s <- ggplot(s_freq, aes(x = s_freq$text, y = s_freq$Freq))
plot_s <- plot_s + geom_bar(stat = "identity", fill="#cc0000") + coord_flip() +
labs(y = "Frequency", x = "Text") # + theme(text = element_text(size=7))
## Plot bigrams.
bi_freq <- head(bi_freq, 10)
bi_freq$text <- factor(bi_freq$text, levels = bi_freq$text)
plot_bi <- ggplot(bi_freq, aes(x = bi_freq$text, y = bi_freq$Freq))
plot_bi <- plot_bi + geom_bar(stat = "identity", fill="dodgerblue") + coord_flip() +
labs(y = "Frequency", x = "Text") # + theme(text = element_text(size=7))
## Plot trigrams.
tri_freq <- head(tri_freq, 10)
tri_freq$text <- factor(tri_freq$text, levels = tri_freq$text)
plot_tri <- ggplot(tri_freq, aes(x = tri_freq$text, y = tri_freq$Freq))
plot_tri <- plot_tri + geom_bar(stat = "identity", fill="forestgreen") + coord_flip() +
labs(y = "Frequency", x = "Text") # + theme(text = element_text(size=7))
grid.arrange(plot_s, plot_bi, plot_tri, ncol = 3, top = "N-Gram Exploration:\r
Top Text for Single, Bi-Grams, and Tri-Grams")
Given how dirty and unnatural tweets appear to be, I’m now sure I’ll need to incorporate our other text sources (e.g. blog posts) if I hope to improve my model’s accuracy. Nonetheless, given the above -almost nested- patterns, it does look like some form of n-gram + smoothing model would work well. I just don’t know much about “backoff models” yet.
As for the Shiny application, I believe I’ll either be taking in some simple text and trying to predict the remainder of the user’s sentence in real time (i.e. reactive), OR I may ask the user to first select what kind of an application they’ll be entering the text into (e.g. Twitter), whereafter I’ll draw from a specialized corpus (e.g. twitter for a tweet, blog post for a blog sentence, etc).
Finally, and sadly, given how long the above analysis took to compute, I may need to turn to another package for data/n-gram mining. TM is a bit too slow for practical applications, it seems, especially for the free edition of Shiny.
Thank you for your time and consideration!