Milestone Report: Capstone

Below, I will create a corpus from some source of Twitter updates (i.e. “tweets”). I will then clean, tokenize, and perform some very basic ngram exploratory analysis upon this corpus. Ultimately, within about a month’s time, I will use the following insights to help develop a predictive model for text suggestion/selection from some user input in a Shiny application.

Please note: I may later exercise my option to integrate blog posts and news posts into this corpus, for more versatile -and potentially more accurate- predictive applications. However, at this milestone, I will only use the source text originally requested of us.

Sample the Source File

The source dataset is HUGE. Therefore, I’m just going to sample 10,000 elements within it. Please note: I’m purposefully surpressing a file encoding error that rarely occurs, because they’re effectively useless. Sample() continues working until its reached the requested sample size.

## Load necessary libraries. Note: I'm surpressing the startup messages, because some of them
## are large and distracting!
library(tm)
library(NLP)
library(openNLP)
library(qdap)
library(RWeka)
library(dplyr)
library(ggplot2)
library(gridExtra)

## Open connection to file. Only using the Twitter text for now, per the assignment
## instructions. I may expand my corpus later.
tw_con <- file("data/SwiftKey/en_US/en_US.twitter.txt", "r")

## Sample 10,000 records from the original and then close the connection.
## I had to suppress warnings, because the warnings of the occassional encoding
## error were, in effect, useless. The sample function continues until it
## gets what it needs.
options(warn=-1)
unclean_tweets <- sample(readLines(tw_con), 10000)
options(warn=0)
close(tw_con)

Preview the Sample and Save

Here, we just double-check the sample size and preview its contents. We also save it, just in case we need it later. Not sure we will, but hey, this ain’t the final product!

## Preview and save unclean tweets, because we might be able to use it later.
print(paste(length(unclean_tweets), "tweets"))

## [1] "10000 tweets"

print(head(unclean_tweets))

## [1] "Really enjoying Washington Ballet's Rock & Roll. At Harman Hall through this weekend."                                                         
## [2] "\"You can set yourself up to be sick, or you can choose to stay well.\" - Wayne Dyer - Have a great day from your friends at Hieber's Pharmacy"
## [3] "What's up? This is your biggest fan, Antoine. You are one of my favorite pornstars. So sexy. I hope we meet soon..Muah ;-)"                    
## [4] "Poor squirrel never made it across the street alive #"                                                                                         
## [5] "See you on the other side Uncle Zay - if I'm fortunate enough. I miss you already."                                                            
## [6] "Sleepy, headache, backache, perhaps from an alcohol- and fun-filled time with , proving that \"grown-ups\" can, in fact, party"

write.table(unclean_tweets, file = "data/unclean_tweets.txt", row.names = F)

Creating a Fairly “Clean” Corpus

Thus far, our tweets are “dirty.” There are tons of useless capitalization inconsistencies, whitespaces, numbers, and punctuations. Let’s address these issues via the very nifty “TM” (text mining) package.

## Detect sentences before converting to a corpus using TM's VCorpus. Then, clean sample: 
## lowercase all, remove whitespaces, remove numbers, remove special characters,
## remove profanity (list used: https://gist.github.com/jamiew/1112488)
unclean_tweets <- sent_detect(unclean_tweets, language = "en", model = NULL)
corpus_tw <- VCorpus(VectorSource(unclean_tweets))
corpus_tw <- tm_map(corpus_tw, tolower)
corpus_tw <- tm_map(corpus_tw, stripWhitespace)
corpus_tw <- tm_map(corpus_tw, removeNumbers)
corpus_tw <- tm_map(corpus_tw, removePunctuation)
corpus_tw <- tm_map(corpus_tw, removeWords, as.vector(readLines("profanity.txt")))

Preview the Corpus and Save

As before, let’s preview and save this. Unfortunately, this time, since a corpus is a set of ‘documents’, we’ll first need to convert this into a data frame.

## Convert to DF, preview, and save for potential later use, then close the connection.
clean_corp <- data.frame(text = unlist(corpus_tw), stringsAsFactors = F)
row.names(clean_corp) <- NULL
print(head(clean_corp))

##                                                                           text
## 1                                really enjoying washington ballets rock  roll
## 2                                          at harman hall through this weekend
## 3            you can set yourself up to be sick or you can choose to stay well
## 4  wayne dyer  have a great day from your friends at hiebers pharmacy whats up
## 5                                             this is your biggest fan antoine
## 6                                         you are one of my favorite pornstars

write.table(clean_corp, file = "data/corpus_tweets.txt", row.names = F)

N-Gram Analysis

Here, we’re going to use the RWeka package to neatly extract our n-grams (single, bi-grams, and tri-grams). Then, we’ll use Plyr and DPlyr to count and arrange frequencies, thereby presenting their summary statistics. And finally, we’ll plot the top words per n-gram frequency.

## Tokenize using RWeka. 
singles <- NGramTokenizer(clean_corp, Weka_control(min = 1, max = 1))
bigrams <- NGramTokenizer(clean_corp, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
trigrams <- NGramTokenizer(clean_corp, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))

## Get frequency counts and prep for plotting.
s_freq <- data.frame(table(singles))
s_freq <- arrange(s_freq, desc(Freq))
names(s_freq)[1] <- "text"

bi_freq <- data.frame(table(bigrams))
bi_freq <- arrange(bi_freq, desc(Freq))
names(bi_freq)[1] <- "text"

tri_freq <- data.frame(table(trigrams))
tri_freq <- arrange(tri_freq, desc(Freq))
names(tri_freq)[1] <- "text"

## Print top freqs and summaries.
print(head(s_freq))

##   text Freq
## 1  the  811
## 2   to  677
## 3    i  667
## 4  you  490
## 5    a  484
## 6  and  395

print(head(bi_freq))

##         text Freq
## 1     in the   85
## 2    for the   69
## 3     on the   46
## 4      to be   43
## 5 thanks for   37
## 6    will be   36

print(head(tri_freq))

##                 text Freq
## 1     thanks for the   22
## 2          i need to   12
## 3          i want to   11
## 4        is going to   11
## 5 looking forward to   11
## 6     for the follow    9

print(summary(s_freq))

##            text           Freq        
##  a           :   1   Min.   :  1.000  
##  ?           :   1   1st Qu.:  1.000  
##  ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>:   1   Median :  1.000  
##  ?<U+0093>nobody    :   1   Mean   :  4.589  
##  ?<U+0094>          :   1   3rd Qu.:  2.000  
##  ?<U+0080><U+0093>         :   1   Max.   :811.000  
##  (Other)     :5703

print(summary(bi_freq))

##                text            Freq     
##  ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> and:    1   Min.   : 1.0  
##  ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> the:    1   1st Qu.: 1.0  
##  a all           :    1   Median : 1.0  
##  a and           :    1   Mean   : 1.3  
##  a animal        :    1   3rd Qu.: 1.0  
##  a anymore       :    1   Max.   :85.0  
##  (Other)         :20138

print(summary(tri_freq))

##                           text            Freq       
##  ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> and you       :    1   Min.   : 1.000  
##  ?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088>?<U+0096><U+0088> the government:    1   1st Qu.: 1.000  
##  a all the                  :    1   Median : 1.000  
##  a and the                  :    1   Mean   : 1.032  
##  a animal u                 :    1   3rd Qu.: 1.000  
##  a anymore kobe             :    1   Max.   :22.000  
##  (Other)                    :25374

Exploratory Plots

Below, we finally plot the top 10 per single, bi-gram, and tri-gram word sets in our corpus. Consequently, we can already see some interesting patterns (e.g. many words are reused moving up the n-grams).

## Let's plot these bad boys, starting with single words.
s_freq <- head(s_freq, 10)
s_freq$text <- factor(s_freq$text, levels = s_freq$text)
plot_s <- ggplot(s_freq, aes(x = s_freq$text, y = s_freq$Freq))
plot_s <- plot_s + geom_bar(stat = "identity", fill="#cc0000") + coord_flip() + 
    labs(y = "Frequency", x = "Text") # + theme(text = element_text(size=7))

## Plot bigrams.
bi_freq <- head(bi_freq, 10)
bi_freq$text <- factor(bi_freq$text, levels = bi_freq$text)
plot_bi <- ggplot(bi_freq, aes(x = bi_freq$text, y = bi_freq$Freq))
plot_bi <- plot_bi + geom_bar(stat = "identity", fill="dodgerblue") + coord_flip() + 
    labs(y = "Frequency", x = "Text") # + theme(text = element_text(size=7))

## Plot trigrams.
tri_freq <- head(tri_freq, 10)
tri_freq$text <- factor(tri_freq$text, levels = tri_freq$text)
plot_tri <- ggplot(tri_freq, aes(x = tri_freq$text, y = tri_freq$Freq))
plot_tri <- plot_tri + geom_bar(stat = "identity", fill="forestgreen") + coord_flip() + 
    labs(y = "Frequency", x = "Text") # + theme(text = element_text(size=7))

grid.arrange(plot_s, plot_bi, plot_tri, ncol = 3, top = "N-Gram Exploration:\r
Top Text for Single, Bi-Grams, and Tri-Grams")

plot of chunk unnamed-chunk-6

What’s Next? Thoughts on Prediction and Shiny.

Given how dirty and unnatural tweets appear to be, I’m now sure I’ll need to incorporate our other text sources (e.g. blog posts) if I hope to improve my model’s accuracy. Nonetheless, given the above -almost nested- patterns, it does look like some form of n-gram + smoothing model would work well. I just don’t know much about “backoff models” yet.

As for the Shiny application, I believe I’ll either be taking in some simple text and trying to predict the remainder of the user’s sentence in real time (i.e. reactive), OR I may ask the user to first select what kind of an application they’ll be entering the text into (e.g. Twitter), whereafter I’ll draw from a specialized corpus (e.g. twitter for a tweet, blog post for a blog sentence, etc).

Finally, and sadly, given how long the above analysis took to compute, I may need to turn to another package for data/n-gram mining. TM is a bit too slow for practical applications, it seems, especially for the free edition of Shiny.

Thank you for your time and consideration!