Introduction

This report contains the initial analysis of Twitter, Blog and News feeds for the Data Science Specialization Capstone project. The goal is to examine these three user feeds to understand there word usage and patterns.

This analysis will soon feed the creation of a prediction model for 2-3 word phrases–termed N-Grams. The ultimate intent of this analysis is to expedite user message creation performance.

Data

This analysis uses three text sources: Tweets, Blogs and News feeds. Specifically, this study uses data from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. These sources contain various non-word issues including textual nulls and non-ASCII chars (i.e. “#”, Smiles). To use these data sources, NULLS are skipped when reading the dataset. Furthermore, the Quanteda text analysis R package is used to remove punctuation and twitter extended characters (i.e. “#” and smiles faces).

Basic Document Metrics

The Twitter, Blog and News feed documents are functionally composed of words and lines. The follow graphs show the counts of words and lines for each feed. It is purposed that the size of these data set provide reasonable statistics to build a predictive model in the future.

Document Word Frequency

When building prediction models it is important to address the customer’s need for a hit rate, speed that is faster than their ability to type and small enough to fit within a devices memory capacity. There is an obvious trade off between these variables. For instance, a 100% hit rate may be too slow and take up too much memory. Therefore, analysis of word frequencies can be used to reduce resource footprints while achieving an acceptable hit rate. Here are the word frequencies associated with the Twitter, News and Blog documents.

Conclusion

This analysis provided insight into the basic word usage. For instance, each feed seems to contain a different set of most frequently used words. The analysis also showed the choppy nature of Twitter versus News (many more lines). Moreover, though the graphs of word frequency are small, digging into the details shows a rapid cut off in word frequency indicating that the table of prediction words can be much smaller than the overall word count.

The next steps to creating the model will be to perform an N-Gram analysis. After that analysis, the prediction model will be tweaked to reduce the number of stored words to cover a significant number of N-gram predictions while maintaining the needed speed and size.

Appendix - Code.

require(Matrix)
require(quanteda)
require(readtext)
require(ggplot2)
require(cowplot)

FindLineCountWithWord <- function(aFilePath){
    f <- file(aFilePath, open="rb")
    l <- length(readLines(f,n=-1, skipNul = TRUE))
    close(f)
    return(l)
}
feedFiles <- c("./Data/en_US.twitter.txt", "./Data/en_US.blogs.txt","./Data/en_US.news.txt")
# Build Corpus
twitter <- readtext('./Data/en_US.twitter.txt')
twitter.Corpus <- corpus(twitter)
blog <- readtext('./Data/en_US.blogs.txt')
blog.Corpus <- corpus(blog)
news <- readtext('./Data/en_US.news.txt')
news.Corpus <- corpus(news)
feedFiles <- c("./Data/en_US.twitter.txt", "./Data/en_US.blogs.txt","./Data/en_US.news.txt")
corpus.Stats <- data.frame(feed=c("Twitter", "Blog", "News"), Lines=NA, Words=NA)
corpus.Stats$Lines<- sapply(feedFiles, FindLineCountWithWord)
corpus.Stats$Words<- c(ntoken(twitter.Corpus), ntoken(blog.Corpus), ntoken(news.Corpus))

g.Lines <- ggplot(corpus.Stats, aes(x = factor(corpus.Stats$feed), y = corpus.Stats$Lines))
g.Lines <- g.Lines + geom_bar(stat = "identity") +
  labs(y = "Line Count", x = "Source", title = "Document Line Count") 

g.Words <- ggplot(corpus.Stats, aes(x = factor(corpus.Stats$feed), y = corpus.Stats$Words))
g.Words <- g.Words + geom_bar(stat = "identity") +
  labs(y = "Word Count", x = "Source", title = "Document Word Count") 

plot_grid(g.Lines, g.Words, align='h')
twitter.dfm <- dfm(twitter.Corpus, remove = c("will", stopwords("english")), remove_punct = TRUE)
blog.dfm <- dfm(blog.Corpus, remove = c("will", stopwords("english")), remove_punct = TRUE)
news.dfm <- dfm(news.Corpus, remove = c("will", stopwords("english")), remove_punct = TRUE)

twitter.topwords <- topfeatures(twitter.dfm, n=20)
twitter.freq <- data.frame(Count=twitter.topwords)
blog.topwords <- topfeatures(blog.dfm, n=20)
blog.freq <- data.frame(Count=blog.topwords)
news.topwords <- topfeatures(news.dfm, n=20)
news.freq <- data.frame(Count=news.topwords)

g.freq.twitter <- ggplot(twitter.freq, aes(x =reorder(rownames(twitter.freq),twitter.freq$Count), y =twitter.freq$Count ))
g.freq.twitter <- g.freq.twitter + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Twitter: Most Used Words", y = "Count", x = "Word") + 
    theme(axis.text.x = element_text(colour="grey20",size=8,face="plain"))

g.freq.blog <- ggplot(blog.freq, aes(x =reorder(rownames(blog.freq),blog.freq$Count), y =blog.freq$Count ))
g.freq.blog <- g.freq.blog + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Blog: Most Used Words", y = "Count", x = "Word") + 
    theme(axis.text.x = element_text(colour="grey20",size=8,face="plain"))

g.freq.news <- ggplot(twitter.freq, aes(x =reorder(rownames(news.freq),news.freq$Count), y =news.freq$Count ))
g.freq.news <- g.freq.news + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "News: Most Used Words", y = "Count", x = "Word") + 
    theme(axis.text.x = element_text(colour="grey20",size=8,face="plain"))

plot_grid(g.freq.twitter, g.freq.blog, g.freq.news, align='h')