We begin by loading a few libraries and the datasets.

library(readr); library(tm); library(dplyr)

news <- data_frame(read_lines("en_US/en_US.news.txt"))
names(news) <- "X1"

blogs <- data_frame(read_lines("en_US/en_US.blogs.txt"))
names(blogs) <- "X1"

twitter <- data_frame(readLines("en_US/en_US.twitter.txt"))
names(twitter) <- "X1"

Let’s examine the size of each dataset. There are over one million of rows in the News data set and approximately 34.4 million words. The Blogs dataset contains almost 900,000 observations and 37.3 million words. The Twitter dataset contains almost 2.4 million entries and 30.4 million words.

# News
cat("Rows: ", dim(news)[1]); cat("Words: ", sum(sapply(gregexpr("\\S+", news$X1), length)))
## Rows:  1010242
## Words:  34372530
# Blogs
cat("Rows: ", dim(blogs)[1]); cat("Words: ", sum(sapply(gregexpr("\\S+", blogs$X1), length)))
## Rows:  899288
## Words:  37334131
# Twitter

cat("Rows: ", dim(twitter)[1]); cat("Words: ", sum(sapply(gregexpr("\\S+", twitter$X1), length)))
## Rows:  2360148
## Words:  30373792


For further analysis below, we only used News dataset - the same analysis can be replicated for other datasets.


Let’s convert the dataset into a corpus, a collection of documents containing text. We used tm package to do that. Given the size of the dataset, we first created a sample from the original dataset, which contained roughly 10 percent of observations. We also performed a basic pre-processing of data: removing white space, punctuations and numbers. Please note that we didn’t remove any stopwords.

news <- sample_n(news,size = nrow(news)*.1)
myCorpus <- Corpus(VectorSource(news$X1))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, stripWhitespace)

Next step involved conversion of corpus data into a document-term matrix, a structure that represents every word and describes the frequency of terms. We also removed any sparse terms from the matrix.

dtm <- DocumentTermMatrix(myCorpus)
dtm <- removeSparseTerms(dtm, 0.99)

Now, we can examine the most common words i.e. those that high frequency. The chart below shows 10 most common words. The table below provides similar information for 25 most common terms.

frequency <- colSums(as.matrix(dtm)) 
order <- order(frequency)

# Top 10 words with highest frequency

temp <- frequency[tail(order,10)]
barplot(temp,main="Frequency of 10 Most Common Words", col="orange", border="white",las=1,ylim=c(0, 200000))
grid()

DT::datatable(data.frame(frequency[tail(order,25)]),colnames = "Frequency")

The most common word is the. We can further examine it by finding terms that are related to this word or are associated with it. The results are shown below and the number indicates correlation coefficient. The analysis can be extended to any other term.

findAssocs(dtm, "the",0.2)
## $the
##  and  for that with 
## 0.36 0.24 0.23 0.21

Thus far we only examined individual words but we can also examine n-grams. Let’s examine bigrams. We defined a custom function1, which we included when converting corpus data to a term matrix. Below are the bigrams that appear at least 2000 times.

bigram <-function(x,n){
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}

two_gram <- DocumentTermMatrix(myCorpus, control = list(tokenize = bigram))

findFreqTerms(two_gram, lowfreq = 2000)
##  [1] "and a"     "and the"   "as a"      "as the"    "at the"   
##  [6] "by the"    "for a"     "for the"   "from the"  "has been" 
## [11] "he said"   "he was"    "in a"      "in the"    "is a"     
## [16] "it was"    "more than" "of a"      "of the"    "on a"     
## [21] "on the"    "one of"    "said the"  "that the"  "the first"
## [26] "to a"      "to be"     "to the"    "will be"   "with a"   
## [31] "with the"

Directions

Creating Shiny app will require some thinking about storing the dataset, performing analysis(processing, creating n-grams, prediction, etc) and yet providing useful information to the end-users. A lot of it will have to be pre-processed on the server rather than during active session.


  1. Borrowed from: http://tm.r-forge.r-project.org/faq.html.