Capstone Milestone project

Exploratory data analysis

Loading the data

The first step shall consist of loading the available textual data

text.blogs   <- readLines(file.path("output", "en_US", "en_US.blogs.txt"))
text.news    <- readLines(file.path("output", "en_US", "en_US.news.txt"), skipNul = T)
text.twitter <- readLines(file.path("output", "en_US", "en_US.twitter.txt"), skipNul = T)

Corpus Descriptive Data

In this step we will check the attributes of the provided files, and than display this attributes

Source <- c("blogs", "news", "twitter")
FileSize.MB <- c(file.size(file.path("output", "en_US", "en_US.blogs.txt")) / 1024^2, 
                       file.size(file.path("output", "en_US", "en_US.news.txt")) / 1024^2, 
                       file.size(file.path("output", "en_US", "en_US.twitter.txt")) / 1024^2)
Total.lines <- c(length(text.blogs),
                  length(text.news),
                  length(text.twitter))
Total.words <- c(sum(stri_count_words(text.blogs)),
                  sum(stri_count_words(text.news)),
                  sum(stri_count_words(text.twitter)))
data.frame(Source, FileSize.MB, Total.lines, Total.words)

##    Source FileSize.MB Total.lines Total.words
## 1   blogs    200.4242      899288    38154238
## 2    news    196.2775       77259     2693898
## 3 twitter    159.3641     2360148    30218166

Frequent ngrams

The next section of this report will consist of display of ngrams (n=1-3 words) that have large frequencies

# Sampling to limit computation time
set.seed(1200)
text.sample <- c(sample(text.blogs,  length(text.blogs) * 0.001),
                 sample(text.news,   length(text.news) * 0.001),
                 sample(text.twitter,length(text.twitter) * 0.001))
# Creating corpus and cleaning data
corpus <- VCorpus(VectorSource(text.sample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

options(mc.cores=1)
bi_gram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_gram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
getFreq <- function(x) {
  freq <- sort(rowSums(as.matrix(x)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
unifreq <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.99999))
bifreq <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bi_gram)), 0.99999))
trifreq <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = tri_gram)), 0.99999))

plot_results <- function(data, label) {
  ggplot(data[1:50,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 90, size = 11, hjust = 1)) +    geom_bar(stat = "identity")}

plot_results(unifreq, "Frequent Unigrams")

plot_results(bifreq, "Frequent Bigrams")

plot_results(trifreq, "Frequent Trigrams")

Plans for future phases of the Capstone project

Given work on my current computer, My needs for the current phase of the project were best served by the RWeka package. Use of a more powerful computer for future parts of the project may allow me to make better use of other packages.

Requirements that I have recognized for future steps of the project include

-The requirements I have idenified fall into two main catagories: Working with a manageable corpus of text and being to develop a user friendly application with tolerable response time

-My first priorty regarding data preperation for the application is Optimizing the final corpus to achieve better coverage and improve prediction accuracy.

-I shall have to come up with more time/resource efficient methods to pre-process the corpus.

-Search for an efficient way to create data frames of 2-5 word ngrams -I need to come up with methods to efficetly process a much larger corpus of text using methods such as file hashing and parallel processing.

-Selection of a prediction model that will allow for efficient processing of a large corpus both while tokenized and while in full form. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model, if needed.

-Finding a way to develop a shiny app with a convenient use interface and tolerable response time

Capstone Milestone project

Alon Gur-Arie

October 21, 2017

Introduction

Exploratory data analysis

Loading the data

Corpus Descriptive Data

Frequent ngrams

Plans for future phases of the Capstone project