This milestone report is the second week of the Coursera Data Science Capstone Project. The will report will open with a short exploratory analysis of the data provided for the capstone project. After that, I will suggest several plans for creating the prediction algorithm and Shiny App and will include brief outlines for each of them.
The first step shall consist of loading the available textual data
text.blogs <- readLines(file.path("output", "en_US", "en_US.blogs.txt"))
text.news <- readLines(file.path("output", "en_US", "en_US.news.txt"), skipNul = T)
text.twitter <- readLines(file.path("output", "en_US", "en_US.twitter.txt"), skipNul = T)
In this step we will check the attributes of the provided files, and than display this attributes
Source <- c("blogs", "news", "twitter")
FileSize.MB <- c(file.size(file.path("output", "en_US", "en_US.blogs.txt")) / 1024^2,
file.size(file.path("output", "en_US", "en_US.news.txt")) / 1024^2,
file.size(file.path("output", "en_US", "en_US.twitter.txt")) / 1024^2)
Total.lines <- c(length(text.blogs),
length(text.news),
length(text.twitter))
Total.words <- c(sum(stri_count_words(text.blogs)),
sum(stri_count_words(text.news)),
sum(stri_count_words(text.twitter)))
data.frame(Source, FileSize.MB, Total.lines, Total.words)
## Source FileSize.MB Total.lines Total.words
## 1 blogs 200.4242 899288 38154238
## 2 news 196.2775 77259 2693898
## 3 twitter 159.3641 2360148 30218166
The next section of this report will consist of display of ngrams (n=1-3 words) that have large frequencies
# Sampling to limit computation time
set.seed(1200)
text.sample <- c(sample(text.blogs, length(text.blogs) * 0.001),
sample(text.news, length(text.news) * 0.001),
sample(text.twitter,length(text.twitter) * 0.001))
# Creating corpus and cleaning data
corpus <- VCorpus(VectorSource(text.sample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
options(mc.cores=1)
bi_gram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_gram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
getFreq <- function(x) {
freq <- sort(rowSums(as.matrix(x)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
unifreq <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.99999))
bifreq <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bi_gram)), 0.99999))
trifreq <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = tri_gram)), 0.99999))
plot_results <- function(data, label) {
ggplot(data[1:50,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 90, size = 11, hjust = 1)) + geom_bar(stat = "identity")}
plot_results(unifreq, "Frequent Unigrams")
plot_results(bifreq, "Frequent Bigrams")
plot_results(trifreq, "Frequent Trigrams")
Given work on my current computer, My needs for the current phase of the project were best served by the RWeka package. Use of a more powerful computer for future parts of the project may allow me to make better use of other packages.
Requirements that I have recognized for future steps of the project include
-The requirements I have idenified fall into two main catagories: Working with a manageable corpus of text and being to develop a user friendly application with tolerable response time
-My first priorty regarding data preperation for the application is Optimizing the final corpus to achieve better coverage and improve prediction accuracy.
-I shall have to come up with more time/resource efficient methods to pre-process the corpus.
-Search for an efficient way to create data frames of 2-5 word ngrams -I need to come up with methods to efficetly process a much larger corpus of text using methods such as file hashing and parallel processing.
-Selection of a prediction model that will allow for efficient processing of a large corpus both while tokenized and while in full form. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model, if needed.
-Finding a way to develop a shiny app with a convenient use interface and tolerable response time