In this project, our objective is to create a text prediction app using R, running on Shiny. The task is divided into a set of subtasks which make up the general flow of a Natural Language Processing (and specifically prediction) piece of software, and are as follows:
In this report we cover the first half of the above steps, demonstrating that we have acquired, cleaned, and explored the data, and are on track to create our prediction model and the final product/presentation.
We beging by reading in our data using the ReadLines
command, since it is faster than the scan
or other variations like `read.csv’, and our data set is extremely large. In addition, we load up three additional libraries that will be useful in our upcoming tasks (descriptions are taken from official documentation):
RWeka: an R interface to Weka , a collection of machine learning algorithms for data mining tasks written in Java
tm: A framework for text mining applications within R.
NLP: A package containing basic classes and methods for Natural Language Processing.
library(NLP)
library(tm)
library(RWeka)
library(ggplot2)
setwd("~/Desktop/Coursera/Capstone Design/final/en_US")
con1 <- file("en_US.twitter.txt", "r")
twitter <- readLines(con1)
close(con1)
con2 <- file("en_US.blogs.txt", "r")
blogs <- readLines(con2)
close(con2)
con3 <- file("en_US.news.txt", "r")
news <- readLines(con3)
close(con3)
Now that we have read in our data, we get started by doing some basic analysis of each file, in this case word counts and line counts. We use the length
command in both cases, initally getting the line count, then using it alongside the MC_tokenizer
command from the “tm” package (to divide them up to words) in order to get the word count.
## [1] "Number of lines of each document:"
## [1] "Twitter doc:"
## [1] 2360148
## [1] "Blogs doc:"
## [1] 899288
## [1] "News doc:"
## [1] 1010242
## [1] "Number of words for each document:"
## [1] "Twitter doc:"
## [1] 37792532
## [1] "Blogs doc:"
## [1] 44158006
## [1] "News doc:"
## [1] 42743790
The next step is to examine the frequency of words and word chains (knowns as “N-grams”). We do this by taking advantage of the Rweka
package and it’s NGramTokenizer
command. However, given the extremely large size of the corpus, we tend to run into memory and time problems when attempting to proces all the data at once, so we choose to use a sample portion of the data instead. To do this, we write a function to sample a portion of each of the text files (in this case 1%), and combine them to give one final sample that can be studied as a representative of all our data.
sampler <- function(text, portion){
lines <- as.logical(rbinom(length(text), 1, portion))
textSample <- text[lines]
return(textSample)
}
twitterSample <- sampler(twitter, 0.01)
blogsSample <- sampler(blogs, 0.01)
newsSample <- sampler(news, 0.01)
finalSample <- c(twitterSample, blogsSample, newsSample)
Now that our data is sampled and reduced to a size that will not cause memory or time constraints, we can clean the data and prepare it for further analysis. We do this by removing non-word components such as numbers, punctuations, and additional whitespaces.
cleanedSample <- VCorpus(VectorSource(finalSample))
cleanedSample <- tm_map(cleanedSample, removeNumbers)
cleanedSample <- tm_map(cleanedSample, removePunctuation)
cleanedSample <- tm_map(cleanedSample, stripWhitespace)
cleanedSample <- tm_map(cleanedSample, content_transformer(tolower))
cleanedSample <- tm_map(cleanedSample, removeWords, stopwords("english"))
cleanedSample <- tm_map(cleanedSample, PlainTextDocument)
The next step is to find the different n-grams in the document, and and demonstrate their relative frequencies in order. For this stage of our project, we only go up to N=3, meaning unigrams, bigrams, and trigrams.
options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(cleanedSample, control = list(tokenize = UnigramTokenizer))
unidtm <- removeSparseTerms(unidtm, 0.99995)
unitable <- data.frame(unigram = names(colSums(as.matrix(unidtm))), count = unname(colSums(as.matrix(unidtm))))
unitable <- unitable[order(unitable[2], decreasing = TRUE), ]
qplot(x = unitable[1:10,1], y = unitable[1:10,2], xlab = "Unigrams", ylab = "Count", geom="histogram", stat="identity", fill = I("red"))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(cleanedSample, control = list(tokenize = BigramTokenizer))
bidtm <- removeSparseTerms(bidtm, 0.99995)
bitable <- data.frame(bigram = names(colSums(as.matrix(bidtm))), count = unname(colSums(as.matrix(bidtm))))
bitable <- bitable[order(bitable[2], decreasing = TRUE), ]
qplot(x = bitable[1:10,1], y = bitable[1:10,2], xlab = "Bigrams", ylab = "Count", geom="histogram", stat="identity", fill = I("green"))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(cleanedSample, control = list(tokenize = TrigramTokenizer))
tridtm <- removeSparseTerms(tridtm, 0.99995)
tritable <- data.frame(trigram = names(colSums(as.matrix(tridtm))), count = unname(colSums(as.matrix(tridtm))))
tritable <- tritable[order(tritable[2], decreasing = TRUE), ]
qplot(x = tritable[1:10,1], y = tritable[1:10,2], xlab = "Trigrams", ylab = "Count", geom="histogram", stat="identity", fill = I("blue"))
Our graphs demonstrate a distribution of different N-grams amongst our sample of the data. More importantly, they show that our process and algorithm for identifying said N-grams is fucntional, and so we’re on the right path to creating our prediction model.
The next steps are to take this data and use them to create our model, which we will use to develop the shiny app that is the main goal of this project.