Summary

In this project, our objective is to create a text prediction app using R, running on Shiny. The task is divided into a set of subtasks which make up the general flow of a Natural Language Processing (and specifically prediction) piece of software, and are as follows:

  1. Acquiring the data
  2. Cleaning and tokenizing
  3. Exploratory data analysis
  4. Predictive modeling
  5. Final product
  6. Presentation and slide deck

In this report we cover the first half of the above steps, demonstrating that we have acquired, cleaned, and explored the data, and are on track to create our prediction model and the final product/presentation.

Data Handling

Acquisition

We beging by reading in our data using the ReadLines command, since it is faster than the scan or other variations like `read.csv’, and our data set is extremely large. In addition, we load up three additional libraries that will be useful in our upcoming tasks (descriptions are taken from official documentation):

  1. RWeka: an R interface to Weka , a collection of machine learning algorithms for data mining tasks written in Java

  2. tm: A framework for text mining applications within R.

  3. NLP: A package containing basic classes and methods for Natural Language Processing.

library(NLP)
library(tm)
library(RWeka)
library(ggplot2)

setwd("~/Desktop/Coursera/Capstone Design/final/en_US")

con1 <- file("en_US.twitter.txt", "r")
twitter <- readLines(con1)
close(con1)

con2 <- file("en_US.blogs.txt", "r")
blogs <- readLines(con2)
close(con2)

con3 <- file("en_US.news.txt", "r")
news <- readLines(con3)
close(con3)

Basic Analysis

Now that we have read in our data, we get started by doing some basic analysis of each file, in this case word counts and line counts. We use the length command in both cases, initally getting the line count, then using it alongside the MC_tokenizer command from the “tm” package (to divide them up to words) in order to get the word count.

## [1] "Number of lines of each document:"
## [1] "Twitter doc:"
## [1] 2360148
## [1] "Blogs doc:"
## [1] 899288
## [1] "News doc:"
## [1] 1010242
## [1] "Number of words for each document:"
## [1] "Twitter doc:"
## [1] 37792532
## [1] "Blogs doc:"
## [1] 44158006
## [1] "News doc:"
## [1] 42743790

Exploratory Analysis and Plotting

The next step is to examine the frequency of words and word chains (knowns as “N-grams”). We do this by taking advantage of the Rweka package and it’s NGramTokenizer command. However, given the extremely large size of the corpus, we tend to run into memory and time problems when attempting to proces all the data at once, so we choose to use a sample portion of the data instead. To do this, we write a function to sample a portion of each of the text files (in this case 1%), and combine them to give one final sample that can be studied as a representative of all our data.

sampler <- function(text, portion){
  lines <- as.logical(rbinom(length(text), 1, portion))
  textSample <- text[lines]
  return(textSample)
}

twitterSample <- sampler(twitter, 0.01)
blogsSample <- sampler(blogs, 0.01)
newsSample <- sampler(news, 0.01)

finalSample <- c(twitterSample, blogsSample, newsSample)

Now that our data is sampled and reduced to a size that will not cause memory or time constraints, we can clean the data and prepare it for further analysis. We do this by removing non-word components such as numbers, punctuations, and additional whitespaces.

cleanedSample <- VCorpus(VectorSource(finalSample))
cleanedSample <- tm_map(cleanedSample, removeNumbers)
cleanedSample <- tm_map(cleanedSample, removePunctuation)
cleanedSample <- tm_map(cleanedSample, stripWhitespace)
cleanedSample <- tm_map(cleanedSample, content_transformer(tolower))
cleanedSample <- tm_map(cleanedSample, removeWords, stopwords("english"))
cleanedSample <- tm_map(cleanedSample, PlainTextDocument)

The next step is to find the different n-grams in the document, and and demonstrate their relative frequencies in order. For this stage of our project, we only go up to N=3, meaning unigrams, bigrams, and trigrams.

options(mc.cores=1)

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(cleanedSample, control = list(tokenize = UnigramTokenizer))
unidtm <- removeSparseTerms(unidtm, 0.99995)
unitable <- data.frame(unigram = names(colSums(as.matrix(unidtm))), count = unname(colSums(as.matrix(unidtm))))
unitable <- unitable[order(unitable[2], decreasing = TRUE), ]
qplot(x = unitable[1:10,1], y = unitable[1:10,2], xlab = "Unigrams", ylab = "Count", geom="histogram", stat="identity", fill = I("red"))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(cleanedSample, control = list(tokenize = BigramTokenizer))
bidtm <- removeSparseTerms(bidtm, 0.99995)
bitable <- data.frame(bigram = names(colSums(as.matrix(bidtm))), count = unname(colSums(as.matrix(bidtm))))
bitable <- bitable[order(bitable[2], decreasing = TRUE), ]
qplot(x = bitable[1:10,1], y = bitable[1:10,2], xlab = "Bigrams", ylab = "Count", geom="histogram", stat="identity", fill = I("green"))

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(cleanedSample, control = list(tokenize = TrigramTokenizer))
tridtm <- removeSparseTerms(tridtm, 0.99995)
tritable <- data.frame(trigram = names(colSums(as.matrix(tridtm))), count = unname(colSums(as.matrix(tridtm))))
tritable <- tritable[order(tritable[2], decreasing = TRUE), ]
qplot(x = tritable[1:10,1], y = tritable[1:10,2], xlab = "Trigrams", ylab = "Count", geom="histogram", stat="identity", fill = I("blue"))

Conclusion

Our graphs demonstrate a distribution of different N-grams amongst our sample of the data. More importantly, they show that our process and algorithm for identifying said N-grams is fucntional, and so we’re on the right path to creating our prediction model.

The next steps are to take this data and use them to create our model, which we will use to develop the shiny app that is the main goal of this project.