Introduction

The ultimate goal of this project is to build a predictive text application using n-grams. Smartkey has provided blogs, twitter and news data in English, German, Finnish and Russian. This document focuses on the initial analysis of the data, producing and analyzing n-grams, as well as laying out the next steps.

#Load required packages
library(knitr); library(NLP); library(tm); library(class); library(SnowballC); library(RWeka); library(stringr); library(stringi); library(ggplot2); library(slam)

Downloading and loading the data

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The data can be downloaded from Capstone Dataset. Below we read in the English versions of the twitter, news, and blogs data.

twit <- readLines("final/en_US/en_US.twitter.txt")
news <- readLines("final/en_US/en_US.news.txt")
blog <- readLines("final/en_US/en_US.blogs.txt")

Exploratory Data Analysis (EDA)

Utilizing standard EDA techniques such as head(), tail() and table() we can see that the data is large and unstructured. Using table(str_length()) we can see that the twitter data lines are from 2 to 140 characters in length which can be expected. However the blogs and news data are up to 40,833 characters and 11,384 characters per line respectively which give an initial idea of the complexity of the task.

To understand the data sets better we will view the characteristics of the files provided. For the brevity of the report I will just focus on the english files. The twitter data has 2,360,148 lines, 65,264,808 words, and 162,096,031 characters .

NROW(twit); sum(stri_count_boundaries(twit, type = "word")); sum(str_length(twit))
## [1] 2360148
## [1] 65264808
## [1] 162096031

The news data has 1,010,242 lines, 74,316,341 words, and 203,223,159 characters.

NROW(news); sum(stri_count_boundaries(news, type = "word")); sum(str_length(news))
## [1] 1010242
## [1] 74316341
## [1] 203223159

The blog data has 899,288 lines, 79,779,796 words, and 206,824,505 characters.

NROW(blog);sum(stri_count_boundaries(blog, type = "word")); sum(str_length(blog))
## [1] 899288
## [1] 79779796
## [1] 206824505

Sampling

The data is quite large. I will take 1% of each of the sources so as to ensure we maintain a fairly representative data sample rather than just taking a certain number a lines from each file. Accuracy could be increased by increase to a higher percentage.

twit <- sample(twit, NROW(twit)/100, replace= FALSE)
news <- sample(news, NROW(news)/100, replace= FALSE)
blog <- sample(blog, NROW(blog)/100, replace= FALSE)

Creating the Corpus

The next step is to create a Corpus, which is the main structure for managing documents in the tm package. A Corpus is collection of text documents. After the creation of the Corpus I remove the unnecessary files to help with memory management, I do this later as well.

twit <- Corpus(VectorSource(twit), readerControl = list(language = "en")); news <- Corpus(VectorSource(news), readerControl = list(language = "en")); blog <- Corpus(VectorSource(blog), readerControl = list(language = "en"))
#Create Corpus
myCorpus <- c(twit, news, blog)
#Remove unnecessary data from memory
rm(twit); rm(news); rm(blog)

Cleaning the Corpus

The data is unusable in its current state. We need to reduce and transform it to just the relevant information. First, we transform the letters to lower case, then we take just the alphabet letters, excluding characters like symbols and numbers. Then we eliminate extra white spaces. I did not do any stemming or remove stop words as we are eventually building a predictive text model. I haven’t removed any profanity yet. There are many files of profane words available and I will likely do this later, however I would like to ensure it does not impact the sentence structure. Also in my ngrams results later in this analysis we will see that no profane words are presented.

myCorpus <- tm_map(myCorpus, content_transformer(tolower))
letters <- function( x ) {gsub("[^a-z ]", "", x)}
myCorpus <- tm_map(myCorpus, letters)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, PlainTextDocument)

Unigram Tokenizer

Our eventual goal is to built a model for predicting text. We need to establish the frequency count of single words (Unigrams), 2 word phrases (Bigrams), and 3 word phrases (Trigrams) in our data set. First, we start with Unigrams.

options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm1 <- TermDocumentMatrix(myCorpus, control = list(tokenize = UnigramTokenizer))
tdm1 <- as.matrix(rollup(tdm1, 2, na.rm=TRUE, FUN = sum))
tdm1 <- data.frame(ngram=rownames(tdm1), count=tdm1[,1])
tdm1 <- tdm1[order(tdm1$count, decreasing=TRUE), ]
tdm1 <- head(tdm1, 30)
ggplot(tdm1, aes(x=ngram, y=count)) + geom_bar(stat="Identity", fill="blue") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 30 Words")

plot of chunk unnamed-chunk-10

rm(tdm1)

The most common Unigrams are “the” and “and”, which is fairly expected.

Bigram Tokenizer

options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm2 <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
tdm2 <- as.matrix(rollup(tdm2, 2, na.rm=TRUE, FUN = sum))
tdm2 <- data.frame(ngram=rownames(tdm2), count=tdm2[,1])
tdm2 <- tdm2[order(tdm2$count, decreasing=TRUE), ]
tdm2 <- head(tdm2, 30)
ggplot(tdm2, aes(x=ngram, y=count)) + geom_bar(stat="Identity", fill="blue") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 30 Bigrams")

plot of chunk unnamed-chunk-11

rm(tdm2)

The most common Bigrams are “in the” and “of the”.

Trigram Tokenizer

options(mc.cores=1)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm3 <- TermDocumentMatrix(myCorpus, control = list(tokenize = TrigramTokenizer))
tdm3 <- as.matrix(rollup(tdm3, 2, na.rm=TRUE, FUN = sum))
tdm3 <- data.frame(ngram=rownames(tdm3), count=tdm3[,1])
tdm3 <- tdm3[order(tdm3$count, decreasing=TRUE), ]
tdm3 <- head(tdm3, 30)
ggplot(tdm3, aes(x=ngram, y=count)) + geom_bar(stat="Identity", fill="blue") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 30 Trigrams")

plot of chunk unnamed-chunk-12

rm(tdm3)

The most common Trigrams are “one of the”, “a lot of”, and “thanks for the”.

Plans and next steps

The Capstone forums have been very useful however any guidance would be highly appreciated.