Data Processing Strategy

The first thing I want to do with the data is to split each document into separate sentences before we do any processing or analysis. This is to avoied problems when turning our data into n-grams later in the process. As an example to demonstrate, if we didn’t split our documents into separate sentences, then a document with “I like cats. Cats are cute.” would create a bigram of “cats cats”, which is clearly not a valid piece of data.

We’ll start by reading in the data, then checking how many documents there are. For the purpose of this document, we will only examine the US news text. The other text documents showed very similar properties, and I decided to omit them for brevity. Using the same techniques shown here provided similar results on the other datasets.

con <- file('en_US.news.txt', 'r')
wholeDataVec <- readLines(con)
close(con)
length(wholeDataVec)
## [1] 1010242

Now we apply the regexCleaner function, and see how many documents we have. The number of documents is now approximately double the original amount.

wholeDataVec <- regexCleaner(wholeDataVec)
length(wholeDataVec)
## [1] 2206918

Now that we have a single sentence for each document, we can begin processing our data using functions from the quanteda and NLP packages. We will apply stemming to the documents so that words like ‘running’ and ‘run’ are combined, and then convert all letter to lower case.

wholeDataVec <- sapply(strsplit(wholeDataVec, ' '), function(x) paste(wordStem(x), collapse=' '))
wholeDataVec <- toLower(wholeDataVec)

Our data is currently huge, so it is probably a good idea to use only a subset of the whole set. We’ll use a sample of 1% of our dataset.

smallDataVec <- sample(wholeDataVec, length(wholeDataVec)*.01)
rm(wholeDataVec)

Now we’ll create 1-grams (just words) and remove profanity from our data. Later we will use the tokenize function to create bi- and trigrams.

smallData <- tokenize(smallDataVec, ngrams=1)
smallData <- removeFeatures(smallData, features=profanity)

Next we want to create document-frequency matrices so we can examine the distribution of terms in our documents. We see that our data set has around 20,000 features (words).

freqMatSmall
## Document-feature matrix of: 22,069 documents, 22,335 features.

We’ll look at just the 100 most frequent words and check how they are distributed.

top100small <- topfeatures(freqMatSmall,100)

We see that a few words like ‘the’ ‘a’ ‘to’ ‘and’ have a much higher count than others. This is in line with what we might expect. Next we will examine the bigrams and trigrams in the same way.

freqMatSmall2
## Document-feature matrix of: 22,069 documents, 173,021 features.
freqMatSmall3
## Document-feature matrix of: 22,069 documents, 274,712 features.

We have 173,000 bigrams and 274,000 trigrams in our data. These are what we will use to build our predictive model.

From the graphs we can see that there are still certain bigrams and trigrams with much higher frequency, but that the distribution is much less extreme than with the single words.

Our general modelling strategy will be to use the trigrams, then bigrams to predict the next word. We will group the bigrams by their first term (ie. hair_piece, hair_ball would be grouped together) and trigrams by their first two terms (ie. my_favorite_food, my_favorite_color would be grouped together). Then for each input we can suggest the most frequently occuring of these choices from our dataset.