Executive Summary

In this project, we use R text mining tools to build a statistical model for word sequences. The model will predict the next word as a user types - similar to the Swiftkey text messaging app. For example, when the user types: "I went to the " : the predictor should present options for what the next word might be. For example, the three words might be gym, store, restaurant.

We train the model using text scraped from blogs, twitter and news sites. Then we predict the next word by calculating the maximum likelihood estimate (MLE) using word statistics stored in a suffix tree.

In this document, we discuss progress made toward the completion of this project.

TL;DR

In a nutshell, here’s a summary of the data analysis performed in this report.

  1. Load the raw data.
  2. Extract a 1% sample of the data.
  3. Preprocess the data to remove stopwords, convert to lowercase and other tasks
  4. Generate 2-grams, 3-grams and 4-grams.
  5. Build a n-gram language model using a suffix tree
  6. Predict the next word by calculating the maximum likelihood estimate (MLE) for all suffixes for a search string.

Complete Dataset

The Data Science Capstone Course provided text data from 3 data sources: blogs, twitter and news. The table below shows the number of lines from data source.

filename line_count
en_US.blogs.txt 899288
en_US.news.txt 1010242
en_US.twitter.txt 2360148
Total 4269678

Processing Unstructured Text for Analysis

We use a framework for text mining applications within R to transform the unstructured text to a structured document-by-term matrix format required for analysis. The first step is to “clean” the text with a series of text processing functions. We perform the following preprocessing steps.

Second, we tokenize the text into ngrams. For our analysis, a gram equals a whitespace delimited sequence of characters corresponding to an English word. The N in ngram corresponds to the number of words considered to be part of the same unit. For example: a 1-gram is new, 2-gram is new york, 3-gram is new york city and a 4-gram is new york city police.

Finally, we build a document-by-term matrix by transforming the ngrams to a bag-of-words model. The columns in a docterm matrix represent unique ngram. The rows represent a document and the frequency that each ngram appears in the document.

Sampling the Dataset

The 4,269,678 lines from the complete dataset can be memory intensive for the text mining tools and slow the analysis. To speed things up, we subsample 1% of the complete dataset and then work with the subsampled data for exploration and modeling. The subsampling implementation is in Appendix 1.

source num_lines num_unique_words median_word_freq mean_word_freq
twitter 23602 8040 9 20
blogs 8993 12414 6 15
news 10103 12850 7 15

We see that the mean word frequency is nearly twice the median frequency for all data sources. The word frequency distribution has a long tail - as seen in the plot below. The mean word frequency is heavily weighted by a few words that occur very frequently. 937 of 22899 (4.1%) words cover 50% of all word instances in the dataset.

NGram Frequencies

Now we will look at the top bigrams and trigrams in the dataset.

Top Bigrams

The top bigrams include places like: new york and common phrases like: happy birthday and last night.

Top Trigrams

The top trigrams include places like: new york city and common phrases like: happy mothers day and rock n roll.

Language Model for Word Prediction

A suffix tree is a data structure where an interior node represents a “root” character sequence and a child node represents a suffix of the root. For our implementation, each node represents a word and an arc (e.g. search path) represents a bigram, trigram or 4-gram. Each node has an associated count that represents the ngram frequency from the training data. We use these counts to predict the next word by calculating the maximimum likelihood estimate (MLE).

Word Prediction for: ‘data’

For example: in the figure below we show a suffix tree for ngrams that begin with the word data. The first level (w1) of the tree corresponds to the 1-gram “data”. The second level (w2) represents all bigrams whose root is “data” : such as “data entry” or “data streams”. The same applies to the third (w3) and fourth (w4) levels for trigrams and 4-grams respectively.

Each node contains the ngram frequency count from the training data.The frequency count (C) for “data” is 198. The frequency count for “data entry” is 12 and “data streams” is 10. We calculate the MLE as:

\[ P_{mle}(w2|w1) = \frac{C(w1 \dots w2)}{C(w1)} \]

The probability of “data entry” is:

\[ P_{mle}(entry|data) = \frac{12}{198} = 0.06 = 6\% \]

The probability of “data streams” is: \[ P_{mle}(streams|data) = \frac{10}{198} = 0.05 = 5\% \]

Therefore, the language model would predict that “entry” is the most likely next word.

Word Prediction for: ‘data entry’

Now let’s predict the next word after: “data entry”. The frequency count (C) for “data entry” is 12. The frequency count for “data entry just” is 6 and “data entry respond” is 6.

We calculate the MLE as:

\[ P_{mle}(w3|w1 \dots w2) = \frac{C(w1 \dots w3)}{C(w1 \dots w2)} \]

The probability of “data entry just” is:

\[ P_{mle}(just|data \space entry) = \frac{6}{12} = 0.50 = 50\% \]

The probablility of “data entry respond” is:

\[ P_{mle}(respond|data \space entry) = \frac{6}{12} = 0.50 = 50\% \]

The language model would predict that “just” and “respond”" are equally likely to be the next word.

Conclusion

The preliminary results show that we can build a language model using ngrams from training data from blogs, news and twitter. We can build a word predictor using a suffix tree built from ngrams extracted from the training data.

Next Steps

Here are the list of next steps for the coming weeks.

  • Use a weighting technique like TF/IDF to adjust the ngram frequencies.
  • Apply text preprocessing to search text.
  • Implement backoff search to handle instances when a phrase is not found in the tree.
  • Build a model using the more than a 1% sample.
  • Deploy the ngram tree to the server-side of an Shiny Application.

Appendix

Appendix 1: Sampling Code

This code collects a 1% sample using a “coin flip” to decide which lines to choose.

# sample the datasci dir
sample_capstone_data <- function(fn, outfn, sample_len=0.01) {
  print(sprintf("Reading %s", fn))
  lines <- readLines(fn)
  set.seed(123)
  print(sprintf("Read %s Length %s", fn, length(lines)))
  lines_sample <- lines[rbinom(length(lines)*sample_len, length(lines), 0.5)]
  print(sprintf("Writing %s. Length %s", outfn, length(lines_sample)))
  write.csv(lines_sample, file=outfn, row.names=FALSE, col.names=FALSE)
}

sample_capstone_data("./data/final/en_US/en_US.twitter.txt",
                     "./data/final/en_US/sample/en_US.twitter.txt")
sample_capstone_data("./data/final/en_US/en_US.blogs.txt",
                     "./data/final/en_US/sample/en_US.blogs.txt")
sample_capstone_data("./data/final/en_US/en_US.news.txt",
                     "./data/final/en_US/sample/en_US.news.txt")