Milestone Report - Word Prediction Tool

Capstone Milestone Report

This report outlines the approach to building a new word prediction tool as a Coursera Capstone project. This tool will accept text input and as output will estimate the probabilties of the top five most likely next words, based on a set of training text.

The body of training text is provided from twitter, blogs and newsfeed articles. The frequency data of n-grams and corresponding dictionary will be generated by processing the raw text. A back-off model will be utilized to make predictions based upon the input text which may contain words we do not have in our dictionary.

A Shiny application will be developed that a user can interact with the model. An overview of the interaction will be provided below. In a production environment, this model could be deployed on mobile or other devices to assist in typing, or be applied to chatbot technology for natural language interaction.

The application should be responsive to the user (under 1 second response time) and limit the storage needs (200 MB).

Approach

At a high level, the backoff algorithm looks at a weighted combinations of n-gram frequency data to make a prediction. In a bigram or 3-gram prediction model the estimate of the conditional probabilities are calculated using frequency data as follows

\[ P\left( w_2 \vert w_1 \right) = \frac{C(w_1 w_2)}{C(w_1)} \;\text{ and }\; P\left( w_3 \vert w_1 w_2 \right) = \frac{C(w_1 w_2 w_3)}{C(w_1 w_2)} \]

where \(C(w_1)\) is the count of \(w_1\) in the training corpus, and \(C(w_1 w_2)\) is the corresponding count in the 2-gram table. This can be used for prediction where the input text consists of KNOWN dictionary words. In the case where the input text may contain UNKNOWN or non-dictionary words the calculation is a weighted average using partial matches on the n-gram frequency datat.

The dictionary is set of words that we base our n-gram frequency data on. The selection of this set of words will drive the size of the storage needed, as well as the usefulness of the application in a real-world context. The only words that we can predict are dictionary words. The next section of this report provides an analysis of the provided text, to help the process of definition the dictionary.

The design approach does NOT include supporting specific context prediction. That is the application will not be aware if the prediction is for twitter text or blog or otherwise. The data may be processed based on source, howewver the frequency tables will not store the source. This could be a future enhancement.

Based on the requirement is to predict the most likley next word the design does NOT remove stop words during the building of the n-gram frequency data. It is common exclude these in NLP, in this case we expect these to be the most likely words to drive the prediction. We may consider building the frequency tables without stopwords to save space or somehow limit low-usage stopwords.

Interaction Overview

The user will have a text field box to provide the input. When they are ready to invoke the algorithm they will hit a submit button. An alternative is to use an on-key event or auto-sense new text entry to invoke the model.

The system will display the 5 most likely predictions along with corresponding probabiltiies. There will be an option to select on of the words to add to the input which will retrigger the prediction.

Corpus Summary

Selecting the appropriate dictionary is a key in the size of the frequency data required.

There are nearly 600 MB of data provided in the raw data set.

Source	Line Count	Word Count	Character Count
Blogs	889,288	37,334,114	210,160,014
News	1,010,242	34,365,936	205,811,889
Twitter	2,360,148	30,359,852	167,105,338
Total	4,269,678	102,059,902	583,077,241

[1] "File List: final/en_US/en_US.twitter.txt"
[2] "File List: final/en_US/en_US.news.txt"   
[3] "File List: final/en_US/en_US.blogs.txt"

For this milestone report we will do a more in-depth analysis of 10% of the raw data. These will be re-run at higher rates as the model development progresses

# create sample files for this report
z <- sapply(fileList, createSample, p=0.10)

Utilize the tm package to help do some processing on the raw data for numbers, punction, non-keyboard (US) characters, and converts the text to lower case.

# here we clean up the raw data 
cleanCorpus <- function(var.df, removeStopwords=FALSE) {
    t.corpus <- Corpus(DataframeSource(var.df))

    # removes unicode
    t.corpus <- tm_map(t.corpus, function(x) { gsub("[^\x20-\x7E]", "", x) })
    t.corpus <- tm_map(t.corpus, removePunctuation)
    t.corpus <- tm_map(t.corpus, removeNumbers)
    t.corpus <- tm_map(t.corpus, content_transformer(tolower))
    t.text <- sapply(t.corpus, as.character)
    data.frame(doc_id = var.df$doc_id,
               text = t.text,
               stringsAsFactors = FALSE)
}

# load method creates df with 'doc_id', 'text' columns; doc_id is one of (blogs, news, twitter)
raw_df <- load_dir_to_df(root.sample.dir)
text_df <- cleanCorpus(raw_df)

The tidytext package has a function unnest_tokens that builds a map of (doc_id,word) pairs, or called with options to generate (doc_id,n-grams) pairs.

#tokenize
word_df <- unnest_tokens(text_df, word, text) #(doc_id,word)
# at this point we should be able to inject
# word_df <- inner_join(word_df, word=dict)
freq_df <- count(word_df, word, sort=TRUE) #(word,n)
freq_df %>%
    mutate(word=factor(word, levels=rev(unique(word)))) %>%
    top_n(10) %>%
    ggplot(aes(word, n)) +
    geom_col(show.legend=FALSE) +
    labs(title="Top 10 Unique Words with Counts",
         x=NULL, y="count") +
    coord_flip() +
    scale_y_continuous(labels=comma)

## Selecting by n

This table summarizes the subset that is being sampled.

         Line Count Word Count Unique Words
blogs         89979    3685996       105064
news           7768     258945        26721
twitter      236249    2946428       106405
combined     333996     174322      6891369

This graph shows out of the sample set, if percentage of that, the left vertical at about 32%, and right at 84% are the points where 50% and 90% of the unique words in teh sample are covered. This tells us that we should expect the same ratio to cover all the unique words in the raw 600MB text.

[1] "Estimated size of corpus to cover 50% of words: 32"

[1] "Estimated size of corpus to cover 90% of words: 84"

Word and N-Gram frequency

Most words and N-Grams occur only once. The following table summarizes how much of our data only occurs 1, 2 3, or more times. This will be helpful as we apply dictionaries to remove non-english words (or misspelled words, or other nonsense).

       Unique Count Frequency 1 Frequency 2 Frequency 3 Frequency > 3
Word         174322        59.6        11.2         5.2          24.0
2-Gram      2086420        76.6        10.0         3.9           9.5
3-Gram      4876839        89.4         5.7         1.8           3.1
4-Gram      6342596        96.3         2.4         0.6           0.7

Data Observations

We have not filtered out non-english/foreign, misspelling, profanity, common contractions, or applied any other dictionaries. The standard qdap package has the GradyAugmented dictionary of \(122,806\) English words.

Comparing the \(174,322\) distinct words less than one third are english dictionary words.

data("GradyAugmented", package="qdapDictionaries")
word_df.dict <- inner_join(word_df, 
                           data.frame(word=GradyAugmented, stringsAsFactors = FALSE))

## Joining, by = "word"

dim(distinct(word_df.dict,word))

## [1] 49620     1

Average word length is calculated as a weighted average, since we expect to use this to calculate estimated storage needs for the n-gram tables, and the most common words will be most prevelent in the tables.

## [1] 4.367026

Size of data storage needed is approximated by \[ \sum_{n=1}^4 \text{NG}_n * n\text{WL} \] Where \(\text{NG}_n\) is the number of n-grams stored, 1-gram is the count of unique words, and \(\text{WL}\) is the average word length.

Using the tables above we would predict to support a \(174,322\) dictionary we would need around 193MB.

## [1] 193669140

One curious observation is that we never seem to sample the same proportion of news data. This leads me to suspect in the raw data there maybe a few outlier articles that are exceptional big we are not picking up in our samples.

Next Steps

Run this report on 50% - 100% of the raw data.

Review and incorporate any feedback from this report in the prototype and model design.

Analyze the dictionary options and spelling options to reduce the size of the dictionary. Review contraction and stemming options to also simplify the size.

Build an initial UI prototype.

Feedback

Please reply to matt@denvercliffs.com or provide feedback through Coursera on suggestions or comments.