This is a progress update on building a next-word predictive model.

Input Data Exploration and Assesement

The input training data was downloaded from the Coursera Swiftkey zip and extracted.

pwd
ls -lh final/en_US/
echo "Count lines"
wc -l final/en_US/en_US.blogs.txt
wc -l final/en_US/en_US.news.txt
wc -l final/en_US/en_US.twitter.txt 
echo "Count words"
wc -w final/en_US/en_US.blogs.txt
wc -w final/en_US/en_US.news.txt
wc -w final/en_US/en_US.twitter.txt 
## /Users/grudy/dev/datasciencecoursera/capstone
## total 1184280
## -rw-r--r--  1 grudy  staff    22M Dec 19 20:03 blogs-1k.txt
## -rw-r--r--  1 grudy  staff   200M Jul 22  2014 en_US.blogs.txt
## -rw-r--r--  1 grudy  staff   196M Jul 22  2014 en_US.news.txt
## -rw-r--r--  1 grudy  staff   159M Jul 22  2014 en_US.twitter.txt
## -rw-r--r--  1 grudy  staff   3.5K Dec 26 15:00 filtered_words.txt
## Count lines
##   899288 final/en_US/en_US.blogs.txt
##  1010242 final/en_US/en_US.news.txt
##  2360148 final/en_US/en_US.twitter.txt
## Count words
##  37334690 final/en_US/en_US.blogs.txt
##  34372720 final/en_US/en_US.news.txt
##  30374206 final/en_US/en_US.twitter.txt

The three training files range from 159 to 200MB, 900K to 2.3M lines of text and 30 to 37M words.

Lets look for the longest line for these files to get a sense of their structure.

library(quanteda)
setwd('~/dev/datasciencecoursera/capstone/final/en_US')
for (file in c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")) 
{
    longestLine = 0
    con <- file(file, "r")
    while (length(l <- readLines(con, n = 1, warn = FALSE)) > 0) {
        longestLine = max(nchar(l), longestLine)
    }
    print(c("File"=file, 'Longest Line Size'=longestLine))
    close(con)
}
##              File Longest Line Size 
## "en_US.blogs.txt"           "40833" 
##              File Longest Line Size 
##  "en_US.news.txt"           "11384" 
##                File   Longest Line Size 
## "en_US.twitter.txt"               "140"

Using the quanteda library, we tokenize each file with its built in ability to remove numbers and drop punctuation. We lowercase the words, filter out profanity from a pre-built list of filtered words and use a hash to count their frequencies.

library(quanteda)
library(hash)
setwd('~/dev/datasciencecoursera/capstone/final/en_US')
con <- file("filtered_words.txt", "r")
filtered_lines <- readLines(con, warn=FALSE)
filtered = hash(filtered_lines, 1:length(filtered_lines))
close(con)

filteredWords = 0
wordHash = hash()
for (file in c("en_US.twitter.txt", "en_US.blogs.txt", "en_US.news.txt")) 
{
    con <- file(file, "r")
    while (length(l <- readLines(con, n = 1000, warn = FALSE)) > 0) {
        words <- tokenize(l, what="fasterword", removeTwitter = TRUE, removePunct = TRUE, removeNumbers = TRUE)
        for(w in words[[1]]) { 
            w <- toLower(w)
            if(!is.null(filtered[[w]])) {
                filteredWords = filteredWords + 1
            } else {
                f <- wordHash[[w]]
                if(is.null(f))
                    wordHash[[w]] = 1
                else
                    wordHash[[w]] = f + 1
            }
        }
    }
    print(c("File"=file, '# unique words'=length(wordHash), '# filtered words'=filteredWords))
    # Look at two words in particular as a sentinal (and to answer a quiz question)
    print(c("#Loves"=wordHash[["love"]], '#Hates'= wordHash[["hate"]], 'Love/Hate'=wordHash[["love"]]/wordHash[["hate"]]))
    close(con)
}
##                File      # unique words    # filtered words 
## "en_US.twitter.txt"              "6301"               "123" 
##    #Loves    #Hates Love/Hate 
##     108.0      20.0       5.4 
##              File    # unique words  # filtered words 
## "en_US.blogs.txt"           "10930"             "158" 
##     #Loves     #Hates  Love/Hate 
## 157.000000  22.000000   7.136364 
##             File   # unique words # filtered words 
## "en_US.news.txt"          "15039"            "166" 
##     #Loves     #Hates  Love/Hate 
## 166.000000  22.000000   7.545455

Lets plot the frequency of these words to get a sense of the coverage of the top N words.

library(dplyr)
library(ggplot2)
l <- as.list(wordHash)
df <- data.frame(words=names(l), frequencies=unlist(l)) %>% arrange(desc(frequencies))
qplot(1:length(frequencies), log10(frequencies), data=df,  main = "Ranked frequencies of observed words")

Its clear there is a very stark drop-off, and the majority of words are singletons (exist only once) with only 1075 words occurring more than 10 times.

head(df, n=25)
##    words frequencies
## 1    the        4765
## 2     to        2765
## 3      a        2429
## 4    and        2333
## 5     of        1976
## 6     in        1679
## 7      i        1624
## 8    for        1055
## 9     is        1052
## 10  that        1046
## 11   you         923
## 12    it         905
## 13    on         790
## 14  with         716
## 15    my         642
## 16   was         607
## 17    be         600
## 18    at         578
## 19  this         571
## 20    as         569
## 21   are         509
## 22  have         498
## 23   but         484
## 24    he         429
## 25    we         401
# How many words have frequency over 1000, 100, 10 and 1
nrow(subset(df, frequencies > 1000))
## [1] 10
nrow(subset(df, frequencies > 100))
## [1] 121
nrow(subset(df, frequencies > 10))
## [1] 1075
nrow(subset(df, frequencies > 1))
## [1] 6286
nrow(df)
## [1] 15039

Strategy for Model Building

To build a predictive model for choosing the next word for a given input phrase, the following model building strategy will be employed:

  1. All 2-gram are extracted from the corpus
  2. The frequency of each 2-gram is counted
  3. For each first word in the 2-gram list, the most common second word is assessed based on the frequency data. Only that word and its predicted subsequent word is saved to into an index model.

The following steps are repeated with 3-gram, 4-gram etc as resources allow. A 3-gram training pass will save what third word is predicted for an observed 2-gram and so forth.

The algorithm for applying the model will be:

  1. Use as many words in the suffix of the input as provided and as large as our pre-build model provides, say a three word suffix that predicts the 4th word.

  2. Look up the 3-word input suffix in the index. If it is present, use the computed 4th word as the prediction.

  3. If the 3-word input suffix is not in the index, back off and look for the 2-word input suffix, then a 1-word input suffix.

  4. If even the last input word is not in the model or there is no user input, predict the word “the”, as it is the most common word.

The evaluation of the performance of this model can be done by extracting 4-word subsets from known text and predicting the 4th word based on the 3-word suffix or as large of a suffix we have built a index for.