Data Science Capstone Milestone Report

Summary

This document is a summary of the progress I have made so far towards the goal of creating a model that will predict the next word a user intends to type, knowing only what he has typed previously. Towards this end, SwiftKey has provided data sets from various sources. The goal is to use this data to train a model which will do neccessary predictions.

Training Data Description

The data provided by SwiftKey contains directories which contain data for four different languages (de_DE, en_US, fi_FI, and ru_RU). Since I’m most familiar with english my goal is to create a model that is good at predicting the next english word. So I’ll be focusing my efforts on the data contained in the en_US folder.

The en_US folder constists of the files listed in the following table:

File Name	Size (MB)	No. Lines
en_US.blogs.txt	210.2	899288
en_US.news.txt	205.8	1010242
en_US.twitter.txt	167.1	2360148

Preprocessing the data.

Preprocessing the data will consist of several steps. First of all a random subset of the data will be sampled then resaved into new files to keep the processing and memory requirments managable. Second the sampled data is loaded and subsiquently cleaned.

Subsetting the data.

I chose to subset the data by going through each data file and picking a fixed number of lines to analyze at random. The resulting data is then resaved in a new directory (samples) preserving the file names where the data originated.

The following R code performs this function and creates sample files with 10000 rows of sampled data for each file:

SAMPLE_DIR = "samples/"
INPUT_DIR = "final/en_US/"
TWITTER_FILE_INPUT <- "en_US.twitter.txt"
BLOGS_FILE_INPUT <- "en_US.blogs.txt"
NEWS_FILE_INPUT <- "en_US.news.txt"

createSampleData <- function(numLines)
{    
    #Split data sets into sample files to analyze
    createSampleFile <- function(input, numSamples)
    {    
        lines = readLines(paste0(INPUT_DIR, input), encoding="UTF-8")
        lines  <- iconv(lines , to = "utf-8")
        lines <- (lines[!is.na(lines)])
        samples = sample(1:length(lines), size=numSamples)
        writeLines(lines[samples], con=paste0(SAMPLE_DIR, input))
    }
    
    if (file.exists(SAMPLE_DIR))
    {
        unlink(SAMPLE_DIR, recursive = TRUE)
    }

    dir.create(SAMPLE_DIR)
    
    set.seed(3141)
    createSampleFile(TWITTER_FILE_INPUT, numLines)
    createSampleFile(BLOGS_FILE_INPUT, numLines)
    createSampleFile(NEWS_FILE_INPUT, numLines)
}

if (!file.exists(SAMPLE_DIR))
{
    createSampleData(10000)
}

In the final capstone project I plan to split the data into training, testing, and validation subsets.

Loading and Cleaning the Data

Cleaning the data basically involves going through the data and removing unwanted artifacts such as capital letters, punctuation, and whitespace. The purpose of the exercise is to make the data as uniform as possible. The following R code reads the contents of the “samples” directory into the

ovid <- Corpus(DirSource(SAMPLE_DIR), readerControl = list(reader = readPlain, 
                                                         language = "en", 
                                                         load = TRUE))

ovid <- tm_map(ovid, PlainTextDocument)
ovid <- tm_map(ovid, tolower)
ovid <- tm_map(ovid, removePunctuation)
ovid <- tm_map(ovid, stripWhitespace)
ovid <- tm_map(ovid, removeNumbers)

summary(ovid)

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID

Please note that I didn’t stem or throw away uncommon or stopwords in the previous step. We are trying to predict the exact word the user is going to type next so this information is important. Removing words from the corpus would mess with the sentence structure and prevent us from predicting as accuratly as we otherwise might.

Exploratory Analysis

I’ll start by using my document corpus to create n-grams of length 1, 2, and 3 and then take a closer look at what I get. An n-gram is basically short fragment of a sentence with a particular predefined length. So an n-gram of length 2 (bigram) would constist of all the consecutive word pairs in a sentence, etc..

This can be done in R by creating TermDocumentMatrix objects with tokenizers that are in charge of creating the n-grams.

tdm <- TermDocumentMatrix(ovid)
tdm <- removeSparseTerms(tdm, 0.5)

tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm_ngram2 <- TermDocumentMatrix(ovid, control = list(tokenize = tokenizer))
tdm_ngram2 <- removeSparseTerms(tdm_ngram2, 0.6)

tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm_ngram3 <- TermDocumentMatrix(ovid, control = list(tokenize = tokenizer))
tdm_ngram3 <- removeSparseTerms(tdm_ngram3, 0.7)

The function “removeSparseTerms” is used to remove uncommon terms and words in order eliminate ngrams that contain words that are very uncommon. The data structure that results from performing the above operations looks like this:

inspect(tdm_ngram3[1:3,])

## A term-document matrix (3 terms, 3 documents)
## 
## Non-/sparse entries: 3/6
## Sparsity           : 67%
## Maximal term length: 13 
## Weighting          : term frequency (tf)
## 
##                     Docs
## Terms                [,1] [,2] [,3]
##   \u0096 a showcase     0    1    0
##   \u0097 an official    0    1    0
##   \u0097 five men       0    1    0

Basically the TermDocumentMatrix consists of a list of terms (the ngrams) and the number of times the term was encountered in each of the affor-mentioned documents.

Relative Term Frequencies

Lets start by plotting the frequencies of the terms that happen the most often. Basically for this plot well count up the number of times that terms are found and then plot those terms on a plot so we can see how they vary with respect to each other.

termSums <- t(apply(tdm, 1, sum))
termSums <- sort(termSums[1,], decreasing=TRUE)[1:10]
termSums

##  the  and  for that  you with  was this  but have 
## 4348 2250  963  899  661  633  593  506  447  446

plot(termSums, main="Most common word Frequencies")

termSums <- t(apply(tdm_ngram2, 1, sum))
termSums <- sort(termSums[1,], decreasing=TRUE)[1:10]
termSums

##   in the   of the   to the  for the   on the    to be  and the   at the 
##      400      388      181      175      155      152      129      113 
##     in a with the 
##      100       99

plot(termSums, main="Most common bigram frequencies")

termSums <- t(apply(tdm_ngram3, 1, sum))
termSums <- sort(termSums[1,], decreasing=TRUE)[1:10]
termSums

##  one of the    a lot of  as well as  be able to going to be i dont know 
##          37          29          20          19          14          14 
##   this is a     to be a a couple of  it was the 
##          14          14          13          13

plot(termSums, main="Most common trigram Frequencies")

Basically the larger the number of words in an NGram the lower the probobility of finding that sequence of words. As a result the average value of the top ten drops every time we add another word to the ngram model. Its interesting to note that as a result off adding a word to the ngram the frequencsies of the top ten terms seem to drop by about a factor of 10 for each additional word added to the N-Gram. The implication of the above data is that the more terms in your grams the more accuratly you should be able to predict the next word.

Its also interesing to note that the most common n-grams tend to be made up of the 10 most common words (the, for …).

Overall pretty obvious results…

Final Capstone Project

For the final capstone project I plan to use the n-gram models to compute the probobility of the next word given the provious n-1 words. If I constructed a table (aggressively chopping uncommon terms) I should be able to predict the next word pretty quickly. At least this is how I plan to start…

Im wondering if I could increase accuracy of such a model by keeping some of the punctuation. Commas especially seem like they would provide an important clue as to what word might come next.. Also counting the beginning of a sentence as a “word” might help predictions as well (words like “The”, “I” etc… seem to be most common start words.)

For the shiny app a plan to do a little app that lets you type in some text. The app will then try to guess what the next word will be based on the sentence fragment you started..