Objective

The objective is to build a predictive text model for use in proposing and completing words on mobile devices. This document summarizes the steps taken so far and outlines some ideas for proceeding from here.

Getting the Data

First, the training dataset was downloaded from the Coursera website. Among the data provided, the en_US was inspected.

library(tm)
library(caret)

download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',
          '~/DSS/Capstone/dataset.zip', method = 'curl')
unzip('~/DSS/Capstone/dataset.zip', exdir = '~/DSS/Capstone/dataset')
system('wc -l ~/DSS/Capstone/dataset/final/en_US/*', intern = T)
## [1] "  899288 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.blogs.txt"  
## [2] " 1010242 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.news.txt"   
## [3] " 2360148 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.twitter.txt"
## [4] " 4269678 total"
system('wc -w ~/DSS/Capstone/dataset/final/en_US/*', intern = T)
## [1] " 37334690 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.blogs.txt"  
## [2] " 34372720 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.news.txt"   
## [3] " 30374206 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.twitter.txt"
## [4] " 102081616 total"
system('ls -lh ~/DSS/Capstone/dataset/final/en_US/*', intern = T)
## [1] "-rw-r--r--  1 charlesfloyd  staff   200M Mar 29 05:10 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.blogs.txt"  
## [2] "-rw-r--r--  1 charlesfloyd  staff   196M Mar 29 05:10 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.news.txt"   
## [3] "-rw-r--r--  1 charlesfloyd  staff   159M Mar 29 05:10 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.twitter.txt"

A few interesting things come from evaluating the line and word counts of the included files. The blogs file contains the fewest lines but the most words, reasonable since blogs are often open ended. The twitter file, on the other extreme, has the most lines and the fewest words, also reasonable because of its rigid 140 character limit per tweet.

Exploratory Analysis

Because of the large size of the dataset, samples were taken for the exploratory analysis.

get.file.length <- function(f) {
    file.length <- system(intern = T, sprintf('cat %s | wc -l', f))
    return(as.numeric(file.length))
    }

write.sample.files <- function(f, p, max.lines = NULL) {
    f.dir <- dirname(f)
    f.base <- basename(f)
    in.sample.dir <- sprintf('%s/in.sample', f.dir)
    outof.sample.dir <- sprintf('%s/outof.sample', f.dir)
    system(sprintf('mkdir -p %s', in.sample.dir))
    system(sprintf('mkdir -p %s', outof.sample.dir))
    file.length <- get.file.length(f)
    if (!is.null(max.lines)) iter.num <- min(file.length, max.lines)
    else iter.num <- file.length
    file.lines <- readLines(f, n = iter.num)
    in.sample <- file(sprintf('%s/%s', in.sample.dir, f.base))
    outof.sample <- file(sprintf('%s/%s', outof.sample.dir, f.base))
    inSample <- createDataPartition(y = 1:iter.num, p = p, list = F)
    inSample <- as.vector(inSample)
    write(file.lines[inSample],in.sample)
    write(file.lines[-inSample],outof.sample)
    close(outof.sample)
    close(in.sample)
}

dir.en.us <- '~/DSS/Capstone/dataset/final/en_US/'
files.en.us <- grep(value = T, 'txt$', dir(dir.en.us, full.names = T))
for (f in files.en.us) write.sample.files(f, p = 0.01)
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul

The sampled data was then read in and manipulated using the tm package. Before looking into the differing word characteristics of the blogs, news, and twitter documents, each had punctuation, whitespace, and extremely common words (such as: a, an, the, be, do, etc.) removed, and was made all lowercase. Profanity was also filtered out using an online list available at http://www.cs.cmu.edu/~biglou/resources/

download.file('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt',
              '~/DSS/Capstone/cs.cmu.bad-words.txt', method = 'curl')
normalize.text <- function (t) tolower(removePunctuation(t))
profane.and.stopwords <- function () {
    profanewords <- readLines(con = '~/DSS/Capstone/cs.cmu.bad-words.txt')    
    profane.and.stopwords <- c(profanewords,stopwords(kind = 'en'))
    profane.and.stopwords <-
        unique(as.vector(sapply(profane.and.stopwords, normalize.text)))
    profane.and.stopwords <- 
        profane.and.stopwords[profane.and.stopwords != '']
    return(profane.and.stopwords)
}
get.corpus <- function(dirsource) {
    corpus <-
        Corpus(DirSource(dirsource), 
               readerControl = 
                   list(language = 'en', reader = readPlain))
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeWords, profane.and.stopwords())
    corpus <- tm_map(corpus, stripWhitespace)
    return(corpus)
}
corpus <- get.corpus('~/DSS/Capstone/dataset/final/en_US/in.sample/')

Some questions of interest were: what are the most common words in each document, and what are their frequencies? It would make sense if the blogs and twitter documents were more similar to each other than to the news document. Blogs and twitter are both social outlets for a large range of users, while news is more focused in scope as well as providers. Is that the case? A document-term matrix is used to investigate.

dtm <- DocumentTermMatrix(corpus, control = list(stopwords = F))
plot.frequent.terms <- function(dtm, dimx, min.freq) {
    freq.terms <- findFreqTerms(dtm[dimx,], min.freq)
    freq.mat <- inspect(dtm[dimx,freq.terms])
    names(freq.mat) <- colnames(freq.mat)
    print(barchart(sort(freq.mat, decreasing = T), 
                 main = dimx, xlab = 'term freq'))
}
plot.frequent.terms(dtm, 'en_US.blogs.txt', 500)
## <<DocumentTermMatrix (documents: 1, terms: 17)>>
## Non-/sparse entries: 17/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## 
##                  Terms
## Docs              also back can day first get good just know like new now
##   en_US.blogs.txt  545  544 977 531   502 659  511  951  613  933 561 596
##                  Terms
## Docs               one people see time will
##   en_US.blogs.txt 1249    624 523  900 1053

plot of chunk get.wordfrequencies

plot.frequent.terms(dtm, 'en_US.twitter.txt', 500)
## <<DocumentTermMatrix (documents: 1, terms: 24)>>
## Non-/sparse entries: 24/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## 
##                    Terms
## Docs                back can day  get going good got great just know like
##   en_US.twitter.txt  569 918 845 1131   551 1022 597   785 1528  776 1202
##                    Terms
## Docs                lol love need new now one people see thanks think time
##   en_US.twitter.txt 670 1010  504 637 806 837    545 635    913   534  746
##                    Terms
## Docs                today will
##   en_US.twitter.txt   664  974

plot of chunk get.wordfrequencies

plot.frequent.terms(dtm, 'en_US.news.txt', 500)
## <<DocumentTermMatrix (documents: 1, terms: 14)>>
## Non-/sparse entries: 14/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## 
##                 Terms
## Docs             also can first just last new one people said time two
##   en_US.news.txt  552 566   531  549  524 712 852    508 2538  515 565
##                 Terms
## Docs             will year years
##   en_US.news.txt 1103  526   535

plot of chunk get.wordfrequencies A feature that stands out in the twitter and blog word counts is the appearance of subjective words, “like” and “good” (as well as “love” and “great” for twitter), which are absent from the frequent news words. And the most frequent word in the news data is “said,” in line with the reporting that takes place in news stories, often include direct quotes and sources.

Next Steps

From here, the next step is to explore the frequencies of word pairs and triples in the data, and to begin building and testing models at predicting the following word based on the one or two words preceding it. Additional ideas include looking for groups of related words that often appear near but not next to each other, and caching a user’s recently and frequently used words, especially long ones and proper nouns which might otherwise be difficult for a model trained on more universal data to predict.