The objective is to build a predictive text model for use in proposing and completing words on mobile devices. This document summarizes the steps taken so far and outlines some ideas for proceeding from here.
First, the training dataset was downloaded from the Coursera website. Among the data provided, the en_US was inspected.
library(tm)
library(caret)
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',
'~/DSS/Capstone/dataset.zip', method = 'curl')
unzip('~/DSS/Capstone/dataset.zip', exdir = '~/DSS/Capstone/dataset')
system('wc -l ~/DSS/Capstone/dataset/final/en_US/*', intern = T)
## [1] " 899288 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.blogs.txt"
## [2] " 1010242 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.news.txt"
## [3] " 2360148 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.twitter.txt"
## [4] " 4269678 total"
system('wc -w ~/DSS/Capstone/dataset/final/en_US/*', intern = T)
## [1] " 37334690 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.blogs.txt"
## [2] " 34372720 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.news.txt"
## [3] " 30374206 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.twitter.txt"
## [4] " 102081616 total"
system('ls -lh ~/DSS/Capstone/dataset/final/en_US/*', intern = T)
## [1] "-rw-r--r-- 1 charlesfloyd staff 200M Mar 29 05:10 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.blogs.txt"
## [2] "-rw-r--r-- 1 charlesfloyd staff 196M Mar 29 05:10 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.news.txt"
## [3] "-rw-r--r-- 1 charlesfloyd staff 159M Mar 29 05:10 /Users/charlesfloyd/DSS/Capstone/dataset/final/en_US/en_US.twitter.txt"
A few interesting things come from evaluating the line and word counts of the included files. The blogs file contains the fewest lines but the most words, reasonable since blogs are often open ended. The twitter file, on the other extreme, has the most lines and the fewest words, also reasonable because of its rigid 140 character limit per tweet.
Because of the large size of the dataset, samples were taken for the exploratory analysis.
get.file.length <- function(f) {
file.length <- system(intern = T, sprintf('cat %s | wc -l', f))
return(as.numeric(file.length))
}
write.sample.files <- function(f, p, max.lines = NULL) {
f.dir <- dirname(f)
f.base <- basename(f)
in.sample.dir <- sprintf('%s/in.sample', f.dir)
outof.sample.dir <- sprintf('%s/outof.sample', f.dir)
system(sprintf('mkdir -p %s', in.sample.dir))
system(sprintf('mkdir -p %s', outof.sample.dir))
file.length <- get.file.length(f)
if (!is.null(max.lines)) iter.num <- min(file.length, max.lines)
else iter.num <- file.length
file.lines <- readLines(f, n = iter.num)
in.sample <- file(sprintf('%s/%s', in.sample.dir, f.base))
outof.sample <- file(sprintf('%s/%s', outof.sample.dir, f.base))
inSample <- createDataPartition(y = 1:iter.num, p = p, list = F)
inSample <- as.vector(inSample)
write(file.lines[inSample],in.sample)
write(file.lines[-inSample],outof.sample)
close(outof.sample)
close(in.sample)
}
dir.en.us <- '~/DSS/Capstone/dataset/final/en_US/'
files.en.us <- grep(value = T, 'txt$', dir(dir.en.us, full.names = T))
for (f in files.en.us) write.sample.files(f, p = 0.01)
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
The sampled data was then read in and manipulated using the tm package. Before looking into the differing word characteristics of the blogs, news, and twitter documents, each had punctuation, whitespace, and extremely common words (such as: a, an, the, be, do, etc.) removed, and was made all lowercase. Profanity was also filtered out using an online list available at http://www.cs.cmu.edu/~biglou/resources/
download.file('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt',
'~/DSS/Capstone/cs.cmu.bad-words.txt', method = 'curl')
normalize.text <- function (t) tolower(removePunctuation(t))
profane.and.stopwords <- function () {
profanewords <- readLines(con = '~/DSS/Capstone/cs.cmu.bad-words.txt')
profane.and.stopwords <- c(profanewords,stopwords(kind = 'en'))
profane.and.stopwords <-
unique(as.vector(sapply(profane.and.stopwords, normalize.text)))
profane.and.stopwords <-
profane.and.stopwords[profane.and.stopwords != '']
return(profane.and.stopwords)
}
get.corpus <- function(dirsource) {
corpus <-
Corpus(DirSource(dirsource),
readerControl =
list(language = 'en', reader = readPlain))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, profane.and.stopwords())
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
corpus <- get.corpus('~/DSS/Capstone/dataset/final/en_US/in.sample/')
Some questions of interest were: what are the most common words in each document, and what are their frequencies? It would make sense if the blogs and twitter documents were more similar to each other than to the news document. Blogs and twitter are both social outlets for a large range of users, while news is more focused in scope as well as providers. Is that the case? A document-term matrix is used to investigate.
dtm <- DocumentTermMatrix(corpus, control = list(stopwords = F))
plot.frequent.terms <- function(dtm, dimx, min.freq) {
freq.terms <- findFreqTerms(dtm[dimx,], min.freq)
freq.mat <- inspect(dtm[dimx,freq.terms])
names(freq.mat) <- colnames(freq.mat)
print(barchart(sort(freq.mat, decreasing = T),
main = dimx, xlab = 'term freq'))
}
plot.frequent.terms(dtm, 'en_US.blogs.txt', 500)
## <<DocumentTermMatrix (documents: 1, terms: 17)>>
## Non-/sparse entries: 17/0
## Sparsity : 0%
## Maximal term length: 6
## Weighting : term frequency (tf)
##
## Terms
## Docs also back can day first get good just know like new now
## en_US.blogs.txt 545 544 977 531 502 659 511 951 613 933 561 596
## Terms
## Docs one people see time will
## en_US.blogs.txt 1249 624 523 900 1053
plot.frequent.terms(dtm, 'en_US.twitter.txt', 500)
## <<DocumentTermMatrix (documents: 1, terms: 24)>>
## Non-/sparse entries: 24/0
## Sparsity : 0%
## Maximal term length: 6
## Weighting : term frequency (tf)
##
## Terms
## Docs back can day get going good got great just know like
## en_US.twitter.txt 569 918 845 1131 551 1022 597 785 1528 776 1202
## Terms
## Docs lol love need new now one people see thanks think time
## en_US.twitter.txt 670 1010 504 637 806 837 545 635 913 534 746
## Terms
## Docs today will
## en_US.twitter.txt 664 974
plot.frequent.terms(dtm, 'en_US.news.txt', 500)
## <<DocumentTermMatrix (documents: 1, terms: 14)>>
## Non-/sparse entries: 14/0
## Sparsity : 0%
## Maximal term length: 6
## Weighting : term frequency (tf)
##
## Terms
## Docs also can first just last new one people said time two
## en_US.news.txt 552 566 531 549 524 712 852 508 2538 515 565
## Terms
## Docs will year years
## en_US.news.txt 1103 526 535
A feature that stands out in the twitter and blog word counts is the appearance of subjective words, “like” and “good” (as well as “love” and “great” for twitter), which are absent from the frequent news words. And the most frequent word in the news data is “said,” in line with the reporting that takes place in news stories, often include direct quotes and sources.
From here, the next step is to explore the frequencies of word pairs and triples in the data, and to begin building and testing models at predicting the following word based on the one or two words preceding it. Additional ideas include looking for groups of related words that often appear near but not next to each other, and caching a user’s recently and frequently used words, especially long ones and proper nouns which might otherwise be difficult for a model trained on more universal data to predict.