Milestone Report

Introduction

The goal of this project is to construct a predictive text model that can supply text predictions based on previous words typed. To determine word frequencies and understand the associations between words, I performed an exploratory analysis of a database of English text from blogs, news articles, and Twitter.

Exploratory Analysis

Line Counts

The dataset contained about 900,000 lines of text from blogs, over a million lines of text from news articles, and over two million lines of text from Twitter (Table 1).

        Line Count
blogs       899288
news       1010242
twitter    2360148

Table 1. Line counts for files in the text dataset.

Sampling

I randomly selected 1% of lines from each file to provide a sample dataset on which to perform further analyses. These lines were written to a separate set of files for further processing.

        Line Count
blogs         8992
news         10102
twitter      23601

Table 2. Line counts for files in the sample dataset.

Tokenization

I used the sample files to create a corpus, which I cleaned by removing numbers and most punctuation (excluding certain apostrophes). I also converted all words to lowercase in order to combine counts for the same words with different captitalizations.

I created term-document matrices for words/unigrams (1-grams), bigrams (2-grams), and trigrams (3-grams) using NGramTokenizers. I then converted each TermDocumentMatrix to a regular matrix for further processing.

Word Counts

I calculated the total number of word instances in each of the sample files (Table 3) and the average words per line (Table 4). The blog files contained the most words per line and the Twitter file the least.

  en_US.blogs.sample.txt    en_US.news.sample.txt en_US.twitter.sample.txt 
                  289381                   275837                   229947

Table 3. Words counts for files in the sample dataset.

        Line Count
blogs    32.182051
news     27.305187
twitter   9.743104

Table 4. Words per line for files in the sample dataset.

Filtering Words

To filter out words that were misspelled or in a foreign (non-English) language, I used the aspell tool to spellcheck words in the corpus. (Note: This tool required installing the aspell program [http://aspell.net] on my computer, to provide a dictionary for it to use.)

The spellchecker identified many tokens that could be safely removed from a predictive model.

[1] "a'f"     "a's"     "a'stan"  "aaa"     "aaaaand" "aaaahhh"

However, 44% of the words in the corpus were flagged by the spellchecker.

[1] 0.4422739

Further examination revealed that the most common words flagged were personal pronouns, acronyms, or other words that require capitalization to pass a spellcheck (Table 5). These “misspellings” were likely generated when the text was converted to lowercase. Because I did not wish to exclude such common words, I decided to use the unfiltered (non-spellchecked) sample when analyzing frequency distributions of n-grams.

     i'm      lol     i've     i'll   friday american 
    1782      670      447      376      349      276

Table 5. Frequencies of the most common “misspelled” words in the sample dataset. Many were only flagged by the spellchecker due to their conversion to lowercase.

Distributions of Frequencies

I examined the distributions of word (unigram) frequencies, bigram (2-gram) frequencies, and trigram (3-gram) freqencies. All distributions were highly skewed, with most n-grams appearing infrequently and relatively few n-grams appearing very frequently. I therefore used the log of frequencies to compare these n-gram distributions. Frequency distributions became more skewed as the size of n-grams increased from unigrams (Figure 1) to bigrams (Figure 2) to trigrams (Figure 3).

Figure 1. Unigram (word) log frequency distribution.

Figure 2. Bigram (2-gram) log frequency distribution.

Figure 3. Trigram (3-gram) log frequency distribution.

Most Common Words

To explore the dataset further, I examined the most common words in the dataset (Figure 4). Not surprisingly, many of these were “stopwords.”

Figure 4. Most common words in the sample dataset.

Most Common Associations

To examine word associations, I looked at the most common bigrams (Figure 5) and trigrams (Figure 6).

Figure 5. Most common bigrams in the sample dataset.

Figure 6. Most common trigrams in the sample dataset.

Coverage of N-gram Instances

To determine how many unique n-grams should be included in the model to ensure a given amount of coverage, I calculated the number of n-grams needed for 50% and 90% coverage for unigrams, bigrams, and trigrams (Table 6).

         50% coverage 90% coverage
Unigrams          319         9134
Bigrams         32525       344636
Trigrams       313710       683205

Table 6. Smallest number of unique n-grams yielding a given degree of coverage for the sample corpus.

Modeling Strategy

I plan to develop a model using Kneser-Ney smoothing to account for the probability of unknown words. I’ll estimate its predictive power by calculating the “perplexity” of the test set based on the model’s probability matrix. This is the inverse probability of the test set, normalized by the number of word instances in the test set. A good model will minimize perplexity [1].

Shiny App

The Shiny app I plan to create will provide a text input box for the user to enter text, next to which will appear a list of the top three word predictions. The predictions will be generated after the completion of each word in the text input box, as indicated by the spacebar key being pressed by the user.

References

[1] Jurafsky, Dan, and James H. Martin. Speech and language processing. Pearson, 2014.

Appendix: R Code

# DATASET
setwd("~/Dropbox/data-science-capstone")
filename <- "Coursera-SwiftKey.zip"
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, filename)
unzip(filename)
file.remove(filename)

# LINE COUNTS
library(R.utils)

en.blogs.linecount <- countLines("final/en_US/en_US.blogs.txt")[1]
en.news.linecount <- countLines( "final/en_US/en_US.news.txt")[1]
en.twitter.linecount <- countLines("final/en_US/en_US.twitter.txt")[1]
linecounts <- data.frame(
        c(en.blogs.linecount, 
          en.news.linecount, 
          en.twitter.linecount), 
        row.names=c("blogs", "news", "twitter"))
colnames(linecounts) <- c("Line Count")
linecounts

# SAMPLING
library(LaF)

set.seed(2)
prob <- 0.01

en.blogs.sample <- sample_lines("final/en_US/en_US.blogs.txt", 
                                prob*en.blogs.linecount, 
                                nlines = en.blogs.linecount)
en.news.sample <- sample_lines("final/en_US/en_US.news.txt", 
                               prob*en.news.linecount, 
                               nlines = en.news.linecount)
en.twitter.sample <- sample_lines("final/en_US/en_US.twitter.txt", 
                                  prob*en.twitter.linecount, 
                                  nlines = en.twitter.linecount)

dir.create(file.path(getwd(), "sample"), showWarnings = FALSE)
write(en.blogs.sample, "sample/en_US.blogs.sample.txt", sep="\n")
write(en.news.sample, "sample/en_US.news.sample.txt", sep="\n")
write(en.twitter.sample, "sample/en_US.twitter.sample.txt", sep="\n")

sample.blogs.linecount <- countLines("sample/en_US.blogs.sample.txt")[1]
sample.news.linecount <- countLines( "sample/en_US.news.sample.txt")[1]
sample.twitter.linecount <- countLines("sample/en_US.twitter.sample.txt")[1]
sample.linecounts <- c(sample.blogs.linecount, sample.news.linecount, sample.twitter.linecount)
sample.linecounts <- data.frame(
        c(sample.blogs.linecount, 
          sample.news.linecount, 
          sample.twitter.linecount), 
        row.names=c("blogs", "news", "twitter"))
colnames(sample.linecounts) <- c("Line Count")
sample.linecounts

# TOKENIZATION
library(tm)

corpus <- Corpus(DirSource("sample"))

# replace numbers and punctuation with whitespaces, except certain apostrophes
removeNumAndPunct <- function(x) gsub("[^[:alpha:][:space:]']|^'|'$|\\W'", " ", x)
corpus <- tm_map(corpus, content_transformer(removeNumAndPunct))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)

library(RWeka)

token_delim <- " \\t\\r\\n.!?,;\"()"
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1, delimiters = token_delim))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = token_delim))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = token_delim))

options(mc.cores=1)
tdm.unigram <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
tdm.bigram <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tdm.trigram <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))

unigram.matrix <- as.matrix(tdm.unigram)
bigram.matrix <- as.matrix(tdm.bigram)
trigram.matrix <- as.matrix(tdm.trigram)

# WORD COUNTS
wordcounts <- colSums(unigram.matrix)
wordcounts

wordcounts / sample.linecounts

# FILTERING
library(utils)

unigrams <- rownames(unigram.matrix)
fn <- tempfile()
writeLines(unigrams, fn)
unigrams.spellcheck.result <- aspell(fn)

head(unigrams.spellcheck.result$Original)

unigram.misspelled <- unigram.matrix[unigrams.spellcheck.result$Line, ]
unigram.prop.misspelled <- nrow(unigram.misspelled) / nrow(unigram.matrix)
unigram.prop.misspelled

unigram.misspelled.frequencies <- rowSums(unigram.misspelled)
unigram.misspelled.frequencies <- sort(unigram.misspelled.frequencies, decreasing=TRUE)
head(unigram.misspelled.frequencies)

# COVERAGE
getCoverage <- function(prop, frequencylist) {
        running.total <- 0
        total.instances <- sum(frequencylist)
        index <- 0
        while (running.total < prop*total.instances) {
                index <- index + 1
                running.total <- running.total + frequencylist[index]
        }
        index
}

coverage.50 <- c(
        getCoverage(0.5, unigram.frequencies),
        getCoverage(0.5, bigram.frequencies),
        getCoverage(0.5, trigram.frequencies)
)

coverage.90 <- c(
        getCoverage(0.9, unigram.frequencies),
        getCoverage(0.9, bigram.frequencies),
        getCoverage(0.9, trigram.frequencies)
)

coverage <- cbind(coverage.50, coverage.90)
colnames(coverage) <- c('50% coverage', '90% coverage')
rownames(coverage) <- c('Unigrams', 'Bigrams', 'Trigrams')
coverage