The goal of this project is to construct a predictive text model that can supply text predictions based on previous words typed. To determine word frequencies and understand the associations between words, I performed an exploratory analysis of a database of English text from blogs, news articles, and Twitter.
The dataset contained about 900,000 lines of text from blogs, over a million lines of text from news articles, and over two million lines of text from Twitter (Table 1).
Line Count
blogs 899288
news 1010242
twitter 2360148
Table 1. Line counts for files in the text dataset.
I randomly selected 1% of lines from each file to provide a sample dataset on which to perform further analyses. These lines were written to a separate set of files for further processing.
Line Count
blogs 8992
news 10102
twitter 23601
Table 2. Line counts for files in the sample dataset.
I used the sample files to create a corpus, which I cleaned by removing numbers and most punctuation (excluding certain apostrophes). I also converted all words to lowercase in order to combine counts for the same words with different captitalizations.
I created term-document matrices for words/unigrams (1-grams), bigrams (2-grams), and trigrams (3-grams) using NGramTokenizers. I then converted each TermDocumentMatrix to a regular matrix for further processing.
I calculated the total number of word instances in each of the sample files (Table 3) and the average words per line (Table 4). The blog files contained the most words per line and the Twitter file the least.
en_US.blogs.sample.txt en_US.news.sample.txt en_US.twitter.sample.txt
289381 275837 229947
Table 3. Words counts for files in the sample dataset.
Line Count
blogs 32.182051
news 27.305187
twitter 9.743104
Table 4. Words per line for files in the sample dataset.
To filter out words that were misspelled or in a foreign (non-English) language, I used the aspell tool to spellcheck words in the corpus. (Note: This tool required installing the aspell program [http://aspell.net] on my computer, to provide a dictionary for it to use.)
The spellchecker identified many tokens that could be safely removed from a predictive model.
[1] "a'f" "a's" "a'stan" "aaa" "aaaaand" "aaaahhh"
However, 44% of the words in the corpus were flagged by the spellchecker.
[1] 0.4422739
Further examination revealed that the most common words flagged were personal pronouns, acronyms, or other words that require capitalization to pass a spellcheck (Table 5). These “misspellings” were likely generated when the text was converted to lowercase. Because I did not wish to exclude such common words, I decided to use the unfiltered (non-spellchecked) sample when analyzing frequency distributions of n-grams.
i'm lol i've i'll friday american
1782 670 447 376 349 276
Table 5. Frequencies of the most common “misspelled” words in the sample dataset. Many were only flagged by the spellchecker due to their conversion to lowercase.
I examined the distributions of word (unigram) frequencies, bigram (2-gram) frequencies, and trigram (3-gram) freqencies. All distributions were highly skewed, with most n-grams appearing infrequently and relatively few n-grams appearing very frequently. I therefore used the log of frequencies to compare these n-gram distributions. Frequency distributions became more skewed as the size of n-grams increased from unigrams (Figure 1) to bigrams (Figure 2) to trigrams (Figure 3).
Figure 1. Unigram (word) log frequency distribution.
Figure 2. Bigram (2-gram) log frequency distribution.
Figure 3. Trigram (3-gram) log frequency distribution.
To explore the dataset further, I examined the most common words in the dataset (Figure 4). Not surprisingly, many of these were “stopwords.”
Figure 4. Most common words in the sample dataset.
To examine word associations, I looked at the most common bigrams (Figure 5) and trigrams (Figure 6).
Figure 5. Most common bigrams in the sample dataset.
Figure 6. Most common trigrams in the sample dataset.
To determine how many unique n-grams should be included in the model to ensure a given amount of coverage, I calculated the number of n-grams needed for 50% and 90% coverage for unigrams, bigrams, and trigrams (Table 6).
50% coverage 90% coverage
Unigrams 319 9134
Bigrams 32525 344636
Trigrams 313710 683205
Table 6. Smallest number of unique n-grams yielding a given degree of coverage for the sample corpus.
I plan to develop a model using Kneser-Ney smoothing to account for the probability of unknown words. I’ll estimate its predictive power by calculating the “perplexity” of the test set based on the model’s probability matrix. This is the inverse probability of the test set, normalized by the number of word instances in the test set. A good model will minimize perplexity [1].
The Shiny app I plan to create will provide a text input box for the user to enter text, next to which will appear a list of the top three word predictions. The predictions will be generated after the completion of each word in the text input box, as indicated by the spacebar key being pressed by the user.
[1] Jurafsky, Dan, and James H. Martin. Speech and language processing. Pearson, 2014.
# DATASET
setwd("~/Dropbox/data-science-capstone")
filename <- "Coursera-SwiftKey.zip"
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, filename)
unzip(filename)
file.remove(filename)
# LINE COUNTS
library(R.utils)
en.blogs.linecount <- countLines("final/en_US/en_US.blogs.txt")[1]
en.news.linecount <- countLines( "final/en_US/en_US.news.txt")[1]
en.twitter.linecount <- countLines("final/en_US/en_US.twitter.txt")[1]
linecounts <- data.frame(
c(en.blogs.linecount,
en.news.linecount,
en.twitter.linecount),
row.names=c("blogs", "news", "twitter"))
colnames(linecounts) <- c("Line Count")
linecounts
# SAMPLING
library(LaF)
set.seed(2)
prob <- 0.01
en.blogs.sample <- sample_lines("final/en_US/en_US.blogs.txt",
prob*en.blogs.linecount,
nlines = en.blogs.linecount)
en.news.sample <- sample_lines("final/en_US/en_US.news.txt",
prob*en.news.linecount,
nlines = en.news.linecount)
en.twitter.sample <- sample_lines("final/en_US/en_US.twitter.txt",
prob*en.twitter.linecount,
nlines = en.twitter.linecount)
dir.create(file.path(getwd(), "sample"), showWarnings = FALSE)
write(en.blogs.sample, "sample/en_US.blogs.sample.txt", sep="\n")
write(en.news.sample, "sample/en_US.news.sample.txt", sep="\n")
write(en.twitter.sample, "sample/en_US.twitter.sample.txt", sep="\n")
sample.blogs.linecount <- countLines("sample/en_US.blogs.sample.txt")[1]
sample.news.linecount <- countLines( "sample/en_US.news.sample.txt")[1]
sample.twitter.linecount <- countLines("sample/en_US.twitter.sample.txt")[1]
sample.linecounts <- c(sample.blogs.linecount, sample.news.linecount, sample.twitter.linecount)
sample.linecounts <- data.frame(
c(sample.blogs.linecount,
sample.news.linecount,
sample.twitter.linecount),
row.names=c("blogs", "news", "twitter"))
colnames(sample.linecounts) <- c("Line Count")
sample.linecounts
# TOKENIZATION
library(tm)
corpus <- Corpus(DirSource("sample"))
# replace numbers and punctuation with whitespaces, except certain apostrophes
removeNumAndPunct <- function(x) gsub("[^[:alpha:][:space:]']|^'|'$|\\W'", " ", x)
corpus <- tm_map(corpus, content_transformer(removeNumAndPunct))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
library(RWeka)
token_delim <- " \\t\\r\\n.!?,;\"()"
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1, delimiters = token_delim))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = token_delim))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = token_delim))
options(mc.cores=1)
tdm.unigram <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
tdm.bigram <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tdm.trigram <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
unigram.matrix <- as.matrix(tdm.unigram)
bigram.matrix <- as.matrix(tdm.bigram)
trigram.matrix <- as.matrix(tdm.trigram)
# WORD COUNTS
wordcounts <- colSums(unigram.matrix)
wordcounts
wordcounts / sample.linecounts
# FILTERING
library(utils)
unigrams <- rownames(unigram.matrix)
fn <- tempfile()
writeLines(unigrams, fn)
unigrams.spellcheck.result <- aspell(fn)
head(unigrams.spellcheck.result$Original)
unigram.misspelled <- unigram.matrix[unigrams.spellcheck.result$Line, ]
unigram.prop.misspelled <- nrow(unigram.misspelled) / nrow(unigram.matrix)
unigram.prop.misspelled
unigram.misspelled.frequencies <- rowSums(unigram.misspelled)
unigram.misspelled.frequencies <- sort(unigram.misspelled.frequencies, decreasing=TRUE)
head(unigram.misspelled.frequencies)
# COVERAGE
getCoverage <- function(prop, frequencylist) {
running.total <- 0
total.instances <- sum(frequencylist)
index <- 0
while (running.total < prop*total.instances) {
index <- index + 1
running.total <- running.total + frequencylist[index]
}
index
}
coverage.50 <- c(
getCoverage(0.5, unigram.frequencies),
getCoverage(0.5, bigram.frequencies),
getCoverage(0.5, trigram.frequencies)
)
coverage.90 <- c(
getCoverage(0.9, unigram.frequencies),
getCoverage(0.9, bigram.frequencies),
getCoverage(0.9, trigram.frequencies)
)
coverage <- cbind(coverage.50, coverage.90)
colnames(coverage) <- c('50% coverage', '90% coverage')
rownames(coverage) <- c('Unigrams', 'Bigrams', 'Trigrams')
coverage