Overview

This report outlines the exploratory analysis I’m doing to form the basis for a text prediction algorithm.

My goal is to create an end product that will resolve several issues I personally encounter while using iPhone text prediction - namely, that the interface is distracting, and that it fails to predict or learn words that I commonly use, possibly because it filters them as expletives.

Note: I’m including critical or interesting code inline, but all other code will be listed in the Appendix.

Setup & preprocessing

First, I call on necessary libraries. For the purposes of this document, I assume all files have been downloaded from the correct source. A list of expletives to filter has also been downloaded from http://www.bannedwordlist.com/lists/swearWords.txt (more on this choice later).

Next, using readLines, I get an idea of what’s in each database. I’m choosing to use all three English language databases - even though Twitter may be more relevant to a texting algorithm, it does have a 140 character limit which may affect grammatical correctness, and I’d like to keep this algorithm relevant to general use.

con <- file("en_US.twitter.txt")
open(con)
twitter <- c(1, 0)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
        twitter[1] <- twitter[1] + 1
        twitter[2] <- twitter[2] + nchar(line)
}
close(con)
# See Appendix/Preprocessing for similar code used on the remaining datasets

As we can see from this summary, each dataset has fairly comparable character count even though the number of lines varies greatly.

Dataset characteristics
twitter news blogs
Lines 2360149 77260 899289
Characters 162384825 15683765 208361438

Given this, I take a random sampling of each database so that roughly the same number of lines are taken from each database. My computer has memory issues, so I’m taking a comparatively small subset of each database.

set.seed(216)
coinflip <- rbinom(whatsthere[1,1], 1, .005)
con <- file("en_US.twitter.txt")
open(con)
dat <- c()
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
        if (coinflip[current.line] == 1){
                dat <- c(dat, line)
        }
        current.line <- current.line + 1
}
close(con)
# See Appendix/Sampling for similar code used on the remaining datasets

The final preprocessing step is to remove all lines with non-latin characters from the datasets. This removes many foreign words in addition to letters/words corrupted by formatting issues.

nonascii <- grep("non_ascii", iconv(dat, "latin1", "ASCII", 
                                    sub = "non_ascii"))
dat <- dat[-nonascii]

Transformations

Before I figure out which words and phrases are the most common, I must remove unnecessary characters and filter for profanity.

(A note on the profanity filter: I was unable to locate a satisfactory list of profanities online (upon review, I found them either gravely lacking or stupidly excessive), so I created my own. Since this is obviously very subjective, this initial list is meant to be a work in progress.)

One extra filter needs to be applied to the corpus I use for unigrams: the removal of stopwords (e.g. “or”, “about”, “as”, “but”). If they were not removed, they would by far be the most common words, but I don’t want my algorithm to constantly predict “but” if the user begins to type “b-” unless there are context cues that “but” is coming next.

However, I leave stopwords in the corpus to be used for bigrams/trigrams - by definition, stopwords “go” with other words so they’ll be important in the prediction algorithm.

# See Appendix/Transform for complete code on these transformations

The final step is to make sure the corpuses (corpi?) are in the right format to see frequently used words.

For the unigram corpus:

# Convert to Document Term Matrix
uniDtm <- DocumentTermMatrix(VCorpus(VectorSource(corpusU$content)))

# Remove infrequently used words
uniDtm <- as.matrix(removeSparseTerms(uniDtm, 0.999))

# Summarize word frequency
uniDtm.freq <- sort(colSums(uniDtm), decreasing = TRUE)

For the bigram/trigram corpus:

# Create functions to turn the corpus into bigrams and trigrams
bigramTokenizer <- function(x){
        unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
               use.names = FALSE)
}
trigramTokenizer <- function(x){
        unlist(lapply(ngrams(words(x), 3), paste, collapse = " "),
               use.names = FALSE)
}

# Convert to document term matrix; remove infrequent words; summarize frequency
biDtm <- DocumentTermMatrix(VCorpus(VectorSource(corpus$content)), control = 
                                    list(tokenize = bigramTokenizer))
biDtm <- as.matrix(removeSparseTerms(biDtm, 0.999))
biDtm.freq <- sort(colSums(biDtm), decreasing = TRUE)

triDtm <- DocumentTermMatrix(VCorpus(VectorSource(corpus$content)), control = 
                                     list(tokenize = trigramTokenizer))
triDtm <- as.matrix(removeSparseTerms(triDtm, 0.999))
triDtm.freq <- sort(colSums(triDtm), decreasing = TRUE)

Visualizations

Using wordclouds and graphs, we can explore frequent unigrams, bigrams, and trigrams. As they illustrate, certain phrases are extremely popular (such as “one of the”). This means the algorithm should have a fairly easy time predicting common phrases.

Comparing different subsets

To ensure that the random database sampling is representative of this whole dataset, I run the same transformations on a different random sampling.

The top 10 n-grams in each category are nearly identical, even in terms of rank. Still, I combine the two frequency tables (adding frequencies together) to improve the accuracy.

Top 10 unigrams
1 2 3 4 5 6 7 8 9 10
Sample 1 said will one just like can time get new now
Sample 2 said will one just can like time get new now
Top 10 bigrams
1 2 3 4 5 6 7 8 9 10
Sample 1 of the in the to the on the for the at the to be and the in a with the
Sample 2 of the in the to the for the on the at the to be and the in a with the
Top 10 trigrams
1 2 3 4 5 6 7 8 9 10
Sample 1 one of the a lot of to be a some of the thanks for the going to be out of the it was a the end of i want to
Sample 2 one of the a lot of thanks for the as well as some of the it was a to be a going to be the end of be able to

What’s next

The next step is to build a text prediction model. When the user types a word, it should be looked up in the corresponding n-gram table - so, if the user types “one of”, the model suggests “the”. Similarly, if there is not a corresponding bi- or tri-gram, the model should try to predict what unigram the user is trying to type.

My hope for the final application is something more user-friendly and less distracting than iPhone text prediction, so I also look forward to exploring what features Shiny has to offer.

I found these online resources to be useful

Introduction to text mining using tm: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Creating term data matrices of bigrams/trigrams: http://tm.r-forge.r-project.org/faq.html

Appendix

Setup

library(knitr, verbose = FALSE)
library(tm, verbose = FALSE)
library(SnowballC, verbose = FALSE)
library(wordcloud, verbose = FALSE)
library(ggplot2, verbose = FALSE)

Preprocessing

con <- file("en_US.news.txt")
open(con)
news <- c(1, 0)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
        news[1] <- news[1] + 1
        news[2] <- news[2] + nchar(line)
}
close(con)

con <- file("en_US.blogs.txt")
open(con)
blogs <- c(1, 0)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
        blogs[1] <- blogs[1] + 1
        blogs[2] <- blogs[2] + nchar(line)
}
close(con)

whatsthere <- data.frame(twitter, news, blogs)
rownames(whatsthere) <- c("Lines", "Characters")

Sampling

set.seed(967)
coinflip <- rbinom(whatsthere[1,2], 1, .16)
con <- file("en_US.news.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
        if (coinflip[current.line] == 1){
                dat <- c(dat, line)
        }
        current.line <- current.line + 1
}
close(con)

set.seed(34)
coinflip <- rbinom(whatsthere[1,3], 1, .013)
con <- file("en_US.blogs.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
        if (coinflip[current.line] == 1){
                dat <- c(dat, line)
        }
        current.line <- current.line + 1
}
close(con)

Transform

# Turn the combined subsets into a corpus so it can be read by package tm
corpus <- VCorpus(VectorSource(dat))

# Make all text lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)

# Finalize profanity list & filter from corpus
bad <- read.csv("profane.txt", header = F)
notbad <- c("bloody","butt","jerk","lmao","lmfao","omg", "penis","queer",
            "sex","vagina","wtf")
morebad <- c("abbo","asses","assfuck","assfucker","asshat","asshole",
             "assholes","assclown","asskiss","asskisser","asslick",
             "assmunch","ballsack","beaner","bong",
             "boobies","boobs","booby","booty","bullcrap",
             "bulldike","bulldyke","bunghole","cameljockey","cameltoe",
             "chinaman","chinamen","chink","chode","cum","faggot",
             "gringo","gyp","gyppo","handjob","hitler","honkey",
             "honky","hooker","jackass","jap","jiggabo","jiz",
             "junglebunny","kike","kkk","koon","kunt","nard","nazi",
             "phuck", "picaninny","piccaninny","pimp","poon","queef",
             "retard","rimjob","sodom","sodomy","sodomize","spic","spick",
             "tard","tits","tittie","titties","titty","towelhead",
             "trannie","tranny","wanker","weenie","wetback",
             "whop","wog","wop","zipperhead")
extrabad <- c("/bitch/","/damn/","/fuck/","/nigger/","/phuck/","/shit/")

bad <- as.character(bad$V1)
bad <- as.character(bad[!bad %in% notbad])
bad <- sort(c(bad, morebad))

corpus <- tm_map(corpus, removeWords, bad)
corpus <- tm_map(corpus, removeWords, extrabad)
# Note: the "extrabad" filter contains regex to remove the entire word if any of the "extrabad" words are contained within that word (e.g. "bitchy"). This only works with some words; for instance, if I put "ass" into "extrabad", it would mistakenly filter the word "assess".

Unigram

corpusU <- tm_map(corpus, removeWords, stopwords("english"))

Graphs

wordcloud(names(uniDtm.freq), uniDtm.freq, min.freq = 600, max.words = 100)
wordcloud(names(biDtm.freq), biDtm.freq, min.freq = 600, max.words = 30)
wordcloud(names(triDtm.freq), triDtm.freq, min.freq = 70, max.words = 20)

unigrams <- data.frame(uniDtm.freq)
unigrams["Unigram"] <- rownames(unigrams)
colnames(unigrams) <- c("Frequency", "Unigram")
ggplot(unigrams[1:10,], aes(x = reorder(Unigram, -Frequency), 
                            y = Frequency,
                            fill = Frequency)) +
        geom_bar(stat = "Identity") +
        scale_fill_gradient(high = "red", low = "navy") +
        labs(title = "Top 10 unigrams", x = "Unigram")

bigrams <- data.frame(biDtm.freq)
bigrams["Bigram"] <- rownames(bigrams)
colnames(bigrams) <- c("Frequency", "Bigram")
ggplot(bigrams[1:10,], aes(x = reorder(Bigram, -Frequency), 
                            y = Frequency,
                            fill = Frequency)) +
        geom_bar(stat = "Identity") +
        scale_fill_gradient(high = "red", low = "navy") +
        labs(title = "Top 10 bigrams", x = "Bigram")

trigrams <- data.frame(triDtm.freq)
trigrams["Trigram"] <- rownames(trigrams)
colnames(trigrams) <- c("Frequency", "Trigram")
ggplot(trigrams[1:10,], aes(x = reorder(Trigram, -Frequency), 
                           y = Frequency,
                           fill = Frequency)) +
        geom_bar(stat = "Identity") +
        scale_fill_gradient(high = "red", low = "navy") +
        labs(title = "Top 10 Trigrams", x = "Trigram") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

Comps

# Sampling a couple different times
set.seed(55)
coinflip <- rbinom(whatsthere[1,1], 1, .005)
con <- file("en_US.twitter.txt")
open(con)
dat2 <- c()
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
        if (coinflip[current.line] == 1){
                dat2 <- c(dat2, line)
        }
        current.line <- current.line + 1
}
close(con)
set.seed(2)
coinflip <- rbinom(whatsthere[1,2], 1, .16)
con <- file("en_US.news.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
        if (coinflip[current.line] == 1){
                dat2 <- c(dat2, line)
        }
        current.line <- current.line + 1
}
close(con)
set.seed(913)
coinflip <- rbinom(whatsthere[1,3], 1, .013)
con <- file("en_US.blogs.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
        if (coinflip[current.line] == 1){
                dat2 <- c(dat2, line)
        }
        current.line <- current.line + 1
}
close(con)

# Remove non-ASCII characters
nonascii <- grep("non_ascii", iconv(dat2, "latin1", "ASCII", 
                                    sub = "non_ascii"))
dat2 <- dat2[-nonascii]

# Transform
corpus2 <- VCorpus(VectorSource(dat2))
corpus2 <- tm_map(corpus2, content_transformer(tolower))
corpus2 <- tm_map(corpus2, removeNumbers)
corpus2 <- tm_map(corpus2, removePunctuation)
corpus2 <- tm_map(corpus2, stripWhitespace)
corpus2 <- tm_map(corpus2, removeWords, bad)
corpus2 <- tm_map(corpus2, removeWords, extrabad)

# Summarize unigram corpus
corpusU2 <- tm_map(corpus2, removeWords, stopwords("english"))
uniDtm2 <- DocumentTermMatrix(VCorpus(VectorSource(corpusU2$content)))
uniDtm2 <- as.matrix(removeSparseTerms(uniDtm2, 0.999))
uniDtm2.freq <- sort(colSums(uniDtm2), decreasing = TRUE)

# Summarize bigram and trigram corpus
biDtm2 <- DocumentTermMatrix(VCorpus(VectorSource(corpus2$content)), control = 
                                    list(tokenize = bigramTokenizer))
triDtm2 <- DocumentTermMatrix(VCorpus(VectorSource(corpus2$content)), control = 
                                     list(tokenize = trigramTokenizer))

biDtm2 <- as.matrix(removeSparseTerms(biDtm2, 0.999))
triDtm2 <- as.matrix(removeSparseTerms(triDtm2, 0.999))

biDtm2.freq <- sort(colSums(biDtm2), decreasing = TRUE)
triDtm2.freq <- sort(colSums(triDtm2), decreasing = TRUE)

# Create lists of the top n-grams
topunigrams <- data.frame(rbind(names(uniDtm.freq[1:10]), names(uniDtm2.freq[1:10])))
rownames(topunigrams) <- c("Sample 1", "Sample 2")
names(topunigrams) <- 1:10
topbigrams <- data.frame(rbind(names(biDtm.freq[1:10]), names(biDtm2.freq[1:10])))
rownames(topbigrams) <- c("Sample 1", "Sample 2")
names(topbigrams) <- 1:10
toptrigrams <- data.frame(rbind(names(triDtm.freq[1:10]), names(triDtm2.freq[1:10])))
rownames(toptrigrams) <- c("Sample 1", "Sample 2")
names(toptrigrams) <- 1:10

# Combine the frequency tables
uniFreq <- data.frame(uniDtm.freq)
uniFreq$unigram <- rownames(uniFreq)
uniFreq2 <- data.frame(uniDtm2.freq)
uniFreq2$unigram <- rownames(uniFreq2)
uniFreq <- merge(uniFreq, uniFreq2, by = "unigram")
uniFreq$total <- uniFreq$uniDtm.freq + uniFreq$uniDtm2.freq
uniFreq <- uniFreq[order(-uniFreq$total),]

biFreq <- data.frame(biDtm.freq)
biFreq$bigram <- rownames(biFreq)
biFreq2 <- data.frame(biDtm2.freq)
biFreq2$bigram <- rownames(biFreq2)
biFreq <- merge(biFreq, biFreq2, by = "bigram")
biFreq$total <- biFreq$biDtm.freq + biFreq$biDtm2.freq
biFreq <- biFreq[order(-biFreq$total),]

triFreq <- data.frame(triDtm.freq)
triFreq$trigram <- rownames(triFreq)
triFreq2 <- data.frame(triDtm2.freq)
triFreq2$trigram <- rownames(triFreq2)
triFreq <- merge(triFreq, triFreq2, by = "trigram")
triFreq$total <- triFreq$triDtm.freq + triFreq$triDtm2.freq
triFreq <- triFreq[order(-triFreq$total),]