This report outlines the exploratory analysis I’m doing to form the basis for a text prediction algorithm.
My goal is to create an end product that will resolve several issues I personally encounter while using iPhone text prediction - namely, that the interface is distracting, and that it fails to predict or learn words that I commonly use, possibly because it filters them as expletives.
Note: I’m including critical or interesting code inline, but all other code will be listed in the Appendix.
First, I call on necessary libraries. For the purposes of this document, I assume all files have been downloaded from the correct source. A list of expletives to filter has also been downloaded from http://www.bannedwordlist.com/lists/swearWords.txt (more on this choice later).
Next, using readLines, I get an idea of what’s in each database. I’m choosing to use all three English language databases - even though Twitter may be more relevant to a texting algorithm, it does have a 140 character limit which may affect grammatical correctness, and I’d like to keep this algorithm relevant to general use.
con <- file("en_US.twitter.txt")
open(con)
twitter <- c(1, 0)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
twitter[1] <- twitter[1] + 1
twitter[2] <- twitter[2] + nchar(line)
}
close(con)
# See Appendix/Preprocessing for similar code used on the remaining datasets
As we can see from this summary, each dataset has fairly comparable character count even though the number of lines varies greatly.
| news | blogs | ||
|---|---|---|---|
| Lines | 2360149 | 77260 | 899289 |
| Characters | 162384825 | 15683765 | 208361438 |
Given this, I take a random sampling of each database so that roughly the same number of lines are taken from each database. My computer has memory issues, so I’m taking a comparatively small subset of each database.
set.seed(216)
coinflip <- rbinom(whatsthere[1,1], 1, .005)
con <- file("en_US.twitter.txt")
open(con)
dat <- c()
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
if (coinflip[current.line] == 1){
dat <- c(dat, line)
}
current.line <- current.line + 1
}
close(con)
# See Appendix/Sampling for similar code used on the remaining datasets
The final preprocessing step is to remove all lines with non-latin characters from the datasets. This removes many foreign words in addition to letters/words corrupted by formatting issues.
nonascii <- grep("non_ascii", iconv(dat, "latin1", "ASCII",
sub = "non_ascii"))
dat <- dat[-nonascii]
Before I figure out which words and phrases are the most common, I must remove unnecessary characters and filter for profanity.
(A note on the profanity filter: I was unable to locate a satisfactory list of profanities online (upon review, I found them either gravely lacking or stupidly excessive), so I created my own. Since this is obviously very subjective, this initial list is meant to be a work in progress.)
One extra filter needs to be applied to the corpus I use for unigrams: the removal of stopwords (e.g. “or”, “about”, “as”, “but”). If they were not removed, they would by far be the most common words, but I don’t want my algorithm to constantly predict “but” if the user begins to type “b-” unless there are context cues that “but” is coming next.
However, I leave stopwords in the corpus to be used for bigrams/trigrams - by definition, stopwords “go” with other words so they’ll be important in the prediction algorithm.
# See Appendix/Transform for complete code on these transformations
The final step is to make sure the corpuses (corpi?) are in the right format to see frequently used words.
For the unigram corpus:
# Convert to Document Term Matrix
uniDtm <- DocumentTermMatrix(VCorpus(VectorSource(corpusU$content)))
# Remove infrequently used words
uniDtm <- as.matrix(removeSparseTerms(uniDtm, 0.999))
# Summarize word frequency
uniDtm.freq <- sort(colSums(uniDtm), decreasing = TRUE)
For the bigram/trigram corpus:
# Create functions to turn the corpus into bigrams and trigrams
bigramTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
}
trigramTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "),
use.names = FALSE)
}
# Convert to document term matrix; remove infrequent words; summarize frequency
biDtm <- DocumentTermMatrix(VCorpus(VectorSource(corpus$content)), control =
list(tokenize = bigramTokenizer))
biDtm <- as.matrix(removeSparseTerms(biDtm, 0.999))
biDtm.freq <- sort(colSums(biDtm), decreasing = TRUE)
triDtm <- DocumentTermMatrix(VCorpus(VectorSource(corpus$content)), control =
list(tokenize = trigramTokenizer))
triDtm <- as.matrix(removeSparseTerms(triDtm, 0.999))
triDtm.freq <- sort(colSums(triDtm), decreasing = TRUE)
Using wordclouds and graphs, we can explore frequent unigrams, bigrams, and trigrams. As they illustrate, certain phrases are extremely popular (such as “one of the”). This means the algorithm should have a fairly easy time predicting common phrases.
To ensure that the random database sampling is representative of this whole dataset, I run the same transformations on a different random sampling.
The top 10 n-grams in each category are nearly identical, even in terms of rank. Still, I combine the two frequency tables (adding frequencies together) to improve the accuracy.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Sample 1 | said | will | one | just | like | can | time | get | new | now |
| Sample 2 | said | will | one | just | can | like | time | get | new | now |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Sample 1 | of the | in the | to the | on the | for the | at the | to be | and the | in a | with the |
| Sample 2 | of the | in the | to the | for the | on the | at the | to be | and the | in a | with the |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Sample 1 | one of the | a lot of | to be a | some of the | thanks for the | going to be | out of the | it was a | the end of | i want to |
| Sample 2 | one of the | a lot of | thanks for the | as well as | some of the | it was a | to be a | going to be | the end of | be able to |
The next step is to build a text prediction model. When the user types a word, it should be looked up in the corresponding n-gram table - so, if the user types “one of”, the model suggests “the”. Similarly, if there is not a corresponding bi- or tri-gram, the model should try to predict what unigram the user is trying to type.
My hope for the final application is something more user-friendly and less distracting than iPhone text prediction, so I also look forward to exploring what features Shiny has to offer.
Introduction to text mining using tm: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Creating term data matrices of bigrams/trigrams: http://tm.r-forge.r-project.org/faq.html
Setup
library(knitr, verbose = FALSE)
library(tm, verbose = FALSE)
library(SnowballC, verbose = FALSE)
library(wordcloud, verbose = FALSE)
library(ggplot2, verbose = FALSE)
Preprocessing
con <- file("en_US.news.txt")
open(con)
news <- c(1, 0)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
news[1] <- news[1] + 1
news[2] <- news[2] + nchar(line)
}
close(con)
con <- file("en_US.blogs.txt")
open(con)
blogs <- c(1, 0)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
blogs[1] <- blogs[1] + 1
blogs[2] <- blogs[2] + nchar(line)
}
close(con)
whatsthere <- data.frame(twitter, news, blogs)
rownames(whatsthere) <- c("Lines", "Characters")
Sampling
set.seed(967)
coinflip <- rbinom(whatsthere[1,2], 1, .16)
con <- file("en_US.news.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
if (coinflip[current.line] == 1){
dat <- c(dat, line)
}
current.line <- current.line + 1
}
close(con)
set.seed(34)
coinflip <- rbinom(whatsthere[1,3], 1, .013)
con <- file("en_US.blogs.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
if (coinflip[current.line] == 1){
dat <- c(dat, line)
}
current.line <- current.line + 1
}
close(con)
Transform
# Turn the combined subsets into a corpus so it can be read by package tm
corpus <- VCorpus(VectorSource(dat))
# Make all text lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Finalize profanity list & filter from corpus
bad <- read.csv("profane.txt", header = F)
notbad <- c("bloody","butt","jerk","lmao","lmfao","omg", "penis","queer",
"sex","vagina","wtf")
morebad <- c("abbo","asses","assfuck","assfucker","asshat","asshole",
"assholes","assclown","asskiss","asskisser","asslick",
"assmunch","ballsack","beaner","bong",
"boobies","boobs","booby","booty","bullcrap",
"bulldike","bulldyke","bunghole","cameljockey","cameltoe",
"chinaman","chinamen","chink","chode","cum","faggot",
"gringo","gyp","gyppo","handjob","hitler","honkey",
"honky","hooker","jackass","jap","jiggabo","jiz",
"junglebunny","kike","kkk","koon","kunt","nard","nazi",
"phuck", "picaninny","piccaninny","pimp","poon","queef",
"retard","rimjob","sodom","sodomy","sodomize","spic","spick",
"tard","tits","tittie","titties","titty","towelhead",
"trannie","tranny","wanker","weenie","wetback",
"whop","wog","wop","zipperhead")
extrabad <- c("/bitch/","/damn/","/fuck/","/nigger/","/phuck/","/shit/")
bad <- as.character(bad$V1)
bad <- as.character(bad[!bad %in% notbad])
bad <- sort(c(bad, morebad))
corpus <- tm_map(corpus, removeWords, bad)
corpus <- tm_map(corpus, removeWords, extrabad)
# Note: the "extrabad" filter contains regex to remove the entire word if any of the "extrabad" words are contained within that word (e.g. "bitchy"). This only works with some words; for instance, if I put "ass" into "extrabad", it would mistakenly filter the word "assess".
Unigram
corpusU <- tm_map(corpus, removeWords, stopwords("english"))
Graphs
wordcloud(names(uniDtm.freq), uniDtm.freq, min.freq = 600, max.words = 100)
wordcloud(names(biDtm.freq), biDtm.freq, min.freq = 600, max.words = 30)
wordcloud(names(triDtm.freq), triDtm.freq, min.freq = 70, max.words = 20)
unigrams <- data.frame(uniDtm.freq)
unigrams["Unigram"] <- rownames(unigrams)
colnames(unigrams) <- c("Frequency", "Unigram")
ggplot(unigrams[1:10,], aes(x = reorder(Unigram, -Frequency),
y = Frequency,
fill = Frequency)) +
geom_bar(stat = "Identity") +
scale_fill_gradient(high = "red", low = "navy") +
labs(title = "Top 10 unigrams", x = "Unigram")
bigrams <- data.frame(biDtm.freq)
bigrams["Bigram"] <- rownames(bigrams)
colnames(bigrams) <- c("Frequency", "Bigram")
ggplot(bigrams[1:10,], aes(x = reorder(Bigram, -Frequency),
y = Frequency,
fill = Frequency)) +
geom_bar(stat = "Identity") +
scale_fill_gradient(high = "red", low = "navy") +
labs(title = "Top 10 bigrams", x = "Bigram")
trigrams <- data.frame(triDtm.freq)
trigrams["Trigram"] <- rownames(trigrams)
colnames(trigrams) <- c("Frequency", "Trigram")
ggplot(trigrams[1:10,], aes(x = reorder(Trigram, -Frequency),
y = Frequency,
fill = Frequency)) +
geom_bar(stat = "Identity") +
scale_fill_gradient(high = "red", low = "navy") +
labs(title = "Top 10 Trigrams", x = "Trigram") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Comps
# Sampling a couple different times
set.seed(55)
coinflip <- rbinom(whatsthere[1,1], 1, .005)
con <- file("en_US.twitter.txt")
open(con)
dat2 <- c()
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
if (coinflip[current.line] == 1){
dat2 <- c(dat2, line)
}
current.line <- current.line + 1
}
close(con)
set.seed(2)
coinflip <- rbinom(whatsthere[1,2], 1, .16)
con <- file("en_US.news.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
if (coinflip[current.line] == 1){
dat2 <- c(dat2, line)
}
current.line <- current.line + 1
}
close(con)
set.seed(913)
coinflip <- rbinom(whatsthere[1,3], 1, .013)
con <- file("en_US.blogs.txt")
open(con)
current.line <- 1
while(length(line <- readLines(con, n = 1, warn = FALSE)) > 0){
if (coinflip[current.line] == 1){
dat2 <- c(dat2, line)
}
current.line <- current.line + 1
}
close(con)
# Remove non-ASCII characters
nonascii <- grep("non_ascii", iconv(dat2, "latin1", "ASCII",
sub = "non_ascii"))
dat2 <- dat2[-nonascii]
# Transform
corpus2 <- VCorpus(VectorSource(dat2))
corpus2 <- tm_map(corpus2, content_transformer(tolower))
corpus2 <- tm_map(corpus2, removeNumbers)
corpus2 <- tm_map(corpus2, removePunctuation)
corpus2 <- tm_map(corpus2, stripWhitespace)
corpus2 <- tm_map(corpus2, removeWords, bad)
corpus2 <- tm_map(corpus2, removeWords, extrabad)
# Summarize unigram corpus
corpusU2 <- tm_map(corpus2, removeWords, stopwords("english"))
uniDtm2 <- DocumentTermMatrix(VCorpus(VectorSource(corpusU2$content)))
uniDtm2 <- as.matrix(removeSparseTerms(uniDtm2, 0.999))
uniDtm2.freq <- sort(colSums(uniDtm2), decreasing = TRUE)
# Summarize bigram and trigram corpus
biDtm2 <- DocumentTermMatrix(VCorpus(VectorSource(corpus2$content)), control =
list(tokenize = bigramTokenizer))
triDtm2 <- DocumentTermMatrix(VCorpus(VectorSource(corpus2$content)), control =
list(tokenize = trigramTokenizer))
biDtm2 <- as.matrix(removeSparseTerms(biDtm2, 0.999))
triDtm2 <- as.matrix(removeSparseTerms(triDtm2, 0.999))
biDtm2.freq <- sort(colSums(biDtm2), decreasing = TRUE)
triDtm2.freq <- sort(colSums(triDtm2), decreasing = TRUE)
# Create lists of the top n-grams
topunigrams <- data.frame(rbind(names(uniDtm.freq[1:10]), names(uniDtm2.freq[1:10])))
rownames(topunigrams) <- c("Sample 1", "Sample 2")
names(topunigrams) <- 1:10
topbigrams <- data.frame(rbind(names(biDtm.freq[1:10]), names(biDtm2.freq[1:10])))
rownames(topbigrams) <- c("Sample 1", "Sample 2")
names(topbigrams) <- 1:10
toptrigrams <- data.frame(rbind(names(triDtm.freq[1:10]), names(triDtm2.freq[1:10])))
rownames(toptrigrams) <- c("Sample 1", "Sample 2")
names(toptrigrams) <- 1:10
# Combine the frequency tables
uniFreq <- data.frame(uniDtm.freq)
uniFreq$unigram <- rownames(uniFreq)
uniFreq2 <- data.frame(uniDtm2.freq)
uniFreq2$unigram <- rownames(uniFreq2)
uniFreq <- merge(uniFreq, uniFreq2, by = "unigram")
uniFreq$total <- uniFreq$uniDtm.freq + uniFreq$uniDtm2.freq
uniFreq <- uniFreq[order(-uniFreq$total),]
biFreq <- data.frame(biDtm.freq)
biFreq$bigram <- rownames(biFreq)
biFreq2 <- data.frame(biDtm2.freq)
biFreq2$bigram <- rownames(biFreq2)
biFreq <- merge(biFreq, biFreq2, by = "bigram")
biFreq$total <- biFreq$biDtm.freq + biFreq$biDtm2.freq
biFreq <- biFreq[order(-biFreq$total),]
triFreq <- data.frame(triDtm.freq)
triFreq$trigram <- rownames(triFreq)
triFreq2 <- data.frame(triDtm2.freq)
triFreq2$trigram <- rownames(triFreq2)
triFreq <- merge(triFreq, triFreq2, by = "trigram")
triFreq$total <- triFreq$triDtm.freq + triFreq$triDtm2.freq
triFreq <- triFreq[order(-triFreq$total),]