# Set working directory
setwd("~/Documents/Courses/datasciencecoursera/Capstone/")
# load packages
library(stringr); library(tm); library(scales); library(SnowballC)
library(ggplot2); library(pryr); library(stylo); library(data.table)
library(doMC); library(doParallel); library(stringi); library(knitr)
library(scales); library(gtable); library(gridExtra)
# register two cores for faster processing
registerDoMC(2)
The objective of this report is to provide an overview of the data that will be used to train the predictive text algorithm.
For this project, I will build a predictive text algorithm using three different bodies of text (i.e. “corpora”). These include excerpts from (1) blogs, (2) twitter, and (3) news articles. The files are provided by Swiftkey and are presumed to be representative of the English language.
if(!("corpora.RData" %in% list.files())){ # skip if cached
# read data into R (original files zipped for github)
corpora = rbind(blogs = data.table(raw = readLines("Data/en_US/en_US.blogs.txt.gz"),
corpus = "blogs"),
twitter = data.table(raw = readLines("Data/en_US/en_US.twitter.txt.gz"),
corpus = "twitter"),
news = data.table(raw = readLines("Data/en_US/en_US.news.txt.gz"),
corpus = "news"))
setkey(corpora, corpus)
} else load("corpora.RData")
For illustration, data entries resemble excepts like these:
head(corpora[,raw], 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan \342\200\234gods\342\200\235."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
# create variables describing features of the corpora
if(!("corpora.RData" %in% list.files())){ # skip if cached
corpora[, raw := str_trim(raw)]
corpora[, clean := gsub("[[:alnum:] ][.!?]", " STARTSENTENCE ", raw)]
punct <- '[]\\?!\"#$%&(){}+*/:;,._`|~\\[<=>@\\^]' # apostrophe and hyphen
corpora[, clean := tolower(gsub(punct, " ", clean)) ]
corpora[, clean := gsub("\342\200\231", "\'", clean)]
corpora[, clean := gsub("startsentence", "<start>", clean) ]
corpora[, nchar := str_length(raw)]
corpora[, words := str_count(raw, "\\S+")]
corpora[, sentences := str_count(raw, "[[:alnum:] ][.!?]")]
save(corpora, file = "corpora.RData")
}
# create a boolean variable flagging entries with profane words
# blacklist taken from Google WDYL api
if(!("corpora.RData" %in% list.files())){
blacklist = str_trim(readLines("Data/google_twunter_lol.txt"))
x = paste0("\\b", paste(blacklist, collapse = "\\b|\\b"), "\\b")
corpora[, profane := grepl(x, clean) ]
setkeyv(corpora, c("corpus", "profane"))
save(corpora, file = "corpora.RData")
}
Looking at the entire body of training data, we can pull some summartive statistics and charts for each corpus.
stats = corpora[, .(.N, mean(nchar), mean(words), mean(sentences),
percent(sum(profane)/.N)), by = corpus]
kable(stats, digits = 1, col.names = c("Corpus", "# of Entries",
"Average # of Characters","Average # of Words",
"Average # of Sentences", "Profanity"))
| Corpus | # of Entries | Average # of Characters | Average # of Words | Average # of Sentences | Profanity |
|---|---|---|---|---|---|
| blogs | 899288 | 231.7 | 41.5 | 2.5 | 2.81% |
| news | 1010242 | 201.7 | 34.0 | 2.2 | 0.818% |
| 2360148 | 68.8 | 12.9 | 1.3 | 4.59% |
p1 = ggplot(corpora, aes(x = nchar, fill = corpus)) +
geom_density(alpha=.3) + xlim(c(0,750)) +
ggtitle ("Characters") +
xlab(NULL) + ylab(NULL) +
theme(legend.position="bottom")
p2 = ggplot(corpora, aes(x = words, fill = corpus)) +
geom_density(alpha=.3, adjust = 2) + xlim(c(0,100)) +
ggtitle ("Words") +
xlab(NULL) + ylab(NULL) +
theme(legend.position="bottom")
p3 = ggplot(corpora, aes(x = sentences, fill = corpus)) +
geom_density(alpha=.3, adjust = 15) + xlim(c(0,15)) +
ggtitle ("Sentences") +
xlab(NULL) + ylab(NULL) +
theme(legend.position="bottom")
#Combine plots
g_legend <- function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)}
leg <- g_legend(p1)
nl = theme(legend.position="none")
grid.arrange(arrangeGrob(arrangeGrob(p1 + nl, p2 + nl, p3 + nl, ncol=3),
leg, nrow = 2, heights = c(6,1)))
Each corpus can be characterised quite differerently from the others. From the table above, we see that the twitter corpus has the most entries, at 2.4 million. These tweets are limited in length to no more than 140 characters, and therefore have shorter average sentence lengths and word counts. The twitter corpus also has the highest ratio of entries with profanity.
The spike near the 140 character maxium for tweets suggests that many twitter users likely reach their maxium and scale back the string lengths of their tweets until they meet the maxium. This may be important in the training stage, as tweets near the 140 character maxium would likely have more truncation of words.
Other take-aways from the graphs are that blog entries tend to be either very short (~75 characters), or in the 100-500 word range. Entries in the news article corpus are more consistent, typically not exceeding 400 characters.
# sample documents
set.seed(9173)
S1 = sample(corpora[,.I[profane == FALSE]], 50000)
docs <- Corpus(VectorSource(corpora[S1, .(clean)]))
# unigrams
tdm <- TermDocumentMatrix(docs, control = list(stopwords = TRUE,
removeNumbers = TRUE, wordLengths = c(1, Inf)))
t1 = as.data.frame(as.matrix(tdm))
t1 = cbind(setDT(tstrsplit(as.character(row.names(t1)), " ", fixed=T))[],t1[[1]])
names(t1) <- c("N1", "count")
setkeyv(t1, c( "N1", "count"))
t1[, logp := -1 * log(count/t1[,sum(count)]) ]
# bigrams
registerDoMC(1); library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm2 <- TermDocumentMatrix(docs, control = list(stopwords = FALSE,
tokenize = BigramTokenizer,
removeNumbers = TRUE, wordLengths = c(1, Inf)))
t2 = as.data.frame(as.matrix(tdm2))
t2 = cbind(setDT(tstrsplit(as.character(row.names(t2)), " ", fixed=T))[],t2[[1]])
names(t2) <- c("N2", "N1", "count")
setkeyv(t2, c("N2", "N1", "count"))
t2[, logp := -1 * log(count/t2[,sum(count)]) ]
t2 <- t2[N2 != "" & N1 != "<start>" ]
# trigrams
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm3 <- TermDocumentMatrix(docs, control = list(stopwords = FALSE,
tokenize = TrigramTokenizer,
removeNumbers = TRUE, wordLengths = c(1, Inf)))
t3 = as.data.frame(as.matrix(tdm3))
t3 = cbind(setDT(tstrsplit(as.character(row.names(t3)), " ", fixed=T))[],t3[[1]])
names(t3) <- c("N3", "N2", "N1", "count")
setkeyv(t3, c("N3", "N2", "N1", "count"))
t3[, logp := -1 * log(count/t3[,sum(count)]) ]
t3 <- t3[ N3 != "" & N2 != "" ]
Using these corpora, my next steps will be to use probability tables for word combinations (i.e. “n-grams”“). These probabilities will serve as the basis for prediction in my app. This logic is known as a”markov-chain" approach to language modeling.
To illustrate this idea, the most frequent two- and three-word combinations are printed below (“bigrams” and “trigrams”). The tables include the counts of these ngrams and their log probabilities. The logarithm is taken to avoid having very small decimal places. Each word from the ngram is split into a variable (N1, N2, etc.):
kable(t2[count > 2500])
| N2 | N1 | count | logp |
|---|---|---|---|
| in | the | 4787 | 5.480677 |
| it | s | 2604 | 6.089532 |
| of | the | 5065 | 5.424227 |
| to | the | 2561 | 6.106183 |
kable(t3[count > 400])
| N3 | N2 | N1 | count | logp |
|---|---|---|---|---|
| i | don | t | 615 | 7.488420 |
| one | of | the | 428 | 7.850919 |
The eventual app will take a user-defined string as input, check to see if it is present/frequent in the highest-order ngram, and if not, successively check lower-order ngrams. This technique is called “backoff.”
Time allowing, I will also try to address:
1. Rare & New words: In order to account for rare word combos that do not appear in the corpora out of random chance, I plan to use so-called ‘Kneser-Ney Smoothing.’ This will increase prediction accuracy for new vocabulary by according the same probability to new words as ngrams that appear only once in the corpora. I also plan to experiment in having the Shiny app re-calculate inclusion probabilities for new words introduced by the end-user at runtime.
2. Apostrophese and More Cleaning: The current tokenization of words strips out apostrophese and treats strings such as “don’t” as two words. I will try to retokenize them as one word and test if predictive performance is better. I will also do another spot check of ngrams to see if the data can be better cleaned before re-training. This may mean removing the twitter corpus from training, as tweets can be highly idosyncratic compared to other sources of English.
3. Machine Learning: Though I believe a working algorithm can be achieved using the method described above, I will experiment with machine learning, taking a number of tokens and feature describing the text. If these perform well, I may consider changing my approach.