# dependencies
library(knitr)
library(tm)
## Loading required package: NLP
library(NLP)
library(readr)
# also install: filehash
This is an intermediate report on the capstone project, which aims to create some system that accelerates user input e.g. on mobile devices by offering options as the next word while the user is typing. I expect this project to finish with a ground-up prototype of a realistic predictive engine.
The language the solution works in is up to the student, I expect to choose the (I assume fairly typical) US English.
This search for a solution cannot aim to be exhaustive – for instance using someone else’s service is not really an option when demonstrating skills, however definitely is a possibility in reality.
The two most promising predictive strategies are covered here, namely
As it is also useful to verify/construct hypotheses using a smaller subset before expanding the scope of discussion, the examinations are restricted to two word structures, longer ones may be considered at a later stage.
For this project, a set of 3 files was provided. These encompass a large corpus of (US) English words, containing hundreds of megabytes of data. The sources are summarized below based on subsamples. Subsampling allows for the processing to take place quickly despite the sheer size of the data (and R’s notoriously bad reputation at scaling up to large data sets).
For the sake of exploration, numbers and punctuations have been consistently removed. This may need reconsideration before the final prototype.
data.path = "../../input/en_US/"
file.names = dir(data.path)
file.paths = file.path(data.path, file.names)
sizes = file.size(file.paths)
# prepare for reproducible random subsampling
set.seed(0)
# read up the files
lines = rep(NA, 3)
line.counts = rep(NA, 3)
sample.lines = c(3000, 3000 * 1.05, 9000)
for(i in 1:3) {
text = readLines(file.paths[i], -1)
line.counts[i] = length(text)
# use a small, random sample of the data as
# text manipulation gets slow otherwise
is.selected = sample(x = length(text), size = sample.lines[i])
text = text[is.selected]
# the text will be more useful as a single character
text = paste0(text, collapse = " ")
lines[i] = text
}
## Warning in readLines(file.paths[i], -1): line 167155 appears to contain an
## embedded nul
## Warning in readLines(file.paths[i], -1): line 268547 appears to contain an
## embedded nul
## Warning in readLines(file.paths[i], -1): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(file.paths[i], -1): line 1759032 appears to contain an
## embedded nul
# prepare for text mining
corp = Corpus(VectorSource(lines))
inspect(corp)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 678432
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 644799
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 627748
cl.corp = tm_map(corp, stripWhitespace)
cl.corp = tm_map(cl.corp, content_transformer(tolower))
cl.corp = tm_map(cl.corp, removePunctuation)
cl.corp = tm_map(cl.corp, removeNumbers)
cl.corp = tm_map(cl.corp, stemDocument)
# currently stopwords are not removed
# cl.corp = tm_map(cl.corp, removeWords, stopwords("english"))
dtm = DocumentTermMatrix(cl.corp)
rm(corp)
rm(lines)
full.to.sample = line.counts / sample.lines
kable(
caption = "Data set summary table (words are stemmed)",
align = c("l", "r", "r", "r", "r", "r", "r"),
col.names = c("File", "File Size (MB)", "Lines (k)",
"Word Count (est. total, k)", "Sampling Ratio",
"Word Count (sample)",
"Unique Words (sample)"),
data.frame(file.names,
round(sizes / 1024 / 1024, digits = 2),
round(line.counts / 1000),
round(apply(as.matrix(dtm), MARGIN = 1, FUN = sum)
* full.to.sample / 1000),
sprintf("%.2f %%", 1 / full.to.sample * 100),
apply(as.matrix(dtm), MARGIN = 1, FUN = sum),
apply(as.matrix(dtm), MARGIN = 1, FUN = function(x) sum(x != 0)))
)
| File | File Size (MB) | Lines (k) | Word Count (est. total, k) | Sampling Ratio | Word Count (sample) | Unique Words (sample) |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.42 | 899 | 28013 | 0.33 % | 93449 | 10858 |
| en_US.news.txt | 196.28 | 1010 | 27588 | 0.31 % | 86020 | 11439 |
| en_US.twitter.txt | 159.36 | 2360 | 22482 | 0.38 % | 85732 | 10997 |
Based on this table the diversity of the language seems to be the highest in the news corpus, while Twitter messages and blog posts possibly repeat somewhat more of the phrases. Nevertheless, there seems no huge difference there.
Zipf’s law states that the words are likely to follow a distribution of a negative power \(\text{(}P(w_i) = c w_i ^ p;\text{ for some }c, p \in R, p < -1\text{)}\), thus the distribution would appear linear on a lin-log scale.
Actually, Wikipedia says log-log scale.
zipf.freqs = as.matrix(dtm)
dtm.col.total =
zipf.freqs[1, ] +
zipf.freqs[2, ] +
zipf.freqs[3, ] # likely to be faster than apply()
for(i in 1:nrow(zipf.freqs)) {
zipf.freqs[i, ] = sort(zipf.freqs[i, ], decreasing = TRUE)
}
dtm.col.total = sort(dtm.col.total, decreasing = TRUE)
zipf.freqs = log10(zipf.freqs)
dtm.col.total = log10(dtm.col.total)
plot(main = "Distribution of word frequency",
ylab = "Log of occurrences",
xlab = "Log of word rank",
x = log10(1:length(zipf.freqs[1, ])),
y = zipf.freqs[1, ], col = "red", type = "l")
lines(x = log10(1:length(zipf.freqs[1, ])),
y = zipf.freqs[2, ], col = "green")
lines(x = log10(1:length(zipf.freqs[1, ])),
y = zipf.freqs[3, ], col = "blue")
lines(x = log10(1:length(zipf.freqs[1, ])),
y = dtm.col.total, col = "black")
plot(main = "Frequency distribution of top ranked words",
ylab = "Log of occurrences",
xlab = "Log of word rank",
xlim = c(0, log10(length(zipf.freqs[1, ])) / 3),
x = log10(1:length(zipf.freqs[1, ])),
zipf.freqs[1, ], col = "red", type = "l")
lines(x = log10(1:length(zipf.freqs[1, ])),
y = zipf.freqs[2, ], col = "green")
lines(x = log10(1:length(zipf.freqs[1, ])),
y = zipf.freqs[3, ], col = "blue")
lines(x = log10(1:length(zipf.freqs[1, ])),
y= dtm.col.total, col = "black")
The charts seem to agree with this formula, furthermore, words ranked below 10000 apparently occur very rarely, once in the samples. This potentially enables focusing on a relevant subset of a whole corpora later on, which is useful if only a smaller subset of e.g. a dictionary or data set needs to be distributed over to a potentially less capable mobile device.
The feasibility of this can be confirmed by looking at how many occurrences some top words account for in total.
plot.occurrence.chart = function(freqs, main, pos.2 = 4) {
freqs = sort(freqs, decreasing = TRUE)
total.words = sum(freqs)
cumsum.freqs = cumsum(freqs)
plot(main = main,
xlab = "Last rank included",
ylab = "Occurrences (%)",
xlim = c(1, length(cumsum.freqs)),
cumsum.freqs / total.words * 100,
type = "l",
col = "darkgrey")
abline(a = c(90, 0), col = "grey", lty = 2)
rank.90.perc = min(which(cumsum.freqs > total.words * 0.9))
abline(v = rank.90.perc, col = "grey", lty = 2)
text(x = rank.90.perc, y = 90, pos = pos.2,
sprintf("90%% of the text samples is covered\nby the top %d", rank.90.perc))
rank.50.perc = min(which(cumsum.freqs > total.words * 0.5))
abline(a = c(50, 0), col = "grey", lty = 3)
abline(v = rank.50.perc, col = "grey", lty = 3)
text(x = rank.50.perc, y = 50, pos = 4,
sprintf("50%% of the text samples is covered\nby the top %d", rank.50.perc))
return(list(rank.50.perc = rank.50.perc, rank.90.perc = rank.90.perc))
}
freqs = as.matrix(dtm)
freqs = freqs[1, ] + freqs[2, ] + freqs[3, ]
freq.quantiles =
plot.occurrence.chart(freqs, "Occurrences covered by top n (stemmed) words")
This figure confirms that a relatively small proportion (20.57 % in the sample) suffices to cover 9 out of 10 words the user may type in. If we are pleased with having only 50% of the words covered, a handful of words (1.05 %) will suffice.
The top words look promising, although are expected to be stopwords:
library(stringi)
df.dtm = t(as.matrix(dtm))
df.dtm = cbind(df.dtm[, 1] + df.dtm[, 2] + df.dtm[, 3], df.dtm)
df.dtm =
data.frame(
# failed to make kable display backslashes in HTML
# raw = gsub("[\\]", "\\\\\\", stri_escape_unicode(rownames(df.dtm))),
raw = gsub("[\\]", " ", stri_escape_unicode(rownames(df.dtm))),
df.dtm
)
colnames(df.dtm) = c("Raw", "Total", "occ1", "occ2", "occ3")
kable(x = head(df.dtm[order(df.dtm$Total, decreasing = TRUE), ], 20),
caption = "Most frequent words",
col.names = c("Raw", "Total", file.names))
| Raw | Total | en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | |
|---|---|---|---|---|---|
| the | the | 16039 | 6176 | 6254 | 3609 |
| and | and | 8033 | 3540 | 2859 | 1634 |
| for | for | 3829 | 1202 | 1133 | 1494 |
| that | that | 3821 | 1593 | 1137 | 1091 |
| you | you | 3366 | 1019 | 304 | 2043 |
| with | with | 2398 | 906 | 799 | 693 |
| was | was | 2020 | 863 | 725 | 432 |
| have | have | 1944 | 783 | 464 | 697 |
| this | this | 1892 | 868 | 407 | 617 |
| are | are | 1678 | 635 | 440 | 603 |
| but | but | 1668 | 664 | 510 | 494 |
| not | not | 1439 | 575 | 334 | 530 |
| from | from | 1292 | 545 | 461 | 286 |
| your | your | 1287 | 378 | 153 | 756 |
| will | will | 1164 | 383 | 344 | 437 |
| just | just | 1127 | 351 | 177 | 599 |
| all | all | 1105 | 464 | 219 | 422 |
| get | get | 1095 | 308 | 196 | 591 |
| one | one | 1076 | 453 | 299 | 324 |
| they | they | 1073 | 453 | 353 | 267 |
The vocabulary analyzed currently contains things that may be unnecessary or unwanted among the predictions. These may include vulgar/foreign expressions, emoticons, etc. alongside the rarer ones.
One way to improve on this is restricting our examinations to the aforementioned most frequent minority of the text elements.
As an example, the first elements contain emoticons:
kable(
col.names = c("Raw", "Total", "Freq.1", "Freq.2", "Freq.3"),
x = head(df.dtm, 10)
)
| Raw | Total | Freq.1 | Freq.2 | Freq.3 | |
|---|---|---|---|---|---|
| ✨🌟✨ | u2728 U0001f31f u2728 | 1 | 0 | 0 | 1 |
| 😏😏👍 | U0001f60f U0001f60f U0001f44d | 1 | 0 | 0 | 1 |
| 💜🏃🍴 | U0001f49c U0001f3c3 U0001f374 | 1 | 0 | 0 | 1 |
| 😂😂😂 | U0001f602 U0001f602 U0001f602 | 2 | 0 | 0 | 2 |
| 💰💰💰 | U0001f4b0 U0001f4b0 U0001f4b0 | 1 | 0 | 0 | 1 |
| 💜💛🏀 | U0001f49c U0001f49b U0001f3c0 | 1 | 0 | 0 | 1 |
| 👠👗👙 | U0001f460 U0001f457 U0001f459 | 1 | 0 | 0 | 1 |
| 😊😍🍓 | U0001f60a U0001f60d U0001f353 | 1 | 0 | 0 | 1 |
| 😁😁😁 | U0001f601 U0001f601 U0001f601 | 1 | 0 | 0 | 1 |
| 😞😞😞 | U0001f61e U0001f61e U0001f61e | 1 | 0 | 0 | 1 |
Unless creating a context-specific system (which can be later made a goal), to initially stay with a simpler approach, the data set can also be restricted to those linguistic units which appear in all 3 of the corpora. This can be useful since some phrases seem to rarely appear outside/within certain contexts.
This would have already filtered out the aforementioned problematic first entries. Swearwords also seem very unfrequent in blog entries. However, the lack of presence in each of the documents can be due to the sparsity of the sampling.
Also while it wasn’t heavily emphasized, the words were considered in a stemmed form. This made a smaller sample denser on the distribution side thus helping the analysis, but also can allow for an initial narrowing down of the choices with a two-phase prediction method. Offering the user a smaller subset: incomplete predictions can also speed entry up. For example, after they typed “ac”, and have been offered with and selected academi, they can be presented with likely endings of that stem, e.g. academia, academic etc..
Some of the word stems, however, may have an unequivocal continuation (at least concluding from the available corpora). It means in that case the stem is likely to be unnecessarily offered. An example identified is albuquerqu .
Languages change continuously and good prediction algorithms should reflect their nature. One thing to note is a static data set cannot adhere to this.
One way to address this is adapting the data set continuously along with the experienced user input. This also allows to adapt to the user-specific vocabulary.
Another way is to distribute data set updates e.g. via the web.
On another hand, when looking for the core set of expressions to wrap up in a conveniently small package, being time-tested could form a basis for preference. Stemming can also help here, or coincide with this filter - words from time to time get constructed from existing stems, and stems are more persistent and constant elements of a language.
As leading words of sentences are capitalized they should be treated case insensitively (i.e. as if their casing was unknown) when harvesting data, or completely ignored.
For prediction, it can be useful to know whether bigrams, which are two-word sequences in a text, have a certain uneven distribution. Intiuitively this is very likely to happen.
The words most likely to follow a given one, w, may be the ones to display at the top of a list of suggestions once w has been entered. These are, in other words, the ones with the highest probability conditional on being preceded by w.
# based on http://tm.r-forge.r-project.org/faq.html
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
dtm.2 <- TermDocumentMatrix(cl.corp, control = list(tokenize = BigramTokenizer))
df.dtm.2 = as.matrix(dtm.2)
df.dtm.2 = cbind(df.dtm.2[, 1] + df.dtm.2[, 2] + df.dtm.2[, 3], df.dtm.2)
colnames(df.dtm.2) <- c("total", "occ1", "occ2", "occ3")
df.dtm.2 = as.data.frame(df.dtm.2)
df.dtm.2 = df.dtm.2[order(df.dtm.2$total, decreasing = TRUE), ]
plot(x = log10(1:nrow(df.dtm.2)),
y = log10(df.dtm.2$total),
type = "l",
col = "darkgrey",
xlim = c(0, log10(nrow(df.dtm.2))),
ylim = c(-0.5, max(log10(df.dtm.2$total))),
ylab = "Log of occurrences", xlab = "Log of rank",
main = "Distribution of bigrams in the samples by rank")
The distribution is very similar to that of the words.
quantiles =
plot.occurrence.chart(df.dtm.2$total,
"Occurrences covered by top n (stemmed) bigrams",
pos.2 = 2)
Since stopword removal is not part of the data processing, the top bigrams are expected expected to contain these:
kable(x = head(df.dtm.2, 20), caption = "Most frequent bigrams",
col.names = c("Total", file.names))
| Total | en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | |
|---|---|---|---|---|
| of the | 1485 | 658 | 609 | 218 |
| in the | 1340 | 497 | 556 | 287 |
| for the | 742 | 231 | 215 | 296 |
| to the | 696 | 272 | 266 | 158 |
| on the | 652 | 253 | 227 | 172 |
| to be | 539 | 216 | 170 | 153 |
| at the | 505 | 164 | 208 | 133 |
| and the | 428 | 183 | 183 | 62 |
| go to | 379 | 109 | 86 | 184 |
| with the | 377 | 163 | 140 | 74 |
| in a | 369 | 126 | 162 | 81 |
| is a | 341 | 155 | 79 | 107 |
| have a | 311 | 98 | 52 | 161 |
| it was | 310 | 151 | 88 | 71 |
| from the | 309 | 153 | 103 | 53 |
| is the | 306 | 121 | 74 | 111 |
| for a | 305 | 109 | 84 | 112 |
| want to | 302 | 117 | 66 | 119 |
| i have | 298 | 166 | 28 | 104 |
| it is | 290 | 169 | 49 | 72 |
The set of questions to consider was:
Question 1-3 have been addressed above.
Some things cannot be detected. Certain words concide in terms of spelling, yet mean different things in different languages. Some are as easy as filtering out Unicode characters. I expect there are strategies but no 100% accurate solutions for this.
In short, at least one easy step to identify such words is to look out for Unicode characters - today’s English words do not contain these.
Another good option is to obtain an English dictionary/thesaurus and verify against them, probably there are some freely avaialble ones out there.
Categorizing the word endings, i.e. those which suffix the stems, may put the words into such groups that the likely continuations can be concluded from them. In this case, the words can be continued with the ending of some other word which belongs to the same group. This may not need to happen in the two phases mentioned above - a small set of choices implies no need for this.
Knowing whether a word is a verb, adjective or noun can also be helpful. Certain verbs are often followed by a likely subset of stopwords (e.g. “want to”, “applies to”, etc.). Such relationships are easier to evaluate using a non-stemmed dataset.
Probably enough of the difficulties with the project have been outlined in the above to start effectively working on it. Extracting the words, bigrams from larger corpora, perhaps trigrams, addressing casing issues, potential performance problems (for not necessarily pure R solutions it should not be a problem to process some hundreds of megabytes of data), filtering words, researching dictionaries, restricting suggestions are plenty of tasks to proceed with.