Milestone Report, Word Prediction System

Coursera Data Science Specialization Capstone

# dependencies

library(knitr)
library(tm)

## Loading required package: NLP

library(NLP)
library(readr)

# also install: filehash

1. Summary

This is an intermediate report on the capstone project, which aims to create some system that accelerates user input e.g. on mobile devices by offering options as the next word while the user is typing. I expect this project to finish with a ground-up prototype of a realistic predictive engine.

The language the solution works in is up to the student, I expect to choose the (I assume fairly typical) US English.

This search for a solution cannot aim to be exhaustive – for instance using someone else’s service is not really an option when demonstrating skills, however definitely is a possibility in reality.

The two most promising predictive strategies are covered here, namely

using a vocabulary to predict words (i.e. predicting while ignoring the context)
leveraging the knowledge of a preceding word (using bigrams)

As it is also useful to verify/construct hypotheses using a smaller subset before expanding the scope of discussion, the examinations are restricted to two word structures, longer ones may be considered at a later stage.

2. Introducing the data set

For this project, a set of 3 files was provided. These encompass a large corpus of (US) English words, containing hundreds of megabytes of data. The sources are summarized below based on subsamples. Subsampling allows for the processing to take place quickly despite the sheer size of the data (and R’s notoriously bad reputation at scaling up to large data sets).

For the sake of exploration, numbers and punctuations have been consistently removed. This may need reconsideration before the final prototype.

data.path  = "../../input/en_US/"
file.names = dir(data.path)
file.paths = file.path(data.path, file.names)
sizes = file.size(file.paths)

# prepare for reproducible random subsampling
set.seed(0)
# read up the files
lines = rep(NA, 3)
line.counts = rep(NA, 3)
sample.lines = c(3000, 3000 * 1.05, 9000)
for(i in 1:3) {
  text = readLines(file.paths[i], -1)
  line.counts[i] = length(text)
  # use a small, random sample of the data as 
  # text manipulation gets slow otherwise
  is.selected = sample(x = length(text), size = sample.lines[i])
  text = text[is.selected]
  # the text will be more useful as a single character
  text = paste0(text, collapse = " ")
  
  lines[i] = text
}

## Warning in readLines(file.paths[i], -1): line 167155 appears to contain an
## embedded nul

## Warning in readLines(file.paths[i], -1): line 268547 appears to contain an
## embedded nul

## Warning in readLines(file.paths[i], -1): line 1274086 appears to contain an
## embedded nul

## Warning in readLines(file.paths[i], -1): line 1759032 appears to contain an
## embedded nul

# prepare for text mining
corp = Corpus(VectorSource(lines))
inspect(corp)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 678432
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 644799
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 627748

cl.corp = tm_map(corp, stripWhitespace)
cl.corp = tm_map(cl.corp, content_transformer(tolower))
cl.corp = tm_map(cl.corp, removePunctuation)
cl.corp = tm_map(cl.corp, removeNumbers)
cl.corp = tm_map(cl.corp, stemDocument)
# currently stopwords are not removed
# cl.corp = tm_map(cl.corp, removeWords, stopwords("english"))
dtm = DocumentTermMatrix(cl.corp)

rm(corp)
rm(lines)

full.to.sample = line.counts / sample.lines

kable(
  caption = "Data set summary table (words are stemmed)",
  align = c("l", "r", "r", "r", "r", "r", "r"),
  col.names = c("File", "File Size (MB)", "Lines (k)", 
                "Word Count (est. total, k)", "Sampling Ratio", 
                "Word Count (sample)",
                "Unique Words (sample)"),
  data.frame(file.names, 
             round(sizes / 1024 / 1024, digits = 2),
             round(line.counts / 1000),
             round(apply(as.matrix(dtm), MARGIN = 1, FUN = sum)
                   * full.to.sample / 1000),
             sprintf("%.2f %%", 1 / full.to.sample * 100),
             apply(as.matrix(dtm), MARGIN = 1, FUN = sum),
             apply(as.matrix(dtm), MARGIN = 1, FUN = function(x) sum(x != 0)))
)

Data set summary table (words are stemmed)
File	File Size (MB)	Lines (k)	Word Count (est. total, k)	Sampling Ratio	Word Count (sample)	Unique Words (sample)
en_US.blogs.txt	200.42	899	28013	0.33 %	93449	10858
en_US.news.txt	196.28	1010	27588	0.31 %	86020	11439
en_US.twitter.txt	159.36	2360	22482	0.38 %	85732	10997

3. Exploratory analysis of words

3.1. Word distribution

Based on this table the diversity of the language seems to be the highest in the news corpus, while Twitter messages and blog posts possibly repeat somewhat more of the phrases. Nevertheless, there seems no huge difference there.

Zipf’s law states that the words are likely to follow a distribution of a negative power \(\text{(}P(w_i) = c w_i ^ p;\text{ for some }c, p \in R, p < -1\text{)}\), thus the distribution would appear linear on a lin-log scale.

Actually, Wikipedia says log-log scale.

zipf.freqs = as.matrix(dtm)
dtm.col.total = 
  zipf.freqs[1, ] + 
  zipf.freqs[2, ] + 
  zipf.freqs[3, ] # likely to be faster than apply()

for(i in 1:nrow(zipf.freqs)) {
  zipf.freqs[i, ] = sort(zipf.freqs[i, ], decreasing = TRUE)
}
dtm.col.total = sort(dtm.col.total, decreasing = TRUE)

zipf.freqs = log10(zipf.freqs)
dtm.col.total = log10(dtm.col.total)

plot(main = "Distribution of word frequency",
     ylab = "Log of occurrences",
     xlab = "Log of word rank",
     x = log10(1:length(zipf.freqs[1, ])),
     y = zipf.freqs[1, ], col = "red", type = "l")

lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[2, ], col = "green")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[3, ], col = "blue")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y = dtm.col.total, col = "black")

plot(main = "Frequency distribution of top ranked words",
     ylab = "Log of occurrences",
     xlab = "Log of word rank",
     xlim = c(0, log10(length(zipf.freqs[1, ])) / 3),
     x = log10(1:length(zipf.freqs[1, ])),
     zipf.freqs[1, ], col = "red", type = "l")

lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[2, ], col = "green")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[3, ], col = "blue")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y= dtm.col.total, col = "black")

The charts seem to agree with this formula, furthermore, words ranked below 10000 apparently occur very rarely, once in the samples. This potentially enables focusing on a relevant subset of a whole corpora later on, which is useful if only a smaller subset of e.g. a dictionary or data set needs to be distributed over to a potentially less capable mobile device.

The feasibility of this can be confirmed by looking at how many occurrences some top words account for in total.

plot.occurrence.chart = function(freqs, main, pos.2 = 4) {
  freqs = sort(freqs, decreasing = TRUE)
  total.words = sum(freqs)
  cumsum.freqs = cumsum(freqs)
  
  plot(main = main,
       xlab = "Last rank included",
       ylab = "Occurrences (%)",
       xlim = c(1, length(cumsum.freqs)),
       cumsum.freqs / total.words * 100, 
       type = "l",
       col = "darkgrey")
  abline(a = c(90, 0), col = "grey", lty = 2)
  rank.90.perc = min(which(cumsum.freqs > total.words * 0.9))
  abline(v = rank.90.perc, col = "grey", lty = 2)
  text(x = rank.90.perc, y = 90, pos = pos.2,
       sprintf("90%% of the text samples is covered\nby the top %d", rank.90.perc))
  
  rank.50.perc = min(which(cumsum.freqs > total.words * 0.5))
  abline(a = c(50, 0), col = "grey", lty = 3)
  abline(v = rank.50.perc, col = "grey", lty = 3)
  text(x = rank.50.perc, y = 50, pos = 4,
       sprintf("50%% of the text samples is covered\nby the top %d", rank.50.perc))
  
  return(list(rank.50.perc = rank.50.perc, rank.90.perc = rank.90.perc))
}

freqs = as.matrix(dtm)
freqs = freqs[1, ] + freqs[2, ] + freqs[3, ]
freq.quantiles = 
  plot.occurrence.chart(freqs, "Occurrences covered by top n (stemmed) words")

This figure confirms that a relatively small proportion (20.57 % in the sample) suffices to cover 9 out of 10 words the user may type in. If we are pleased with having only 50% of the words covered, a handful of words (1.05 %) will suffice.

The top words look promising, although are expected to be stopwords:

library(stringi)

df.dtm = t(as.matrix(dtm))
df.dtm = cbind(df.dtm[, 1] + df.dtm[, 2] + df.dtm[, 3], df.dtm)
df.dtm = 
  data.frame(
#    failed to make kable display backslashes in HTML
#    raw = gsub("[\\]", "\\\\\\", stri_escape_unicode(rownames(df.dtm))),

    raw = gsub("[\\]", " ", stri_escape_unicode(rownames(df.dtm))),
    df.dtm
  )
colnames(df.dtm) = c("Raw", "Total", "occ1", "occ2", "occ3") 

kable(x = head(df.dtm[order(df.dtm$Total, decreasing = TRUE), ], 20), 
      caption = "Most frequent words", 
      col.names = c("Raw", "Total", file.names))

Most frequent words
	Raw	Total	en_US.blogs.txt	en_US.news.txt	en_US.twitter.txt
the	the	16039	6176	6254	3609
and	and	8033	3540	2859	1634
for	for	3829	1202	1133	1494
that	that	3821	1593	1137	1091
you	you	3366	1019	304	2043
with	with	2398	906	799	693
was	was	2020	863	725	432
have	have	1944	783	464	697
this	this	1892	868	407	617
are	are	1678	635	440	603
but	but	1668	664	510	494
not	not	1439	575	334	530
from	from	1292	545	461	286
your	your	1287	378	153	756
will	will	1164	383	344	437
just	just	1127	351	177	599
all	all	1105	464	219	422
get	get	1095	308	196	591
one	one	1076	453	299	324
they	they	1073	453	353	267

3.2. Further cleaning of the data

The vocabulary analyzed currently contains things that may be unnecessary or unwanted among the predictions. These may include vulgar/foreign expressions, emoticons, etc. alongside the rarer ones.

One way to improve on this is restricting our examinations to the aforementioned most frequent minority of the text elements.

As an example, the first elements contain emoticons:

kable(
  col.names = c("Raw", "Total", "Freq.1", "Freq.2", "Freq.3"),
  x = head(df.dtm, 10)
)

	Raw	Total	Freq.3
✨🌟✨	u2728 U0001f31f u2728	1	1
😏😏👍	U0001f60f U0001f60f U0001f44d	1	1
💜🏃🍴	U0001f49c U0001f3c3 U0001f374	1	1
😂😂😂	U0001f602 U0001f602 U0001f602	2	2
💰💰💰	U0001f4b0 U0001f4b0 U0001f4b0	1	1
💜💛🏀	U0001f49c U0001f49b U0001f3c0	1	1
👠👗👙	U0001f460 U0001f457 U0001f459	1	1
😊😍🍓	U0001f60a U0001f60d U0001f353	1	1
😁😁😁	U0001f601 U0001f601 U0001f601	1	1
😞😞😞	U0001f61e U0001f61e U0001f61e	1	1

Unless creating a context-specific system (which can be later made a goal), to initially stay with a simpler approach, the data set can also be restricted to those linguistic units which appear in all 3 of the corpora. This can be useful since some phrases seem to rarely appear outside/within certain contexts.

This would have already filtered out the aforementioned problematic first entries. Swearwords also seem very unfrequent in blog entries. However, the lack of presence in each of the documents can be due to the sparsity of the sampling.

3.3. Leveraging stemming - 2 phases?

Also while it wasn’t heavily emphasized, the words were considered in a stemmed form. This made a smaller sample denser on the distribution side thus helping the analysis, but also can allow for an initial narrowing down of the choices with a two-phase prediction method. Offering the user a smaller subset: incomplete predictions can also speed entry up. For example, after they typed “ac”, and have been offered with and selected academi, they can be presented with likely endings of that stem, e.g. academia, academic etc..

3.3.1. “albuquerqu”

Some of the word stems, however, may have an unequivocal continuation (at least concluding from the available corpora). It means in that case the stem is likely to be unnecessarily offered. An example identified is albuquerqu .

3.4. About changes over time

Languages change continuously and good prediction algorithms should reflect their nature. One thing to note is a static data set cannot adhere to this.

3.4.1. Adaptive data sets

One way to address this is adapting the data set continuously along with the experienced user input. This also allows to adapt to the user-specific vocabulary.

Another way is to distribute data set updates e.g. via the web.

3.4.2. Choice of core vocabulary

On another hand, when looking for the core set of expressions to wrap up in a conveniently small package, being time-tested could form a basis for preference. Stemming can also help here, or coincide with this filter - words from time to time get constructed from existing stems, and stems are more persistent and constant elements of a language.

3.5. Casing

As leading words of sentences are capitalized they should be treated case insensitively (i.e. as if their casing was unknown) when harvesting data, or completely ignored.

4. Exploring more complex structures

4.1. Bigrams

For prediction, it can be useful to know whether bigrams, which are two-word sequences in a text, have a certain uneven distribution. Intiuitively this is very likely to happen.

The words most likely to follow a given one, w, may be the ones to display at the top of a list of suggestions once w has been entered. These are, in other words, the ones with the highest probability conditional on being preceded by w.

# based on http://tm.r-forge.r-project.org/faq.html
BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

dtm.2 <- TermDocumentMatrix(cl.corp, control = list(tokenize = BigramTokenizer))

df.dtm.2 = as.matrix(dtm.2)
df.dtm.2 = cbind(df.dtm.2[, 1] + df.dtm.2[, 2] + df.dtm.2[, 3], df.dtm.2)
colnames(df.dtm.2) <- c("total", "occ1", "occ2", "occ3")
df.dtm.2 = as.data.frame(df.dtm.2)
df.dtm.2 = df.dtm.2[order(df.dtm.2$total, decreasing = TRUE), ]
plot(x = log10(1:nrow(df.dtm.2)), 
     y = log10(df.dtm.2$total), 
     type = "l", 
     col = "darkgrey",
     xlim = c(0, log10(nrow(df.dtm.2))),
     ylim = c(-0.5, max(log10(df.dtm.2$total))),
     ylab = "Log of occurrences", xlab = "Log of rank",
     main = "Distribution of bigrams in the samples by rank")

The distribution is very similar to that of the words.

quantiles = 
  plot.occurrence.chart(df.dtm.2$total, 
                        "Occurrences covered by top n (stemmed) bigrams",
                        pos.2 = 2)

Since stopword removal is not part of the data processing, the top bigrams are expected expected to contain these:

kable(x = head(df.dtm.2, 20), caption = "Most frequent bigrams", 
      col.names = c("Total", file.names))

Most frequent bigrams
	Total	en_US.blogs.txt	en_US.news.txt	en_US.twitter.txt
of the	1485	658	609	218
in the	1340	497	556	287
for the	742	231	215	296
to the	696	272	266	158
on the	652	253	227	172
to be	539	216	170	153
at the	505	164	208	133
and the	428	183	183	62
go to	379	109	86	184
with the	377	163	140	74
in a	369	126	162	81
is a	341	155	79	107
have a	311	98	52	161
it was	310	151	88	71
from the	309	153	103	53
is the	306	121	74	111
for a	305	109	84	112
want to	302	117	66	119
i have	298	166	28	104
it is	290	169	49	72

5. Summary

5.1. “Questions to consider”

The set of questions to consider was:

“1. Some words are more frequent than others - what are the distributions of word frequencies?”
“2. What are the frequencies of 2-grams and 3-grams in the dataset?”
“3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?”
“4. How do you evaluate how many of the words come from foreign languages?”
“5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?”

Question 1-3 have been addressed above.

5.1.1. Question 4.

How do you evaluate how many of the words come from foreign languages?”

Some things cannot be detected. Certain words concide in terms of spelling, yet mean different things in different languages. Some are as easy as filtering out Unicode characters. I expect there are strategies but no 100% accurate solutions for this.

In short, at least one easy step to identify such words is to look out for Unicode characters - today’s English words do not contain these.

Another good option is to obtain an English dictionary/thesaurus and verify against them, probably there are some freely avaialble ones out there.

5.1.2. Question 5.

“Can you think of a way to increase the coverage…”

Categorizing the word endings, i.e. those which suffix the stems, may put the words into such groups that the likely continuations can be concluded from them. In this case, the words can be continued with the ending of some other word which belongs to the same group. This may not need to happen in the two phases mentioned above - a small set of choices implies no need for this.

Knowing whether a word is a verb, adjective or noun can also be helpful. Certain verbs are often followed by a likely subset of stopwords (e.g. “want to”, “applies to”, etc.). Such relationships are easier to evaluate using a non-stemmed dataset.

5.2. Closing words

Probably enough of the difficulties with the project have been outlined in the above to start effectively working on it. Extracting the words, bigrams from larger corpora, perhaps trigrams, addressing casing issues, potential performance problems (for not necessarily pure R solutions it should not be a problem to process some hundreds of megabytes of data), filtering words, researching dictionaries, restricting suggestions are plenty of tasks to proceed with.