Coursera Data Science Specialization Capstone

# dependencies

library(knitr)
library(tm)
## Loading required package: NLP
library(NLP)
library(readr)

# also install: filehash

1. Summary

This is an intermediate report on the capstone project, which aims to create some system that accelerates user input e.g. on mobile devices by offering options as the next word while the user is typing. I expect this project to finish with a ground-up prototype of a realistic predictive engine.

The language the solution works in is up to the student, I expect to choose the (I assume fairly typical) US English.

This search for a solution cannot aim to be exhaustive – for instance using someone else’s service is not really an option when demonstrating skills, however definitely is a possibility in reality.

The two most promising predictive strategies are covered here, namely

As it is also useful to verify/construct hypotheses using a smaller subset before expanding the scope of discussion, the examinations are restricted to two word structures, longer ones may be considered at a later stage.

2. Introducing the data set

For this project, a set of 3 files was provided. These encompass a large corpus of (US) English words, containing hundreds of megabytes of data. The sources are summarized below based on subsamples. Subsampling allows for the processing to take place quickly despite the sheer size of the data (and R’s notoriously bad reputation at scaling up to large data sets).

For the sake of exploration, numbers and punctuations have been consistently removed. This may need reconsideration before the final prototype.

data.path  = "../../input/en_US/"
file.names = dir(data.path)
file.paths = file.path(data.path, file.names)
sizes = file.size(file.paths)

# prepare for reproducible random subsampling
set.seed(0)
# read up the files
lines = rep(NA, 3)
line.counts = rep(NA, 3)
sample.lines = c(3000, 3000 * 1.05, 9000)
for(i in 1:3) {
  text = readLines(file.paths[i], -1)
  line.counts[i] = length(text)
  # use a small, random sample of the data as 
  # text manipulation gets slow otherwise
  is.selected = sample(x = length(text), size = sample.lines[i])
  text = text[is.selected]
  # the text will be more useful as a single character
  text = paste0(text, collapse = " ")
  
  lines[i] = text
}
## Warning in readLines(file.paths[i], -1): line 167155 appears to contain an
## embedded nul
## Warning in readLines(file.paths[i], -1): line 268547 appears to contain an
## embedded nul
## Warning in readLines(file.paths[i], -1): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(file.paths[i], -1): line 1759032 appears to contain an
## embedded nul
# prepare for text mining
corp = Corpus(VectorSource(lines))
inspect(corp)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 678432
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 644799
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 627748
cl.corp = tm_map(corp, stripWhitespace)
cl.corp = tm_map(cl.corp, content_transformer(tolower))
cl.corp = tm_map(cl.corp, removePunctuation)
cl.corp = tm_map(cl.corp, removeNumbers)
cl.corp = tm_map(cl.corp, stemDocument)
# currently stopwords are not removed
# cl.corp = tm_map(cl.corp, removeWords, stopwords("english"))
dtm = DocumentTermMatrix(cl.corp)

rm(corp)
rm(lines)
full.to.sample = line.counts / sample.lines

kable(
  caption = "Data set summary table (words are stemmed)",
  align = c("l", "r", "r", "r", "r", "r", "r"),
  col.names = c("File", "File Size (MB)", "Lines (k)", 
                "Word Count (est. total, k)", "Sampling Ratio", 
                "Word Count (sample)",
                "Unique Words (sample)"),
  data.frame(file.names, 
             round(sizes / 1024 / 1024, digits = 2),
             round(line.counts / 1000),
             round(apply(as.matrix(dtm), MARGIN = 1, FUN = sum)
                   * full.to.sample / 1000),
             sprintf("%.2f %%", 1 / full.to.sample * 100),
             apply(as.matrix(dtm), MARGIN = 1, FUN = sum),
             apply(as.matrix(dtm), MARGIN = 1, FUN = function(x) sum(x != 0)))
)
Data set summary table (words are stemmed)
File File Size (MB) Lines (k) Word Count (est. total, k) Sampling Ratio Word Count (sample) Unique Words (sample)
en_US.blogs.txt 200.42 899 28013 0.33 % 93449 10858
en_US.news.txt 196.28 1010 27588 0.31 % 86020 11439
en_US.twitter.txt 159.36 2360 22482 0.38 % 85732 10997

3. Exploratory analysis of words

3.1. Word distribution

Based on this table the diversity of the language seems to be the highest in the news corpus, while Twitter messages and blog posts possibly repeat somewhat more of the phrases. Nevertheless, there seems no huge difference there.

Zipf’s law states that the words are likely to follow a distribution of a negative power \(\text{(}P(w_i) = c w_i ^ p;\text{ for some }c, p \in R, p < -1\text{)}\), thus the distribution would appear linear on a lin-log scale.

Actually, Wikipedia says log-log scale.

zipf.freqs = as.matrix(dtm)
dtm.col.total = 
  zipf.freqs[1, ] + 
  zipf.freqs[2, ] + 
  zipf.freqs[3, ] # likely to be faster than apply()

for(i in 1:nrow(zipf.freqs)) {
  zipf.freqs[i, ] = sort(zipf.freqs[i, ], decreasing = TRUE)
}
dtm.col.total = sort(dtm.col.total, decreasing = TRUE)

zipf.freqs = log10(zipf.freqs)
dtm.col.total = log10(dtm.col.total)

plot(main = "Distribution of word frequency",
     ylab = "Log of occurrences",
     xlab = "Log of word rank",
     x = log10(1:length(zipf.freqs[1, ])),
     y = zipf.freqs[1, ], col = "red", type = "l")

lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[2, ], col = "green")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[3, ], col = "blue")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y = dtm.col.total, col = "black")

plot(main = "Frequency distribution of top ranked words",
     ylab = "Log of occurrences",
     xlab = "Log of word rank",
     xlim = c(0, log10(length(zipf.freqs[1, ])) / 3),
     x = log10(1:length(zipf.freqs[1, ])),
     zipf.freqs[1, ], col = "red", type = "l")

lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[2, ], col = "green")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y = zipf.freqs[3, ], col = "blue")
lines(x = log10(1:length(zipf.freqs[1, ])),
      y= dtm.col.total, col = "black")

The charts seem to agree with this formula, furthermore, words ranked below 10000 apparently occur very rarely, once in the samples. This potentially enables focusing on a relevant subset of a whole corpora later on, which is useful if only a smaller subset of e.g. a dictionary or data set needs to be distributed over to a potentially less capable mobile device.

The feasibility of this can be confirmed by looking at how many occurrences some top words account for in total.

plot.occurrence.chart = function(freqs, main, pos.2 = 4) {
  freqs = sort(freqs, decreasing = TRUE)
  total.words = sum(freqs)
  cumsum.freqs = cumsum(freqs)
  
  plot(main = main,
       xlab = "Last rank included",
       ylab = "Occurrences (%)",
       xlim = c(1, length(cumsum.freqs)),
       cumsum.freqs / total.words * 100, 
       type = "l",
       col = "darkgrey")
  abline(a = c(90, 0), col = "grey", lty = 2)
  rank.90.perc = min(which(cumsum.freqs > total.words * 0.9))
  abline(v = rank.90.perc, col = "grey", lty = 2)
  text(x = rank.90.perc, y = 90, pos = pos.2,
       sprintf("90%% of the text samples is covered\nby the top %d", rank.90.perc))
  
  rank.50.perc = min(which(cumsum.freqs > total.words * 0.5))
  abline(a = c(50, 0), col = "grey", lty = 3)
  abline(v = rank.50.perc, col = "grey", lty = 3)
  text(x = rank.50.perc, y = 50, pos = 4,
       sprintf("50%% of the text samples is covered\nby the top %d", rank.50.perc))
  
  return(list(rank.50.perc = rank.50.perc, rank.90.perc = rank.90.perc))
}

freqs = as.matrix(dtm)
freqs = freqs[1, ] + freqs[2, ] + freqs[3, ]
freq.quantiles = 
  plot.occurrence.chart(freqs, "Occurrences covered by top n (stemmed) words")

This figure confirms that a relatively small proportion (20.57 % in the sample) suffices to cover 9 out of 10 words the user may type in. If we are pleased with having only 50% of the words covered, a handful of words (1.05 %) will suffice.

The top words look promising, although are expected to be stopwords:

library(stringi)

df.dtm = t(as.matrix(dtm))
df.dtm = cbind(df.dtm[, 1] + df.dtm[, 2] + df.dtm[, 3], df.dtm)
df.dtm = 
  data.frame(
#    failed to make kable display backslashes in HTML
#    raw = gsub("[\\]", "\\\\\\", stri_escape_unicode(rownames(df.dtm))),

    raw = gsub("[\\]", " ", stri_escape_unicode(rownames(df.dtm))),
    df.dtm
  )
colnames(df.dtm) = c("Raw", "Total", "occ1", "occ2", "occ3") 

kable(x = head(df.dtm[order(df.dtm$Total, decreasing = TRUE), ], 20), 
      caption = "Most frequent words", 
      col.names = c("Raw", "Total", file.names))
Most frequent words
Raw Total en_US.blogs.txt en_US.news.txt en_US.twitter.txt
the the 16039 6176 6254 3609
and and 8033 3540 2859 1634
for for 3829 1202 1133 1494
that that 3821 1593 1137 1091
you you 3366 1019 304 2043
with with 2398 906 799 693
was was 2020 863 725 432
have have 1944 783 464 697
this this 1892 868 407 617
are are 1678 635 440 603
but but 1668 664 510 494
not not 1439 575 334 530
from from 1292 545 461 286
your your 1287 378 153 756
will will 1164 383 344 437
just just 1127 351 177 599
all all 1105 464 219 422
get get 1095 308 196 591
one one 1076 453 299 324
they they 1073 453 353 267

3.2. Further cleaning of the data

The vocabulary analyzed currently contains things that may be unnecessary or unwanted among the predictions. These may include vulgar/foreign expressions, emoticons, etc. alongside the rarer ones.

One way to improve on this is restricting our examinations to the aforementioned most frequent minority of the text elements.

As an example, the first elements contain emoticons:

kable(
  col.names = c("Raw", "Total", "Freq.1", "Freq.2", "Freq.3"),
  x = head(df.dtm, 10)
)
Raw Total Freq.1 Freq.2 Freq.3
✨🌟✨ u2728 U0001f31f u2728 1 0 0 1
😏😏👍 U0001f60f U0001f60f U0001f44d 1 0 0 1
💜🏃🍴 U0001f49c U0001f3c3 U0001f374 1 0 0 1
😂😂😂 U0001f602 U0001f602 U0001f602 2 0 0 2
💰💰💰 U0001f4b0 U0001f4b0 U0001f4b0 1 0 0 1
💜💛🏀 U0001f49c U0001f49b U0001f3c0 1 0 0 1
👠👗👙 U0001f460 U0001f457 U0001f459 1 0 0 1
😊😍🍓 U0001f60a U0001f60d U0001f353 1 0 0 1
😁😁😁 U0001f601 U0001f601 U0001f601 1 0 0 1
😞😞😞 U0001f61e U0001f61e U0001f61e 1 0 0 1

Unless creating a context-specific system (which can be later made a goal), to initially stay with a simpler approach, the data set can also be restricted to those linguistic units which appear in all 3 of the corpora. This can be useful since some phrases seem to rarely appear outside/within certain contexts.

This would have already filtered out the aforementioned problematic first entries. Swearwords also seem very unfrequent in blog entries. However, the lack of presence in each of the documents can be due to the sparsity of the sampling.

3.3. Leveraging stemming - 2 phases?

Also while it wasn’t heavily emphasized, the words were considered in a stemmed form. This made a smaller sample denser on the distribution side thus helping the analysis, but also can allow for an initial narrowing down of the choices with a two-phase prediction method. Offering the user a smaller subset: incomplete predictions can also speed entry up. For example, after they typed “ac”, and have been offered with and selected academi, they can be presented with likely endings of that stem, e.g. academia, academic etc..

3.3.1. “albuquerqu”

Some of the word stems, however, may have an unequivocal continuation (at least concluding from the available corpora). It means in that case the stem is likely to be unnecessarily offered. An example identified is albuquerqu .

3.4. About changes over time

Languages change continuously and good prediction algorithms should reflect their nature. One thing to note is a static data set cannot adhere to this.

3.4.1. Adaptive data sets

One way to address this is adapting the data set continuously along with the experienced user input. This also allows to adapt to the user-specific vocabulary.

Another way is to distribute data set updates e.g. via the web.

3.4.2. Choice of core vocabulary

On another hand, when looking for the core set of expressions to wrap up in a conveniently small package, being time-tested could form a basis for preference. Stemming can also help here, or coincide with this filter - words from time to time get constructed from existing stems, and stems are more persistent and constant elements of a language.

3.5. Casing

As leading words of sentences are capitalized they should be treated case insensitively (i.e. as if their casing was unknown) when harvesting data, or completely ignored.

4. Exploring more complex structures

4.1. Bigrams

For prediction, it can be useful to know whether bigrams, which are two-word sequences in a text, have a certain uneven distribution. Intiuitively this is very likely to happen.

The words most likely to follow a given one, w, may be the ones to display at the top of a list of suggestions once w has been entered. These are, in other words, the ones with the highest probability conditional on being preceded by w.

# based on http://tm.r-forge.r-project.org/faq.html
BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

dtm.2 <- TermDocumentMatrix(cl.corp, control = list(tokenize = BigramTokenizer))

df.dtm.2 = as.matrix(dtm.2)
df.dtm.2 = cbind(df.dtm.2[, 1] + df.dtm.2[, 2] + df.dtm.2[, 3], df.dtm.2)
colnames(df.dtm.2) <- c("total", "occ1", "occ2", "occ3")
df.dtm.2 = as.data.frame(df.dtm.2)
df.dtm.2 = df.dtm.2[order(df.dtm.2$total, decreasing = TRUE), ]
plot(x = log10(1:nrow(df.dtm.2)), 
     y = log10(df.dtm.2$total), 
     type = "l", 
     col = "darkgrey",
     xlim = c(0, log10(nrow(df.dtm.2))),
     ylim = c(-0.5, max(log10(df.dtm.2$total))),
     ylab = "Log of occurrences", xlab = "Log of rank",
     main = "Distribution of bigrams in the samples by rank")

The distribution is very similar to that of the words.

quantiles = 
  plot.occurrence.chart(df.dtm.2$total, 
                        "Occurrences covered by top n (stemmed) bigrams",
                        pos.2 = 2)

Since stopword removal is not part of the data processing, the top bigrams are expected expected to contain these:

kable(x = head(df.dtm.2, 20), caption = "Most frequent bigrams", 
      col.names = c("Total", file.names))
Most frequent bigrams
Total en_US.blogs.txt en_US.news.txt en_US.twitter.txt
of the 1485 658 609 218
in the 1340 497 556 287
for the 742 231 215 296
to the 696 272 266 158
on the 652 253 227 172
to be 539 216 170 153
at the 505 164 208 133
and the 428 183 183 62
go to 379 109 86 184
with the 377 163 140 74
in a 369 126 162 81
is a 341 155 79 107
have a 311 98 52 161
it was 310 151 88 71
from the 309 153 103 53
is the 306 121 74 111
for a 305 109 84 112
want to 302 117 66 119
i have 298 166 28 104
it is 290 169 49 72

5. Summary

5.1. “Questions to consider”

The set of questions to consider was:

  • “1. Some words are more frequent than others - what are the distributions of word frequencies?”
  • “2. What are the frequencies of 2-grams and 3-grams in the dataset?”
  • “3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?”
  • “4. How do you evaluate how many of the words come from foreign languages?”
  • “5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?”

Question 1-3 have been addressed above.

5.1.1. Question 4.

How do you evaluate how many of the words come from foreign languages?”

Some things cannot be detected. Certain words concide in terms of spelling, yet mean different things in different languages. Some are as easy as filtering out Unicode characters. I expect there are strategies but no 100% accurate solutions for this.

In short, at least one easy step to identify such words is to look out for Unicode characters - today’s English words do not contain these.

Another good option is to obtain an English dictionary/thesaurus and verify against them, probably there are some freely avaialble ones out there.

5.1.2. Question 5.

“Can you think of a way to increase the coverage…”

Categorizing the word endings, i.e. those which suffix the stems, may put the words into such groups that the likely continuations can be concluded from them. In this case, the words can be continued with the ending of some other word which belongs to the same group. This may not need to happen in the two phases mentioned above - a small set of choices implies no need for this.

Knowing whether a word is a verb, adjective or noun can also be helpful. Certain verbs are often followed by a likely subset of stopwords (e.g. “want to”, “applies to”, etc.). Such relationships are easier to evaluate using a non-stemmed dataset.

5.2. Closing words

Probably enough of the difficulties with the project have been outlined in the above to start effectively working on it. Extracting the words, bigrams from larger corpora, perhaps trigrams, addressing casing issues, potential performance problems (for not necessarily pure R solutions it should not be a problem to process some hundreds of megabytes of data), filtering words, researching dictionaries, restricting suggestions are plenty of tasks to proceed with.