This report is published at the halfway point of a project to predict the next word in a sentence. It contains a summary of the distribution of words and sequences of words in a sample of English text from the internet. The report summarises my findings so far and will outline a plan to complete the project.
The dataset is curated and known as the HC Corpora. It contains data from twitter, blogs and news articles.
if(!file.exists("Coursera-SwiftKey.zip")) {
training.url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(training.url, "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
con.twitter <- file("final\\en_US\\en_US.twitter.txt", open="rb")
con.blogs <- file("final\\en_US\\en_US.blogs.txt", open="rb")
con.news <- file("final\\en_US\\en_US.news.txt", open="rb")
twitter <- readLines(con.twitter, encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con.blogs, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con.news, encoding = "UTF-8", skipNul = TRUE)
close(con.twitter)
close(con.news)
close(con.blogs)
A new TokeniseText function simplifies the text into tokens that are easier to model. This separates the words (tokens) for each piece of text. These words are lower case and contain no numbers or whitespace; punctuation is limited to hyphens and apostrophes within words.
source("next-word-functions.R")
twitter.tokens <- sapply(twitter, TokeniseText)
blogs.tokens <- sapply(blogs, TokeniseText)
news.tokens <- sapply(news, TokeniseText)
A new CountWords function produces a summary table with the count of each unique word. This information can be used to produce the word frequency table. The code to produce this table is omitted for brevity.
| name | line.count | word.count | unique.words | words.per.line | average.frequency |
|---|---|---|---|---|---|
| 2,360,148 | 29,647,990 | 363,503 | 13 | 82 | |
| Blogs | 899,288 | 37,319,338 | 341,745 | 41 | 109 |
| News | 1,010,242 | 33,696,664 | 307,317 | 33 | 110 |
| Total | 4,269,678 | 100,663,992 | 738,308 | 24 | 136 |
The word cloud below shows the 100 most frequent words. These account for 46% of word instances. The most frequent word is “the”, which occurs 4768355 times and accounts for 5% of word instances.
set.seed(1234)
library(wordcloud, quietly = TRUE)
pal <- brewer.pal(8, "Dark2")
wordcloud(total.words$word, total.words$count, max.words = 100,
scale = c(4, 1), colors = pal, random.color = TRUE)
A large number of words only occur a couple of times, and are unlikely to be predicted. Using a fixed dictionary size that includes only frequent words would improve algorithm performance. Marking the missing words as unknown (<UNK>) would provide a way for the algorithm to predict words that are not in the original dataset.
min.instances <- c(1, 2, 3, 5, 10, 20, 50, 80, 100)
dictionary.size <- sapply(min.instances,
function(x) sum(total.words$count >= x))
words.covered <- sapply(min.instances,
function(x) sum(total.words$count[total.words$count >= x]))
coverage <- paste0(round((words.covered / words.covered[1]) * 100, 2), "%")
word.coverage <- data.frame(min.instances, dictionary.size, coverage)
kable(word.coverage, align = 'l', format.args = list(big.mark = ","))
| min.instances | dictionary.size | coverage |
|---|---|---|
| 1 | 738,308 | 100% |
| 2 | 317,244 | 99.58% |
| 3 | 226,658 | 99.4% |
| 5 | 158,186 | 99.17% |
| 10 | 103,331 | 98.82% |
| 20 | 70,346 | 98.37% |
| 50 | 43,284 | 97.54% |
| 80 | 33,657 | 96.94% |
| 100 | 29,731 | 96.59% |
Based on the data above, lets start with a minimum dictionary size of 30,000 words. This can be tuned at a later date. This corresponds to a min.instance of 80. The word cloud below shows a random sample of 50 words that would be omitted from the dictionary and treated as unknown.
min.instance.selected <- max(min.instances[dictionary.size >= 30000])
words.omitted <- total.words$word[total.words$count > min.instance.selected - 1]
set.seed(1234)
subset.omitted <- as.character(words.omitted[sample(length(words.omitted), 50)])
wordcloud(subset.omitted, rep(1, 50), scale = c(1.2, 1.2), rot.per = 0,
fixed.asp = FALSE, colors = pal, random.color = TRUE)
This histogram shows the distribution of word instances. It uses a square root scale on the Y axis because of the large number of low frequency words. The X axis is truncated because the tail of the distribution is so long. This shows that omitting low frequency words should have minimal impact on accuracy.
library(ggplot2)
ggplot(total.words, aes(x = count)) +
geom_histogram(binwidth = 2, fill = "saddlebrown") +
xlim(0, 1000) +
scale_y_sqrt() +
labs(x = "Word instances", y = "Unique words")
I have created a simple 3 word (trigram) model using 60% of the data. This was quite inaccurate. For quiz 2 the model predicted the answers without using the shortlist of answers as context, and only 3 out of 10 answers were correct. 6 of the questions did not have any of the shortlist answers in the top 5 predictions. The algorithm will need to use additional context from the text. Repeatable measurement of accuracy and perplexity for each model is also important.
Using 60% of the training data exceeded the 8GB memory limit on my PC. The table below shows durations and memory usage for a selection of models. The memory would peak during the grouping and summarizing of the trigrams.
The model was run on a virtual machine (VM) in Microsoft Azure. This was slower than my laptop, but has 56GB Ram available. It should be possible to use Revolution R to take advantage of the 8 core processor on the Azure VM and significantly reduce processing time.
Once the measurement of accuracy and perplexity is implemented, reductions in memory usage should be a priority. The models should be initially tuned on a subset of the data.
load("data\\metrics.RData")
kable(metrics[c(1, 3:6, 11, 17:19), ], align = 'l')
| start.time | records | runtime | mem.before | mem.after | mem.model | comment | |
|---|---|---|---|---|---|---|---|
| 1 | 2015-12-26 10:54:06 | 1000 | 2.53 secs | 97.1 MB | 102 MB | 4.11 MB | First model |
| 3 | 2015-12-26 11:18:19 | 1000 | 1.37 secs | 125 MB | 129 MB | 3.44 MB | Trigrams split into 3 word vectors |
| 4 | 2015-12-26 11:21:21 | 10000 | 8.9 secs | 129 MB | 157 MB | 24.7 MB | Trigrams split into 3 word vectors |
| 5 | 2015-12-26 12:07:56 | 100000 | 1.28 mins | 176 MB | 383 MB | 171 MB | Larger dataset |
| 6 | 2015-12-26 12:26:33 | 100000 | 1.37 mins | 190 MB | 204 MB | 13.4 MB | Filter out unique trigrams |
| 11 | 2015-12-26 18:20:50 | 10000 | 7.22 secs | 108 MB | 109 MB | 1.02 MB | Remove unique words from index |
| 17 | 2015-12-27 21:50:38 | 100000 | 1.3 mins | 138 MB | 165 MB | 17.1 MB | Using StringsAsFactors, but no index |
| 18 | 2015-12-28 04:19:50 | 100000 | 5.88 mins | 119 MB | 144 MB | 13.3 MB | Rerun on Azure VM |
| 19 | 2015-12-28 04:30:44 | 2561806 | 1.83 hours | 865 MB | 1.06 GB | 204 MB | Rerun on Azure VM |
Here is a high level plan to complete the project.
<UNK><S>