This report is published at the halfway point of a project to predict the next word in a sentence. It contains a summary of the distribution of words and sequences of words in a sample of English text from the internet. The report summarises my findings so far and will outline a plan to complete the project.

Basic analysis

Data load and preparation

The dataset is curated and known as the HC Corpora. It contains data from twitter, blogs and news articles.

if(!file.exists("Coursera-SwiftKey.zip")) {
  training.url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(training.url, "Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}
con.twitter <- file("final\\en_US\\en_US.twitter.txt", open="rb")
con.blogs <- file("final\\en_US\\en_US.blogs.txt", open="rb")
con.news <- file("final\\en_US\\en_US.news.txt", open="rb")
twitter <- readLines(con.twitter, encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con.blogs, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con.news, encoding = "UTF-8", skipNul = TRUE)
close(con.twitter)
close(con.news)
close(con.blogs)

A new TokeniseText function simplifies the text into tokens that are easier to model. This separates the words (tokens) for each piece of text. These words are lower case and contain no numbers or whitespace; punctuation is limited to hyphens and apostrophes within words.

source("next-word-functions.R")
twitter.tokens <- sapply(twitter, TokeniseText)
blogs.tokens <- sapply(blogs, TokeniseText)
news.tokens <- sapply(news, TokeniseText)

Word frequency

A new CountWords function produces a summary table with the count of each unique word. This information can be used to produce the word frequency table. The code to produce this table is omitted for brevity.

name line.count word.count unique.words words.per.line average.frequency
Twitter 2,360,148 29,647,990 363,503 13 82
Blogs 899,288 37,319,338 341,745 41 109
News 1,010,242 33,696,664 307,317 33 110
Total 4,269,678 100,663,992 738,308 24 136

Frequent words

The word cloud below shows the 100 most frequent words. These account for 46% of word instances. The most frequent word is “the”, which occurs 4768355 times and accounts for 5% of word instances.

set.seed(1234)
library(wordcloud, quietly = TRUE)
pal <- brewer.pal(8, "Dark2")
wordcloud(total.words$word, total.words$count, max.words = 100, 
          scale = c(4, 1), colors = pal, random.color = TRUE)

Dictionary size and coverage

A large number of words only occur a couple of times, and are unlikely to be predicted. Using a fixed dictionary size that includes only frequent words would improve algorithm performance. Marking the missing words as unknown (<UNK>) would provide a way for the algorithm to predict words that are not in the original dataset.

min.instances <- c(1, 2, 3, 5, 10, 20, 50, 80, 100)
dictionary.size <- sapply(min.instances, 
                          function(x) sum(total.words$count >= x))
words.covered <- sapply(min.instances, 
                        function(x) sum(total.words$count[total.words$count >= x]))
coverage <- paste0(round((words.covered / words.covered[1]) * 100, 2), "%")
word.coverage <- data.frame(min.instances, dictionary.size, coverage)
kable(word.coverage, align = 'l', format.args = list(big.mark   = ","))
min.instances dictionary.size coverage
1 738,308 100%
2 317,244 99.58%
3 226,658 99.4%
5 158,186 99.17%
10 103,331 98.82%
20 70,346 98.37%
50 43,284 97.54%
80 33,657 96.94%
100 29,731 96.59%

Omitted words

Based on the data above, lets start with a minimum dictionary size of 30,000 words. This can be tuned at a later date. This corresponds to a min.instance of 80. The word cloud below shows a random sample of 50 words that would be omitted from the dictionary and treated as unknown.

min.instance.selected <- max(min.instances[dictionary.size >= 30000])
words.omitted <- total.words$word[total.words$count > min.instance.selected - 1]
set.seed(1234)
subset.omitted <- as.character(words.omitted[sample(length(words.omitted), 50)])

wordcloud(subset.omitted, rep(1, 50), scale = c(1.2, 1.2), rot.per = 0,
          fixed.asp = FALSE, colors = pal, random.color = TRUE)

Word distribution

This histogram shows the distribution of word instances. It uses a square root scale on the Y axis because of the large number of low frequency words. The X axis is truncated because the tail of the distribution is so long. This shows that omitting low frequency words should have minimal impact on accuracy.

library(ggplot2)
ggplot(total.words, aes(x = count)) + 
   geom_histogram(binwidth = 2, fill = "saddlebrown") + 
   xlim(0, 1000) + 
   scale_y_sqrt() +
   labs(x = "Word instances", y = "Unique words")

Initial findings

Accuracy

I have created a simple 3 word (trigram) model using 60% of the data. This was quite inaccurate. For quiz 2 the model predicted the answers without using the shortlist of answers as context, and only 3 out of 10 answers were correct. 6 of the questions did not have any of the shortlist answers in the top 5 predictions. The algorithm will need to use additional context from the text. Repeatable measurement of accuracy and perplexity for each model is also important.

Memory

Using 60% of the training data exceeded the 8GB memory limit on my PC. The table below shows durations and memory usage for a selection of models. The memory would peak during the grouping and summarizing of the trigrams.

The model was run on a virtual machine (VM) in Microsoft Azure. This was slower than my laptop, but has 56GB Ram available. It should be possible to use Revolution R to take advantage of the 8 core processor on the Azure VM and significantly reduce processing time.

Once the measurement of accuracy and perplexity is implemented, reductions in memory usage should be a priority. The models should be initially tuned on a subset of the data.

load("data\\metrics.RData")
kable(metrics[c(1, 3:6, 11, 17:19), ], align = 'l')
start.time records runtime mem.before mem.after mem.model comment
1 2015-12-26 10:54:06 1000 2.53 secs 97.1 MB 102 MB 4.11 MB First model
3 2015-12-26 11:18:19 1000 1.37 secs 125 MB 129 MB 3.44 MB Trigrams split into 3 word vectors
4 2015-12-26 11:21:21 10000 8.9 secs 129 MB 157 MB 24.7 MB Trigrams split into 3 word vectors
5 2015-12-26 12:07:56 100000 1.28 mins 176 MB 383 MB 171 MB Larger dataset
6 2015-12-26 12:26:33 100000 1.37 mins 190 MB 204 MB 13.4 MB Filter out unique trigrams
11 2015-12-26 18:20:50 10000 7.22 secs 108 MB 109 MB 1.02 MB Remove unique words from index
17 2015-12-27 21:50:38 100000 1.3 mins 138 MB 165 MB 17.1 MB Using StringsAsFactors, but no index
18 2015-12-28 04:19:50 100000 5.88 mins 119 MB 144 MB 13.3 MB Rerun on Azure VM
19 2015-12-28 04:30:44 2561806 1.83 hours 865 MB 1.06 GB 204 MB Rerun on Azure VM

Plan

Here is a high level plan to complete the project.

  1. Model evaluation
    • Measure accuracy
    • Measure perplexity
    • Log results for each model
    • Optimize evaluation (reduce validation set?)
  2. Optimize memory usage
    • Use a data.table to improve performance
    • Prune less frequent trigrams
    • Remove less frequent words and replace with <UNK>
    • Try implementing a word index?
  3. Improve model
    • Interpolation of unigrams, bigrams and trigrams
    • Sentence segmentation with <S>
    • Fix incorrectly encoded apostrophes
    • Try quadgrams
    • Try Part of speech tagging using tagPOS()
  4. Shiny app
    • Start this early
    • Include probabilities for each prediction
  5. Presentation