This report is published at the halfway point of a project to predict the next word in a sentence. It contains a summary of the distribution of words and sequences of words in a sample of English text from the internet. The report summarises my findings so far and will outline a plan to complete the project.

Basic analysis

Data load and preparation

The dataset is curated and known as the HC Corpora. It contains data from twitter, blogs and news articles.

if(!file.exists("Coursera-SwiftKey.zip")) {
  training.url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(training.url, "Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}
con.twitter <- file("final\\en_US\\en_US.twitter.txt", open="rb")
con.blogs <- file("final\\en_US\\en_US.blogs.txt", open="rb")
con.news <- file("final\\en_US\\en_US.news.txt", open="rb")
twitter <- readLines(con.twitter, encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(con.blogs, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(con.news, encoding = "UTF-8", skipNul = TRUE)
close(con.twitter)
close(con.news)
close(con.blogs)

A new TokeniseText function simplifies the text into tokens that are easier to model. This separates the words (tokens) for each piece of text. These words are lower case and contain no numbers or whitespace; punctuation is limited to hyphens and apostrophes within words.

source("next-word-functions.R")
twitter.tokens <- sapply(twitter, TokeniseText)
blogs.tokens <- sapply(blogs, TokeniseText)
news.tokens <- sapply(news, TokeniseText)

Word frequency

A new CountWords function produces a summary table with the count of each unique word. This information can be used to produce the word frequency table. The code to produce this table is omitted for brevity.

name	line.count	word.count	unique.words	words.per.line	average.frequency
Twitter	2,360,148	29,647,990	363,503	13	82
Blogs	899,288	37,319,338	341,745	41	109
News	1,010,242	33,696,664	307,317	33	110
Total	4,269,678	100,663,992	738,308	24	136

Frequent words

The word cloud below shows the 100 most frequent words. These account for 46% of word instances. The most frequent word is “the”, which occurs 4768355 times and accounts for 5% of word instances.

set.seed(1234)
library(wordcloud, quietly = TRUE)
pal <- brewer.pal(8, "Dark2")
wordcloud(total.words$word, total.words$count, max.words = 100, 
          scale = c(4, 1), colors = pal, random.color = TRUE)

Dictionary size and coverage

A large number of words only occur a couple of times, and are unlikely to be predicted. Using a fixed dictionary size that includes only frequent words would improve algorithm performance. Marking the missing words as unknown (<UNK>) would provide a way for the algorithm to predict words that are not in the original dataset.

min.instances <- c(1, 2, 3, 5, 10, 20, 50, 80, 100)
dictionary.size <- sapply(min.instances, 
                          function(x) sum(total.words$count >= x))
words.covered <- sapply(min.instances, 
                        function(x) sum(total.words$count[total.words$count >= x]))
coverage <- paste0(round((words.covered / words.covered[1]) * 100, 2), "%")
word.coverage <- data.frame(min.instances, dictionary.size, coverage)
kable(word.coverage, align = 'l', format.args = list(big.mark   = ","))

min.instances	dictionary.size	coverage
1	738,308	100%
2	317,244	99.58%
3	226,658	99.4%
5	158,186	99.17%
10	103,331	98.82%
20	70,346	98.37%
50	43,284	97.54%
80	33,657	96.94%
100	29,731	96.59%

Omitted words

Based on the data above, lets start with a minimum dictionary size of 30,000 words. This can be tuned at a later date. This corresponds to a min.instance of 80. The word cloud below shows a random sample of 50 words that would be omitted from the dictionary and treated as unknown.

min.instance.selected <- max(min.instances[dictionary.size >= 30000])
words.omitted <- total.words$word[total.words$count > min.instance.selected - 1]
set.seed(1234)
subset.omitted <- as.character(words.omitted[sample(length(words.omitted), 50)])

wordcloud(subset.omitted, rep(1, 50), scale = c(1.2, 1.2), rot.per = 0,
          fixed.asp = FALSE, colors = pal, random.color = TRUE)

Word distribution

This histogram shows the distribution of word instances. It uses a square root scale on the Y axis because of the large number of low frequency words. The X axis is truncated because the tail of the distribution is so long. This shows that omitting low frequency words should have minimal impact on accuracy.

library(ggplot2)
ggplot(total.words, aes(x = count)) + 
   geom_histogram(binwidth = 2, fill = "saddlebrown") + 
   xlim(0, 1000) + 
   scale_y_sqrt() +
   labs(x = "Word instances", y = "Unique words")

Initial findings

Accuracy

I have created a simple 3 word (trigram) model using 60% of the data. This was quite inaccurate. For quiz 2 the model predicted the answers without using the shortlist of answers as context, and only 3 out of 10 answers were correct. 6 of the questions did not have any of the shortlist answers in the top 5 predictions. The algorithm will need to use additional context from the text. Repeatable measurement of accuracy and perplexity for each model is also important.

Memory

Using 60% of the training data exceeded the 8GB memory limit on my PC. The table below shows durations and memory usage for a selection of models. The memory would peak during the grouping and summarizing of the trigrams.

The model was run on a virtual machine (VM) in Microsoft Azure. This was slower than my laptop, but has 56GB Ram available. It should be possible to use Revolution R to take advantage of the 8 core processor on the Azure VM and significantly reduce processing time.

Once the measurement of accuracy and perplexity is implemented, reductions in memory usage should be a priority. The models should be initially tuned on a subset of the data.

load("data\\metrics.RData")
kable(metrics[c(1, 3:6, 11, 17:19), ], align = 'l')

	start.time	records	runtime	mem.before	mem.after	mem.model	comment
1	2015-12-26 10:54:06	1000	2.53 secs	97.1 MB	102 MB	4.11 MB	First model
3	2015-12-26 11:18:19	1000	1.37 secs	125 MB	129 MB	3.44 MB	Trigrams split into 3 word vectors
4	2015-12-26 11:21:21	10000	8.9 secs	129 MB	157 MB	24.7 MB	Trigrams split into 3 word vectors
5	2015-12-26 12:07:56	100000	1.28 mins	176 MB	383 MB	171 MB	Larger dataset
6	2015-12-26 12:26:33	100000	1.37 mins	190 MB	204 MB	13.4 MB	Filter out unique trigrams
11	2015-12-26 18:20:50	10000	7.22 secs	108 MB	109 MB	1.02 MB	Remove unique words from index
17	2015-12-27 21:50:38	100000	1.3 mins	138 MB	165 MB	17.1 MB	Using StringsAsFactors, but no index
18	2015-12-28 04:19:50	100000	5.88 mins	119 MB	144 MB	13.3 MB	Rerun on Azure VM
19	2015-12-28 04:30:44	2561806	1.83 hours	865 MB	1.06 GB	204 MB	Rerun on Azure VM

Plan

Here is a high level plan to complete the project.

Model evaluation
- Measure accuracy
- Measure perplexity
- Log results for each model
- Optimize evaluation (reduce validation set?)
Optimize memory usage
- Use a data.table to improve performance
- Prune less frequent trigrams
- Remove less frequent words and replace with <UNK>
- Try implementing a word index?
Improve model
- Interpolation of unigrams, bigrams and trigrams
- Sentence segmentation with <S>
- Fix incorrectly encoded apostrophes
- Try quadgrams
- Try Part of speech tagging using tagPOS()
Shiny app
- Start this early
- Include probabilities for each prediction
Presentation

Next word milestone report

Chris Lill

28 December 2015