Introduction

The following report explores the Swift Keyboard data set to gather a understanding of how to build a predictive n-gram model using it. Word & bi/tri gram frequencies will be the primary focus. Bi/tri grams are a sequence of words generated from a document in a length of 2 & 3 respectively.

The Data

The follow data set contains 4,269,678 documents. The Twitter data set has 2,360,148 documents, representing 55% of total documents, the News data set has 1,010,242 documents, representing 24% of the data and the Blogs data set has 899,288 documents, representing 21% of the data.

That data sets have been combined into a single corpus for analysis, and sampled for in-memory performance.

A sample of each of the data sets is below.

library(quanteda)
library(knitr)
set.seed(123)

path_to_data <- file.path(".", "data", "final", "en_US")

twitter <- file.path(path_to_data, "en_US.twitter.txt")
news <- file.path(path_to_data, "en_US.news.txt")
blogs <- file.path(path_to_data, "en_US.blogs.txt")

get_data <- function(path_to_file) {
    file_con <- file(path_to_file)
    txt <- readLines(file_con)
    close(file_con)
    txt
} 

twitter_txt <- get_data(twitter)
news_txt <- get_data(news)
blogs_txt <- get_data(blogs)

dat <- corpus(c(
    twitter = twitter_txt,
    news = news_txt,
    blogs = blogs_txt
))

sample_size <- 0.10  # sample size of 10% of the full dataset
dat_sample <- corpus_sample(dat, size = ndoc(dat) * sample_size)

Twitter Corpus

twitter_corpus <- corpus(twitter_txt)
kable(summary(corpus_sample(twitter_corpus, size = 10)))
Text Types Tokens Sentences
text1948344 16 16 2
text1186132 8 8 1
text1841323 6 6 1
text2161139 21 26 4
text609601 25 26 3
text1700242 23 25 2
text1000545 11 11 1
text2160865 11 12 2
text297306 5 5 1
text1200045 19 23 1

News Corpus

news_corpus <- corpus(news_txt)
kable(summary(corpus_sample(news_corpus, size = 10)))
Text Types Tokens Sentences
text564458 32 41 2
text629060 44 52 3
text380840 33 41 2
text209130 48 62 2
text486872 2 2 1
text610596 40 48 2
text648827 35 39 2
text846894 41 55 3
text530555 40 50 2
text330853 21 29 2

Blogs Corpus

blogs_corpus <- corpus(blogs_txt)
kable(summary(corpus_sample(blogs_corpus, size = 10)))
Text Types Tokens Sentences
text532821 38 47 2
text500278 5 6 1
text402100 7 7 1
text437603 55 81 1
text627095 20 22 1
text458534 5 5 1
text548368 7 7 1
text548300 5 5 1
text779589 42 49 2
text113494 108 163 7

Unigram Word Frequencies

Single word frequencies will be explored to see how they are distributed in the corpus.

library(quanteda.textstats)
dat_toks <- tokens(
    dat_sample, 
    remove_punct = T,
    remove_symbols = T,
    remove_numbers = T,
    remove_url = T,
    remove_separators = T
)
dat_dfm <- dfm(dat_toks)
uni_freq <- textstat_frequency(dat_dfm)

Highest Frequency Words

kable(head(uni_freq, 10))
feature frequency rank docfreq group
the 477576 1 203048 all
to 275641 2 163220 all
and 241965 3 139579 all
a 238336 4 145047 all
of 200787 5 120526 all
i 166051 6 96683 all
in 164574 7 112156 all
for 110014 8 87797 all
is 108272 9 81365 all
that 104924 10 75621 all

Lowest Frequency Words

kable(tail(uni_freq, 10))
feature frequency rank docfreq group
223015 4-2-1 1 95671 1 all
223016 hysteria-driving 1 95671 1 all
223017 hero2 1 95671 1 all
223018 dressmaker’s 1 95671 1 all
223019 sticktime 1 95671 1 all
223020 acas 1 95671 1 all
223021 multiculturalists 1 95671 1 all
223022 #whitneyhayes 1 95671 1 all
223023 #registeritbeyootch 1 95671 1 all
223024 joy-killing 1 95671 1 all
library(ggplot2)

freq_dist <- data.frame(
    cum_sum = cumsum(uni_freq$frequency), 
    idx = 1:nrow(uni_freq),
    rank = uni_freq$rank
)

Distribution of Frequency of Words Sorted by Rank

ggplot(data = freq_dist, aes(x = rank, y = cum_sum)) + 
    geom_line() + 
    xlab("Index of Feature Sorted by Frequency Rank") + 
    ylab("Feature Frequency Rank") + 
    ggtitle("Frequency of Words")

## Frequency of Words in Log-Log Scale

ggplot(data = uni_freq, aes(x = log(rank), y = log(frequency))) + 
    geom_line() + 
    xlab("Log(Rank)") + 
    ylab("Log(Frequency)") + 
    ggtitle("Frequency of Words in Log-Log Scale")

The above graphic shows that there are relatively few words that make up the majority of words used in the corpus.

Words that used only once represent 57% of the unique words in corpus. These words are generally slang, misspelled words, names, and other words that are not commonly used.

In the English language there are approximately 476,000 words, according to Webster’s Third New International Dictionary. If the single use words are removed from the corpus, the corpus contains 95,868 unique words. This only represents 20% of the words in the English language. To cover 50% of the words in the English language, 238,000 unique words would be required and 428,400 words would be required to cover 90% of the language.

A few steps that could be taken to increase language coverage in the data set would be to run a spell checker on all the words, followed by stemming or lemmatizing the words to their root. Stemming or lemmatizing would reduce the total words in the language and increase coverage. The trade off would possibly losing some context or meaning of the sequence of words.

Bigram Frequencies

Bigram instances will be explored to determine how they are distributed in the dataset.

toks_bigram <- tokens_ngrams(dat_toks, n = 2)
bigram <- dfm(toks_bigram)
bigram_freq <- textstat_frequency(bigram)

High Frequency Words

kable(head(bigram_freq, 10))
feature frequency rank docfreq group
of_the 42896 1 35235 all
in_the 41214 2 35596 all
to_the 21474 3 19672 all
for_the 20110 4 18953 all
on_the 19703 5 18182 all
to_be 16372 6 14985 all
at_the 14347 7 13468 all
and_the 12552 8 11706 all
in_a 11915 9 11273 all
with_the 10605 10 10021 all

Low Frequency Words

kable(tail(bigram_freq, 10))
feature frequency rank docfreq group
2687052 on_fairy 1 664874 1 all
2687053 fairy_stories 1 664874 1 all
2687054 grow_numb 1 664874 1 all
2687055 possess_unless 1 664874 1 all
2687056 intentionally_fight 1 664874 1 all
2687057 mentally_lock 1 664874 1 all
2687058 we_daily 1 664874 1 all
2687059 daily_encounter 1 664874 1 all
2687060 encounter_therefore 1 664874 1 all
2687061 longer_delight 1 664874 1 all

There are 2,681,545 unique bigram instances, which represents a only 0.0012% of all possible bigrams in the English language. However the distribution of frequent instances is larger as shown in the graphics below. This will capture the most frequently used word pairs in the language. There are 2,022,188 where the bigram only has a single use, this represents 75% of all bigram instances.

bigramfreq_dist <- data.frame(
    cum_sum = cumsum(bigram_freq$frequency), 
    idx = 1:nrow(bigram_freq)
)

Frequency of Bigram Instances Sorted by Rank

ggplot(data = bigramfreq_dist, aes(x = idx, y = cum_sum)) + 
    geom_line() + 
    xlab("Index of Feature Sorted by Frequency Rank") + 
    ylab("Cumulative Sum of Frequency in Sample") + 
    ggtitle("Frequency of Bigram Instances")

## Frequency of Bigrams Instances in Log-Log Scale

ggplot(data = bigram_freq, aes(x = log(rank), y = log(frequency))) + 
    geom_line() + 
    xlab("Log(Rank)") + 
    ylab("Log(Frequency)") + 
    ggtitle("Frequency of Bigrame Instances in Log-Log Scale")

Trigram Frequencies

Trigram instances will be explored to see how they are distributed in the corpus

toks_trigram <- tokens_ngrams(dat_toks, n = 3)
trigram <- dfm(toks_trigram)
trigram_freq <- textstat_frequency(trigram)

High Frequency Trigrams

kable(head(trigram_freq, 10))
feature frequency rank docfreq group
one_of_the 3436 1 3337 all
a_lot_of 2928 2 2788 all
thanks_for_the 2329 3 2323 all
going_to_be 1823 4 1731 all
to_be_a 1823 4 1785 all
the_end_of 1574 6 1536 all
out_of_the 1532 7 1506 all
i_want_to 1506 8 1425 all
it_was_a 1411 9 1376 all
some_of_the 1396 10 1368 all

Low Frequency Trigrams

kable(tail(trigram_freq, 10))
feature frequency rank docfreq group
6251115 up_the_wonders 1 727214 1 all
6251116 wonders_that_we 1 727214 1 all
6251117 that_we_daily 1 727214 1 all
6251118 we_daily_encounter 1 727214 1 all
6251119 daily_encounter_therefore 1 727214 1 all
6251120 encounter_therefore_we 1 727214 1 all
6251121 therefore_we_will 1 727214 1 all
6251122 no_longer_delight 1 727214 1 all
6251123 longer_delight_in 1 727214 1 all
6251124 in_their_beauty 1 727214 1 all

There are 6,251,124 unique trigram instances, which represents 6.42e-09% of the possible trigram instances in the language. However, consistent with the bigram instances, there is a large distribution of frequent instances. There are 5,523,911 trigram instances with only a single use, this represents 88.37% of all the trigram instances.

trigramfreq_dist <- data.frame(
    cum_sum = cumsum(trigram_freq$frequency), 
    idx = 1:nrow(trigram_freq)
)

Frequency of Trigrams Sorted by Rank

ggplot(data = trigramfreq_dist, aes(x = idx, y = cum_sum)) + 
    geom_line() + 
    xlab("Index of Feature Sorted by Frequency Rank") + 
    ylab("Cumulative Sum of Frequency in Sample") + 
    ggtitle("Frequency of Trigrams")

## Frequency of Trigrams in Log-Log Scale

ggplot(data = trigram_freq, aes(x = log(rank), y = log(frequency))) + 
    geom_line() + 
    xlab("Log(Rank)") + 
    ylab("Log(Frequency)") + 
    ggtitle("Frequency of Words in Log-Log Scale")

Next Steps

Further cleaning of the corpus may produce higher frequency bigram & trigram instances to increase the language coverage. Cleaning will also reduce the number of single frequency words in unigram, bigram, & trigram instances. Increasing the sample size may improve the frequency of some words and word pairs. However, according to Zipf’s Law, it is likely the representation will scale linearly, with low frequency words & pairs increasing at the same rate of high frequency words & pairs.

Furthering testing using n-grams models will be done to determine an appropriate sample size for model computation performance, and how further cleaning may improve accuracy of predictions. As well different modeling techniques will be explored to test for out of sample performance.