The following report explores the Swift Keyboard data set to gather a understanding of how to build a predictive n-gram model using it. Word & bi/tri gram frequencies will be the primary focus. Bi/tri grams are a sequence of words generated from a document in a length of 2 & 3 respectively.
The follow data set contains 4,269,678 documents. The Twitter data set has 2,360,148 documents, representing 55% of total documents, the News data set has 1,010,242 documents, representing 24% of the data and the Blogs data set has 899,288 documents, representing 21% of the data.
That data sets have been combined into a single corpus for analysis, and sampled for in-memory performance.
A sample of each of the data sets is below.
library(quanteda)
library(knitr)
set.seed(123)
path_to_data <- file.path(".", "data", "final", "en_US")
twitter <- file.path(path_to_data, "en_US.twitter.txt")
news <- file.path(path_to_data, "en_US.news.txt")
blogs <- file.path(path_to_data, "en_US.blogs.txt")
get_data <- function(path_to_file) {
file_con <- file(path_to_file)
txt <- readLines(file_con)
close(file_con)
txt
}
twitter_txt <- get_data(twitter)
news_txt <- get_data(news)
blogs_txt <- get_data(blogs)
dat <- corpus(c(
twitter = twitter_txt,
news = news_txt,
blogs = blogs_txt
))
sample_size <- 0.10 # sample size of 10% of the full dataset
dat_sample <- corpus_sample(dat, size = ndoc(dat) * sample_size)
twitter_corpus <- corpus(twitter_txt)
kable(summary(corpus_sample(twitter_corpus, size = 10)))
| Text | Types | Tokens | Sentences |
|---|---|---|---|
| text1948344 | 16 | 16 | 2 |
| text1186132 | 8 | 8 | 1 |
| text1841323 | 6 | 6 | 1 |
| text2161139 | 21 | 26 | 4 |
| text609601 | 25 | 26 | 3 |
| text1700242 | 23 | 25 | 2 |
| text1000545 | 11 | 11 | 1 |
| text2160865 | 11 | 12 | 2 |
| text297306 | 5 | 5 | 1 |
| text1200045 | 19 | 23 | 1 |
news_corpus <- corpus(news_txt)
kable(summary(corpus_sample(news_corpus, size = 10)))
| Text | Types | Tokens | Sentences |
|---|---|---|---|
| text564458 | 32 | 41 | 2 |
| text629060 | 44 | 52 | 3 |
| text380840 | 33 | 41 | 2 |
| text209130 | 48 | 62 | 2 |
| text486872 | 2 | 2 | 1 |
| text610596 | 40 | 48 | 2 |
| text648827 | 35 | 39 | 2 |
| text846894 | 41 | 55 | 3 |
| text530555 | 40 | 50 | 2 |
| text330853 | 21 | 29 | 2 |
blogs_corpus <- corpus(blogs_txt)
kable(summary(corpus_sample(blogs_corpus, size = 10)))
| Text | Types | Tokens | Sentences |
|---|---|---|---|
| text532821 | 38 | 47 | 2 |
| text500278 | 5 | 6 | 1 |
| text402100 | 7 | 7 | 1 |
| text437603 | 55 | 81 | 1 |
| text627095 | 20 | 22 | 1 |
| text458534 | 5 | 5 | 1 |
| text548368 | 7 | 7 | 1 |
| text548300 | 5 | 5 | 1 |
| text779589 | 42 | 49 | 2 |
| text113494 | 108 | 163 | 7 |
Single word frequencies will be explored to see how they are distributed in the corpus.
library(quanteda.textstats)
dat_toks <- tokens(
dat_sample,
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_url = T,
remove_separators = T
)
dat_dfm <- dfm(dat_toks)
uni_freq <- textstat_frequency(dat_dfm)
kable(head(uni_freq, 10))
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| the | 477576 | 1 | 203048 | all |
| to | 275641 | 2 | 163220 | all |
| and | 241965 | 3 | 139579 | all |
| a | 238336 | 4 | 145047 | all |
| of | 200787 | 5 | 120526 | all |
| i | 166051 | 6 | 96683 | all |
| in | 164574 | 7 | 112156 | all |
| for | 110014 | 8 | 87797 | all |
| is | 108272 | 9 | 81365 | all |
| that | 104924 | 10 | 75621 | all |
kable(tail(uni_freq, 10))
| feature | frequency | rank | docfreq | group | |
|---|---|---|---|---|---|
| 223015 | 4-2-1 | 1 | 95671 | 1 | all |
| 223016 | hysteria-driving | 1 | 95671 | 1 | all |
| 223017 | hero2 | 1 | 95671 | 1 | all |
| 223018 | dressmaker’s | 1 | 95671 | 1 | all |
| 223019 | sticktime | 1 | 95671 | 1 | all |
| 223020 | acas | 1 | 95671 | 1 | all |
| 223021 | multiculturalists | 1 | 95671 | 1 | all |
| 223022 | #whitneyhayes | 1 | 95671 | 1 | all |
| 223023 | #registeritbeyootch | 1 | 95671 | 1 | all |
| 223024 | joy-killing | 1 | 95671 | 1 | all |
library(ggplot2)
freq_dist <- data.frame(
cum_sum = cumsum(uni_freq$frequency),
idx = 1:nrow(uni_freq),
rank = uni_freq$rank
)
ggplot(data = freq_dist, aes(x = rank, y = cum_sum)) +
geom_line() +
xlab("Index of Feature Sorted by Frequency Rank") +
ylab("Feature Frequency Rank") +
ggtitle("Frequency of Words")
## Frequency of Words in Log-Log Scale
ggplot(data = uni_freq, aes(x = log(rank), y = log(frequency))) +
geom_line() +
xlab("Log(Rank)") +
ylab("Log(Frequency)") +
ggtitle("Frequency of Words in Log-Log Scale")
The above graphic shows that there are relatively few words that make up the majority of words used in the corpus.
Words that used only once represent 57% of the unique words in corpus. These words are generally slang, misspelled words, names, and other words that are not commonly used.
In the English language there are approximately 476,000 words, according to Webster’s Third New International Dictionary. If the single use words are removed from the corpus, the corpus contains 95,868 unique words. This only represents 20% of the words in the English language. To cover 50% of the words in the English language, 238,000 unique words would be required and 428,400 words would be required to cover 90% of the language.
A few steps that could be taken to increase language coverage in the data set would be to run a spell checker on all the words, followed by stemming or lemmatizing the words to their root. Stemming or lemmatizing would reduce the total words in the language and increase coverage. The trade off would possibly losing some context or meaning of the sequence of words.
Bigram instances will be explored to determine how they are distributed in the dataset.
toks_bigram <- tokens_ngrams(dat_toks, n = 2)
bigram <- dfm(toks_bigram)
bigram_freq <- textstat_frequency(bigram)
kable(head(bigram_freq, 10))
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| of_the | 42896 | 1 | 35235 | all |
| in_the | 41214 | 2 | 35596 | all |
| to_the | 21474 | 3 | 19672 | all |
| for_the | 20110 | 4 | 18953 | all |
| on_the | 19703 | 5 | 18182 | all |
| to_be | 16372 | 6 | 14985 | all |
| at_the | 14347 | 7 | 13468 | all |
| and_the | 12552 | 8 | 11706 | all |
| in_a | 11915 | 9 | 11273 | all |
| with_the | 10605 | 10 | 10021 | all |
kable(tail(bigram_freq, 10))
| feature | frequency | rank | docfreq | group | |
|---|---|---|---|---|---|
| 2687052 | on_fairy | 1 | 664874 | 1 | all |
| 2687053 | fairy_stories | 1 | 664874 | 1 | all |
| 2687054 | grow_numb | 1 | 664874 | 1 | all |
| 2687055 | possess_unless | 1 | 664874 | 1 | all |
| 2687056 | intentionally_fight | 1 | 664874 | 1 | all |
| 2687057 | mentally_lock | 1 | 664874 | 1 | all |
| 2687058 | we_daily | 1 | 664874 | 1 | all |
| 2687059 | daily_encounter | 1 | 664874 | 1 | all |
| 2687060 | encounter_therefore | 1 | 664874 | 1 | all |
| 2687061 | longer_delight | 1 | 664874 | 1 | all |
There are 2,681,545 unique bigram instances, which represents a only 0.0012% of all possible bigrams in the English language. However the distribution of frequent instances is larger as shown in the graphics below. This will capture the most frequently used word pairs in the language. There are 2,022,188 where the bigram only has a single use, this represents 75% of all bigram instances.
bigramfreq_dist <- data.frame(
cum_sum = cumsum(bigram_freq$frequency),
idx = 1:nrow(bigram_freq)
)
ggplot(data = bigramfreq_dist, aes(x = idx, y = cum_sum)) +
geom_line() +
xlab("Index of Feature Sorted by Frequency Rank") +
ylab("Cumulative Sum of Frequency in Sample") +
ggtitle("Frequency of Bigram Instances")
## Frequency of Bigrams Instances in Log-Log Scale
ggplot(data = bigram_freq, aes(x = log(rank), y = log(frequency))) +
geom_line() +
xlab("Log(Rank)") +
ylab("Log(Frequency)") +
ggtitle("Frequency of Bigrame Instances in Log-Log Scale")
Trigram instances will be explored to see how they are distributed in the corpus
toks_trigram <- tokens_ngrams(dat_toks, n = 3)
trigram <- dfm(toks_trigram)
trigram_freq <- textstat_frequency(trigram)
kable(head(trigram_freq, 10))
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| one_of_the | 3436 | 1 | 3337 | all |
| a_lot_of | 2928 | 2 | 2788 | all |
| thanks_for_the | 2329 | 3 | 2323 | all |
| going_to_be | 1823 | 4 | 1731 | all |
| to_be_a | 1823 | 4 | 1785 | all |
| the_end_of | 1574 | 6 | 1536 | all |
| out_of_the | 1532 | 7 | 1506 | all |
| i_want_to | 1506 | 8 | 1425 | all |
| it_was_a | 1411 | 9 | 1376 | all |
| some_of_the | 1396 | 10 | 1368 | all |
kable(tail(trigram_freq, 10))
| feature | frequency | rank | docfreq | group | |
|---|---|---|---|---|---|
| 6251115 | up_the_wonders | 1 | 727214 | 1 | all |
| 6251116 | wonders_that_we | 1 | 727214 | 1 | all |
| 6251117 | that_we_daily | 1 | 727214 | 1 | all |
| 6251118 | we_daily_encounter | 1 | 727214 | 1 | all |
| 6251119 | daily_encounter_therefore | 1 | 727214 | 1 | all |
| 6251120 | encounter_therefore_we | 1 | 727214 | 1 | all |
| 6251121 | therefore_we_will | 1 | 727214 | 1 | all |
| 6251122 | no_longer_delight | 1 | 727214 | 1 | all |
| 6251123 | longer_delight_in | 1 | 727214 | 1 | all |
| 6251124 | in_their_beauty | 1 | 727214 | 1 | all |
There are 6,251,124 unique trigram instances, which represents 6.42e-09% of the possible trigram instances in the language. However, consistent with the bigram instances, there is a large distribution of frequent instances. There are 5,523,911 trigram instances with only a single use, this represents 88.37% of all the trigram instances.
trigramfreq_dist <- data.frame(
cum_sum = cumsum(trigram_freq$frequency),
idx = 1:nrow(trigram_freq)
)
ggplot(data = trigramfreq_dist, aes(x = idx, y = cum_sum)) +
geom_line() +
xlab("Index of Feature Sorted by Frequency Rank") +
ylab("Cumulative Sum of Frequency in Sample") +
ggtitle("Frequency of Trigrams")
## Frequency of Trigrams in Log-Log Scale
ggplot(data = trigram_freq, aes(x = log(rank), y = log(frequency))) +
geom_line() +
xlab("Log(Rank)") +
ylab("Log(Frequency)") +
ggtitle("Frequency of Words in Log-Log Scale")
Further cleaning of the corpus may produce higher frequency bigram & trigram instances to increase the language coverage. Cleaning will also reduce the number of single frequency words in unigram, bigram, & trigram instances. Increasing the sample size may improve the frequency of some words and word pairs. However, according to Zipf’s Law, it is likely the representation will scale linearly, with low frequency words & pairs increasing at the same rate of high frequency words & pairs.
Furthering testing using n-grams models will be done to determine an appropriate sample size for model computation performance, and how further cleaning may improve accuracy of predictions. As well different modeling techniques will be explored to test for out of sample performance.