Introduction

The following report explores the Swift Keyboard data set to gather a understanding of how to build a predictive n-gram model using it. Word & bi/tri gram frequencies will be the primary focus. Bi/tri grams are a sequence of words generated from a document in a length of 2 & 3 respectively.

The Data

The follow data set contains 4,269,678 documents. The Twitter data set has 2,360,148 documents, representing 55% of total documents, the News data set has 1,010,242 documents, representing 24% of the data and the Blogs data set has 899,288 documents, representing 21% of the data.

That data sets have been combined into a single corpus for analysis, and sampled for in-memory performance.

A sample of each of the data sets is below.

library(quanteda)
library(knitr)
set.seed(123)

path_to_data <- file.path(".", "data", "final", "en_US")

twitter <- file.path(path_to_data, "en_US.twitter.txt")
news <- file.path(path_to_data, "en_US.news.txt")
blogs <- file.path(path_to_data, "en_US.blogs.txt")

get_data <- function(path_to_file) {
    file_con <- file(path_to_file)
    txt <- readLines(file_con)
    close(file_con)
    txt
} 

twitter_txt <- get_data(twitter)
news_txt <- get_data(news)
blogs_txt <- get_data(blogs)

dat <- corpus(c(
    twitter = twitter_txt,
    news = news_txt,
    blogs = blogs_txt
))

sample_size <- 0.10  # sample size of 10% of the full dataset
dat_sample <- corpus_sample(dat, size = ndoc(dat) * sample_size)

Twitter Corpus

twitter_corpus <- corpus(twitter_txt)
kable(summary(corpus_sample(twitter_corpus, size = 10)))

Text	Types	Tokens	Sentences
text1948344	16	16	2
text1186132	8	8	1
text1841323	6	6	1
text2161139	21	26	4
text609601	25	26	3
text1700242	23	25	2
text1000545	11	11	1
text2160865	11	12	2
text297306	5	5	1
text1200045	19	23	1

News Corpus

news_corpus <- corpus(news_txt)
kable(summary(corpus_sample(news_corpus, size = 10)))

Text	Types	Tokens	Sentences
text564458	32	41	2
text629060	44	52	3
text380840	33	41	2
text209130	48	62	2
text486872	2	2	1
text610596	40	48	2
text648827	35	39	2
text846894	41	55	3
text530555	40	50	2
text330853	21	29	2

Blogs Corpus

blogs_corpus <- corpus(blogs_txt)
kable(summary(corpus_sample(blogs_corpus, size = 10)))

Text	Types	Tokens	Sentences
text532821	38	47	2
text500278	5	6	1
text402100	7	7	1
text437603	55	81	1
text627095	20	22	1
text458534	5	5	1
text548368	7	7	1
text548300	5	5	1
text779589	42	49	2
text113494	108	163	7

Unigram Word Frequencies

Single word frequencies will be explored to see how they are distributed in the corpus.

library(quanteda.textstats)
dat_toks <- tokens(
    dat_sample, 
    remove_punct = T,
    remove_symbols = T,
    remove_numbers = T,
    remove_url = T,
    remove_separators = T
)
dat_dfm <- dfm(dat_toks)
uni_freq <- textstat_frequency(dat_dfm)

Highest Frequency Words

kable(head(uni_freq, 10))

feature	frequency	rank	docfreq	group
the	477576	1	203048	all
to	275641	2	163220	all
and	241965	3	139579	all
a	238336	4	145047	all
of	200787	5	120526	all
i	166051	6	96683	all
in	164574	7	112156	all
for	110014	8	87797	all
is	108272	9	81365	all
that	104924	10	75621	all

Lowest Frequency Words

kable(tail(uni_freq, 10))

	feature	frequency	rank	docfreq	group
223015	4-2-1	1	95671	1	all
223016	hysteria-driving	1	95671	1	all
223017	hero2	1	95671	1	all
223018	dressmaker’s	1	95671	1	all
223019	sticktime	1	95671	1	all
223020	acas	1	95671	1	all
223021	multiculturalists	1	95671	1	all
223022	#whitneyhayes	1	95671	1	all
223023	#registeritbeyootch	1	95671	1	all
223024	joy-killing	1	95671	1	all

library(ggplot2)

freq_dist <- data.frame(
    cum_sum = cumsum(uni_freq$frequency), 
    idx = 1:nrow(uni_freq),
    rank = uni_freq$rank
)

Distribution of Frequency of Words Sorted by Rank

ggplot(data = freq_dist, aes(x = rank, y = cum_sum)) + 
    geom_line() + 
    xlab("Index of Feature Sorted by Frequency Rank") + 
    ylab("Feature Frequency Rank") + 
    ggtitle("Frequency of Words")

## Frequency of Words in Log-Log Scale

ggplot(data = uni_freq, aes(x = log(rank), y = log(frequency))) + 
    geom_line() + 
    xlab("Log(Rank)") + 
    ylab("Log(Frequency)") + 
    ggtitle("Frequency of Words in Log-Log Scale")

The above graphic shows that there are relatively few words that make up the majority of words used in the corpus.

Words that used only once represent 57% of the unique words in corpus. These words are generally slang, misspelled words, names, and other words that are not commonly used.

In the English language there are approximately 476,000 words, according to Webster’s Third New International Dictionary. If the single use words are removed from the corpus, the corpus contains 95,868 unique words. This only represents 20% of the words in the English language. To cover 50% of the words in the English language, 238,000 unique words would be required and 428,400 words would be required to cover 90% of the language.

A few steps that could be taken to increase language coverage in the data set would be to run a spell checker on all the words, followed by stemming or lemmatizing the words to their root. Stemming or lemmatizing would reduce the total words in the language and increase coverage. The trade off would possibly losing some context or meaning of the sequence of words.

Bigram Frequencies

Bigram instances will be explored to determine how they are distributed in the dataset.

toks_bigram <- tokens_ngrams(dat_toks, n = 2)
bigram <- dfm(toks_bigram)
bigram_freq <- textstat_frequency(bigram)

High Frequency Words

kable(head(bigram_freq, 10))

feature	frequency	rank	docfreq	group
of_the	42896	1	35235	all
in_the	41214	2	35596	all
to_the	21474	3	19672	all
for_the	20110	4	18953	all
on_the	19703	5	18182	all
to_be	16372	6	14985	all
at_the	14347	7	13468	all
and_the	12552	8	11706	all
in_a	11915	9	11273	all
with_the	10605	10	10021	all

Low Frequency Words

kable(tail(bigram_freq, 10))

	feature	frequency	rank	docfreq	group
2687052	on_fairy	1	664874	1	all
2687053	fairy_stories	1	664874	1	all
2687054	grow_numb	1	664874	1	all
2687055	possess_unless	1	664874	1	all
2687056	intentionally_fight	1	664874	1	all
2687057	mentally_lock	1	664874	1	all
2687058	we_daily	1	664874	1	all
2687059	daily_encounter	1	664874	1	all
2687060	encounter_therefore	1	664874	1	all
2687061	longer_delight	1	664874	1	all

There are 2,681,545 unique bigram instances, which represents a only 0.0012% of all possible bigrams in the English language. However the distribution of frequent instances is larger as shown in the graphics below. This will capture the most frequently used word pairs in the language. There are 2,022,188 where the bigram only has a single use, this represents 75% of all bigram instances.

bigramfreq_dist <- data.frame(
    cum_sum = cumsum(bigram_freq$frequency), 
    idx = 1:nrow(bigram_freq)
)

Frequency of Bigram Instances Sorted by Rank

ggplot(data = bigramfreq_dist, aes(x = idx, y = cum_sum)) + 
    geom_line() + 
    xlab("Index of Feature Sorted by Frequency Rank") + 
    ylab("Cumulative Sum of Frequency in Sample") + 
    ggtitle("Frequency of Bigram Instances")

## Frequency of Bigrams Instances in Log-Log Scale

ggplot(data = bigram_freq, aes(x = log(rank), y = log(frequency))) + 
    geom_line() + 
    xlab("Log(Rank)") + 
    ylab("Log(Frequency)") + 
    ggtitle("Frequency of Bigrame Instances in Log-Log Scale")

Trigram Frequencies

Trigram instances will be explored to see how they are distributed in the corpus

toks_trigram <- tokens_ngrams(dat_toks, n = 3)
trigram <- dfm(toks_trigram)
trigram_freq <- textstat_frequency(trigram)

High Frequency Trigrams

kable(head(trigram_freq, 10))

feature	frequency	rank	docfreq	group
one_of_the	3436	1	3337	all
a_lot_of	2928	2	2788	all
thanks_for_the	2329	3	2323	all
going_to_be	1823	4	1731	all
to_be_a	1823	4	1785	all
the_end_of	1574	6	1536	all
out_of_the	1532	7	1506	all
i_want_to	1506	8	1425	all
it_was_a	1411	9	1376	all
some_of_the	1396	10	1368	all

Low Frequency Trigrams

kable(tail(trigram_freq, 10))

	feature	frequency	rank	docfreq	group
6251115	up_the_wonders	1	727214	1	all
6251116	wonders_that_we	1	727214	1	all
6251117	that_we_daily	1	727214	1	all
6251118	we_daily_encounter	1	727214	1	all
6251119	daily_encounter_therefore	1	727214	1	all
6251120	encounter_therefore_we	1	727214	1	all
6251121	therefore_we_will	1	727214	1	all
6251122	no_longer_delight	1	727214	1	all
6251123	longer_delight_in	1	727214	1	all
6251124	in_their_beauty	1	727214	1	all

There are 6,251,124 unique trigram instances, which represents 6.42e-09% of the possible trigram instances in the language. However, consistent with the bigram instances, there is a large distribution of frequent instances. There are 5,523,911 trigram instances with only a single use, this represents 88.37% of all the trigram instances.

trigramfreq_dist <- data.frame(
    cum_sum = cumsum(trigram_freq$frequency), 
    idx = 1:nrow(trigram_freq)
)

Frequency of Trigrams Sorted by Rank

ggplot(data = trigramfreq_dist, aes(x = idx, y = cum_sum)) + 
    geom_line() + 
    xlab("Index of Feature Sorted by Frequency Rank") + 
    ylab("Cumulative Sum of Frequency in Sample") + 
    ggtitle("Frequency of Trigrams")

## Frequency of Trigrams in Log-Log Scale

ggplot(data = trigram_freq, aes(x = log(rank), y = log(frequency))) + 
    geom_line() + 
    xlab("Log(Rank)") + 
    ylab("Log(Frequency)") + 
    ggtitle("Frequency of Words in Log-Log Scale")

Next Steps

Further cleaning of the corpus may produce higher frequency bigram & trigram instances to increase the language coverage. Cleaning will also reduce the number of single frequency words in unigram, bigram, & trigram instances. Increasing the sample size may improve the frequency of some words and word pairs. However, according to Zipf’s Law, it is likely the representation will scale linearly, with low frequency words & pairs increasing at the same rate of high frequency words & pairs.

Furthering testing using n-grams models will be done to determine an appropriate sample size for model computation performance, and how further cleaning may improve accuracy of predictions. As well different modeling techniques will be explored to test for out of sample performance.

Exploratory Data Analysis for Ngram Models

TI

2022-07-05