1 Overview

This is the milestone report for the Data Science Capstone project on introducing a predictive text model. The report describes the results of exploratory analysis of the training data set and the decisions to be applied to the predictive model.

2 Data description and preprocessing

2.1 Corpus overview

The corpus consists of three large English text sources originated from blogs, news and Twitter

Corpus composition by source
source n_docs n_tokens vocab_size
blogs 897,580 36,333,110 243,725
news 1,008,998 33,077,470 202,592
twitter 2,359,621 28,654,472 288,499

In total:

  • Documents: 4,266,199
  • Tokens: 98,065,052
  • Unique word types (after filtering): 507,824

Twitter has the largest vocabulary despite fewer tokens than blogs/news, which can be expected given its noisier, more “creative” language.

2.2 Preprocessing and cleanup

Based on the final goal, that the predictive model is intended to predict / suggest 3 words for the provided N-gram, and based on the results of the exploratory analysis I had to made a few decisions on the data preprocessing that will be applied going forward throughout the project:

  • Lowercasing: all tokens are converted to lowercase
  • Alphabetic-only filter: retain only tokens matching ^[a-z]+$ regex. This removes numbers, punctuation, emojis, and other garbage.
  • No stemming / lemmatization: the final model should predict an exact form, not lemmas, so I keep the full word forms to preserve realistic n-grams.
  • Stopwords
    • for language modeling and prediction, stopwords will be kept. Function words (the, of, to, and, …) dominate n-gram patterns and must be part of the predictions.
    • for semantic Exploratory Data Analysis (EDA), I removed stopwords to highlight content.
  • Sampling
    • Global statistics are computed on the full corpus
    • heavy EDA (like n-grams analysis) is done on a stratified sample of up to 100k documents per source
  • Profanity: at this stage, profanity is not explicitly removed

3 Full-corpus unigram distribution

3.1 Head of the distribution

The top 10 most frequent words in the full corpus are:

Top 10 most frequent words (full corpus, with stopwords)
word n rank
the 4,771,927 1
to 2,764,230 2
and 2,422,450 3
a 2,389,755 4
of 2,010,936 5
in 1,657,973 6
i 1,657,335 7
for 1,103,087 8
is 1,075,727 9
that 1,042,522 10

These are all function words, with the alone occurring 4,771,927 times.

3.2 Coverage plot

3.3 Zipfian coverage of tokens

The cumulative coverage of tokens by vocabulary size looks like a classic Zipfian pattern. Key points:

  • 50% of all tokens are covered by the top 132 word types (last word: being).
  • 90% of all tokens are covered by the top 6912 word types (last word: snake).

In other words:

  • ~-0.03% of the vocabulary covers half the tokens
  • ~1.4% of the vocabulary covers 90% of the tokens

This extreme concentration of probability mass in the head means we can prune our vocabulary – treat the tail the same way as an unknown token. See the “Vocabulary pruning and <UNK> token” section.

4 Semantic exploration

For semantic exploration, I use the cleaned sample where stopwords are removed.

4.1 Top content words

The top content words include terms such as time, people, day, love, life, home, week, school, and world, reflecting typical daily-life topics.

4.2 Coverage plot

4.3 Cleaned corpus unigram distribution

For the cleaned sample:

  • 50% of content tokens are covered by the top 1658 words (last word: legs).
  • 90% of content tokens are covered by the top 18659 words (last word: ambiguity).

5 Foreign words

5.1 Dictionary-based “non-English” words

I used hunspell_check on the top frequent words and marked those not recognized by an English dictionary:

Example of frequent words not recognized by hunspell (is_english = FALSE)
word is_english
lol FALSE
friday FALSE
im FALSE
haha FALSE
american FALSE
saturday FALSE
monday FALSE
sunday FALSE
april FALSE
ok FALSE
york FALSE
thursday FALSE
tuesday FALSE
san FALSE
mr FALSE
wednesday FALSE
tv FALSE
obama FALSE
christmas FALSE
facebook FALSE

Many flagged tokens are:

  • informal Internet language (lol, haha, im),
  • proper names or month/day names (friday, monday, sunday, april),
  • abbreviations (ok),
  • names (obama)

which illustrates that dictionary coverage is not a reliable proxy for “foreignness”.

5.2 Approximate language distribution by source

Even though the identification of foreign words obviously contains a lot of false-positives, out of curiousity let’s explore a distribution by languages using the textcat:

5.2.1 Applicability for next steps

Rather than trying to explicitly remove or detect “foreign words”, I’m going to adopt the following:

  • Keep all tokens during EDA, including foreign words and creative spellings.
  • Handle rare and foreign words implicitly via vocabulary pruning and <UNK>, described next.

6 Vocabulary pruning and <UNK> token

6.1 Pruning strategy

I prune the unigram vocabulary by frequency using a cut-off threshold:

Vocabulary pruning statistics for the future language model
min_count_unigram vocab_size_lm total_types_full total_tokens_full
2 247,760 507,824 98,065,052
  • Cut-off threshold: min_count_unigram = 2
  • Full-corpus distinct types: 507,824
  • LM vocabulary after pruning: 247,760

All word types with frequency < min_count_unigram are mapped to <UNK>.

6.2 Role of <UNK>

<UNK> is the unknown word token and represents:

  • very rare words (frequency below threshold)
  • foreign words
  • heavy typos and unusual spellings
  • rare proper names

Collapsing the long tail into <UNK> dramatically reduces sparsity and it’s supported by the Zipfian distribution, refer to the Coverage section.

See also the definition of UNK.

Going forward:

  • During training, all tokens with count < min_count_unigram are mapped to <UNK> before counting n-grams.
  • At prediction time, any word not in the LM vocabulary is also mapped to <UNK>.

7 N-gram exploratory analysis

To understand semantic collocations, I inspect n-grams on the cleaned sample (stopwords removed).

7.1 Top bigrams

7.2 Top trigrams

8 N-gram LMs

For the actual language model, I work with:

8.1 Bigram LM probabilities

Top 20 bigrams (LM view, with function words)
w1 w2 n prob
of the 41445 0.2281598
in the 38197 0.2709718
to the 19768 0.0837340
on the 17671 0.2644132
for the 16587 0.1898804
to be 14188 0.0600980
at the 12504 0.2952538
and the 12380 0.0565253
in a 11079 0.0785951
with the 9934 0.1608901
is a 9022 0.1008349
it was 8905 0.1252285
from the 8371 0.2568737
for a 8076 0.0924503
i was 7720 0.0631627
with a 7719 0.1250162
of a 7679 0.0422738
and i 7675 0.0350429
it is 7599 0.1068626
i have 7374 0.0603318

The model captures strong function-word patterns such as of the, in the, at the, to be. For example:

  • P(the | of) ≈ 0.23
  • P(the | in) ≈ 0.27
  • P(be | to) ≈ 0.06

These align well with my prior expectations about English word sequences.

8.2 Trigram LM probabilities

Top 20 trigrams (LM view, with function words)
w1 w2 w3 n prob
one of the 3333 0.5104135
a lot of 2641 0.6587678
to be a 1578 0.1134354
as well as 1411 0.6102941
the end of 1386 0.6777506
going to be 1371 0.2209865
out of the 1358 0.2777096
it was a 1316 0.1515431
some of the 1300 0.4978935
be able to 1233 0.9911576
i want to 1223 0.6136478
part of the 1204 0.4091064
this is a 982 0.1841365
thanks for the 980 0.4885344
the rest of 978 0.8211587
a couple of 964 0.6634549
the first time 962 0.1793772
i have a 919 0.1266538
i have to 864 0.1190739
the fact that 864 0.8396501

Examples of very strong trigram patterns:

  • one of the → P(the | one of) ≈ 0.51
  • a lot of → P(of | a lot) ≈ 0.66
  • be able to → P(to | be able) ≈ 0.99
  • the fact that → P(that | the fact) ≈ 0.84
  • the end of, the rest of, i want to similarly have high conditional probabilities.

These observations confirm that, backoff to bigrams/unigrams will be needed primarily for rare contexts, not common ones.

9 Model design decisions

9.1 Model order and backoff

Based on the analysis:

  • I will use a trigram language model P(w_n | w_{n-2}, w_{n-1}) as the main model.
  • I will adopt a backoff hierarchy:
    1. Trigram P(w_n | w_{n-2}, w_{n-1}) when context exists and has mass.
    2. Backoff to bigram P(w_n | w_{n-1}) when trigram counts are insufficient.
    3. Backoff to unigram P(w_n) in unseen or very sparse contexts.
    4. If w_n is out-of-vocabulary, the model predicts <UNK>

This ensures the model always return a probability distribution over the vocabulary, even for previously unseen contexts.

9.2 Smoothing

To make the model more robust and help it generalize better to new, unseen data, a smoothing technique will be applied. Given the heavy-tailed distribution and strong sparsity in higher-order n-grams, and referring to what is mentioned in the Jurafsky & Manning slide-decks, I will apply:

  • Absolute discounting with interpolation (Katz backoff), and/or
  • Interpolated Kneser–Ney smoothing for bigrams/trigrams.

9.3 Out of scope

  • Sentence boundary tokens <s> and </s> are not yet used – meaning the n-grams can be composed of tokens from different sentences
  • No explicit removal of foreign words; they are handled via <UNK> and frequency pruning.

10 Next steps

Based on the results of EDA, the next steps will be

  1. Finalize and save the cleaned n-gram frequency tables for use in the app
  2. Implement a prediction API on top of the n-gram tables, e.g.:
    • predict_next(w1, w2, k = 3) — return top-k candidates and their probabilities.
  3. Add smoothing
    • Implement absolute-discount backoff / interpolated Kneser–Ney for bigrams and trigrams.
  4. Evaluate
  5. Create a Shiny application as a user interface for the model