1 Overview

This is the milestone report for the Data Science Capstone project on introducing a predictive text model. The report describes the results of exploratory analysis of the training data set and the decisions to be applied to the predictive model.

2 Data description and preprocessing

2.1 Corpus overview

The corpus consists of three large English text sources originated from blogs, news and Twitter

Corpus composition by source
source	n_docs	n_tokens	vocab_size
blogs	897,580	36,333,110	243,725
news	1,008,998	33,077,470	202,592
twitter	2,359,621	28,654,472	288,499

In total:

Documents: 4,266,199
Tokens: 98,065,052
Unique word types (after filtering): 507,824

Twitter has the largest vocabulary despite fewer tokens than blogs/news, which can be expected given its noisier, more “creative” language.

2.2 Preprocessing and cleanup

Based on the final goal, that the predictive model is intended to predict / suggest 3 words for the provided N-gram, and based on the results of the exploratory analysis I had to made a few decisions on the data preprocessing that will be applied going forward throughout the project:

Lowercasing: all tokens are converted to lowercase
Alphabetic-only filter: retain only tokens matching ^[a-z]+$ regex. This removes numbers, punctuation, emojis, and other garbage.
No stemming / lemmatization: the final model should predict an exact form, not lemmas, so I keep the full word forms to preserve realistic n-grams.
Stopwords
- for language modeling and prediction, stopwords will be kept. Function words (the, of, to, and, …) dominate n-gram patterns and must be part of the predictions.
- for semantic Exploratory Data Analysis (EDA), I removed stopwords to highlight content.
Sampling
- Global statistics are computed on the full corpus
- heavy EDA (like n-grams analysis) is done on a stratified sample of up to 100k documents per source
Profanity: at this stage, profanity is not explicitly removed

3 Full-corpus unigram distribution

3.1 Head of the distribution

The top 10 most frequent words in the full corpus are:

Top 10 most frequent words (full corpus, with stopwords)
word	n	rank
the	4,771,927	1
to	2,764,230	2
and	2,422,450	3
a	2,389,755	4
of	2,010,936	5
in	1,657,973	6
i	1,657,335	7
for	1,103,087	8
is	1,075,727	9
that	1,042,522	10

These are all function words, with the alone occurring 4,771,927 times.

3.2 Coverage plot

3.3 Zipfian coverage of tokens

The cumulative coverage of tokens by vocabulary size looks like a classic Zipfian pattern. Key points:

50% of all tokens are covered by the top 132 word types (last word: being).
90% of all tokens are covered by the top 6912 word types (last word: snake).

In other words:

~-0.03% of the vocabulary covers half the tokens
~1.4% of the vocabulary covers 90% of the tokens

This extreme concentration of probability mass in the head means we can prune our vocabulary – treat the tail the same way as an unknown token. See the “Vocabulary pruning and <UNK> token” section.

4 Semantic exploration

For semantic exploration, I use the cleaned sample where stopwords are removed.

4.1 Top content words

The top content words include terms such as time, people, day, love, life, home, week, school, and world, reflecting typical daily-life topics.

4.2 Coverage plot

4.3 Cleaned corpus unigram distribution

For the cleaned sample:

50% of content tokens are covered by the top 1658 words (last word: legs).
90% of content tokens are covered by the top 18659 words (last word: ambiguity).

5 Foreign words

5.1 Dictionary-based “non-English” words

I used hunspell_check on the top frequent words and marked those not recognized by an English dictionary:

Example of frequent words not recognized by hunspell (is_english = FALSE)
word	is_english
lol	FALSE
friday	FALSE
im	FALSE
haha	FALSE
american	FALSE
saturday	FALSE
monday	FALSE
sunday	FALSE
april	FALSE
ok	FALSE
york	FALSE
thursday	FALSE
tuesday	FALSE
san	FALSE
mr	FALSE
wednesday	FALSE
tv	FALSE
obama	FALSE
christmas	FALSE
facebook	FALSE

Many flagged tokens are:

informal Internet language (lol, haha, im),
proper names or month/day names (friday, monday, sunday, april),
abbreviations (ok),
names (obama)

which illustrates that dictionary coverage is not a reliable proxy for “foreignness”.

5.2 Approximate language distribution by source

Even though the identification of foreign words obviously contains a lot of false-positives, out of curiousity let’s explore a distribution by languages using the textcat:

5.2.1 Applicability for next steps

Rather than trying to explicitly remove or detect “foreign words”, I’m going to adopt the following:

Keep all tokens during EDA, including foreign words and creative spellings.
Handle rare and foreign words implicitly via vocabulary pruning and <UNK>, described next.

6 Vocabulary pruning and `<UNK>` token

6.1 Pruning strategy

I prune the unigram vocabulary by frequency using a cut-off threshold:

Vocabulary pruning statistics for the future language model
min_count_unigram	vocab_size_lm	total_types_full	total_tokens_full
2	247,760	507,824	98,065,052

Cut-off threshold: min_count_unigram = 2
Full-corpus distinct types: 507,824
LM vocabulary after pruning: 247,760

All word types with frequency < min_count_unigram are mapped to <UNK>.

6.2 Role of `<UNK>`

<UNK> is the unknown word token and represents:

very rare words (frequency below threshold)
foreign words
heavy typos and unusual spellings
rare proper names

Collapsing the long tail into <UNK> dramatically reduces sparsity and it’s supported by the Zipfian distribution, refer to the Coverage section.

7 N-gram exploratory analysis

To understand semantic collocations, I inspect n-grams on the cleaned sample (stopwords removed).

7.1 Top bigrams

7.2 Top trigrams

8 N-gram LMs

For the actual language model, I work with:

Full tokens (including stopwords)
Vocabulary pruned at frequency ≥ min_count_unigram,
Rare tokens mapped to <UNK>

8.1 Bigram LM probabilities

Top 20 bigrams (LM view, with function words)
w1	w2	n	prob
of	the	41445	0.2281598
in	the	38197	0.2709718
to	the	19768	0.0837340
on	the	17671	0.2644132
for	the	16587	0.1898804
to	be	14188	0.0600980
at	the	12504	0.2952538
and	the	12380	0.0565253
in	a	11079	0.0785951
with	the	9934	0.1608901
is	a	9022	0.1008349
it	was	8905	0.1252285
from	the	8371	0.2568737
for	a	8076	0.0924503
i	was	7720	0.0631627
with	a	7719	0.1250162
of	a	7679	0.0422738
and	i	7675	0.0350429
it	is	7599	0.1068626
i	have	7374	0.0603318

The model captures strong function-word patterns such as of the, in the, at the, to be. For example:

P(the | of) ≈ 0.23
P(the | in) ≈ 0.27
P(be | to) ≈ 0.06

These align well with my prior expectations about English word sequences.

8.2 Trigram LM probabilities

Top 20 trigrams (LM view, with function words)
w1	w2	w3	n	prob
one	of	the	3333	0.5104135
a	lot	of	2641	0.6587678
to	be	a	1578	0.1134354
as	well	as	1411	0.6102941
the	end	of	1386	0.6777506
going	to	be	1371	0.2209865
out	of	the	1358	0.2777096
it	was	a	1316	0.1515431
some	of	the	1300	0.4978935
be	able	to	1233	0.9911576
i	want	to	1223	0.6136478
part	of	the	1204	0.4091064
this	is	a	982	0.1841365
thanks	for	the	980	0.4885344
the	rest	of	978	0.8211587
a	couple	of	964	0.6634549
the	first	time	962	0.1793772
i	have	a	919	0.1266538
i	have	to	864	0.1190739
the	fact	that	864	0.8396501

Examples of very strong trigram patterns:

one of the → P(the | one of) ≈ 0.51
a lot of → P(of | a lot) ≈ 0.66
be able to → P(to | be able) ≈ 0.99
the fact that → P(that | the fact) ≈ 0.84
the end of, the rest of, i want to similarly have high conditional probabilities.

These observations confirm that, backoff to bigrams/unigrams will be needed primarily for rare contexts, not common ones.

9 Model design decisions

9.1 Model order and backoff

Based on the analysis:

I will use a trigram language model P(w_n | w_{n-2}, w_{n-1}) as the main model.
I will adopt a backoff hierarchy:
1. Trigram P(w_n | w_{n-2}, w_{n-1}) when context exists and has mass.
2. Backoff to bigram P(w_n | w_{n-1}) when trigram counts are insufficient.
3. Backoff to unigram P(w_n) in unseen or very sparse contexts.
4. If w_n is out-of-vocabulary, the model predicts <UNK>

This ensures the model always return a probability distribution over the vocabulary, even for previously unseen contexts.

9.2 Smoothing

To make the model more robust and help it generalize better to new, unseen data, a smoothing technique will be applied. Given the heavy-tailed distribution and strong sparsity in higher-order n-grams, and referring to what is mentioned in the Jurafsky & Manning slide-decks, I will apply:

Absolute discounting with interpolation (Katz backoff), and/or
Interpolated Kneser–Ney smoothing for bigrams/trigrams.

9.3 Out of scope

Sentence boundary tokens <s> and </s> are not yet used – meaning the n-grams can be composed of tokens from different sentences
No explicit removal of foreign words; they are handled via <UNK> and frequency pruning.

10 Next steps

Based on the results of EDA, the next steps will be

Finalize and save the cleaned n-gram frequency tables for use in the app
Implement a prediction API on top of the n-gram tables, e.g.:
- predict_next(w1, w2, k = 3) — return top-k candidates and their probabilities.
Add smoothing
- Implement absolute-discount backoff / interpolated Kneser–Ney for bigrams and trigrams.
Evaluate
Create a Shiny application as a user interface for the model

Coursera Capstone Project Milestone Report - 2025

Konstantin Zakharov

2025-12-06

1 Overview

2 Data description and preprocessing

2.1 Corpus overview

2.2 Preprocessing and cleanup

3 Full-corpus unigram distribution

3.1 Head of the distribution

3.2 Coverage plot

3.3 Zipfian coverage of tokens

4 Semantic exploration

4.1 Top content words

4.2 Coverage plot

4.3 Cleaned corpus unigram distribution

5 Foreign words

5.1 Dictionary-based “non-English” words

5.2 Approximate language distribution by source

5.2.1 Applicability for next steps

6 Vocabulary pruning and `<UNK>` token

6.1 Pruning strategy

6.2 Role of `<UNK>`

7 N-gram exploratory analysis

7.1 Top bigrams

7.2 Top trigrams

8 N-gram LMs

8.1 Bigram LM probabilities

8.2 Trigram LM probabilities

9 Model design decisions

9.1 Model order and backoff

9.2 Smoothing

9.3 Out of scope

10 Next steps

Coursera Capstone Project Milestone Report - 2025

Konstantin Zakharov

2025-12-06

1 Overview

2 Data description and preprocessing

2.1 Corpus overview

2.2 Preprocessing and cleanup

3 Full-corpus unigram distribution

3.1 Head of the distribution

3.2 Coverage plot

3.3 Zipfian coverage of tokens

4 Semantic exploration

4.1 Top content words

4.2 Coverage plot

4.3 Cleaned corpus unigram distribution

5 Foreign words

5.1 Dictionary-based “non-English” words

5.2 Approximate language distribution by source

5.2.1 Applicability for next steps

6 Vocabulary pruning and <UNK> token

6.1 Pruning strategy

6.2 Role of <UNK>

7 N-gram exploratory analysis

7.1 Top bigrams

7.2 Top trigrams

8 N-gram LMs

8.1 Bigram LM probabilities

8.2 Trigram LM probabilities

9 Model design decisions

9.1 Model order and backoff

9.2 Smoothing

9.3 Out of scope

10 Next steps

6 Vocabulary pruning and `<UNK>` token

6.2 Role of `<UNK>`