This is the milestone report for the Data Science Capstone project on introducing a predictive text model. The report describes the results of exploratory analysis of the training data set and the decisions to be applied to the predictive model.
The corpus consists of three large English text sources originated from blogs, news and Twitter
| source | n_docs | n_tokens | vocab_size |
|---|---|---|---|
| blogs | 897,580 | 36,333,110 | 243,725 |
| news | 1,008,998 | 33,077,470 | 202,592 |
| 2,359,621 | 28,654,472 | 288,499 |
In total:
Twitter has the largest vocabulary despite fewer tokens than blogs/news, which can be expected given its noisier, more “creative” language.
Based on the final goal, that the predictive model is intended to predict / suggest 3 words for the provided N-gram, and based on the results of the exploratory analysis I had to made a few decisions on the data preprocessing that will be applied going forward throughout the project:
^[a-z]+$ regex. This removes numbers, punctuation, emojis,
and other garbage.the, of, to,
and, …) dominate n-gram patterns and must be part of the
predictions.The top 10 most frequent words in the full corpus are:
| word | n | rank |
|---|---|---|
| the | 4,771,927 | 1 |
| to | 2,764,230 | 2 |
| and | 2,422,450 | 3 |
| a | 2,389,755 | 4 |
| of | 2,010,936 | 5 |
| in | 1,657,973 | 6 |
| i | 1,657,335 | 7 |
| for | 1,103,087 | 8 |
| is | 1,075,727 | 9 |
| that | 1,042,522 | 10 |
These are all function words, with the alone occurring 4,771,927 times.
The cumulative coverage of tokens by vocabulary size looks like a classic Zipfian pattern. Key points:
In other words:
This extreme concentration of probability mass in the head means we
can prune our vocabulary – treat the tail the same way as an
unknown token. See the “Vocabulary pruning and <UNK>
token” section.
For semantic exploration, I use the cleaned sample where stopwords are removed.
The top content words include terms such as
time,
people, day, love,
life, home, week,
school, and world, reflecting typical
daily-life topics.
For the cleaned sample:
I used hunspell_check on the top frequent words and
marked those not recognized by an English dictionary:
| word | is_english |
|---|---|
| lol | FALSE |
| friday | FALSE |
| im | FALSE |
| haha | FALSE |
| american | FALSE |
| saturday | FALSE |
| monday | FALSE |
| sunday | FALSE |
| april | FALSE |
| ok | FALSE |
| york | FALSE |
| thursday | FALSE |
| tuesday | FALSE |
| san | FALSE |
| mr | FALSE |
| wednesday | FALSE |
| tv | FALSE |
| obama | FALSE |
| christmas | FALSE |
| FALSE |
Many flagged tokens are:
lol, haha,
im),friday,
monday, sunday, april),ok),obama)which illustrates that dictionary coverage is not a reliable proxy for “foreignness”.
Even though the identification of foreign words obviously contains a
lot of false-positives, out of curiousity let’s explore a distribution
by languages using the textcat:
Rather than trying to explicitly remove or detect “foreign words”, I’m going to adopt the following:
<UNK>,
described next.<UNK> tokenI prune the unigram vocabulary by frequency using a cut-off threshold:
| min_count_unigram | vocab_size_lm | total_types_full | total_tokens_full |
|---|---|---|---|
| 2 | 247,760 | 507,824 | 98,065,052 |
min_count_unigram =
2All word types with frequency < min_count_unigram are
mapped to <UNK>.
<UNK><UNK> is the unknown word token
and represents:
Collapsing the long tail into <UNK> dramatically
reduces sparsity and it’s supported by the Zipfian distribution, refer
to the Coverage section.
See also the definition of UNK.
Going forward:
< min_count_unigram are mapped to
<UNK> before counting n-grams.<UNK>.To understand semantic collocations, I inspect n-grams on the cleaned sample (stopwords removed).
For the actual language model, I work with:
min_count_unigram,<UNK>| w1 | w2 | n | prob |
|---|---|---|---|
| of | the | 41445 | 0.2281598 |
| in | the | 38197 | 0.2709718 |
| to | the | 19768 | 0.0837340 |
| on | the | 17671 | 0.2644132 |
| for | the | 16587 | 0.1898804 |
| to | be | 14188 | 0.0600980 |
| at | the | 12504 | 0.2952538 |
| and | the | 12380 | 0.0565253 |
| in | a | 11079 | 0.0785951 |
| with | the | 9934 | 0.1608901 |
| is | a | 9022 | 0.1008349 |
| it | was | 8905 | 0.1252285 |
| from | the | 8371 | 0.2568737 |
| for | a | 8076 | 0.0924503 |
| i | was | 7720 | 0.0631627 |
| with | a | 7719 | 0.1250162 |
| of | a | 7679 | 0.0422738 |
| and | i | 7675 | 0.0350429 |
| it | is | 7599 | 0.1068626 |
| i | have | 7374 | 0.0603318 |
The model captures strong function-word patterns such as
of the, in the, at the,
to be. For example:
the | of) ≈ 0.23the | in) ≈ 0.27be | to) ≈ 0.06These align well with my prior expectations about English word sequences.
| w1 | w2 | w3 | n | prob |
|---|---|---|---|---|
| one | of | the | 3333 | 0.5104135 |
| a | lot | of | 2641 | 0.6587678 |
| to | be | a | 1578 | 0.1134354 |
| as | well | as | 1411 | 0.6102941 |
| the | end | of | 1386 | 0.6777506 |
| going | to | be | 1371 | 0.2209865 |
| out | of | the | 1358 | 0.2777096 |
| it | was | a | 1316 | 0.1515431 |
| some | of | the | 1300 | 0.4978935 |
| be | able | to | 1233 | 0.9911576 |
| i | want | to | 1223 | 0.6136478 |
| part | of | the | 1204 | 0.4091064 |
| this | is | a | 982 | 0.1841365 |
| thanks | for | the | 980 | 0.4885344 |
| the | rest | of | 978 | 0.8211587 |
| a | couple | of | 964 | 0.6634549 |
| the | first | time | 962 | 0.1793772 |
| i | have | a | 919 | 0.1266538 |
| i | have | to | 864 | 0.1190739 |
| the | fact | that | 864 | 0.8396501 |
Examples of very strong trigram patterns:
one of the → P(the | one of)
≈ 0.51a lot of → P(of | a lot) ≈
0.66be able to → P(to | be able)
≈ 0.99the fact that → P(that |
the fact) ≈ 0.84the end of, the rest of,
i want to similarly have high conditional
probabilities.These observations confirm that, backoff to bigrams/unigrams will be needed primarily for rare contexts, not common ones.
Based on the analysis:
P(w_n | w_{n-2}, w_{n-1}) as the main model.P(w_n | w_{n-2}, w_{n-1}) when context exists
and has mass.P(w_n | w_{n-1}) when trigram counts
are insufficient.P(w_n) in unseen or very sparse
contexts.w_n is out-of-vocabulary, the model predicts
<UNK>This ensures the model always return a probability distribution over the vocabulary, even for previously unseen contexts.
To make the model more robust and help it generalize better to new, unseen data, a smoothing technique will be applied. Given the heavy-tailed distribution and strong sparsity in higher-order n-grams, and referring to what is mentioned in the Jurafsky & Manning slide-decks, I will apply:
<s> and
</s> are not yet used – meaning the n-grams can be
composed of tokens from different sentences<UNK> and frequency pruning.Based on the results of EDA, the next steps will be
predict_next(w1, w2, k = 3) — return top-k candidates
and their probabilities.