The objective of this capstone is to build a next-word prediction
model similar to the SwiftKey keyboard.
This report summarizes the work completed for Tasks 1–3 and is written
to be understandable by a non-technical manager.
The completed tasks include:
The dataset used is the English HC Corpora dataset provided by Coursera. It contains three large text files:
Due to the large size of the dataset, a reproducible 0.5% random sample was used to make analysis computationally feasible.
| Rank | Token | Count |
|---|---|---|
| 1 | the | 9278 |
| 2 | to | 5347 |
| 3 | and | 5263 |
| 4 | a | 4371 |
| 5 | of | 4326 |
| Rank | Token | Count |
|---|---|---|
| 1 | the | 9778 |
| 2 | to | 4600 |
| 3 | a | 4490 |
| 4 | and | 4390 |
| 5 | of | 3940 |
| Rank | Token | Count |
|---|---|---|
| 1 | the | 4546 |
| 2 | to | 3902 |
| 3 | i | 3619 |
| 4 | a | 2990 |
| 5 | you | 2594 |
These results show a Zipf-like distribution where a small number of words account for a large portion of usage.
To estimate the presence of foreign-language text, the proportion of tokens containing non-ASCII characters was calculated.
| Dataset | Estimated Foreign Word Ratio |
|---|---|
| Blogs | 1.10% |
| News | 1.77% |
| 3.18% |
Twitter contains more noisy and mixed-language text compared to blogs and news.
To capture word relationships, n-grams were generated.
These patterns help improve next-word prediction accuracy.
A compact n-gram language model was constructed using the cleaned and tokenized data:
This milestone confirms readiness to proceed toward building the final predictive text application.