The objective of this project is to build a predictive text model that suggests the next word as a user types, similar to the behavior of smartphone keyboards. This milestone demonstrates that the dataset has been successfully downloaded, cleaned, analyzed, and that a model development plan is underway.
The dataset was downloaded from the Coursera-provided HC Corpora source. The English language corpora includes three text files:
en_US.blogs.txten_US.news.txten_US.twitter.txtAll data was read using readLines() in R with UTF-8
encoding.
Basic statistics computed from the three files:
| Dataset | Lines (approx) | Words (approx) |
|---|---|---|
| Blogs | 899,288 | 37 million |
| News | 1,010,242 | 34 million |
| 2,360,148 | 30 million |
Key observations:
Cleaning steps included:
This process ensures a clean and normalized text representation.
The most frequent words across the datasets are:
“the”, “to”, “and”, “a”, “of”, “in”, “i”, “that”, “is”, “for”
These results were visualized using a barplot of the top 20 tokens.
We calculated:
Example frequent bigrams:
Example frequent trigrams:
These results give insight into English phrase structure and form the foundation for the n-gram predictive model.
This suggests that a compressed prediction model can remain highly effective without storing the entire vocabulary.
The predictive model will use:
For unseen n-grams, a backoff model will be used:
To avoid zero probabilities, we will apply probability smoothing such as:
To ensure fast execution in a Shiny app, we will:
object.size()Rprof()gc()The goal is for the model to run comfortably even on resource-constrained devices.
The final application will:
This milestone confirms: