1. Introduction

The objective of this project is to build a predictive text model that suggests the next word as a user types, similar to the behavior of smartphone keyboards. This milestone demonstrates that the dataset has been successfully downloaded, cleaned, analyzed, and that a model development plan is underway.

2. Data Acquisition

The dataset was downloaded from the Coursera-provided HC Corpora source. The English language corpora includes three text files:

All data was read using readLines() in R with UTF-8 encoding.

3. Data Summary

Basic statistics computed from the three files:

Dataset Lines (approx) Words (approx)
Blogs 899,288 37 million
News 1,010,242 34 million
Twitter 2,360,148 30 million

Key observations:

4. Data Cleaning

Cleaning steps included:

This process ensures a clean and normalized text representation.

5. Exploratory Analysis

5.1 Word Frequency Distribution

The most frequent words across the datasets are:

“the”, “to”, “and”, “a”, “of”, “in”, “i”, “that”, “is”, “for”

These results were visualized using a barplot of the top 20 tokens.

5.2 Bigram & Trigram Analysis

We calculated:

  • most common 2-word sequences (bigrams)
  • most common 3-word sequences (trigrams)

Example frequent bigrams:

  • “in the”, “of the”, “to the”, “on the”

Example frequent trigrams:

  • “one of the”, “a lot of”, “as well as”

These results give insight into English phrase structure and form the foundation for the n-gram predictive model.

6. Findings & Insights

This suggests that a compressed prediction model can remain highly effective without storing the entire vocabulary.

7. Modeling Approach

The predictive model will use:

For unseen n-grams, a backoff model will be used:

To avoid zero probabilities, we will apply probability smoothing such as:

8. Memory & Performance Considerations

To ensure fast execution in a Shiny app, we will:

The goal is for the model to run comfortably even on resource-constrained devices.

9. Shiny App Plan

The final application will:

10. Conclusion

This milestone confirms: