Milestone Report — Exploratory Data Analysis for Next-Word Prediction

1. Introduction

The objective of this project is to build a predictive text model that suggests the next word as a user types, similar to the behavior of smartphone keyboards. This milestone demonstrates that the dataset has been successfully downloaded, cleaned, analyzed, and that a model development plan is underway.

2. Data Acquisition

The dataset was downloaded from the Coursera-provided HC Corpora source. The English language corpora includes three text files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

All data was read using readLines() in R with UTF-8 encoding.

3. Data Summary

Basic statistics computed from the three files:

Dataset	Lines (approx)	Words (approx)
Blogs	899,288	37 million
News	1,010,242	34 million
Twitter	2,360,148	30 million

Key observations:

Tweets have many short, informal text segments.
Blogs have longer, narrative styles.
News tends to include standard grammatical structure.

4. Data Cleaning

Cleaning steps included:

converting to lowercase
removing punctuation
removing numbers
trimming whitespace
tokenization into individual words
handling stopwords depending on modeling needs
profanity filtering using a banned-word list

This process ensures a clean and normalized text representation.

5. Exploratory Analysis

5.1 Word Frequency Distribution

The most frequent words across the datasets are:

“the”, “to”, “and”, “a”, “of”, “in”, “i”, “that”, “is”, “for”

These results were visualized using a barplot of the top 20 tokens.

5.2 Bigram & Trigram Analysis

We calculated:

most common 2-word sequences (bigrams)
most common 3-word sequences (trigrams)

Example frequent bigrams:

“in the”, “of the”, “to the”, “on the”

Example frequent trigrams:

“one of the”, “a lot of”, “as well as”

These results give insight into English phrase structure and form the foundation for the n-gram predictive model.

6. Findings & Insights

Twitter is more conversational and informal than blogs and news.
A relatively small set of top words represents a large portion of the dataset:
- The top 100 words cover about 50% of all word occurrences.
- The top ~5,000 words cover about 90%.

This suggests that a compressed prediction model can remain highly effective without storing the entire vocabulary.

7. Modeling Approach

The predictive model will use:

Unigram (1 word)
Bigram (2 words)
Trigram (3 words)
Possibly 4-grams (depending on RAM constraints)

For unseen n-grams, a backoff model will be used:

If a trigram is not found → use the corresponding bigram.
If the bigram is not found → use the unigram.
If still not found → fall back to a low-probability word list.

To avoid zero probabilities, we will apply probability smoothing such as:

Good-Turing
Kneser-Ney
Add-k smoothing

8. Memory & Performance Considerations

To ensure fast execution in a Shiny app, we will:

remove low-frequency n-grams
compress tables where possible
use integer indexing
measure memory footprint with object.size()
profile performance with Rprof()
reclaim memory with gc()

The goal is for the model to run comfortably even on resource-constrained devices.

9. Shiny App Plan

The final application will:

provide a text entry box
suggest the most likely next word in real time
provide multiple word suggestion options
show the user input and the predicted next word
operate smoothly with minimal latency

10. Conclusion

This milestone confirms:

✔ The data has been successfully loaded and analyzed.
✔ Basic frequency and n-gram statistics have been explored.
✔ Modeling methodology and optimization approach are planned.
✔ Work is on track toward the final predictive text Shiny app.