Executive Summary

This report explores the HC Corpora dataset (Blogs, News, and Twitter) to prepare for building a predictive text application. We have analyzed the structure of the data, performed cleaning, and identified common word patterns. To maintain performance and reproducibility, we used a 1% random sample of the total data.

1. Data Statistics

Before cleaning, we analyzed the raw files to understand their scale. The dataset is massive, containing over 4 million lines combined. This volume requires an efficient sampling strategy for model development.

Summary of Raw Data Files
File Size_MB Line_Count Word_Count
Blogs 200.42 899288 37546806
News 196.28 77259 2674561
Twitter 159.36 2360148 30096690

2. Data Cleaning & Sampling

Because the raw data is nearly 600MB, we took a 1% random sample to ensure the algorithm remains fast and responsive. We cleaned the sample by converting text to lowercase, removing punctuation, numbers, special characters, and excess white space.

3. Exploratory Analysis (Word Frequencies)

We analyzed N-grams, which are sequences of words. This identifies which words or phrases are most likely to appear in specific contexts.

Top Unigrams (Single Words)

The most common words are “stop words” like the, to, and. These are essential for the grammatical structure of our predictions.

Top Bigrams (Two-Word Pairs)

Bigrams are the foundation of our next-word prediction. For example, if a user types “of,” the model identifies that “the” is a highly probable next word.

4. Interesting Findings

While exploring the data, several key observations were made:

  • Frequency Distribution: A small number of words cover a vast majority of the language used (Zipf’s Law). This suggests that our dictionary can be pruned for efficiency without losing significant accuracy.
  • Contextual Patterns: Bigrams and Trigrams are remarkably consistent across different media (Blogs vs. Twitter), though Twitter contains significantly more informal contractions.
  • Foreign Characters: The dataset contains occasional non-English characters which were filtered out to focus on the English-US prediction model.

5. Prediction Strategy & Conclusion

The exploratory analysis confirms that the data is sufficient for building a predictive model. The project will now proceed to the development of the Shiny application.

The Algorithm

  1. N-gram Lookup: I will implement a Katz Back-off model. This checks for a 3-word match first; if not found, it “backs off” to a 2-word match, and eventually to a 1-word match.
  2. Optimization: To handle the memory constraints of a mobile-style app, I will remove rare N-grams (those appearing only once) to shrink the data size.

The Shiny App

The final app will feature a clean user interface where a user can type a sentence, and the app will instantly display the top three most likely next words as interactive buttons.