This report explores the HC Corpora dataset (Blogs, News, and Twitter) to prepare for building a predictive text application. We have analyzed the structure of the data, performed cleaning, and identified common word patterns. To maintain performance and reproducibility, we used a 1% random sample of the total data.
Before cleaning, we analyzed the raw files to understand their scale. The dataset is massive, containing over 4 million lines combined. This volume requires an efficient sampling strategy for model development.
| File | Size_MB | Line_Count | Word_Count |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546806 |
| News | 196.28 | 77259 | 2674561 |
| 159.36 | 2360148 | 30096690 |
Because the raw data is nearly 600MB, we took a 1% random sample to ensure the algorithm remains fast and responsive. We cleaned the sample by converting text to lowercase, removing punctuation, numbers, special characters, and excess white space.
We analyzed N-grams, which are sequences of words. This identifies which words or phrases are most likely to appear in specific contexts.
Top Unigrams (Single Words)
The most common words are “stop words” like the, to, and. These are essential for the grammatical structure of our predictions.
Top Bigrams (Two-Word Pairs)
Bigrams are the foundation of our next-word prediction. For example, if a user types “of,” the model identifies that “the” is a highly probable next word.
While exploring the data, several key observations were made:
The exploratory analysis confirms that the data is sufficient for building a predictive model. The project will now proceed to the development of the Shiny application.
The Algorithm
The Shiny App
The final app will feature a clean user interface where a user can type a sentence, and the app will instantly display the top three most likely next words as interactive buttons.