This milstone report outlines exploratory data analysis of the Capstone English language dataset (en_US). The main goal is to analyze the underlying structure of three distinct textual sources: Blogs, News articles, and Twitter feeds.
By inspecting basic summary metrics and word distributions, I establish a clean foundational baseline to construct a predictive text algorithm (Next-Word Prediction Engine) and deploy an interactive user interface via a Shiny Application.
Before text manipulation, a structural assessment of the raw text files was performed to determine storage sizes, line depth, and token word distribution counts.
| File_Source | File_Size_MB | Total_Lines | Total_Words |
|---|---|---|---|
| en_US.blogs.txt | 200.42 | 899288 | 37546806 |
| en_US.news.txt | 196.28 | 1010206 | 34761151 |
| en_US.twitter.txt | 159.36 | 2360148 | 30096690 |
Storage Footprint: The datasets aggregate to over 550 MB of raw unstructured string data, requiring downsampling optimizations for stable text mining.
Length Constraints: Blog posts feature the longest continuous sentence lengths, while Twitter datasets maintain strict character ceilings resulting in high line density but compact word metrics.
The token environment is heavily dominated by common structural connector stop words (such as “the”, “and”, and “to”). While standard data science pipelines filter these out, we must retain them for our predictive typing engine since users frequently type these combinations.
Moving forward into production deployment, the engineering architecture is structured across two phases:
Phase 1: Predictive Engine Design:
N-Gram Back-Off Modeling: Construct sorted operational reference lookups for Quadgrams (4 words), Trigrams (3 words), and Bigrams (2 words).
Execution Path: When a user enters text, the algorithm checks the final 3 words against the Quadgram database. If no match exists, it “backs off” to look at the last 2 words in the Trigram matrix, and so forth.
Optimization: Words with low occurrence counts will be pruned to compress the model file size, keeping application response latency below 100 milliseconds.
Phase 2: User Interface (Shiny App Product):
Input Interface: A simple text box where a non-technical manager can naturally type expressions.
Reactive Output: The app backend dynamically listens to keystrokes and instantly outputs the top three predicted next words as selectable buttons.