This report provides an exploratory analysis of the digital text data provided for our predictive text project. The goal of this project is to build a smart keyboard algorithm that predicts the next word a user wants to type, similar to technologies found on modern smartphones.
We successfully loaded and analyzed three large datasets containing text from Blogs, News articles, and Twitter. This report outlines the basic structure of this data, uncovers key patterns in word usage, and outlines our strategy for building the final predictive application.
We successfully imported the three English text files
(en_US.blogs.txt, en_US.news.txt, and
en_US.twitter.txt). Because these files are exceptionally
large, we captured their core metrics—file sizes, line counts, and total
words—to understand the scale of data we are working with.
| File_Source | File_Size_MB | Line_Count | Word_Count |
|---|---|---|---|
| Blogs | 200.42 | 899,288 | 37,546,806 |
| News | 196.28 | 1,010,206 | 34,761,151 |
| 159.36 | 2,360,148 | 30,096,690 |
To build a predictor, we need to know which words and word combinations appear most frequently. Because processing millions of lines requires heavy computer memory, we took a random 1% sample of the data to uncover major linguistic patterns.
We cleaned the text by removing punctuation, numbers, and converting everything to lowercase. We then analyzed Unigrams (single words) and Bigrams (two-word combinations).
As expected, standard filler words like “the”, “to”, and “and” dominate English text.
Analyzing phrases gives us a clearer picture of how words link together naturally (e.g., “of the”, “in the”, “to the”).
Based on our exploratory findings, we have a clear path forward for creating the final data product:
We will build a simple, clean interactive web interface using R Shiny: * Input text box: A space where the user can type any sentence. * Real-time prediction buttons: The app will instantly display the top 3 predicted next words below the text box, exactly like a smartphone keyboard.