The goal of this capstone project is to build a next-word prediction model similar to those used in mobile keyboards. The training data come from the HC Corpora collection and include text from blogs, news websites, and Twitter in multiple languages.
This milestone report:
Shows that the English data have been downloaded and successfully loaded into R.
Provides basic summaries of the three text files (line counts, word counts, etc.).
Outlines an initial plan for the prediction algorithm and the Shiny application in simple, non-technical language.
The goal is to demonstrate that I am comfortable working with the data and am on track to build the final prediction model.
##Data loading
| Dataset | Lines |
|---|---|
| Blogs | 899288 |
| News | 1010242 |
| 2360148 |
| Dataset | WordCount |
|---|---|
| Blogs | 37546250 |
| News | 34762395 |
| 30093413 |
| Dataset | Lines | WordCount | AvgWordsPerLine |
|---|---|---|---|
| Blogs | 899288 | 37546250 | 41.75 |
| News | 1010242 | 34762395 | 34.41 |
| 2360148 | 30093413 | 12.75 |
To reduce processing time, we sample 10,000 lines from each file.
## Early Findings Twitter lines are much shorter than Blogs or News due
to message length limits.
Blogs have the highest average words per line, pointing to richer sentence structure.
News text is balanced and formal, potentially influencing predictive word patterns.
The dataset is very large (over 4 million lines), so sampling strategies will be necessary when building the prediction model.
The next phase will focus on:
Removing punctuation, numbers, and profanity
Keeping common words (e.g., “the”, “to”, “and”) since they help prediction
Reducing memory by trimming rare word combinations
Ensuring fast prediction speeds for deployment
User types text
Model suggests the next word in real time
This milestone shows that we are comfortable with the data and ready to begin model development.