In this report, I downloaded the Coursera-SwiftKey.zip file, unzipped it, and loaded the Twitter, blogs, and news documents from the ‘en_US’ folder. I then computed line counts, word counts per line, and created histograms to visualize the distribution of document lengths for each data source. Finally, I computed the total word counts.
| Source | Lines |
|---|---|
| 2360148 | |
| blogs | 899288 |
| news | 1010206 |
| Source | Min | Q1 | Median | Mean | Q3 | Max |
|---|---|---|---|---|---|---|
| 1 | 7 | 12 | 12.86936 | 18 | 47 | |
| Blogs | 1 | 9 | 28 | 41.51521 | 59 | 6630 |
| News | 1 | 19 | 31 | 34.02378 | 45 | 1792 |
| Source | Total_Words |
|---|---|
| 30373583 | |
| Blogs | 37334131 |
| News | 34371031 |
Next Steps: Prediction Algorithm and Shiny App
Based on the exploratory analysis above, I plan to build a next-word prediction algorithm using an n-gram model with backoff. I will first generate n-grams (1-gram to 4-gram) from the cleaned text data, then apply smoothing techniques to handle unseen word combinations. The final Shiny app will take a user’s partial sentence as input and display the top three predicted next words in real time. The app interface will be kept simple, with a text box and a clear output area, making it easy for non-technical users to interact with.