Overview

In this report, I downloaded the Coursera-SwiftKey.zip file, unzipped it, and loaded the Twitter, blogs, and news documents from the ‘en_US’ folder. I then computed line counts, word counts per line, and created histograms to visualize the distribution of document lengths for each data source. Finally, I computed the total word counts.

Table 1: Line counts for each data source
Source Lines
twitter 2360148
blogs 899288
news 1010206
Table 2: Word Count per Text Message (Tweet, Blog, News)
Source Min Q1 Median Mean Q3 Max
Twitter 1 7 12 12.86936 18 47
Blogs 1 9 28 41.51521 59 6630
News 1 19 31 34.02378 45 1792

Total Words per File by Source (Tweet, Blog, News)
Source Total_Words
Twitter 30373583
Blogs 37334131
News 34371031

Next Steps: Prediction Algorithm and Shiny App

Based on the exploratory analysis above, I plan to build a next-word prediction algorithm using an n-gram model with backoff. I will first generate n-grams (1-gram to 4-gram) from the cleaned text data, then apply smoothing techniques to handle unseen word combinations. The final Shiny app will take a user’s partial sentence as input and display the top three predicted next words in real time. The app interface will be kept simple, with a text box and a clear output area, making it easy for non-technical users to interact with.