Milestone Report

Overview

In this report, I downloaded the Coursera-SwiftKey.zip file, unzipped it, and loaded the Twitter, blogs, and news documents from the ‘en_US’ folder. I then computed line counts, word counts per line, and created histograms to visualize the distribution of document lengths for each data source. Finally, I computed the total word counts.

Table 1: Line counts for each data source
Source	Lines
twitter	2360148
blogs	899288
news	1010206

Table 2: Word Count per Text Message (Tweet, Blog, News)
Source	Min	Q1	Median	Mean	Q3	Max
Twitter	1	7	12	12.86936	18	47
Blogs	1	9	28	41.51521	59	6630
News	1	19	31	34.02378	45	1792

Total Words per File by Source (Tweet, Blog, News)
Source	Total_Words
Twitter	30373583
Blogs	37334131
News	34371031

Next Steps: Prediction Algorithm and Shiny App

Based on the exploratory analysis above, I plan to build a next-word prediction algorithm using an n-gram model with backoff. I will first generate n-grams (1-gram to 4-gram) from the cleaned text data, then apply smoothing techniques to handle unseen word combinations. The final Shiny app will take a user’s partial sentence as input and display the top three predicted next words in real time. The app interface will be kept simple, with a text box and a clear output area, making it easy for non-technical users to interact with.

Milestone Report

Haiyan Chen

2026-05-12

Overview