Project Summary

The objective of this capstone is to build a next-word prediction model similar to the SwiftKey keyboard.
This report summarizes the work completed for Tasks 1–3 and is written to be understandable by a non-technical manager.

The completed tasks include:


Data Overview

The dataset used is the English HC Corpora dataset provided by Coursera. It contains three large text files:

Due to the large size of the dataset, a reproducible 0.5% random sample was used to make analysis computationally feasible.


Task 1 — Getting & Cleaning the Data

What was done

  • Loaded raw files using streaming to avoid memory issues
  • Created reproducible random samples from each file
  • Applied text cleaning: lowercase conversion, removal of URLs, numbers, non-ASCII characters
  • Removed profane words
  • Tokenized text into individual words

Basic Token Statistics (Top Words)

Blogs — Top Tokens

Rank Token Count
1 the 9278
2 to 5347
3 and 5263
4 a 4371
5 of 4326

News — Top Tokens

Rank Token Count
1 the 9778
2 to 4600
3 a 4490
4 and 4390
5 of 3940

Twitter — Top Tokens

Rank Token Count
1 the 4546
2 to 3902
3 i 3619
4 a 2990
5 you 2594

These results show a Zipf-like distribution where a small number of words account for a large portion of usage.


Foreign Language Detection

To estimate the presence of foreign-language text, the proportion of tokens containing non-ASCII characters was calculated.

Dataset Estimated Foreign Word Ratio
Blogs 1.10%
News 1.77%
Twitter 3.18%

Twitter contains more noisy and mixed-language text compared to blogs and news.


N-gram Highlights

To capture word relationships, n-grams were generated.

Common Bigrams

  • of the
  • in the
  • to the

Common Trigrams

  • one of the
  • as well as
  • going to be

These patterns help improve next-word prediction accuracy.


Task 3 — Modeling Summary

A compact n-gram language model was constructed using the cleaned and tokenized data:


Key Findings & Recommendations


Next Steps

  1. Train the final model on a larger sample
  2. Implement the prediction logic in a Shiny application
  3. Build a simple user interface for next-word prediction
  4. Prepare a 5-slide pitch deck explaining the model and app

This milestone confirms readiness to proceed toward building the final predictive text application.