Introduction

This milestone report explores the SwiftKey dataset provided for the Data Science Capstone project. The goal is to analyze the structure of the text data and outline the plan for building the word prediction algorithm and Shiny application.

Summary Statistics

The dataset consists of text from three different sources: Blogs, News, and Twitter. Below are the summary statistics for the English datasets.

File Source	Line Count	Word Count
Blogs	899,288	37,334,131
News	1,010,242	34,372,530
Twitter	2,360,148	30,373,543

Exploratory Data Analysis

Initial exploration reveals that the datasets require cleaning and preprocessing before modeling. Tasks include removing punctuation, converting text to lowercase, removing special characters, and filtering profanity. The distribution of words follows Zipf’s Law, where a small number of words appear very frequently.

Key Findings

Twitter contains the highest number of lines.
Blogs contain the highest total word count.
Common English words dominate the datasets.
Significant preprocessing will be necessary before prediction modeling.

Prediction Strategy and Plan

The final prediction model will utilize an N-gram approach combined with a Stupid Backoff algorithm.

N-gram Modeling

The algorithm will analyze sequences of words such as bigrams, trigrams, and four-grams to predict the next likely word based on historical frequency.

Stupid Backoff

If a four-word sequence is unavailable, the model will back off to a smaller N-gram model until a prediction can be generated.

Shiny Application

The Shiny app will provide a simple user interface where users can enter text and receive a predicted next word instantly.

Performance Optimization

To improve speed and usability, rare words and infrequent phrases will be removed from the model.

Conclusion

This report demonstrated successful loading and exploration of the SwiftKey datasets. Initial analysis showed differences in size and structure between Blogs, News, and Twitter datasets. The next stage of the project will focus on cleaning the data, building N-gram models, and deploying a Shiny application for next-word prediction.

Milestone Report

Jatin Bhardwaj

2026-05-08