Data Science Capstone: Milestone Report

Executive Summary

The goal of this report is to provide a brief exploratory analysis of the SwiftKey training data. We examine the basic features of the three English data files: Blogs, News, and Twitter to prepare for building a word prediction algorithm.

Data Summary Statistics

The following table summarizes the dimensions and word counts for the English datasets.

Summary of Training Data
File_Source	Line_Count	Word_Count
Blogs	899,288	37,334,131
News	1,010,242	34,372,533
Twitter	2,360,148	30,373,543

Data Science Capstone: Milestone Report

Mohd Azeem Ansari

2026-01-27

Executive Summary

Data Summary Statistics