Executive Summary

The goal of this report is to provide a brief exploratory analysis of the SwiftKey training data. We examine the basic features of the three English data files: Blogs, News, and Twitter to prepare for building a word prediction algorithm.

Data Summary Statistics

The following table summarizes the dimensions and word counts for the English datasets.

Summary of Training Data
File_Source Line_Count Word_Count
Blogs 899,288 37,334,131
News 1,010,242 34,372,533
Twitter 2,360,148 30,373,543