SwiftKey Data Science Capstone - Exploratory Data Analysis

Introduction

This report presents an exploratory analysis of the English-language text datasets provided for the SwiftKey Data Science Capstone project. The end goal of this capstone is to build a predictive text algorithm and a companion Shiny web app, similar to the predictive keyboard on a smartphone, that suggests the next word as a user types. This report demonstrates that the data has been downloaded and loaded successfully, summarizes its basic characteristics, highlights early findings, and outlines the plan for building the prediction model and app.

Loading the Data

The dataset consists of three English-language text sources: blog posts, news articles, and Twitter posts.

Basic Summary Statistics

The table below summarizes the size, number of lines, and word counts for each data source.

Corpus Summary Statistics
Source	File Size (MB)	Lines	Words
Blogs	200.4	899288	37546806
News	196.3	77259	2674561
Twitter	159.4	2360148	30096690

As shown above, all three files are large. To keep the exploratory analysis fast and manageable, a random sample was drawn from each source rather than processing the full corpus.

Sampling the Data

A 1% random sample was taken from each of the three sources and combined, giving 33365 lines of text to work with for exploration.

Text Cleaning and Tokenization

The sampled text was converted into a corpus and tokenized: converted to lowercase, and stripped of punctuation, numbers, and symbols.

Word Frequency Analysis

Most Common Single Words (Unigrams)

Most Common Word Pairs (Bigrams)

Most Common Three-Word Sequences (Trigrams)

Word Coverage

An important question for building an efficient prediction model is: how many unique words are needed to cover most of the language actually used? The plot below shows cumulative word coverage.

This shows that a relatively small number of unique words account for a large fraction of all word usage in the corpus — a common property of natural language known as a “long tail” distribution. This is useful for the prediction algorithm, since it means the model does not need to store every rare word to be effective.

Key Findings

All three text sources (blogs, news, Twitter) are large, requiring sampling for efficient exploration.
A relatively small set of common words accounts for the majority of word usage across the corpus.
Common word pairs and three-word sequences show clear patterns (e.g. common phrases), which will form the basis of the prediction model.
Twitter text tends to be shorter and more informal compared to blogs and news.

Plans for the Prediction Algorithm and Shiny App

The next phase of this project will use the word, word-pair, and three-word patterns identified above to build a next-word prediction model:

N-gram model: Build frequency tables of single words, word pairs, and three-word sequences from the full training data.
Backoff strategy: When predicting the next word, the algorithm will first try to match the most recent two words (trigram model). If no match is found, it will “back off” to using just the last word (bigram), and finally to overall word frequency (unigram) if needed.
Efficiency: Very rare word combinations will be pruned to keep the model small and fast enough to run in a web app.
Shiny App: A simple web interface will be built where a user types a phrase into a text box, and the app displays the predicted next word (or a few likely candidates) in real time.

The goal is a lightweight, responsive app that demonstrates practical next-word prediction, similar in spirit to the predictive text feature found on smartphone keyboards.

Conclusion

The data has been successfully downloaded, loaded, and explored. Initial analysis confirms that word usage follows expected natural language patterns, which supports the planned n-gram based approach. The next steps are building the full prediction model and packaging it into an interactive Shiny application.