This report presents an exploratory analysis of the English-language text datasets provided for the SwiftKey Data Science Capstone project. The end goal of this capstone is to build a predictive text algorithm and a companion Shiny web app, similar to the predictive keyboard on a smartphone, that suggests the next word as a user types. This report demonstrates that the data has been downloaded and loaded successfully, summarizes its basic characteristics, highlights early findings, and outlines the plan for building the prediction model and app.
The dataset consists of three English-language text sources: blog posts, news articles, and Twitter posts.
The table below summarizes the size, number of lines, and word counts for each data source.
| Source | File Size (MB) | Lines | Words |
|---|---|---|---|
| Blogs | 200.4 | 899288 | 37546806 |
| News | 196.3 | 77259 | 2674561 |
| 159.4 | 2360148 | 30096690 |
As shown above, all three files are large. To keep the exploratory analysis fast and manageable, a random sample was drawn from each source rather than processing the full corpus.
A 1% random sample was taken from each of the three sources and combined, giving 33365 lines of text to work with for exploration.
The sampled text was converted into a corpus and tokenized: converted to lowercase, and stripped of punctuation, numbers, and symbols.
An important question for building an efficient prediction model is: how many unique words are needed to cover most of the language actually used? The plot below shows cumulative word coverage.
This shows that a relatively small number of unique words account for a large fraction of all word usage in the corpus — a common property of natural language known as a “long tail” distribution. This is useful for the prediction algorithm, since it means the model does not need to store every rare word to be effective.
The next phase of this project will use the word, word-pair, and three-word patterns identified above to build a next-word prediction model:
The goal is a lightweight, responsive app that demonstrates practical next-word prediction, similar in spirit to the predictive text feature found on smartphone keyboards.
The data has been successfully downloaded, loaded, and explored. Initial analysis confirms that word usage follows expected natural language patterns, which supports the planned n-gram based approach. The next steps are building the full prediction model and packaging it into an interactive Shiny application.