This report outlines the exploratory data analysis (EDA) phase of the SwiftKey Data Science Capstone project. The goal is to build a predictive text model that suggests the next word as a user types.
In this phase, we have:
This analysis confirms the data is suitable for modeling and sets the stage for building a Markov-chain based prediction application in Shiny.
We begin by loading the three English text corpora: Blogs, News, and Twitter. Before diving into deep analysis, we assess the sheer volume of the data to understand computational requirements.
The table below summarizes the file sizes, line counts, and total word counts.
| Source | Size_MB | Lines | Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546806 |
| News | 196.28 | 1010206 | 34761151 |
| 159.36 | 2360148 | 30096690 |
To maintain performance while preserving statistical significance, we sample 1% of each dataset. We then clean the text by removing numbers, punctuation, and extra whitespace.
Note: For this exploratory phase, we are removing ‘stopwords’ (common words like ‘the’, ‘and’) to visualize distinct content. However, for the final prediction model, stopwords will be retained as they are critical for sentence structure.
An N-gram is a contiguous sequence of n items from a given sample of text. We analyze Unigrams (single words), Bigrams (two-word pairs), and Trigrams (three-word sequences) to find the most frequent patterns.
Top Unigrams (Single Words)
Top Bigrams (Two Words)
Top Trigrams (Three Words)
Visualization Results
The core of the application will be an N-gram Backoff Model:
Input Processing: The user’s input will be cleaned (to match our training data format).
Search Strategy:
Efficiency: To ensure the app runs fast on the web, we will store the N-grams in compact frequency lookup tables (data frames or data.tables) rather than processing raw text in real-time.