1. Executive Summary

This report outlines the exploratory data analysis (EDA) phase of the SwiftKey Data Science Capstone project. The goal is to build a predictive text model that suggests the next word as a user types.

In this phase, we have:

This analysis confirms the data is suitable for modeling and sets the stage for building a Markov-chain based prediction application in Shiny.

2. Data Loading and Basic Summaries

We begin by loading the three English text corpora: Blogs, News, and Twitter. Before diving into deep analysis, we assess the sheer volume of the data to understand computational requirements.

The table below summarizes the file sizes, line counts, and total word counts.

Table 1: Summary of Raw Data
Source Size_MB Lines Words
Blogs 200.42 899288 37546806
News 196.28 1010206 34761151
Twitter 159.36 2360148 30096690

3. Sampling and Cleaning

To maintain performance while preserving statistical significance, we sample 1% of each dataset. We then clean the text by removing numbers, punctuation, and extra whitespace.

Note: For this exploratory phase, we are removing ‘stopwords’ (common words like ‘the’, ‘and’) to visualize distinct content. However, for the final prediction model, stopwords will be retained as they are critical for sentence structure.

4. N-Gram Analysis

An N-gram is a contiguous sequence of n items from a given sample of text. We analyze Unigrams (single words), Bigrams (two-word pairs), and Trigrams (three-word sequences) to find the most frequent patterns.

Top Unigrams (Single Words)

Top Bigrams (Two Words)

Top Trigrams (Three Words)

Visualization Results

Interesting Findings:

  • Context Matters: Even with stopwords removed, bigrams like “right_now” and “last_year” show that time-based context is very common.
  • Twitter Influence: Trigrams like “happy_mothers_day” indicate distinct seasonal or event-based trends in the sampled data.

5. Plan for Prediction Algorithm and Shiny App

The Prediction Algorithm

The core of the application will be an N-gram Backoff Model:

  • Input Processing: The user’s input will be cleaned (to match our training data format).

  • Search Strategy:

    • First, look for the last 3 words typed to match a 4-gram (Quadgram) to predict the 4th word.
    • If no match is found, “back off” to the last 2 words and check Trigrams.
    • If still no match, back off to Bigrams.
    • As a last resort, use the most common Unigrams.
  • Efficiency: To ensure the app runs fast on the web, we will store the N-grams in compact frequency lookup tables (data frames or data.tables) rather than processing raw text in real-time.