Data Science Capstone: Milestone Report

Executive Summary

The goal of this project is to develop a predictive text algorithm and a Shiny application that mimics “Smart Keyboard” functionality.
This report explores the dataset provided by SwiftKey, summarizes its major features, and outlines the roadmap for the final prediction model.

1. Data Summary

The analysis is based on a large corpus of text from three sources: Blogs, News, and Twitter.
Below is a summary of the raw data files.

Table 1: Summary Statistics of English US Corpora
Source	Lines	Words
Blogs	899288	37546250
News	1010242	34762395
Twitter	2360148	30093413

2. Sampling and Cleaning

Because the files are very large, a 1% sample was taken from each source to perform exploratory analysis.
During the cleaning process:

Text was converted to lowercase.
Punctuation and numbers were removed using regular expressions.
Profanity was filtered using the Carnegie Mellon University “Bad Words” list.

Table 2: Sample of Cleaned Text
last	worked	that	he
week	up	were	down
he	about	not	in
was	some	actually	his
so	things	real	bed

3. Exploratory Data Analysis

    word_freq <- tibble(word = cleaned_tokens) %>% 
    count(word, sort = TRUE) %>%
    mutate(freq = n / sum(n), cumulative = cumsum(freq))

#Coverage analysis:
    cover_50 <- word_freq %>% 
                filter(cumulative >= 0.5) %>% 
                slice(1) %>% 
                pull("word")
    count_50 <- which(word_freq$word == cover_50)
    
    cover_90 <- word_freq %>% 
                filter(cumulative >= 0.9) %>% 
                slice(1) %>% 
                pull("word")
    count_90 <- which(word_freq$word == cover_90)
    
# Foreign Language analysis:
    foreign_check <- word_freq %>%
        mutate(detected_lang = detect_language(word),
               is_foreign = !is.na(detected_lang) & detected_lang != "en")
    percent_foreign <- paste0(round(mean(foreign_check$is_foreign) * 100, 2), "%")

Word Frequency & Coverage:

It was identified that a relatively small number of unique words account for most of the language used. To cover 50% of all word instances in the sample, only a few hundred words are needed (135).

Foreign Language Detection:

Based on the cld2 language detection library, approximately 4.53% of the unique words in the sample appear to be from foreign languages.

N-Gram Distributions:

The plots below show the top 20 Unigrams (single words), Bigrams (two-word phrases), and Trigrams (three-word phrases).

4. Plan for Prediction Algorithm and Shiny App

Moving forward, the prediction strategy will rely on a Stupid Backoff model:

Trigram Match: The algorithm will first look for the last two words typed in the Trigram database.
Bigram Backoff: If no match is found, it “backs off” to look for the last single word in the Bigram database.
Default: If both fail, it will suggest the most frequent word in the English language (“the”).

The final Shiny App will feature a reactive interface where the user can type text, and the top predicted word will appear instantly.
To ensure the app is fast and lightweight for mobile simulation, the N-gram tables will be optimized and pruned of very low-frequency entries.