Executive Summary

The goal of this project is to develop a predictive text algorithm and a Shiny application that mimics “Smart Keyboard” functionality.
This report explores the dataset provided by SwiftKey, summarizes its major features, and outlines the roadmap for the final prediction model.

1. Data Summary

The analysis is based on a large corpus of text from three sources: Blogs, News, and Twitter.
Below is a summary of the raw data files.

Table 1: Summary Statistics of English US Corpora
Source Lines Words
Blogs 899288 37546250
News 1010242 34762395
Twitter 2360148 30093413

2. Sampling and Cleaning

Because the files are very large, a 1% sample was taken from each source to perform exploratory analysis.
During the cleaning process:

Table 2: Sample of Cleaned Text
last worked that he
week up were down
he about not in
was some actually his
so things real bed

3. Exploratory Data Analysis

    word_freq <- tibble(word = cleaned_tokens) %>% 
    count(word, sort = TRUE) %>%
    mutate(freq = n / sum(n), cumulative = cumsum(freq))

#Coverage analysis:
    cover_50 <- word_freq %>% 
                filter(cumulative >= 0.5) %>% 
                slice(1) %>% 
                pull("word")
    count_50 <- which(word_freq$word == cover_50)
    
    cover_90 <- word_freq %>% 
                filter(cumulative >= 0.9) %>% 
                slice(1) %>% 
                pull("word")
    count_90 <- which(word_freq$word == cover_90)
    
# Foreign Language analysis:
    foreign_check <- word_freq %>%
        mutate(detected_lang = detect_language(word),
               is_foreign = !is.na(detected_lang) & detected_lang != "en")
    percent_foreign <- paste0(round(mean(foreign_check$is_foreign) * 100, 2), "%")

Word Frequency & Coverage:

It was identified that a relatively small number of unique words account for most of the language used. To cover 50% of all word instances in the sample, only a few hundred words are needed (135).

Foreign Language Detection:

Based on the cld2 language detection library, approximately 4.53% of the unique words in the sample appear to be from foreign languages.

N-Gram Distributions:

The plots below show the top 20 Unigrams (single words), Bigrams (two-word phrases), and Trigrams (three-word phrases).

4. Plan for Prediction Algorithm and Shiny App

Moving forward, the prediction strategy will rely on a Stupid Backoff model:

The final Shiny App will feature a reactive interface where the user can type text, and the top predicted word will appear instantly.
To ensure the app is fast and lightweight for mobile simulation, the N-gram tables will be optimized and pruned of very low-frequency entries.