Executive Summary

The goal of this project is to build a predictive text model, similar to the one used in SwiftKey smart keyboards. When a user types a word, the algorithm will predict the most likely next word.

This milestone report outlines the initial exploratory data analysis (EDA) of the English text corpora provided for training: blogs, news, and Twitter feeds. The objective is to understand the distribution of words, the frequency of phrases (N-grams), and to lay out a clear, non-technical roadmap for the final predictive application.


1. Data Acquisition and Processing

The raw data consists of three large text files (en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt). Due to the massive size of these datasets (hundreds of megabytes each), processing them in their entirety is inefficient and unnecessary for initial exploration.

Instead, a systematic sampling approach is used. We extract a representative 1% sample from each file. The data is then cleaned by removing punctuation, numbers, URLs, and standard English stop words to reveal the true vocabulary patterns.

# Load libraries and import data
pacman::p_load(data.table, quanteda, ggplot2)

if(!file.exists("Coursera-SwiftKey.zip")) {
  options(timeout = 600)
  download.file(
    url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
    destfile = "Coursera-SwiftKey.zip", mode = "wb") # mode = "wb" (write binary) 
  unzip("Coursera-SwiftKey.zip")
}


# Define paths
en_path <- "./final/en_US"
files <- list.files(en_path, full.names = TRUE)
names(files) <- basename(files)

# Systematic sampling function (1% of data)
sample_systematic <- function(file_path, sample_rate = 0.01, chunk_size = 50000) {
  step <- floor(1 / sample_rate) 
  con <- file(file_path, "rb")
  on.exit(close(con))
  
  sampled_lines <- character()
  line_counter <- 0
  total_read <- 0
  
  repeat {
    chunk <- readLines(con, n = chunk_size, warn = FALSE, encoding = "UTF-8")
    if (length(chunk) == 0) break
    for (i in seq_along(chunk)) {
      line_counter <- line_counter + 1
      total_read <- total_read + 1
      if (line_counter %% step == 0) sampled_lines <- c(sampled_lines, chunk[i])
    }
  }
  return(list(lines = sampled_lines, total_lines = total_read))
}

# Extract and process data
results <- list()
for (f in files) {
  file_name <- basename(f)
  sample_data <- sample_systematic(f, sample_rate = 0.01)
  
  # Tokenization using quanteda
  corp <- corpus(sample_data$lines)
  toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
                 remove_numbers = TRUE, remove_url = TRUE) |>
          tokens_tolower() |>
          tokens_select(pattern = stopwords("en"), selection = "remove")
  
  words <- as.character(toks)
  freq_table <- sort(table(words), decreasing = TRUE)
  
  results[[file_name]] <- list(
    total_lines_original = sample_data$total_lines,
    sample_lines = length(sample_data$lines),
    total_words_sample = length(words),
    frequencies = freq_table
  )
}

2. Summary Statistics

Before building a model, it is crucial to understand the size and scope of our data. The table below displays the total lines in the original files, the number of lines in our 1% sample, and the total word count extracted from that sample.

Source File Total Lines (Raw) Lines (1% Sample) Word Count (Sample) Top Word
en_US.blogs.txt 899288 8992 189707 one
en_US.news.txt 1010242 10102 195776 said
en_US.twitter.txt 2360148 23601 167635 just

3. Exploratory Data Analysis

Most Frequent Words

By counting the words, we can see the differences in vocabulary across the three sources. For instance, Twitter data tends to be more conversational.

Understanding Phrasing (N-Grams)

To predict the next word, predicting single words is not enough. We must look at word pairs (Bigrams) and triplets (Trigrams). The algorithm will use the frequency of these word combinations to guess what the user will type next.

(Note: In the final model, stop words will be retained, as they are essential for natural sentence formation).


4. Goals for the Prediction Algorithm and App

Moving forward, the strategy to build the predictive text product involves the following steps:

  1. N-Gram Modeling: We will construct a mathematical model (specifically, an N-gram Language Model using Markov Chains). The model will calculate the probability of a word occurring given the 1, 2, or 3 words that immediately preceded it.
  2. Memory Optimization: Mobile applications have limited memory. We will prune our frequency tables, keeping only the word combinations that cover 95% of standard language usage, discarding extremely rare combinations or typos.
  3. Handling Unseen Words (Backoff Strategy): If a user types a phrase the model has never seen, it will “back off” to a shorter phrase. For example, if the model doesn’t recognize the 3-word phrase preceding the blank, it will check the 2-word phrase, and so on.
  4. Shiny Application Deployment: The final product will be a user-friendly web interface. It will feature a simple text box. As the user types, the application will instantly display the top 3 most likely next words, demonstrating the speed and accuracy of the algorithm.