Text Prediction Milestone Report

Executive Summary

This milestone report presents an exploratory analysis of text data for developing a predictive text application. The analysis examines three large text corpora (blogs, news, and Twitter) to understand their characteristics and inform the development of a word prediction algorithm. Key findings include significant differences in text length and vocabulary across sources, with Twitter showing the most constrained format and blogs the most varied. The next phase will focus on building an n-gram based prediction model with a Shiny web interface.

Introduction

The goal of this project is to build a text prediction application similar to those used in smartphone keyboards. This report demonstrates:

Successful data loading and preprocessing
Basic statistical summaries of the text corpora
Exploratory visualizations highlighting key features
Plans for the prediction algorithm and Shiny app

Data Acquisition and Loading

Download Dataset

The data comes from HC Corpora and includes text from blogs, news articles, and Twitter in multiple languages. We focus on the English corpus.

# Install required packages if needed
required_packages <- c("tm", "ggplot2", "dplyr", "knitr", "quanteda", "gridExtra", "slam")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]
if (length(new_packages)) install.packages(new_packages, repos = "http://cran.us.r-project.org")

# Load libraries
library(tm)
library(ggplot2)
library(dplyr)
library(knitr)
library(quanteda)
library(gridExtra)
library(slam)

# Set working directory and create data folder
if (!file.exists("data")) {
    dir.create("data")
}

# Download and unzip data if not already present
if (!file.exists("data/final/en_US")) {
    fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(fileUrl, destfile = "data/Coursera-SwiftKey.zip")
    unzip("data/Coursera-SwiftKey.zip", exdir = "data")
}

Load Text Files

# Define file paths
blogs_file <- "data/final/en_US/en_US.blogs.txt"
news_file <- "data/final/en_US/en_US.news.txt"
twitter_file <- "data/final/en_US/en_US.twitter.txt"

# Read files (using binary mode to handle special characters)
con <- file(blogs_file, "rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file(news_file, "rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file(twitter_file, "rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

Basic Data Summaries

File Statistics

# Function to get word count
word_count <- function(text) {
    sum(sapply(gregexpr("\\S+", text), length))
}

# Calculate statistics
file_stats <- data.frame(
    Source = c("Blogs", "News", "Twitter"),
    File_Size_MB = c(
        file.info(blogs_file)$size / 1024^2,
        file.info(news_file)$size / 1024^2,
        file.info(twitter_file)$size / 1024^2
    ),
    Line_Count = c(length(blogs), length(news), length(twitter)),
    Word_Count = c(
        word_count(blogs),
        word_count(news),
        word_count(twitter)
    ),
    Avg_Words_Per_Line = c(
        word_count(blogs) / length(blogs),
        word_count(news) / length(news),
        word_count(twitter) / length(twitter)
    )
)

# Display table
kable(file_stats,
    digits = 2,
    format.args = list(big.mark = ","),
    caption = "Table 1: Summary Statistics of Text Corpora"
)

Table 1: Summary Statistics of Text Corpora
Source	File_Size_MB	Line_Count	Word_Count	Avg_Words_Per_Line
Blogs	200.42	899,288	37,334,131	41.52
News	196.28	1,010,242	34,372,530	34.02
Twitter	159.36	2,360,148	30,373,583	12.87

Key Observations:

The datasets are substantial, with millions of lines and words
Twitter has the shortest average line length (constrained by character limits)
Blogs have the longest average line length, reflecting more detailed content
News articles fall between blogs and Twitter in terms of length

Data Sampling

Due to the large size of the datasets, we’ll work with a sample for exploratory analysis and model development.

set.seed(12345)
sample_size <- 0.01 # 1% sample for faster processing

# Create samples
blogs_sample <- sample(blogs, length(blogs) * sample_size)
news_sample <- sample(news, length(news) * sample_size)
twitter_sample <- sample(twitter, length(twitter) * sample_size)

# Combine samples
combined_sample <- c(blogs_sample, news_sample, twitter_sample)

cat("Sample sizes:\n")

## Sample sizes:

cat("Blogs:", length(blogs_sample), "\n")

## Blogs: 8992

cat("News:", length(news_sample), "\n")

## News: 10102

cat("Twitter:", length(twitter_sample), "\n")

## Twitter: 23601

cat("Combined:", length(combined_sample), "\n")

## Combined: 42695

Exploratory Data Analysis

Line Length Distribution

# Calculate character counts per line
line_lengths <- data.frame(
    Source = c(
        rep("Blogs", length(blogs_sample)),
        rep("News", length(news_sample)),
        rep("Twitter", length(twitter_sample))
    ),
    Length = c(nchar(blogs_sample), nchar(news_sample), nchar(twitter_sample))
)

# Create histogram
ggplot(line_lengths, aes(x = Length, fill = Source)) +
    geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
    facet_wrap(~Source, scales = "free_y", ncol = 1) +
    labs(
        title = "Figure 1: Distribution of Line Lengths by Source",
        x = "Characters per Line",
        y = "Frequency"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

Insights:

Twitter shows a clear constraint around 140-280 characters
Blogs and news have more varied distributions
News articles tend to have more consistent line lengths

Text Preprocessing

# Create corpus
corpus <- VCorpus(VectorSource(combined_sample))

# Clean the corpus
corpus_clean <- corpus %>%
    tm_map(content_transformer(tolower)) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>%
    tm_map(stripWhitespace)

# Create term document matrix
tdm <- TermDocumentMatrix(corpus_clean)

Word Frequency Analysis

# Get word frequencies (using slam to avoid memory issues with sparse matrices)
library(slam)
word_freq <- sort(row_sums(tdm), decreasing = TRUE)
word_freq_df <- data.frame(word = names(word_freq), freq = word_freq)

# Top 20 words
top_words <- head(word_freq_df, 20)

# Plot
ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(
        title = "Figure 2: Top 20 Most Frequent Words",
        x = "Word",
        y = "Frequency"
    ) +
    theme_minimal()

Observations:

Common English words (articles, prepositions, conjunctions) dominate
These “stop words” will need special handling in the prediction model
Content words appear less frequently but carry more meaning

N-gram Analysis

Understanding word combinations is crucial for prediction.

# Create tokens from combined sample
tokens_sample <- tokens(combined_sample,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE
)
tokens_sample <- tokens_tolower(tokens_sample)

# Create bigrams
bigrams <- tokens_ngrams(tokens_sample, n = 2)

# Get bigram frequencies
bigram_list <- unlist(bigrams)
bigram_freq <- sort(table(bigram_list), decreasing = TRUE)
bigram_freq_df <- data.frame(
    bigram = names(bigram_freq),
    freq = as.numeric(bigram_freq),
    stringsAsFactors = FALSE
)

# Top 20 bigrams
top_bigrams <- head(bigram_freq_df, 20)

# Plot
ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
    geom_bar(stat = "identity", fill = "darkgreen") +
    coord_flip() +
    labs(
        title = "Figure 3: Top 20 Most Frequent Bigrams",
        x = "Bigram",
        y = "Frequency"
    ) +
    theme_minimal()

# Create trigrams
trigrams <- tokens_ngrams(tokens_sample, n = 3)

# Get trigram frequencies
trigram_list <- unlist(trigrams)
trigram_freq <- sort(table(trigram_list), decreasing = TRUE)
trigram_freq_df <- data.frame(
    trigram = names(trigram_freq),
    freq = as.numeric(trigram_freq),
    stringsAsFactors = FALSE
)

# Top 20 trigrams
top_trigrams <- head(trigram_freq_df, 20)

# Plot
ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = trigram)) +
    geom_bar(stat = "identity", fill = "coral") +
    labs(
        title = "Figure 4: Top 20 Most Frequent Trigrams",
        x = "Frequency",
        y = "Trigram"
    ) +
    theme_minimal()

Vocabulary Coverage

# Calculate cumulative coverage
word_freq_df$cumsum <- cumsum(word_freq_df$freq)
word_freq_df$coverage <- word_freq_df$cumsum / sum(word_freq_df$freq) * 100

# Find words needed for coverage thresholds
coverage_50 <- which(word_freq_df$coverage >= 50)[1]
coverage_90 <- which(word_freq_df$coverage >= 90)[1]

# Plot
ggplot(word_freq_df[1:1000, ], aes(x = 1:1000, y = coverage)) +
    geom_line(color = "blue", size = 1) +
    geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
    geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
    annotate("text",
        x = 500, y = 55,
        label = paste0("50% coverage: ", coverage_50, " words"),
        color = "red"
    ) +
    annotate("text",
        x = 500, y = 85,
        label = paste0("90% coverage: ", coverage_90, " words"),
        color = "orange"
    ) +
    labs(
        title = "Figure 5: Vocabulary Coverage Analysis",
        x = "Number of Unique Words",
        y = "Cumulative Coverage (%)"
    ) +
    theme_minimal()

Key Finding:

A relatively small number of words covers a large percentage of text
This suggests we can build an efficient model without storing the entire vocabulary
Rare words can be handled through backoff strategies

Interesting Findings

Source Diversity: The three text sources show distinct characteristics in length, vocabulary, and style, which will require careful handling in the prediction model.
Zipf’s Law: Word frequencies follow a power-law distribution, with a few words appearing very frequently and most words appearing rarely.
N-gram Patterns: Common phrases and collocations emerge clearly in bigram and trigram analysis, validating the n-gram approach for prediction.
Efficiency Opportunity: 50% of word instances can be covered by just 326 unique words, enabling memory-efficient model design.

Plans for Prediction Algorithm and Shiny App

Prediction Algorithm

The prediction algorithm will use an n-gram model with backoff:

N-gram Construction: Build 2-gram, 3-gram, and 4-gram frequency tables from the full dataset
Smoothing: Apply Katz backoff or Kneser-Ney smoothing to handle unseen n-grams
Prediction Logic:
- Given input text, find the longest matching n-gram
- Return top 3-5 most likely next words
- Fall back to shorter n-grams if no match found
Optimization: Prune low-frequency n-grams to reduce model size
Profanity Filter: Remove offensive words from predictions

Shiny App Features

The interactive application will include:

Text Input: User types text in a text box
Real-time Predictions: Display top 3 word suggestions as user types
Click to Insert: Users can click suggestions to insert them
Statistics Dashboard: Show model performance metrics
Source Toggle: Allow users to weight different text sources (blogs/news/Twitter)
Responsive Design: Mobile-friendly interface

Next Steps

Data Processing: Process full dataset (not just sample) to build comprehensive n-gram tables
Model Development: Implement and test different smoothing algorithms
Performance Tuning: Optimize for speed and memory usage
App Development: Build Shiny interface with user testing
Evaluation: Measure prediction accuracy and response time

Conclusion

This exploratory analysis has successfully demonstrated data loading, basic summarization, and visualization of the text corpora. The findings support an n-gram based approach for text prediction, with clear opportunities for optimization through vocabulary pruning and efficient data structures. The next phase will focus on building and refining the prediction model for deployment in a user-friendly Shiny application.

Note: This report uses a 1% sample of the data for computational efficiency. The final model will be trained on the complete dataset for better prediction accuracy.