Executive Summary

This milestone report presents an exploratory analysis of text data for developing a predictive text application. The analysis examines three large text corpora (blogs, news, and Twitter) to understand their characteristics and inform the development of a word prediction algorithm. Key findings include significant differences in text length and vocabulary across sources, with Twitter showing the most constrained format and blogs the most varied. The next phase will focus on building an n-gram based prediction model with a Shiny web interface.

Introduction

The goal of this project is to build a text prediction application similar to those used in smartphone keyboards. This report demonstrates:

  1. Successful data loading and preprocessing
  2. Basic statistical summaries of the text corpora
  3. Exploratory visualizations highlighting key features
  4. Plans for the prediction algorithm and Shiny app

Data Acquisition and Loading

Download Dataset

The data comes from HC Corpora and includes text from blogs, news articles, and Twitter in multiple languages. We focus on the English corpus.

# Install required packages if needed
required_packages <- c("tm", "ggplot2", "dplyr", "knitr", "quanteda", "gridExtra", "slam")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]
if (length(new_packages)) install.packages(new_packages, repos = "http://cran.us.r-project.org")

# Load libraries
library(tm)
library(ggplot2)
library(dplyr)
library(knitr)
library(quanteda)
library(gridExtra)
library(slam)

# Set working directory and create data folder
if (!file.exists("data")) {
    dir.create("data")
}

# Download and unzip data if not already present
if (!file.exists("data/final/en_US")) {
    fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(fileUrl, destfile = "data/Coursera-SwiftKey.zip")
    unzip("data/Coursera-SwiftKey.zip", exdir = "data")
}

Load Text Files

# Define file paths
blogs_file <- "data/final/en_US/en_US.blogs.txt"
news_file <- "data/final/en_US/en_US.news.txt"
twitter_file <- "data/final/en_US/en_US.twitter.txt"

# Read files (using binary mode to handle special characters)
con <- file(blogs_file, "rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file(news_file, "rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file(twitter_file, "rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

Basic Data Summaries

File Statistics

# Function to get word count
word_count <- function(text) {
    sum(sapply(gregexpr("\\S+", text), length))
}

# Calculate statistics
file_stats <- data.frame(
    Source = c("Blogs", "News", "Twitter"),
    File_Size_MB = c(
        file.info(blogs_file)$size / 1024^2,
        file.info(news_file)$size / 1024^2,
        file.info(twitter_file)$size / 1024^2
    ),
    Line_Count = c(length(blogs), length(news), length(twitter)),
    Word_Count = c(
        word_count(blogs),
        word_count(news),
        word_count(twitter)
    ),
    Avg_Words_Per_Line = c(
        word_count(blogs) / length(blogs),
        word_count(news) / length(news),
        word_count(twitter) / length(twitter)
    )
)

# Display table
kable(file_stats,
    digits = 2,
    format.args = list(big.mark = ","),
    caption = "Table 1: Summary Statistics of Text Corpora"
)
Table 1: Summary Statistics of Text Corpora
Source File_Size_MB Line_Count Word_Count Avg_Words_Per_Line
Blogs 200.42 899,288 37,334,131 41.52
News 196.28 1,010,242 34,372,530 34.02
Twitter 159.36 2,360,148 30,373,583 12.87

Key Observations:

  • The datasets are substantial, with millions of lines and words
  • Twitter has the shortest average line length (constrained by character limits)
  • Blogs have the longest average line length, reflecting more detailed content
  • News articles fall between blogs and Twitter in terms of length

Data Sampling

Due to the large size of the datasets, we’ll work with a sample for exploratory analysis and model development.

set.seed(12345)
sample_size <- 0.01 # 1% sample for faster processing

# Create samples
blogs_sample <- sample(blogs, length(blogs) * sample_size)
news_sample <- sample(news, length(news) * sample_size)
twitter_sample <- sample(twitter, length(twitter) * sample_size)

# Combine samples
combined_sample <- c(blogs_sample, news_sample, twitter_sample)

cat("Sample sizes:\n")
## Sample sizes:
cat("Blogs:", length(blogs_sample), "\n")
## Blogs: 8992
cat("News:", length(news_sample), "\n")
## News: 10102
cat("Twitter:", length(twitter_sample), "\n")
## Twitter: 23601
cat("Combined:", length(combined_sample), "\n")
## Combined: 42695

Exploratory Data Analysis

Line Length Distribution

# Calculate character counts per line
line_lengths <- data.frame(
    Source = c(
        rep("Blogs", length(blogs_sample)),
        rep("News", length(news_sample)),
        rep("Twitter", length(twitter_sample))
    ),
    Length = c(nchar(blogs_sample), nchar(news_sample), nchar(twitter_sample))
)

# Create histogram
ggplot(line_lengths, aes(x = Length, fill = Source)) +
    geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
    facet_wrap(~Source, scales = "free_y", ncol = 1) +
    labs(
        title = "Figure 1: Distribution of Line Lengths by Source",
        x = "Characters per Line",
        y = "Frequency"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

Insights:

  • Twitter shows a clear constraint around 140-280 characters
  • Blogs and news have more varied distributions
  • News articles tend to have more consistent line lengths

Text Preprocessing

# Create corpus
corpus <- VCorpus(VectorSource(combined_sample))

# Clean the corpus
corpus_clean <- corpus %>%
    tm_map(content_transformer(tolower)) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>%
    tm_map(stripWhitespace)

# Create term document matrix
tdm <- TermDocumentMatrix(corpus_clean)

Word Frequency Analysis

# Get word frequencies (using slam to avoid memory issues with sparse matrices)
library(slam)
word_freq <- sort(row_sums(tdm), decreasing = TRUE)
word_freq_df <- data.frame(word = names(word_freq), freq = word_freq)

# Top 20 words
top_words <- head(word_freq_df, 20)

# Plot
ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(
        title = "Figure 2: Top 20 Most Frequent Words",
        x = "Word",
        y = "Frequency"
    ) +
    theme_minimal()

Observations:

  • Common English words (articles, prepositions, conjunctions) dominate
  • These “stop words” will need special handling in the prediction model
  • Content words appear less frequently but carry more meaning

N-gram Analysis

Understanding word combinations is crucial for prediction.

# Create tokens from combined sample
tokens_sample <- tokens(combined_sample,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE
)
tokens_sample <- tokens_tolower(tokens_sample)

# Create bigrams
bigrams <- tokens_ngrams(tokens_sample, n = 2)

# Get bigram frequencies
bigram_list <- unlist(bigrams)
bigram_freq <- sort(table(bigram_list), decreasing = TRUE)
bigram_freq_df <- data.frame(
    bigram = names(bigram_freq),
    freq = as.numeric(bigram_freq),
    stringsAsFactors = FALSE
)

# Top 20 bigrams
top_bigrams <- head(bigram_freq_df, 20)

# Plot
ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
    geom_bar(stat = "identity", fill = "darkgreen") +
    coord_flip() +
    labs(
        title = "Figure 3: Top 20 Most Frequent Bigrams",
        x = "Bigram",
        y = "Frequency"
    ) +
    theme_minimal()

# Create trigrams
trigrams <- tokens_ngrams(tokens_sample, n = 3)

# Get trigram frequencies
trigram_list <- unlist(trigrams)
trigram_freq <- sort(table(trigram_list), decreasing = TRUE)
trigram_freq_df <- data.frame(
    trigram = names(trigram_freq),
    freq = as.numeric(trigram_freq),
    stringsAsFactors = FALSE
)

# Top 20 trigrams
top_trigrams <- head(trigram_freq_df, 20)

# Plot
ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = trigram)) +
    geom_bar(stat = "identity", fill = "coral") +
    labs(
        title = "Figure 4: Top 20 Most Frequent Trigrams",
        x = "Frequency",
        y = "Trigram"
    ) +
    theme_minimal()

Vocabulary Coverage

# Calculate cumulative coverage
word_freq_df$cumsum <- cumsum(word_freq_df$freq)
word_freq_df$coverage <- word_freq_df$cumsum / sum(word_freq_df$freq) * 100

# Find words needed for coverage thresholds
coverage_50 <- which(word_freq_df$coverage >= 50)[1]
coverage_90 <- which(word_freq_df$coverage >= 90)[1]

# Plot
ggplot(word_freq_df[1:1000, ], aes(x = 1:1000, y = coverage)) +
    geom_line(color = "blue", size = 1) +
    geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
    geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
    annotate("text",
        x = 500, y = 55,
        label = paste0("50% coverage: ", coverage_50, " words"),
        color = "red"
    ) +
    annotate("text",
        x = 500, y = 85,
        label = paste0("90% coverage: ", coverage_90, " words"),
        color = "orange"
    ) +
    labs(
        title = "Figure 5: Vocabulary Coverage Analysis",
        x = "Number of Unique Words",
        y = "Cumulative Coverage (%)"
    ) +
    theme_minimal()

Key Finding:

  • A relatively small number of words covers a large percentage of text
  • This suggests we can build an efficient model without storing the entire vocabulary
  • Rare words can be handled through backoff strategies

Interesting Findings

  1. Source Diversity: The three text sources show distinct characteristics in length, vocabulary, and style, which will require careful handling in the prediction model.

  2. Zipf’s Law: Word frequencies follow a power-law distribution, with a few words appearing very frequently and most words appearing rarely.

  3. N-gram Patterns: Common phrases and collocations emerge clearly in bigram and trigram analysis, validating the n-gram approach for prediction.

  4. Efficiency Opportunity: 50% of word instances can be covered by just 326 unique words, enabling memory-efficient model design.

Plans for Prediction Algorithm and Shiny App

Prediction Algorithm

The prediction algorithm will use an n-gram model with backoff:

  1. N-gram Construction: Build 2-gram, 3-gram, and 4-gram frequency tables from the full dataset
  2. Smoothing: Apply Katz backoff or Kneser-Ney smoothing to handle unseen n-grams
  3. Prediction Logic:
    • Given input text, find the longest matching n-gram
    • Return top 3-5 most likely next words
    • Fall back to shorter n-grams if no match found
  4. Optimization: Prune low-frequency n-grams to reduce model size
  5. Profanity Filter: Remove offensive words from predictions

Shiny App Features

The interactive application will include:

  • Text Input: User types text in a text box
  • Real-time Predictions: Display top 3 word suggestions as user types
  • Click to Insert: Users can click suggestions to insert them
  • Statistics Dashboard: Show model performance metrics
  • Source Toggle: Allow users to weight different text sources (blogs/news/Twitter)
  • Responsive Design: Mobile-friendly interface

Next Steps

  1. Data Processing: Process full dataset (not just sample) to build comprehensive n-gram tables
  2. Model Development: Implement and test different smoothing algorithms
  3. Performance Tuning: Optimize for speed and memory usage
  4. App Development: Build Shiny interface with user testing
  5. Evaluation: Measure prediction accuracy and response time

Conclusion

This exploratory analysis has successfully demonstrated data loading, basic summarization, and visualization of the text corpora. The findings support an n-gram based approach for text prediction, with clear opportunities for optimization through vocabulary pruning and efficient data structures. The next phase will focus on building and refining the prediction model for deployment in a user-friendly Shiny application.


Note: This report uses a 1% sample of the data for computational efficiency. The final model will be trained on the complete dataset for better prediction accuracy.