This milestone report presents an exploratory analysis of text data for developing a predictive text application. The analysis examines three large text corpora (blogs, news, and Twitter) to understand their characteristics and inform the development of a word prediction algorithm. Key findings include significant differences in text length and vocabulary across sources, with Twitter showing the most constrained format and blogs the most varied. The next phase will focus on building an n-gram based prediction model with a Shiny web interface.
The goal of this project is to build a text prediction application similar to those used in smartphone keyboards. This report demonstrates:
The data comes from HC Corpora and includes text from blogs, news articles, and Twitter in multiple languages. We focus on the English corpus.
# Install required packages if needed
required_packages <- c("tm", "ggplot2", "dplyr", "knitr", "quanteda", "gridExtra", "slam")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]
if (length(new_packages)) install.packages(new_packages, repos = "http://cran.us.r-project.org")
# Load libraries
library(tm)
library(ggplot2)
library(dplyr)
library(knitr)
library(quanteda)
library(gridExtra)
library(slam)
# Set working directory and create data folder
if (!file.exists("data")) {
dir.create("data")
}
# Download and unzip data if not already present
if (!file.exists("data/final/en_US")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "data/Coursera-SwiftKey.zip")
unzip("data/Coursera-SwiftKey.zip", exdir = "data")
}# Define file paths
blogs_file <- "data/final/en_US/en_US.blogs.txt"
news_file <- "data/final/en_US/en_US.news.txt"
twitter_file <- "data/final/en_US/en_US.twitter.txt"
# Read files (using binary mode to handle special characters)
con <- file(blogs_file, "rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
con <- file(news_file, "rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
con <- file(twitter_file, "rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)# Function to get word count
word_count <- function(text) {
sum(sapply(gregexpr("\\S+", text), length))
}
# Calculate statistics
file_stats <- data.frame(
Source = c("Blogs", "News", "Twitter"),
File_Size_MB = c(
file.info(blogs_file)$size / 1024^2,
file.info(news_file)$size / 1024^2,
file.info(twitter_file)$size / 1024^2
),
Line_Count = c(length(blogs), length(news), length(twitter)),
Word_Count = c(
word_count(blogs),
word_count(news),
word_count(twitter)
),
Avg_Words_Per_Line = c(
word_count(blogs) / length(blogs),
word_count(news) / length(news),
word_count(twitter) / length(twitter)
)
)
# Display table
kable(file_stats,
digits = 2,
format.args = list(big.mark = ","),
caption = "Table 1: Summary Statistics of Text Corpora"
)| Source | File_Size_MB | Line_Count | Word_Count | Avg_Words_Per_Line |
|---|---|---|---|---|
| Blogs | 200.42 | 899,288 | 37,334,131 | 41.52 |
| News | 196.28 | 1,010,242 | 34,372,530 | 34.02 |
| 159.36 | 2,360,148 | 30,373,583 | 12.87 |
Key Observations:
Due to the large size of the datasets, we’ll work with a sample for exploratory analysis and model development.
set.seed(12345)
sample_size <- 0.01 # 1% sample for faster processing
# Create samples
blogs_sample <- sample(blogs, length(blogs) * sample_size)
news_sample <- sample(news, length(news) * sample_size)
twitter_sample <- sample(twitter, length(twitter) * sample_size)
# Combine samples
combined_sample <- c(blogs_sample, news_sample, twitter_sample)
cat("Sample sizes:\n")## Sample sizes:
## Blogs: 8992
## News: 10102
## Twitter: 23601
## Combined: 42695
# Calculate character counts per line
line_lengths <- data.frame(
Source = c(
rep("Blogs", length(blogs_sample)),
rep("News", length(news_sample)),
rep("Twitter", length(twitter_sample))
),
Length = c(nchar(blogs_sample), nchar(news_sample), nchar(twitter_sample))
)
# Create histogram
ggplot(line_lengths, aes(x = Length, fill = Source)) +
geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
facet_wrap(~Source, scales = "free_y", ncol = 1) +
labs(
title = "Figure 1: Distribution of Line Lengths by Source",
x = "Characters per Line",
y = "Frequency"
) +
theme_minimal() +
theme(legend.position = "none")Insights:
# Create corpus
corpus <- VCorpus(VectorSource(combined_sample))
# Clean the corpus
corpus_clean <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace)
# Create term document matrix
tdm <- TermDocumentMatrix(corpus_clean)# Get word frequencies (using slam to avoid memory issues with sparse matrices)
library(slam)
word_freq <- sort(row_sums(tdm), decreasing = TRUE)
word_freq_df <- data.frame(word = names(word_freq), freq = word_freq)
# Top 20 words
top_words <- head(word_freq_df, 20)
# Plot
ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Figure 2: Top 20 Most Frequent Words",
x = "Word",
y = "Frequency"
) +
theme_minimal()Observations:
Understanding word combinations is crucial for prediction.
# Create tokens from combined sample
tokens_sample <- tokens(combined_sample,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE
)
tokens_sample <- tokens_tolower(tokens_sample)
# Create bigrams
bigrams <- tokens_ngrams(tokens_sample, n = 2)
# Get bigram frequencies
bigram_list <- unlist(bigrams)
bigram_freq <- sort(table(bigram_list), decreasing = TRUE)
bigram_freq_df <- data.frame(
bigram = names(bigram_freq),
freq = as.numeric(bigram_freq),
stringsAsFactors = FALSE
)
# Top 20 bigrams
top_bigrams <- head(bigram_freq_df, 20)
# Plot
ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(
title = "Figure 3: Top 20 Most Frequent Bigrams",
x = "Bigram",
y = "Frequency"
) +
theme_minimal()# Create trigrams
trigrams <- tokens_ngrams(tokens_sample, n = 3)
# Get trigram frequencies
trigram_list <- unlist(trigrams)
trigram_freq <- sort(table(trigram_list), decreasing = TRUE)
trigram_freq_df <- data.frame(
trigram = names(trigram_freq),
freq = as.numeric(trigram_freq),
stringsAsFactors = FALSE
)
# Top 20 trigrams
top_trigrams <- head(trigram_freq_df, 20)
# Plot
ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = trigram)) +
geom_bar(stat = "identity", fill = "coral") +
labs(
title = "Figure 4: Top 20 Most Frequent Trigrams",
x = "Frequency",
y = "Trigram"
) +
theme_minimal()# Calculate cumulative coverage
word_freq_df$cumsum <- cumsum(word_freq_df$freq)
word_freq_df$coverage <- word_freq_df$cumsum / sum(word_freq_df$freq) * 100
# Find words needed for coverage thresholds
coverage_50 <- which(word_freq_df$coverage >= 50)[1]
coverage_90 <- which(word_freq_df$coverage >= 90)[1]
# Plot
ggplot(word_freq_df[1:1000, ], aes(x = 1:1000, y = coverage)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
annotate("text",
x = 500, y = 55,
label = paste0("50% coverage: ", coverage_50, " words"),
color = "red"
) +
annotate("text",
x = 500, y = 85,
label = paste0("90% coverage: ", coverage_90, " words"),
color = "orange"
) +
labs(
title = "Figure 5: Vocabulary Coverage Analysis",
x = "Number of Unique Words",
y = "Cumulative Coverage (%)"
) +
theme_minimal()Key Finding:
Source Diversity: The three text sources show distinct characteristics in length, vocabulary, and style, which will require careful handling in the prediction model.
Zipf’s Law: Word frequencies follow a power-law distribution, with a few words appearing very frequently and most words appearing rarely.
N-gram Patterns: Common phrases and collocations emerge clearly in bigram and trigram analysis, validating the n-gram approach for prediction.
Efficiency Opportunity: 50% of word instances can be covered by just 326 unique words, enabling memory-efficient model design.
The prediction algorithm will use an n-gram model with backoff:
The interactive application will include:
This exploratory analysis has successfully demonstrated data loading, basic summarization, and visualization of the text corpora. The findings support an n-gram based approach for text prediction, with clear opportunities for optimization through vocabulary pruning and efficient data structures. The next phase will focus on building and refining the prediction model for deployment in a user-friendly Shiny application.
Note: This report uses a 1% sample of the data for computational efficiency. The final model will be trained on the complete dataset for better prediction accuracy.