This report presents an exploratory analysis of a large text corpus to be used for building a predictive text application. The dataset consists of blog posts, news articles, and Twitter messages in English. Our analysis reveals key characteristics of the data and outlines our strategy for developing a word prediction algorithm.
Key Findings:
Develop a data-driven text prediction application (similar to smartphone keyboards) that suggests the next word as users type.
The Coursera-SwiftKey dataset contains English text from three sources:
# Load required libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringi)
library(knitr)
library(gridExtra)# Define file paths
blogs_file <- "final/en_US/en_US.blogs.txt"
news_file <- "final/en_US/en_US.news.txt"
twitter_file <- "final/en_US/en_US.twitter.txt"# Function to get file statistics
get_file_stats <- function(filepath) {
# File size
size_mb <- file.info(filepath)$size / (1024^2)
# Read file for line and word counts
con <- file(filepath, "r")
lines <- readLines(con, warn = FALSE)
close(con)
# Line count
line_count <- length(lines)
# Word count
word_count <- sum(stri_count_words(lines))
# Character count
char_count <- sum(nchar(lines))
# Average words per line
avg_words <- word_count / line_count
return(data.frame(
Size_MB = round(size_mb, 2),
Lines = line_count,
Words = word_count,
Characters = char_count,
Avg_Words_Per_Line = round(avg_words, 2)
))
}
# Get statistics for all files (if files exist)
file_stats <- data.frame()
if(file.exists(blogs_file)) {
blogs_stats <- get_file_stats(blogs_file)
blogs_stats$Source <- "Blogs"
file_stats <- rbind(file_stats, blogs_stats)
}
if(file.exists(news_file)) {
news_stats <- get_file_stats(news_file)
news_stats$Source <- "News"
file_stats <- rbind(file_stats, news_stats)
}
if(file.exists(twitter_file)) {
twitter_stats <- get_file_stats(twitter_file)
twitter_stats$Source <- "Twitter"
file_stats <- rbind(file_stats, twitter_stats)
}
# Reorder columns
file_stats <- file_stats[, c("Source", "Size_MB", "Lines", "Words",
"Characters", "Avg_Words_Per_Line")]
# Display table
kable(file_stats,
format.args = list(big.mark = ","),
caption = "Table 1: Summary Statistics of Text Datasets")| Source | Size_MB | Lines | Words | Characters | Avg_Words_Per_Line |
|---|---|---|---|---|---|
| Blogs | 200.42 | 899,288 | 37,546,621 | 206,824,505 | 41.75 |
| News | 196.28 | 1,010,242 | 34,760,761 | 203,223,159 | 34.41 |
| 159.36 | 2,360,148 | 30,109,954 | 162,096,031 | 12.76 |
Observations:
# Reshape data for plotting
file_stats_long <- file_stats %>%
select(Source, Lines, Words) %>%
pivot_longer(cols = c(Lines, Words),
names_to = "Metric",
values_to = "Count")
# Create comparison plot
ggplot(file_stats_long, aes(x = Source, y = Count, fill = Source)) +
geom_bar(stat = "identity") +
facet_wrap(~Metric, scales = "free_y") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Figure 1: Comparison of Dataset Sizes",
subtitle = "Lines vs Words across three text sources",
y = "Count",
x = "") +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))Given the large size of the datasets, we’ll use a 1% random sample for detailed exploration. This approach:
set.seed(12345) # For reproducibility
# Function to sample lines from file
sample_file <- function(filepath, sample_rate = 0.01) {
con <- file(filepath, "r")
lines <- readLines(con, warn = FALSE)
close(con)
# Random sampling
sample_size <- floor(length(lines) * sample_rate)
sampled_lines <- sample(lines, sample_size)
return(sampled_lines)
}
# Create samples
if(file.exists(blogs_file)) {
blogs_sample <- sample_file(blogs_file)
}
if(file.exists(news_file)) {
news_sample <- sample_file(news_file)
}
if(file.exists(twitter_file)) {
twitter_sample <- sample_file(twitter_file)
}
# Combine all samples
all_samples <- c(blogs_sample, news_sample, twitter_sample)
cat("Sample sizes:\n")## Sample sizes:
## Blogs: 8992 lines
## News: 10102 lines
## Twitter: 23601 lines
## Total: 42695 lines
# Function to clean and tokenize text
clean_and_tokenize <- function(text) {
# Convert to lowercase
text <- tolower(text)
# Remove URLs
text <- gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
"", text)
# Remove email addresses
text <- gsub("\\S+@\\S+", "", text)
# Remove Twitter handles
text <- gsub("@\\w+", "", text)
# Extract words (including contractions)
words <- unlist(stri_extract_all_words(text))
return(words)
}
# Tokenize all samples
all_words <- clean_and_tokenize(all_samples)
cat("Total words in sample:", length(all_words), "\n")## Total words in sample: 1023771
## Unique words in sample: 55061
# Calculate word frequencies
word_freq <- as.data.frame(table(all_words))
colnames(word_freq) <- c("Word", "Frequency")
word_freq <- word_freq %>%
arrange(desc(Frequency)) %>%
mutate(Proportion = Frequency / sum(Frequency) * 100)
# Top 20 words
top20 <- head(word_freq, 20)
kable(top20,
row.names = FALSE,
caption = "Table 2: Top 20 Most Frequent Words",
digits = 2)| Word | Frequency | Proportion |
|---|---|---|
| the | 48048 | 4.69 |
| to | 27627 | 2.70 |
| a | 23871 | 2.33 |
| and | 23862 | 2.33 |
| of | 20199 | 1.97 |
| in | 16653 | 1.63 |
| i | 16523 | 1.61 |
| for | 10905 | 1.07 |
| is | 10814 | 1.06 |
| that | 10398 | 1.02 |
| you | 9780 | 0.96 |
| it | 9157 | 0.89 |
| on | 8210 | 0.80 |
| with | 7058 | 0.69 |
| was | 6177 | 0.60 |
| my | 6030 | 0.59 |
| at | 5723 | 0.56 |
| this | 5514 | 0.54 |
| be | 5447 | 0.53 |
| have | 5369 | 0.52 |
# Visualize top 30 words
top30 <- head(word_freq, 30)
ggplot(top30, aes(x = reorder(Word, Frequency), y = Frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Figure 2: Top 30 Most Frequent Words",
subtitle = "Based on 1% sample of combined datasets",
x = "Word",
y = "Frequency") +
theme(plot.title = element_text(face = "bold"))Observations:
# Calculate cumulative coverage
word_freq <- word_freq %>%
mutate(Cumulative_Prop = cumsum(Proportion))
# How many words needed for 50% and 90% coverage?
words_50 <- which(word_freq$Cumulative_Prop >= 50)[1]
words_90 <- which(word_freq$Cumulative_Prop >= 90)[1]
cat("Words needed to cover:\n")## Words needed to cover:
## 50% of text: 155 words
## 90% of text: 7517 words
# Plot coverage curve
coverage_data <- word_freq[1:1000, ]
ggplot(coverage_data, aes(x = 1:nrow(coverage_data), y = Cumulative_Prop)) +
geom_line(color = "darkblue", size = 1) +
geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
annotate("text", x = 700, y = 52, label = "50% coverage", color = "red") +
annotate("text", x = 700, y = 92, label = "90% coverage", color = "orange") +
theme_minimal() +
labs(title = "Figure 3: Cumulative Word Coverage",
subtitle = "How many unique words are needed to cover percentage of text",
x = "Number of Unique Words (ranked by frequency)",
y = "Cumulative Coverage (%)") +
theme(plot.title = element_text(face = "bold"))Key Finding: A relatively small vocabulary can cover a large portion of the text, which will help optimize our prediction model.
# Calculate word lengths
word_lengths <- nchar(all_words)
# Create histogram
ggplot(data.frame(Length = word_lengths), aes(x = Length)) +
geom_histogram(binwidth = 1, fill = "forestgreen", color = "black") +
scale_x_continuous(breaks = seq(0, 20, 2)) +
theme_minimal() +
labs(title = "Figure 4: Distribution of Word Lengths",
subtitle = "Character count per word",
x = "Word Length (characters)",
y = "Frequency") +
theme(plot.title = element_text(face = "bold"))N-grams are sequences of N consecutive words. They’re essential for predicting the next word based on context.
# Function to create n-grams
create_ngrams <- function(words, n = 2) {
if(length(words) < n) return(character(0))
ngrams <- character(length(words) - n + 1)
for(i in 1:(length(words) - n + 1)) {
ngrams[i] <- paste(words[i:(i + n - 1)], collapse = " ")
}
return(ngrams)
}
# Create bigrams
bigrams <- create_ngrams(all_words, 2)
# Calculate bigram frequencies
bigram_freq <- as.data.frame(table(bigrams))
colnames(bigram_freq) <- c("Bigram", "Frequency")
bigram_freq <- bigram_freq %>%
arrange(desc(Frequency))
# Top 20 bigrams
top20_bigrams <- head(bigram_freq, 20)
kable(top20_bigrams,
row.names = FALSE,
caption = "Table 3: Top 20 Most Frequent Bigrams")| Bigram | Frequency |
|---|---|
| of the | 4317 |
| in the | 4162 |
| to the | 2098 |
| for the | 2014 |
| on the | 1923 |
| to be | 1612 |
| at the | 1434 |
| and the | 1270 |
| in a | 1201 |
| with the | 1057 |
| is a | 1008 |
| it was | 977 |
| i have | 903 |
| for a | 887 |
| from the | 834 |
| i was | 833 |
| going to | 818 |
| of a | 815 |
| and i | 813 |
| it is | 796 |
# Visualize top 20 bigrams
ggplot(top20_bigrams, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
geom_bar(stat = "identity", fill = "coral") +
coord_flip() +
theme_minimal() +
labs(title = "Figure 5: Top 20 Most Frequent Bigrams",
subtitle = "Two-word sequences",
x = "Bigram",
y = "Frequency") +
theme(plot.title = element_text(face = "bold"))# Create trigrams
trigrams <- create_ngrams(all_words, 3)
# Calculate trigram frequencies
trigram_freq <- as.data.frame(table(trigrams))
colnames(trigram_freq) <- c("Trigram", "Frequency")
trigram_freq <- trigram_freq %>%
arrange(desc(Frequency))
# Top 20 trigrams
top20_trigrams <- head(trigram_freq, 20)
kable(top20_trigrams,
row.names = FALSE,
caption = "Table 4: Top 20 Most Frequent Trigrams")| Trigram | Frequency |
|---|---|
| one of the | 348 |
| a lot of | 302 |
| thanks for the | 233 |
| to be a | 186 |
| going to be | 174 |
| out of the | 160 |
| i want to | 156 |
| as well as | 150 |
| it was a | 141 |
| part of the | 138 |
| the end of | 138 |
| be able to | 137 |
| some of the | 137 |
| the rest of | 123 |
| i have a | 121 |
| looking forward to | 118 |
| thank you for | 115 |
| there is a | 111 |
| i need to | 110 |
| is going to | 109 |
# Visualize top 20 trigrams
ggplot(top20_trigrams, aes(x = reorder(Trigram, Frequency), y = Frequency)) +
geom_bar(stat = "identity", fill = "mediumpurple") +
coord_flip() +
theme_minimal() +
labs(title = "Figure 6: Top 20 Most Frequent Trigrams",
subtitle = "Three-word sequences",
x = "Trigram",
y = "Frequency") +
theme(plot.title = element_text(face = "bold"))# Words that appear only once (hapax legomena)
singleton_words <- word_freq %>% filter(Frequency == 1)
# Frequency of frequencies
freq_of_freq <- as.data.frame(table(word_freq$Frequency))
colnames(freq_of_freq) <- c("Occurrence", "Number_of_Words")
freq_of_freq$Occurrence <- as.numeric(as.character(freq_of_freq$Occurrence))
# Plot first 50 frequency levels
ggplot(head(freq_of_freq, 50),
aes(x = Occurrence, y = Number_of_Words)) +
geom_bar(stat = "identity", fill = "darkorange") +
scale_y_log10(labels = scales::comma) +
theme_minimal() +
labs(title = "Figure 7: Frequency of Frequencies",
subtitle = "How many words appear exactly N times (log scale)",
x = "Number of Occurrences",
y = "Number of Words (log scale)") +
theme(plot.title = element_text(face = "bold"))## Words appearing only once: 28328
## Percentage of vocabulary: 51.45 %
Observation: Many words appear very rarely. This “long tail” is typical in natural language and will require special handling in our model.
# Tokenize each source separately
blogs_words <- clean_and_tokenize(blogs_sample)
news_words <- clean_and_tokenize(news_sample)
twitter_words <- clean_and_tokenize(twitter_sample)
# Calculate unique word counts
unique_counts <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Total_Words = c(length(blogs_words), length(news_words), length(twitter_words)),
Unique_Words = c(length(unique(blogs_words)),
length(unique(news_words)),
length(unique(twitter_words)))
)
unique_counts <- unique_counts %>%
mutate(Vocabulary_Richness = round(Unique_Words / Total_Words * 100, 2))
kable(unique_counts,
caption = "Table 5: Vocabulary Richness by Source",
format.args = list(big.mark = ","))| Source | Total_Words | Unique_Words | Vocabulary_Richness |
|---|---|---|---|
| Blogs | 373,397 | 28,723 | 7.69 |
| News | 348,594 | 30,983 | 8.89 |
| 301,780 | 25,638 | 8.50 |
ggplot(unique_counts, aes(x = Source, y = Vocabulary_Richness, fill = Source)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Figure 8: Vocabulary Richness by Source",
subtitle = "Percentage of unique words",
y = "Vocabulary Richness (%)",
x = "") +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))Our text prediction algorithm will use an N-gram model with backoff:
| Step | Task | Description |
|---|---|---|
| 1 | Data Sampling & Cleaning | Sample data, remove profanity, clean text |
| 2 | N-gram Generation (2-5 grams) | Create all n-grams from training corpus |
| 3 | Frequency Calculation & Pruning | Keep only n-grams above frequency threshold |
| 4 | Implement Backoff Algorithm | Implement fallback to shorter n-grams |
| 5 | Add Smoothing & Profanity Filter | Handle unseen combinations, filter bad words |
| 6 | Performance Testing & Optimization | Test accuracy and speed, optimize size |
The final application will include:
Challenge 1: Model Size - Solution: Prune rare n-grams, use efficient data structures (hash tables)
Challenge 2: Speed - Solution: Pre-compute lookup tables, implement caching
Challenge 3: Accuracy - Solution: Use larger n-grams (4-5 words) where possible, validate against test set
Challenge 4: Unknown Words - Solution: Implement backoff to shorter n-grams, use word similarity for suggestions
Our exploratory analysis has revealed:
✓ Successful data loading of 4+ million lines across
three text sources
✓ Clear patterns in word frequency following natural
language distributions
✓ Efficient sampling strategy enables rapid analysis of
massive datasets
✓ N-gram patterns show promise for building accurate
predictions
✓ Feasible approach identified using backoff models and
smoothing
The data is clean, well-understood, and ready for model building. Our planned n-gram approach with backoff is well-suited for this prediction task and will deliver a responsive, accurate text prediction application.
Report generated on 2025-11-25 for Coursera Data Science Capstone Project