library(stringi)
library(quanteda)
library(dplyr)
library(ggplot2)
library(knitr)Data Science Capstone Milestone Report
1 Introduction
The goal of this capstone project is to build a predictive text application using natural language processing techniques. Before developing the final prediction algorithm and Shiny app, this milestone report explores the training data, summarizes its major features, and describes the planned modeling approach.
The final application will predict the next word based on a phrase entered by the user. This report focuses on understanding the data, cleaning the text, analyzing word and phrase frequencies, and preparing for an n-gram based prediction model.
1.1 Executive Summary
This report presents an exploratory analysis of a large English text corpus consisting of blogs, news articles, and Twitter data. A 5% sample of the dataset, comprising over 1 million words, was used to efficiently analyze language patterns while preserving statistical reliability.
The analysis shows that word usage follows a highly skewed distribution, where a small number of words account for a large proportion of all text. Additionally, common word sequences (n-grams) reveal strong and repetitive linguistic structures that can be leveraged for next-word prediction.
These findings support the development of a predictive text model based on n-gram probabilities with a backoff strategy. The final application will be implemented as a Shiny app capable of generating fast and accurate next-word predictions while maintaining efficient memory usage.
2 Data Source
The dataset consists of English text files from three sources:
- Blogs
- News
The files used in this report are:
en_US.blogs.txten_US.news.txten_US.twitter.txt
These files contain real-world text and therefore include noise such as slang, informal spelling, punctuation variation, profanity, and possible foreign-language fragments.
3 Setup
4 Load the Data
blogs_data <- "en_US.blogs.txt"
news_data <- "en_US.news.txt"
twitter_data <- "en_US.twitter.txt"
blogs <- readLines(blogs_data, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_data, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_data, encoding = "UTF-8", skipNul = TRUE)5 Basic Summary Statistics
The first step is to confirm that the data has been loaded successfully and to calculate basic summaries for each source.
summary_table <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),
Characters = c(
sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))
),
File_Size_MB = round(c(
file.info(blogs_data)$size,
file.info(news_data)$size,
file.info(twitter_data)$size
) / 1024^2, 2)
)
kable(summary_table, caption = "Basic summary statistics for the English corpus")| Source | Lines | Words | Characters | File_Size_MB |
|---|---|---|---|---|
| Blogs | 899288 | 37546806 | 206824505 | 200.42 |
| News | 1010206 | 34761151 | 203214543 | 196.28 |
| 2360148 | 30096690 | 162096241 | 159.36 |
6 Line Length Analysis
Line length gives a useful view of the structure of each data source. Twitter lines are expected to be shorter, while blogs and news may contain longer passages.
line_lengths <- data.frame(
Source = rep(c("Blogs", "News", "Twitter"),
times = c(length(blogs), length(news), length(twitter))),
Characters = c(nchar(blogs), nchar(news), nchar(twitter)),
Words = c(stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter))
)
kable(
line_lengths %>%
group_by(Source) %>%
summarise(
Mean_Words = round(mean(Words), 2),
Median_Words = median(Words),
Mean_Characters = round(mean(Characters), 2),
Median_Characters = median(Characters),
.groups = "drop"
),
caption = "Line length summary by source"
)| Source | Mean_Words | Median_Words | Mean_Characters | Median_Characters |
|---|---|---|---|---|
| Blogs | 41.75 | 28 | 229.99 | 156 |
| News | 34.41 | 32 | 201.16 | 185 |
| 12.75 | 12 | 68.68 | 64 |
ggplot(line_lengths, aes(x = Words)) +
geom_histogram(bins = 50) +
facet_wrap(~ Source, scales = "free_y") +
coord_cartesian(xlim = c(0, 100)) +
labs(
title = "Distribution of Line Lengths",
subtitle = "Word counts per line, limited to 0-100 words for readability",
x = "Words per Line",
y = "Number of Lines"
)7 Sampling Strategy
The full dataset is large, so this report uses a random sample for exploratory analysis. Sampling reduces memory usage and processing time while still allowing the major language patterns to be observed.
The sampling rate can be adjusted. A value of 0.05 means 5% of each data source is sampled.
set.seed(123)
sample_rate <- 0.01
blogs_sample <- sample(blogs, size = max(1, floor(length(blogs) * sample_rate)))
news_sample <- sample(news, size = max(1, floor(length(news) * sample_rate)))
twitter_sample <- sample(twitter, size = max(1, floor(length(twitter) * sample_rate)))
sample_text <- c(blogs_sample, news_sample, twitter_sample)
sample_summary <- data.frame(
Source = c("Blogs", "News", "Twitter", "Combined Sample"),
Lines = c(length(blogs_sample), length(news_sample), length(twitter_sample), length(sample_text)),
Words = c(
sum(stri_count_words(blogs_sample)),
sum(stri_count_words(news_sample)),
sum(stri_count_words(twitter_sample)),
sum(stri_count_words(sample_text))
)
)
kable(sample_summary, caption = "Summary of sampled data used for exploratory analysis")| Source | Lines | Words |
|---|---|---|
| Blogs | 8992 | 374837 |
| News | 10102 | 347501 |
| 23601 | 301078 | |
| Combined Sample | 42695 | 1023416 |
8 Text Cleaning and Tokenization
The sampled text is cleaned and tokenized using the following steps:
- Convert all text to lowercase
- Remove punctuation
- Remove numbers
- Remove symbols
- Split text into word tokens
tokens_clean <- tokens(
sample_text,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE
)
tokens_clean <- tokens_tolower(tokens_clean)9 Unigram Analysis
Unigrams are individual words. The most frequent words give a simple view of the vocabulary and common language patterns in the corpus.
dfm_uni <- dfm(tokens_clean)
top_uni <- topfeatures(dfm_uni, 25)
unigram_table <- data.frame(
Word = names(top_uni),
Frequency = as.numeric(top_uni)
)
kable(unigram_table, caption = "Top 25 most frequent unigrams")| Word | Frequency |
|---|---|
| the | 47771 |
| to | 27621 |
| and | 24225 |
| a | 23698 |
| of | 19825 |
| i | 16755 |
| in | 16576 |
| for | 11125 |
| is | 10556 |
| that | 10459 |
| you | 9347 |
| it | 9106 |
| on | 8194 |
| with | 7236 |
| was | 6284 |
| my | 6004 |
| at | 5719 |
| be | 5591 |
| this | 5366 |
| have | 5277 |
| but | 4913 |
| are | 4759 |
| as | 4726 |
| he | 4279 |
| we | 4169 |
ggplot(unigram_table, aes(x = reorder(Word, Frequency), y = Frequency)) +
geom_col() +
coord_flip() +
labs(
title = "Top 25 Most Frequent Words",
x = "Word",
y = "Frequency"
)10 Bigram Analysis
Bigrams are two-word phrases. They help reveal common phrase structures and are useful for next-word prediction.
tokens_bi <- tokens_ngrams(tokens_clean, n = 2)
dfm_bi <- dfm(tokens_bi)
top_bi <- topfeatures(dfm_bi, 25)
bigram_table <- data.frame(
Bigram = names(top_bi),
Frequency = as.numeric(top_bi)
)
kable(bigram_table, caption = "Top 25 most frequent bigrams")| Bigram | Frequency |
|---|---|
| in_the | 4218 |
| of_the | 4203 |
| to_the | 2146 |
| for_the | 2078 |
| on_the | 1962 |
| to_be | 1593 |
| at_the | 1421 |
| and_the | 1243 |
| in_a | 1171 |
| with_the | 1031 |
| is_a | 1006 |
| it_was | 974 |
| for_a | 928 |
| from_the | 899 |
| i_was | 879 |
| i_have | 856 |
| and_i | 836 |
| i_am | 832 |
| with_a | 810 |
| will_be | 809 |
| of_a | 802 |
| going_to | 789 |
| if_you | 764 |
| it_is | 757 |
| have_a | 754 |
ggplot(bigram_table, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
geom_col() +
coord_flip() +
labs(
title = "Top 25 Most Frequent Bigrams",
x = "Bigram",
y = "Frequency"
)11 Trigram Analysis
Trigrams are three-word phrases. These are especially useful for predicting the next word from recent context.
tokens_tri <- tokens_ngrams(tokens_clean, n = 3)
dfm_tri <- dfm(tokens_tri)
top_tri <- topfeatures(dfm_tri, 25)
trigram_table <- data.frame(
Trigram = names(top_tri),
Frequency = as.numeric(top_tri)
)
kable(trigram_table, caption = "Top 25 most frequent trigrams")| Trigram | Frequency |
|---|---|
| one_of_the | 335 |
| a_lot_of | 268 |
| thanks_for_the | 234 |
| going_to_be | 180 |
| out_of_the | 169 |
| to_be_a | 163 |
| the_end_of | 157 |
| it_was_a | 154 |
| be_able_to | 143 |
| i_want_to | 141 |
| looking_forward_to | 136 |
| i_have_to | 124 |
| i_have_a | 124 |
| this_is_a | 124 |
| as_well_as | 123 |
| some_of_the | 119 |
| part_of_the | 118 |
| the_rest_of | 117 |
| thank_you_for | 111 |
| i_love_you | 111 |
| in_the_first | 108 |
| a_couple_of | 107 |
| at_the_end | 104 |
| end_of_the | 102 |
| is_going_to | 101 |
ggplot(trigram_table, aes(x = reorder(Trigram, Frequency), y = Frequency)) +
geom_col() +
coord_flip() +
labs(
title = "Top 25 Most Frequent Trigrams",
x = "Trigram",
y = "Frequency"
)12 Word Frequency Coverage
A useful question for predictive modeling is how many unique words are needed to cover a large proportion of total word usage. If a relatively small dictionary covers most word instances, the final model can be made smaller and faster.
word_freq <- sort(colSums(dfm_uni), decreasing = TRUE)
cumulative_coverage <- cumsum(word_freq) / sum(word_freq)
coverage_50 <- which(cumulative_coverage >= 0.50)[1]
coverage_90 <- which(cumulative_coverage >= 0.90)[1]
coverage_95 <- which(cumulative_coverage >= 0.95)[1]
coverage_table <- data.frame(
Coverage = c("50%", "90%", "95%"),
Unique_Words_Required = c(coverage_50, coverage_90, coverage_95)
)
kable(coverage_table, caption = "Number of unique words required to reach selected coverage levels")| Coverage | Unique_Words_Required | |
|---|---|---|
| may | 50% | 144 |
| pins | 90% | 7987 |
| agony | 95% | 18443 |
coverage_df <- data.frame(
Rank = seq_along(cumulative_coverage),
Coverage = cumulative_coverage
)
ggplot(coverage_df, aes(x = Rank, y = Coverage)) +
geom_line() +
geom_hline(yintercept = 0.50, linetype = "dashed") +
geom_hline(yintercept = 0.90, linetype = "dashed") +
geom_hline(yintercept = 0.95, linetype = "dashed") +
labs(
title = "Cumulative Word Frequency Coverage",
x = "Word Rank by Frequency",
y = "Cumulative Coverage"
)13 Key Findings
The exploratory analysis of the English corpus reveals several important insights:
The dataset is large and representative of real-world language
The corpus contains millions of lines and tens of millions of words across blogs, news, and Twitter. A 5% random sample was used for analysis, resulting in over 1 million words, which is sufficiently large to capture the main linguistic patterns while reducing computational cost.Distinct writing styles across data sources
The three sources exhibit clear structural differences. Blogs contain the longest entries (average ~42 words per line), followed by news (~34 words), while Twitter entries are much shorter (~13 words per line). This highlights the contrast between formal, descriptive writing (blogs/news) and informal, conversational text (Twitter).Word usage follows a highly skewed distribution
A small number of common words dominate the corpus. Words such as the, to, and, a, and of appear far more frequently than others, confirming a Zipfian distribution typical of natural language. This has important implications for model efficiency.Frequent phrase patterns are strongly present
Bigram and trigram analysis reveals highly recurring structures such as:- in the, of the, to the
- one of the, a lot of, going to be
These patterns demonstrate that language is highly structured and predictable, supporting the use of n-gram models for next-word prediction.
Efficient coverage with a relatively small vocabulary
The coverage analysis shows that:- Approximately 144 unique words account for 50% of all word occurrences
- Around 8,000 words cover 90%
- Around 18,000 words cover 95%
This indicates that a compact, frequency-based dictionary can provide strong predictive coverage while keeping the model size manageable.
Presence of real-world noise in the data
The dataset includes informal language, abbreviations, punctuation inconsistencies, and occasional foreign-language fragments. Twitter data in particular contributes slang and conversational expressions. This reinforces the importance of thorough text cleaning and preprocessing.5% sampling is sufficient for reliable analysis
The use of a 5% sample preserves key statistical properties of the full dataset, including word frequency distributions and common phrase structures. This confirms that meaningful insights can be obtained without processing the entire dataset, improving efficiency during development.
14 Planned Prediction Algorithm
The prediction model will be based on an n-gram language modeling approach, which estimates the probability of a word given the previous sequence of words.
The core strategy will use a backoff model, structured as follows:
- The user input will be cleaned and tokenized using the same preprocessing steps applied to the training data.
- The model will attempt to match the last three words of the input against a trigram dataset to predict the next word.
- If no trigram match is found, the model will back off to a bigram model using the last two words.
- If no bigram match is found, the model will fall back to unigram frequencies.
- The prediction returned will be the word with the highest conditional probability.
To improve prediction quality, the model may incorporate smoothing techniques such as: - Assigning non-zero probabilities to unseen n-grams - Using frequency thresholds to reduce noise from rare combinations
Additionally, the model will be implemented using precomputed n-gram frequency tables, enabling fast lookup during prediction. This approach ensures that predictions can be generated in real time within the Shiny application.
This design balances accuracy, computational efficiency, and scalability, making it suitable for deployment in a resource-constrained environment.
15 Model Efficiency Considerations
Since the final model will be deployed in a Shiny application, both memory usage and runtime performance are critical.
To optimize efficiency, the model will: - Use precomputed n-gram frequency tables stored in compact formats - Apply frequency thresholds to remove low-frequency n-grams - Limit the size of the prediction dictionary based on coverage analysis - Use efficient data structures for fast lookup (e.g., hash tables or indexed data frames) - Implement backoff logic to avoid unnecessary computation
These strategies will ensure that the model remains responsive while maintaining acceptable prediction accuracy.
16 Planned Shiny Application
The planned Shiny app will include:
- A text input box where the user enters a phrase
- A prediction button or automatic prediction behavior
- A displayed next-word prediction
- Possibly a small list of top alternative predictions
The app will use precomputed n-gram tables so that predictions can be generated quickly.
17 Conclusion
This milestone confirms that the English corpus has been loaded, summarized, sampled, cleaned, and explored. The analysis provides basic summaries, visualizations, unigram frequencies, bigram frequencies, trigram frequencies, and word coverage estimates.
The next step is to build and optimize the n-gram prediction algorithm and then deploy it in a Shiny application.