Data Science Capstone Milestone Report

Author

Samuel Tandoh

Published

April 27, 2026

1 Introduction

The goal of this capstone project is to build a predictive text application using natural language processing techniques. Before developing the final prediction algorithm and Shiny app, this milestone report explores the training data, summarizes its major features, and describes the planned modeling approach.

The final application will predict the next word based on a phrase entered by the user. This report focuses on understanding the data, cleaning the text, analyzing word and phrase frequencies, and preparing for an n-gram based prediction model.

1.1 Executive Summary

This report presents an exploratory analysis of a large English text corpus consisting of blogs, news articles, and Twitter data. A 5% sample of the dataset, comprising over 1 million words, was used to efficiently analyze language patterns while preserving statistical reliability.

The analysis shows that word usage follows a highly skewed distribution, where a small number of words account for a large proportion of all text. Additionally, common word sequences (n-grams) reveal strong and repetitive linguistic structures that can be leveraged for next-word prediction.

These findings support the development of a predictive text model based on n-gram probabilities with a backoff strategy. The final application will be implemented as a Shiny app capable of generating fast and accurate next-word predictions while maintaining efficient memory usage.

2 Data Source

The dataset consists of English text files from three sources:

Blogs
News
Twitter

The files used in this report are:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

These files contain real-world text and therefore include noise such as slang, informal spelling, punctuation variation, profanity, and possible foreign-language fragments.

3 Setup

library(stringi)
library(quanteda)
library(dplyr)
library(ggplot2)
library(knitr)

4 Load the Data

blogs_data <- "en_US.blogs.txt"
news_data <- "en_US.news.txt"
twitter_data <- "en_US.twitter.txt"

blogs <- readLines(blogs_data, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_data, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_data, encoding = "UTF-8", skipNul = TRUE)

5 Basic Summary Statistics

The first step is to confirm that the data has been loaded successfully and to calculate basic summaries for each source.

summary_table <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),
  Characters = c(
    sum(nchar(blogs)),
    sum(nchar(news)),
    sum(nchar(twitter))
  ),
  File_Size_MB = round(c(
    file.info(blogs_data)$size,
    file.info(news_data)$size,
    file.info(twitter_data)$size
  ) / 1024^2, 2)
)

kable(summary_table, caption = "Basic summary statistics for the English corpus")

Basic summary statistics for the English corpus
Source	Lines	Words	Characters	File_Size_MB
Blogs	899288	37546806	206824505	200.42
News	1010206	34761151	203214543	196.28
Twitter	2360148	30096690	162096241	159.36

6 Line Length Analysis

Line length gives a useful view of the structure of each data source. Twitter lines are expected to be shorter, while blogs and news may contain longer passages.

line_lengths <- data.frame(
  Source = rep(c("Blogs", "News", "Twitter"),
               times = c(length(blogs), length(news), length(twitter))),
  Characters = c(nchar(blogs), nchar(news), nchar(twitter)),
  Words = c(stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter))
)

kable(
  line_lengths %>%
    group_by(Source) %>%
    summarise(
      Mean_Words = round(mean(Words), 2),
      Median_Words = median(Words),
      Mean_Characters = round(mean(Characters), 2),
      Median_Characters = median(Characters),
      .groups = "drop"
    ),
  caption = "Line length summary by source"
)

Line length summary by source
Source	Mean_Words	Median_Words	Mean_Characters	Median_Characters
Blogs	41.75	28	229.99	156
News	34.41	32	201.16	185
Twitter	12.75	12	68.68	64

ggplot(line_lengths, aes(x = Words)) +
  geom_histogram(bins = 50) +
  facet_wrap(~ Source, scales = "free_y") +
  coord_cartesian(xlim = c(0, 100)) +
  labs(
    title = "Distribution of Line Lengths",
    subtitle = "Word counts per line, limited to 0-100 words for readability",
    x = "Words per Line",
    y = "Number of Lines"
  )

7 Sampling Strategy

The full dataset is large, so this report uses a random sample for exploratory analysis. Sampling reduces memory usage and processing time while still allowing the major language patterns to be observed.

The sampling rate can be adjusted. A value of 0.05 means 5% of each data source is sampled.

set.seed(123)

sample_rate <- 0.01

blogs_sample <- sample(blogs, size = max(1, floor(length(blogs) * sample_rate)))
news_sample <- sample(news, size = max(1, floor(length(news) * sample_rate)))
twitter_sample <- sample(twitter, size = max(1, floor(length(twitter) * sample_rate)))

sample_text <- c(blogs_sample, news_sample, twitter_sample)

sample_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter", "Combined Sample"),
  Lines = c(length(blogs_sample), length(news_sample), length(twitter_sample), length(sample_text)),
  Words = c(
    sum(stri_count_words(blogs_sample)),
    sum(stri_count_words(news_sample)),
    sum(stri_count_words(twitter_sample)),
    sum(stri_count_words(sample_text))
  )
)

kable(sample_summary, caption = "Summary of sampled data used for exploratory analysis")

Summary of sampled data used for exploratory analysis
Source	Lines	Words
Blogs	8992	374837
News	10102	347501
Twitter	23601	301078
Combined Sample	42695	1023416

8 Text Cleaning and Tokenization

The sampled text is cleaned and tokenized using the following steps:

Convert all text to lowercase
Remove punctuation
Remove numbers
Remove symbols
Split text into word tokens

tokens_clean <- tokens(
  sample_text,
  remove_punct = TRUE,
  remove_numbers = TRUE,
  remove_symbols = TRUE,
  remove_separators = TRUE
)

tokens_clean <- tokens_tolower(tokens_clean)

9 Unigram Analysis

Unigrams are individual words. The most frequent words give a simple view of the vocabulary and common language patterns in the corpus.

dfm_uni <- dfm(tokens_clean)

top_uni <- topfeatures(dfm_uni, 25)

unigram_table <- data.frame(
  Word = names(top_uni),
  Frequency = as.numeric(top_uni)
)

kable(unigram_table, caption = "Top 25 most frequent unigrams")

Top 25 most frequent unigrams
Word	Frequency
the	47771
to	27621
and	24225
a	23698
of	19825
i	16755
in	16576
for	11125
is	10556
that	10459
you	9347
it	9106
on	8194
with	7236
was	6284
my	6004
at	5719
be	5591
this	5366
have	5277
but	4913
are	4759
as	4726
he	4279
we	4169

ggplot(unigram_table, aes(x = reorder(Word, Frequency), y = Frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Most Frequent Words",
    x = "Word",
    y = "Frequency"
  )

10 Bigram Analysis

Bigrams are two-word phrases. They help reveal common phrase structures and are useful for next-word prediction.

tokens_bi <- tokens_ngrams(tokens_clean, n = 2)
dfm_bi <- dfm(tokens_bi)

top_bi <- topfeatures(dfm_bi, 25)

bigram_table <- data.frame(
  Bigram = names(top_bi),
  Frequency = as.numeric(top_bi)
)

kable(bigram_table, caption = "Top 25 most frequent bigrams")

Top 25 most frequent bigrams
Bigram	Frequency
in_the	4218
of_the	4203
to_the	2146
for_the	2078
on_the	1962
to_be	1593
at_the	1421
and_the	1243
in_a	1171
with_the	1031
is_a	1006
it_was	974
for_a	928
from_the	899
i_was	879
i_have	856
and_i	836
i_am	832
with_a	810
will_be	809
of_a	802
going_to	789
if_you	764
it_is	757
have_a	754

ggplot(bigram_table, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Most Frequent Bigrams",
    x = "Bigram",
    y = "Frequency"
  )

11 Trigram Analysis

Trigrams are three-word phrases. These are especially useful for predicting the next word from recent context.

tokens_tri <- tokens_ngrams(tokens_clean, n = 3)
dfm_tri <- dfm(tokens_tri)

top_tri <- topfeatures(dfm_tri, 25)

trigram_table <- data.frame(
  Trigram = names(top_tri),
  Frequency = as.numeric(top_tri)
)

kable(trigram_table, caption = "Top 25 most frequent trigrams")

Top 25 most frequent trigrams
Trigram	Frequency
one_of_the	335
a_lot_of	268
thanks_for_the	234
going_to_be	180
out_of_the	169
to_be_a	163
the_end_of	157
it_was_a	154
be_able_to	143
i_want_to	141
looking_forward_to	136
i_have_to	124
i_have_a	124
this_is_a	124
as_well_as	123
some_of_the	119
part_of_the	118
the_rest_of	117
thank_you_for	111
i_love_you	111
in_the_first	108
a_couple_of	107
at_the_end	104
end_of_the	102
is_going_to	101

ggplot(trigram_table, aes(x = reorder(Trigram, Frequency), y = Frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Most Frequent Trigrams",
    x = "Trigram",
    y = "Frequency"
  )

12 Word Frequency Coverage

A useful question for predictive modeling is how many unique words are needed to cover a large proportion of total word usage. If a relatively small dictionary covers most word instances, the final model can be made smaller and faster.

word_freq <- sort(colSums(dfm_uni), decreasing = TRUE)
cumulative_coverage <- cumsum(word_freq) / sum(word_freq)

coverage_50 <- which(cumulative_coverage >= 0.50)[1]
coverage_90 <- which(cumulative_coverage >= 0.90)[1]
coverage_95 <- which(cumulative_coverage >= 0.95)[1]

coverage_table <- data.frame(
  Coverage = c("50%", "90%", "95%"),
  Unique_Words_Required = c(coverage_50, coverage_90, coverage_95)
)

kable(coverage_table, caption = "Number of unique words required to reach selected coverage levels")

Number of unique words required to reach selected coverage levels
	Coverage	Unique_Words_Required
may	50%	144
pins	90%	7987
agony	95%	18443

coverage_df <- data.frame(
  Rank = seq_along(cumulative_coverage),
  Coverage = cumulative_coverage
)

ggplot(coverage_df, aes(x = Rank, y = Coverage)) +
  geom_line() +
  geom_hline(yintercept = 0.50, linetype = "dashed") +
  geom_hline(yintercept = 0.90, linetype = "dashed") +
  geom_hline(yintercept = 0.95, linetype = "dashed") +
  labs(
    title = "Cumulative Word Frequency Coverage",
    x = "Word Rank by Frequency",
    y = "Cumulative Coverage"
  )

13 Key Findings

The exploratory analysis of the English corpus reveals several important insights:

The dataset is large and representative of real-world language
The corpus contains millions of lines and tens of millions of words across blogs, news, and Twitter. A 5% random sample was used for analysis, resulting in over 1 million words, which is sufficiently large to capture the main linguistic patterns while reducing computational cost.
Distinct writing styles across data sources
The three sources exhibit clear structural differences. Blogs contain the longest entries (average ~42 words per line), followed by news (~34 words), while Twitter entries are much shorter (~13 words per line). This highlights the contrast between formal, descriptive writing (blogs/news) and informal, conversational text (Twitter).
Word usage follows a highly skewed distribution
A small number of common words dominate the corpus. Words such as the, to, and, a, and of appear far more frequently than others, confirming a Zipfian distribution typical of natural language. This has important implications for model efficiency.
Frequent phrase patterns are strongly present
Bigram and trigram analysis reveals highly recurring structures such as:
- in the, of the, to the
- one of the, a lot of, going to be
These patterns demonstrate that language is highly structured and predictable, supporting the use of n-gram models for next-word prediction.
Efficient coverage with a relatively small vocabulary
The coverage analysis shows that:
- Approximately 144 unique words account for 50% of all word occurrences
- Around 8,000 words cover 90%
- Around 18,000 words cover 95%
This indicates that a compact, frequency-based dictionary can provide strong predictive coverage while keeping the model size manageable.
Presence of real-world noise in the data
The dataset includes informal language, abbreviations, punctuation inconsistencies, and occasional foreign-language fragments. Twitter data in particular contributes slang and conversational expressions. This reinforces the importance of thorough text cleaning and preprocessing.
5% sampling is sufficient for reliable analysis
The use of a 5% sample preserves key statistical properties of the full dataset, including word frequency distributions and common phrase structures. This confirms that meaningful insights can be obtained without processing the entire dataset, improving efficiency during development.

14 Planned Prediction Algorithm

The prediction model will be based on an n-gram language modeling approach, which estimates the probability of a word given the previous sequence of words.

The core strategy will use a backoff model, structured as follows:

The user input will be cleaned and tokenized using the same preprocessing steps applied to the training data.
The model will attempt to match the last three words of the input against a trigram dataset to predict the next word.
If no trigram match is found, the model will back off to a bigram model using the last two words.
If no bigram match is found, the model will fall back to unigram frequencies.
The prediction returned will be the word with the highest conditional probability.

To improve prediction quality, the model may incorporate smoothing techniques such as: - Assigning non-zero probabilities to unseen n-grams - Using frequency thresholds to reduce noise from rare combinations

Additionally, the model will be implemented using precomputed n-gram frequency tables, enabling fast lookup during prediction. This approach ensures that predictions can be generated in real time within the Shiny application.

This design balances accuracy, computational efficiency, and scalability, making it suitable for deployment in a resource-constrained environment.

15 Model Efficiency Considerations

Since the final model will be deployed in a Shiny application, both memory usage and runtime performance are critical.

To optimize efficiency, the model will: - Use precomputed n-gram frequency tables stored in compact formats - Apply frequency thresholds to remove low-frequency n-grams - Limit the size of the prediction dictionary based on coverage analysis - Use efficient data structures for fast lookup (e.g., hash tables or indexed data frames) - Implement backoff logic to avoid unnecessary computation

These strategies will ensure that the model remains responsive while maintaining acceptable prediction accuracy.

16 Planned Shiny Application

The planned Shiny app will include:

A text input box where the user enters a phrase
A prediction button or automatic prediction behavior
A displayed next-word prediction
Possibly a small list of top alternative predictions

The app will use precomputed n-gram tables so that predictions can be generated quickly.

17 Conclusion

This milestone confirms that the English corpus has been loaded, summarized, sampled, cleaned, and explored. The analysis provides basic summaries, visualizations, unigram frequencies, bigram frequencies, trigram frequencies, and word coverage estimates.

The next step is to build and optimize the n-gram prediction algorithm and then deploy it in a Shiny application.