Data Science Capstone Milestone Report

Author

Samuel Tandoh

Published

April 27, 2026

1 Introduction

The goal of this capstone project is to build a predictive text application using natural language processing techniques. Before developing the final prediction algorithm and Shiny app, this milestone report explores the training data, summarizes its major features, and describes the planned modeling approach.

The final application will predict the next word based on a phrase entered by the user. This report focuses on understanding the data, cleaning the text, analyzing word and phrase frequencies, and preparing for an n-gram based prediction model.

1.1 Executive Summary

This report presents an exploratory analysis of a large English text corpus consisting of blogs, news articles, and Twitter data. A 5% sample of the dataset, comprising over 1 million words, was used to efficiently analyze language patterns while preserving statistical reliability.

The analysis shows that word usage follows a highly skewed distribution, where a small number of words account for a large proportion of all text. Additionally, common word sequences (n-grams) reveal strong and repetitive linguistic structures that can be leveraged for next-word prediction.

These findings support the development of a predictive text model based on n-gram probabilities with a backoff strategy. The final application will be implemented as a Shiny app capable of generating fast and accurate next-word predictions while maintaining efficient memory usage.

2 Data Source

The dataset consists of English text files from three sources:

  • Blogs
  • News
  • Twitter

The files used in this report are:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

These files contain real-world text and therefore include noise such as slang, informal spelling, punctuation variation, profanity, and possible foreign-language fragments.

3 Setup

library(stringi)
library(quanteda)
library(dplyr)
library(ggplot2)
library(knitr)

4 Load the Data

blogs_data <- "en_US.blogs.txt"
news_data <- "en_US.news.txt"
twitter_data <- "en_US.twitter.txt"

blogs <- readLines(blogs_data, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_data, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_data, encoding = "UTF-8", skipNul = TRUE)

5 Basic Summary Statistics

The first step is to confirm that the data has been loaded successfully and to calculate basic summaries for each source.

summary_table <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),
  Characters = c(
    sum(nchar(blogs)),
    sum(nchar(news)),
    sum(nchar(twitter))
  ),
  File_Size_MB = round(c(
    file.info(blogs_data)$size,
    file.info(news_data)$size,
    file.info(twitter_data)$size
  ) / 1024^2, 2)
)

kable(summary_table, caption = "Basic summary statistics for the English corpus")
Basic summary statistics for the English corpus
Source Lines Words Characters File_Size_MB
Blogs 899288 37546806 206824505 200.42
News 1010206 34761151 203214543 196.28
Twitter 2360148 30096690 162096241 159.36

6 Line Length Analysis

Line length gives a useful view of the structure of each data source. Twitter lines are expected to be shorter, while blogs and news may contain longer passages.

line_lengths <- data.frame(
  Source = rep(c("Blogs", "News", "Twitter"),
               times = c(length(blogs), length(news), length(twitter))),
  Characters = c(nchar(blogs), nchar(news), nchar(twitter)),
  Words = c(stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter))
)

kable(
  line_lengths %>%
    group_by(Source) %>%
    summarise(
      Mean_Words = round(mean(Words), 2),
      Median_Words = median(Words),
      Mean_Characters = round(mean(Characters), 2),
      Median_Characters = median(Characters),
      .groups = "drop"
    ),
  caption = "Line length summary by source"
)
Line length summary by source
Source Mean_Words Median_Words Mean_Characters Median_Characters
Blogs 41.75 28 229.99 156
News 34.41 32 201.16 185
Twitter 12.75 12 68.68 64
ggplot(line_lengths, aes(x = Words)) +
  geom_histogram(bins = 50) +
  facet_wrap(~ Source, scales = "free_y") +
  coord_cartesian(xlim = c(0, 100)) +
  labs(
    title = "Distribution of Line Lengths",
    subtitle = "Word counts per line, limited to 0-100 words for readability",
    x = "Words per Line",
    y = "Number of Lines"
  )

7 Sampling Strategy

The full dataset is large, so this report uses a random sample for exploratory analysis. Sampling reduces memory usage and processing time while still allowing the major language patterns to be observed.

The sampling rate can be adjusted. A value of 0.05 means 5% of each data source is sampled.

set.seed(123)

sample_rate <- 0.01

blogs_sample <- sample(blogs, size = max(1, floor(length(blogs) * sample_rate)))
news_sample <- sample(news, size = max(1, floor(length(news) * sample_rate)))
twitter_sample <- sample(twitter, size = max(1, floor(length(twitter) * sample_rate)))

sample_text <- c(blogs_sample, news_sample, twitter_sample)

sample_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter", "Combined Sample"),
  Lines = c(length(blogs_sample), length(news_sample), length(twitter_sample), length(sample_text)),
  Words = c(
    sum(stri_count_words(blogs_sample)),
    sum(stri_count_words(news_sample)),
    sum(stri_count_words(twitter_sample)),
    sum(stri_count_words(sample_text))
  )
)

kable(sample_summary, caption = "Summary of sampled data used for exploratory analysis")
Summary of sampled data used for exploratory analysis
Source Lines Words
Blogs 8992 374837
News 10102 347501
Twitter 23601 301078
Combined Sample 42695 1023416

8 Text Cleaning and Tokenization

The sampled text is cleaned and tokenized using the following steps:

  • Convert all text to lowercase
  • Remove punctuation
  • Remove numbers
  • Remove symbols
  • Split text into word tokens
tokens_clean <- tokens(
  sample_text,
  remove_punct = TRUE,
  remove_numbers = TRUE,
  remove_symbols = TRUE,
  remove_separators = TRUE
)

tokens_clean <- tokens_tolower(tokens_clean)

9 Unigram Analysis

Unigrams are individual words. The most frequent words give a simple view of the vocabulary and common language patterns in the corpus.

dfm_uni <- dfm(tokens_clean)

top_uni <- topfeatures(dfm_uni, 25)

unigram_table <- data.frame(
  Word = names(top_uni),
  Frequency = as.numeric(top_uni)
)

kable(unigram_table, caption = "Top 25 most frequent unigrams")
Top 25 most frequent unigrams
Word Frequency
the 47771
to 27621
and 24225
a 23698
of 19825
i 16755
in 16576
for 11125
is 10556
that 10459
you 9347
it 9106
on 8194
with 7236
was 6284
my 6004
at 5719
be 5591
this 5366
have 5277
but 4913
are 4759
as 4726
he 4279
we 4169
ggplot(unigram_table, aes(x = reorder(Word, Frequency), y = Frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Most Frequent Words",
    x = "Word",
    y = "Frequency"
  )

10 Bigram Analysis

Bigrams are two-word phrases. They help reveal common phrase structures and are useful for next-word prediction.

tokens_bi <- tokens_ngrams(tokens_clean, n = 2)
dfm_bi <- dfm(tokens_bi)

top_bi <- topfeatures(dfm_bi, 25)

bigram_table <- data.frame(
  Bigram = names(top_bi),
  Frequency = as.numeric(top_bi)
)

kable(bigram_table, caption = "Top 25 most frequent bigrams")
Top 25 most frequent bigrams
Bigram Frequency
in_the 4218
of_the 4203
to_the 2146
for_the 2078
on_the 1962
to_be 1593
at_the 1421
and_the 1243
in_a 1171
with_the 1031
is_a 1006
it_was 974
for_a 928
from_the 899
i_was 879
i_have 856
and_i 836
i_am 832
with_a 810
will_be 809
of_a 802
going_to 789
if_you 764
it_is 757
have_a 754
ggplot(bigram_table, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Most Frequent Bigrams",
    x = "Bigram",
    y = "Frequency"
  )

11 Trigram Analysis

Trigrams are three-word phrases. These are especially useful for predicting the next word from recent context.

tokens_tri <- tokens_ngrams(tokens_clean, n = 3)
dfm_tri <- dfm(tokens_tri)

top_tri <- topfeatures(dfm_tri, 25)

trigram_table <- data.frame(
  Trigram = names(top_tri),
  Frequency = as.numeric(top_tri)
)

kable(trigram_table, caption = "Top 25 most frequent trigrams")
Top 25 most frequent trigrams
Trigram Frequency
one_of_the 335
a_lot_of 268
thanks_for_the 234
going_to_be 180
out_of_the 169
to_be_a 163
the_end_of 157
it_was_a 154
be_able_to 143
i_want_to 141
looking_forward_to 136
i_have_to 124
i_have_a 124
this_is_a 124
as_well_as 123
some_of_the 119
part_of_the 118
the_rest_of 117
thank_you_for 111
i_love_you 111
in_the_first 108
a_couple_of 107
at_the_end 104
end_of_the 102
is_going_to 101
ggplot(trigram_table, aes(x = reorder(Trigram, Frequency), y = Frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 25 Most Frequent Trigrams",
    x = "Trigram",
    y = "Frequency"
  )

12 Word Frequency Coverage

A useful question for predictive modeling is how many unique words are needed to cover a large proportion of total word usage. If a relatively small dictionary covers most word instances, the final model can be made smaller and faster.

word_freq <- sort(colSums(dfm_uni), decreasing = TRUE)
cumulative_coverage <- cumsum(word_freq) / sum(word_freq)

coverage_50 <- which(cumulative_coverage >= 0.50)[1]
coverage_90 <- which(cumulative_coverage >= 0.90)[1]
coverage_95 <- which(cumulative_coverage >= 0.95)[1]

coverage_table <- data.frame(
  Coverage = c("50%", "90%", "95%"),
  Unique_Words_Required = c(coverage_50, coverage_90, coverage_95)
)

kable(coverage_table, caption = "Number of unique words required to reach selected coverage levels")
Number of unique words required to reach selected coverage levels
Coverage Unique_Words_Required
may 50% 144
pins 90% 7987
agony 95% 18443
coverage_df <- data.frame(
  Rank = seq_along(cumulative_coverage),
  Coverage = cumulative_coverage
)

ggplot(coverage_df, aes(x = Rank, y = Coverage)) +
  geom_line() +
  geom_hline(yintercept = 0.50, linetype = "dashed") +
  geom_hline(yintercept = 0.90, linetype = "dashed") +
  geom_hline(yintercept = 0.95, linetype = "dashed") +
  labs(
    title = "Cumulative Word Frequency Coverage",
    x = "Word Rank by Frequency",
    y = "Cumulative Coverage"
  )

13 Key Findings

The exploratory analysis of the English corpus reveals several important insights:

  1. The dataset is large and representative of real-world language
    The corpus contains millions of lines and tens of millions of words across blogs, news, and Twitter. A 5% random sample was used for analysis, resulting in over 1 million words, which is sufficiently large to capture the main linguistic patterns while reducing computational cost.

  2. Distinct writing styles across data sources
    The three sources exhibit clear structural differences. Blogs contain the longest entries (average ~42 words per line), followed by news (~34 words), while Twitter entries are much shorter (~13 words per line). This highlights the contrast between formal, descriptive writing (blogs/news) and informal, conversational text (Twitter).

  3. Word usage follows a highly skewed distribution
    A small number of common words dominate the corpus. Words such as the, to, and, a, and of appear far more frequently than others, confirming a Zipfian distribution typical of natural language. This has important implications for model efficiency.

  4. Frequent phrase patterns are strongly present
    Bigram and trigram analysis reveals highly recurring structures such as:

    • in the, of the, to the
    • one of the, a lot of, going to be

    These patterns demonstrate that language is highly structured and predictable, supporting the use of n-gram models for next-word prediction.

  5. Efficient coverage with a relatively small vocabulary
    The coverage analysis shows that:

    • Approximately 144 unique words account for 50% of all word occurrences
    • Around 8,000 words cover 90%
    • Around 18,000 words cover 95%

    This indicates that a compact, frequency-based dictionary can provide strong predictive coverage while keeping the model size manageable.

  6. Presence of real-world noise in the data
    The dataset includes informal language, abbreviations, punctuation inconsistencies, and occasional foreign-language fragments. Twitter data in particular contributes slang and conversational expressions. This reinforces the importance of thorough text cleaning and preprocessing.

  7. 5% sampling is sufficient for reliable analysis
    The use of a 5% sample preserves key statistical properties of the full dataset, including word frequency distributions and common phrase structures. This confirms that meaningful insights can be obtained without processing the entire dataset, improving efficiency during development.

14 Planned Prediction Algorithm

The prediction model will be based on an n-gram language modeling approach, which estimates the probability of a word given the previous sequence of words.

The core strategy will use a backoff model, structured as follows:

  1. The user input will be cleaned and tokenized using the same preprocessing steps applied to the training data.
  2. The model will attempt to match the last three words of the input against a trigram dataset to predict the next word.
  3. If no trigram match is found, the model will back off to a bigram model using the last two words.
  4. If no bigram match is found, the model will fall back to unigram frequencies.
  5. The prediction returned will be the word with the highest conditional probability.

To improve prediction quality, the model may incorporate smoothing techniques such as: - Assigning non-zero probabilities to unseen n-grams - Using frequency thresholds to reduce noise from rare combinations

Additionally, the model will be implemented using precomputed n-gram frequency tables, enabling fast lookup during prediction. This approach ensures that predictions can be generated in real time within the Shiny application.

This design balances accuracy, computational efficiency, and scalability, making it suitable for deployment in a resource-constrained environment.

15 Model Efficiency Considerations

Since the final model will be deployed in a Shiny application, both memory usage and runtime performance are critical.

To optimize efficiency, the model will: - Use precomputed n-gram frequency tables stored in compact formats - Apply frequency thresholds to remove low-frequency n-grams - Limit the size of the prediction dictionary based on coverage analysis - Use efficient data structures for fast lookup (e.g., hash tables or indexed data frames) - Implement backoff logic to avoid unnecessary computation

These strategies will ensure that the model remains responsive while maintaining acceptable prediction accuracy.

16 Planned Shiny Application

The planned Shiny app will include:

  • A text input box where the user enters a phrase
  • A prediction button or automatic prediction behavior
  • A displayed next-word prediction
  • Possibly a small list of top alternative predictions

The app will use precomputed n-gram tables so that predictions can be generated quickly.

17 Conclusion

This milestone confirms that the English corpus has been loaded, summarized, sampled, cleaned, and explored. The analysis provides basic summaries, visualizations, unigram frequencies, bigram frequencies, trigram frequencies, and word coverage estimates.

The next step is to build and optimize the n-gram prediction algorithm and then deploy it in a Shiny application.