Johns Hopkins Data Science Capstone Project

1. Introduction: The Goal and The Data

Our ultimate goal is to create a text prediction model similar to those that we are now used to encountering everyday in our phone keyboards or email composition tasks. First, we need to understand our raw data set, which in this case is: English text from blogs, news stories, and Twitter (now X) posts.

This report summarizes our initial exploratory analysis. It covers same basic stats, as well as key findings from cleaning and analyzing (exploring) the text data. Finally, we discuss our proposed plan for building a prediction algorithm. Feel free to us the navigation on the left to skip through the sections.

Data Summary

The source data is large, totaling over 550MB. To make the analysis fast and manageable (i.e. it has to run fast, on a smartphone), we first create a smaller, representative 2% random sample of the data. The tables below show a comparison of the full data set and the sample size we used for this report. We may adjust the sample size at a later time as we balance speed and accuracy of our model(s).

Full Data Set
Source	Size_MB	Line_Count	Word_Count
en_US.blogs.txt	200.42	899,288	37,546,806
en_US.news.txt	196.28	1,010,206	34,761,151
en_US.twitter.txt	159.36	2,360,148	30,096,690

2% Sampled Data
Source	Size_MB	Line_Count	Word_Count
en_US.blogs.txt	4.12	18,214	771,960
en_US.news.txt	3.98	20,354	705,168
en_US.twitter.txt	3.21	47,447	605,101

2. Key Finding: Vocabulary and Cleaning

To build a model, we first try to understand the vocabulary - the words it contains. Our first key finding is that a small number of words make up the majority of the text. This is not surprising and in fact, is an expected result.

The plot below shows this relationship visually. We can see that we only need around 1,500 unique words to cover 80% of all word instances, and about 7,300 words to cover 90%. This is a crucial insight, as it confirms we can build an effective and accurate model with a relatively small dictionary, while also ensuring it is fast and efficient.

To find the most important “content” words, we first removed common “stop words” (like ‘the’, ‘of’, ‘it’, ‘a’ and ‘is’). This led us to the 2nd very important insight. After this step, we found that the most common unigrams, bigrams and trigrams were weird things like “1 2” and “2 2”. (Visuals and tables omitted from this summary). We then had to adjust our cleaning approach to also filter out those meaningless “words”. The word cloud below shows the final and most frequent meaningful words that remain. This gives us a confidence that we have a good sample dataset that is representative of the source text and a good base to start building prediction models from.

3. Our Plan for the Prediction Algorithm

Our goal is to predict the next word a user might type. For this, we will use an n-gram backoff model.

N-Grams: We will count the frequency of common two-word phrases (bigrams) and three-word phrases (trigrams).
Backoff Logic: When a user types a phrase, our algorithm will look for a matching three-word phrase to make a prediction. If no match is found, it will “back off” and use a two-word phrase, and so on.

The tables below show the top bigrams found in our sample data. The first table includes stop words and reveals the text’s structure, while the second table (with stop words removed) highlights meaningful content phrases. These frequencies will be at the core of our prediction model.

Top 10 Structural Phrases
bigram	n
of the	8983
in the	8151
to the	4349
for the	4080
on the	4010
to be	3302
at the	2999
and the	2590
in a	2448
is a	2092

Top 10 Content Phrases
bigram	n
happy birthday	214
st louis	200
los angeles	147
san francisco	126
social media	115
san diego	114
love love	93
mother's day	89
health care	85
vice president	75

4. An Additional Finding: Data Source Matters

Finally, an insight we also discovered is that the three data sources have different characteristics. The Twitter data is ‘noisier’. It likely contains more slang and, as shown below, it contains more non-English words than the blogs or news data.

This is another valuable insight, as it tells us our model must be trained on a diverse sample that includes this informal text to be effective for our users. The chart below visualizes this by showing the top detected non-English languages in each source.

5. Next Steps

With this exploratory analysis now complete, our next steps will be to:

Finalize and save the cleaned n-gram frequency tables for use in the app.
Build the backoff prediction algorithm as an R function.
Create a simple Shiny application to provide a user interface for the model.

Johns Hopkins Data Science Capstone Project - Milestone Report

Peter Cebo

2025-10-02