Executive Summary

This report presents the exploratory analysis performed on the HC Corpora dataset provided for the Data Science Capstone project. The objective is to understand the characteristics of the text data and prepare for the development of a predictive text application.

The dataset contains text from blogs, news articles, and Twitter posts. Exploratory analysis was conducted to examine word frequencies, common phrases, and vocabulary coverage. These findings will guide the development of a next-word prediction model and Shiny application.

Data Summary

The dataset consists of three English-language text sources.

File	Lines	Words	Size (MB)
Blogs	899,288	37,546,806	200.42
News	1,010,206	34,761,151	196.28
Twitter	2,360,148	30,096,690	159.36

The Twitter dataset contains the largest number of lines, while the Blogs dataset contains the highest number of words.

Most Frequent Words

The most common words identified in the sample were:

Word	Frequency
the	1856
to	1138
and	1066
a	947
of	905

These results follow Zipf’s Law, where a small number of words occur very frequently while most words occur relatively rarely.

Most Frequent Bigrams

Bigram	Frequency
of the	193
in the	176
for the	83
on the	77
to the	76

These frequent word pairs represent common English language structures and provide useful context for next-word prediction.

Most Frequent Trigrams

Trigram	Frequency
I don’t	17
a lot of	14
one of the	12
I can’t	11
rest of the	9

Frequent trigrams capture meaningful language patterns that improve prediction accuracy.

Vocabulary Coverage

Coverage analysis was performed to determine how many unique words are required to represent the majority of the corpus.

50% coverage: 191 unique words
90% coverage: 6078 unique words

Interesting Findings

Several observations emerged from the exploratory analysis:

Word frequencies are highly skewed.
Common English words dominate the corpus.
Frequently occurring bigrams and trigrams capture common language structures.
A relatively small vocabulary accounts for a large proportion of all text.
The dataset contains sufficient information to build an effective predictive text model.

Prediction Algorithm Plan

The planned prediction model will use an N-gram language modeling approach.

The model will generate predictions using:

Unigrams
Bigrams
Trigrams

A backoff strategy will be implemented to handle unseen phrases. If a trigram match is unavailable, the model will search the bigram model. If no bigram match exists, the most frequent unigram will be returned.

This approach balances prediction accuracy, memory usage, and computational efficiency.

Shiny Application Plan

A Shiny application will be developed to demonstrate the predictive text model.

The application will:

Accept text input from the user.
Predict the most likely next word.
Display predictions in real time.
Use a lightweight N-gram model suitable for deployment on shinyapps.io.

The final application will focus on providing fast predictions while maintaining low memory requirements.

Conclusion

The exploratory analysis successfully identified key characteristics of the dataset and demonstrated the feasibility of building a predictive text model.

The next phase of the project will focus on optimizing the N-gram model, implementing backoff techniques, and deploying the final prediction system as an interactive Shiny application.

Data Science Capstone Milestone Report

Zayeem

2026-05-30