Executive Summary

This report presents the exploratory analysis performed on the HC Corpora dataset provided for the Data Science Capstone project. The objective is to understand the characteristics of the text data and prepare for the development of a predictive text application.

The dataset contains text from blogs, news articles, and Twitter posts. Exploratory analysis was conducted to examine word frequencies, common phrases, and vocabulary coverage. These findings will guide the development of a next-word prediction model and Shiny application.

Data Summary

The dataset consists of three English-language text sources.

File Lines Words Size (MB)
Blogs 899,288 37,546,806 200.42
News 1,010,206 34,761,151 196.28
Twitter 2,360,148 30,096,690 159.36

The Twitter dataset contains the largest number of lines, while the Blogs dataset contains the highest number of words.

Most Frequent Words

The most common words identified in the sample were:

Word Frequency
the 1856
to 1138
and 1066
a 947
of 905

These results follow Zipf’s Law, where a small number of words occur very frequently while most words occur relatively rarely.

Most Frequent Bigrams

Bigram Frequency
of the 193
in the 176
for the 83
on the 77
to the 76

These frequent word pairs represent common English language structures and provide useful context for next-word prediction.

Most Frequent Trigrams

Trigram Frequency
I don’t 17
a lot of 14
one of the 12
I can’t 11
rest of the 9

Frequent trigrams capture meaningful language patterns that improve prediction accuracy.

Vocabulary Coverage

Coverage analysis was performed to determine how many unique words are required to represent the majority of the corpus.

Interesting Findings

Several observations emerged from the exploratory analysis:

Prediction Algorithm Plan

The planned prediction model will use an N-gram language modeling approach.

The model will generate predictions using:

A backoff strategy will be implemented to handle unseen phrases. If a trigram match is unavailable, the model will search the bigram model. If no bigram match exists, the most frequent unigram will be returned.

This approach balances prediction accuracy, memory usage, and computational efficiency.

Shiny Application Plan

A Shiny application will be developed to demonstrate the predictive text model.

The application will:

The final application will focus on providing fast predictions while maintaining low memory requirements.

Conclusion

The exploratory analysis successfully identified key characteristics of the dataset and demonstrated the feasibility of building a predictive text model.

The next phase of the project will focus on optimizing the N-gram model, implementing backoff techniques, and deploying the final prediction system as an interactive Shiny application.