This report presents the exploratory analysis performed on the HC Corpora dataset provided for the Data Science Capstone project. The objective is to understand the characteristics of the text data and prepare for the development of a predictive text application.
The dataset contains text from blogs, news articles, and Twitter posts. Exploratory analysis was conducted to examine word frequencies, common phrases, and vocabulary coverage. These findings will guide the development of a next-word prediction model and Shiny application.
The dataset consists of three English-language text sources.
| File | Lines | Words | Size (MB) |
|---|---|---|---|
| Blogs | 899,288 | 37,546,806 | 200.42 |
| News | 1,010,206 | 34,761,151 | 196.28 |
| 2,360,148 | 30,096,690 | 159.36 |
The Twitter dataset contains the largest number of lines, while the Blogs dataset contains the highest number of words.
The most common words identified in the sample were:
| Word | Frequency |
|---|---|
| the | 1856 |
| to | 1138 |
| and | 1066 |
| a | 947 |
| of | 905 |
These results follow Zipf’s Law, where a small number of words occur very frequently while most words occur relatively rarely.
| Bigram | Frequency |
|---|---|
| of the | 193 |
| in the | 176 |
| for the | 83 |
| on the | 77 |
| to the | 76 |
These frequent word pairs represent common English language structures and provide useful context for next-word prediction.
| Trigram | Frequency |
|---|---|
| I don’t | 17 |
| a lot of | 14 |
| one of the | 12 |
| I can’t | 11 |
| rest of the | 9 |
Frequent trigrams capture meaningful language patterns that improve prediction accuracy.
Coverage analysis was performed to determine how many unique words are required to represent the majority of the corpus.
Several observations emerged from the exploratory analysis:
The planned prediction model will use an N-gram language modeling approach.
The model will generate predictions using:
A backoff strategy will be implemented to handle unseen phrases. If a trigram match is unavailable, the model will search the bigram model. If no bigram match exists, the most frequent unigram will be returned.
This approach balances prediction accuracy, memory usage, and computational efficiency.
A Shiny application will be developed to demonstrate the predictive text model.
The application will:
The final application will focus on providing fast predictions while maintaining low memory requirements.
The exploratory analysis successfully identified key characteristics of the dataset and demonstrated the feasibility of building a predictive text model.
The next phase of the project will focus on optimizing the N-gram model, implementing backoff techniques, and deploying the final prediction system as an interactive Shiny application.