The dataset used for this analysis is derived from the HC Corpora, a collection of text data from various sources. This corpus was provided as part of the Data Scientist Capstone course on Coursera, and can be accessed through the following link: Capstone Dataset
For this project, I focused on the English language data within the corpus, comprising blogs, news articles, and Twitter posts. I randomly subsampled data from each source by 0.5% to create a manageable, representative sample. This diverse collection provides a foundation for EDA and a word prediction application. The varied sources offer a rich language sample for modeling.
In the following sections, I will detail the EDA process and outline plans for developing the word prediction application based on insights from this dataset.
Following the data introduction, I conducted an initial exploration of the subsampled corpora to assess the presence of profanity and stopwords, which could impact downstream analysis.
The analysis revealed varying levels of profanity across the corpora: 145% in blogs, 7% in news, and 542% in Twitter samples. Stopwords accounted for 8.9175^{4}%, 5410%, and 6.3649^{4}% of the content in blogs, news, and Twitter samples, respectively.
To optimize the dataset for further analysis, profanity and stopwords were removed. Basic summaries of the cleaned corpora are presented in Table 1.
| source | n_docs | n_words_total | n_unique_words | n_sentences_total | avg_words_per_doc | avg_unique_words_per_doc | avg_sentences_per_doc | vocabulary_richness |
|---|---|---|---|---|---|---|---|---|
| en_US.blogs_sampled | 4497 | 94826 | 22833 | 11722 | 21.086502 | 5.077385 | 2.606627 | 0.2407884 |
| en_US.news_sampled | 387 | 7122 | 4176 | 765 | 18.403101 | 10.790698 | 1.976744 | 0.5863521 |
| en_US.twitter_sampled | 11801 | 83859 | 21297 | 18908 | 7.106093 | 1.804678 | 1.602237 | 0.2539620 |
Key findings from the summaries include document length variations across sources (blogs longest, Twitter shortest), highest vocabulary diversity in news articles, and distinct characteristics for each corpus.These insights suggest that tailored models may be needed to effectively predict language patterns specific to each source: blogs, news articles, and Twitter posts.
The next phase will focus on n-gram analysis and frequency distributions across corpora, laying the groundwork for developing accurate, context-aware word prediction models.
This figure compares word frequencies on Blogs, News, and Twitter using word clouds. While common terms appear across platforms, distinct patterns emerge: Blogs emphasize personal experiences (“time,” “people”), News focuses on reporting (“said”), and Twitter emphasizes real-time interaction (“thanks,” “RT,” “today”). This visualization highlights how language and content priorities differ across digital media contexts.
The cumulative coverage graphs for blogs, news, and Twitter reveal Zipfian word frequency distributions with platform-specific patterns. Twitter shows more concentrated common word usage, while news exhibits broader vocabulary. A small set of words accounts for 50% coverage, but 90% coverage requires significantly more unique words. This suggests that focusing on frequent words will provide high initial coverage for word prediction models, with diminishing returns as vocabulary size increases.
N-gram cumulative coverage graphs offer deeper insights into language complexity across blogs, news, and Twitter. For all platforms, 2-grams offer the highest coverage, followed by 3-grams and 4-grams, showing increasing complexity. Twitter reaches full coverage with fewer n-grams, reflecting its constrained, predictable language due to character limits. Blogs use the most diverse language, requiring more n-grams for full coverage, indicating richer vocabulary and complex structures. News falls in between. These results suggest that word prediction models should be platform-specific, with Twitter benefiting from a more compact model and blogs/news needing broader models for language diversity.
Based on the EDA findings, I propose the following approach:
This approach, informed by the initial EDA, should result in a robust, platform-specific word prediction model capable of handling diverse language patterns across blogs, news, and Twitter.