Dataset Profile | ||||
---|---|---|---|---|
Source | File Size | Lines | Words | Unique_Words |
Blogs | 200mb | 899,288 | 37,334,131 | 1,103,503 |
News | 196mb | 1,010,242 | 34,372,530 | 876,770 |
159mb | 2,360,148 | 30,373,543 | 1,290,171 |
Introduction
The goal of this project is to build a predictive text model similar to SwiftKey’s keyboard, which suggests the next word as users type. This report presents an exploratory data analysis (EDA) of the dataset to understand word distributions, phrase patterns, and key insights before building the prediction model.
Data Profiling for Predictive Text Dataset
The dataset consists of three large text files from blogs, news, and Twitter in English. Data profiling is crucial to understand the dataset’s structure, quality, and n[poidistribution before building the predictive model. Here are the key data profiling tasks for this project:
Basic Structure & Metadata Profiling
The dataset consists of three text sources (Blogs, News, and Twitter) with notable differences in size, structure, and vocabulary richness:
Twitter has the highest number of lines (2.36M) but the smallest file size (159MB) and fewest words per line. This reflects short, fragmented messages with high variability in language (e.g., slang, abbreviations).
Blogs have the highest word count (37.3M) and most unique words (1.1M) despite having fewer lines than News and Twitter. This suggests longer, more diverse, and personal narratives, leading to a richer vocabulary.
News articles have a relatively structured format, with medium-length texts (1.01M lines, 34.3M words) and a lower unique word count (876K) compared to Blogs and Twitter. This suggests consistent language use and formal tone in news reporting.
Key Observations:
- Different text structures require different tokenization & modeling approaches (e.g., handling abbreviations for Twitter, formal grammar for News).
- Twitter’s short messages require efficient context modeling, while Blogs need long-term dependency handling.
- Preprocessing strategies should account for dataset-specific traits (e.g., removing redundant words in News, handling slang in Twitter).
- The dataset is very large, requiring sampling for efficient analysis.
Text Length Distribution
The analysis of character length per line across Blogs, News, and Twitter reveals key differences in text structure:
- Blogs have the longest entries, with an average length of 230 characters and a maximum of 40,833 characters, indicating long-form, personal, and detailed writing.
- News articles have a slightly shorter average length (201 characters) but a higher median (185) than blogs (156), suggesting more uniform and structured writing.
- Twitter has the shortest entries, with an average of 68.7 characters and a hard limit of 140 characters, reflecting the concise, informal, and real-time nature of tweets.
- Standard deviation (Std) shows variability: Blogs (259) have the widest variation, while Twitter (37.2) is highly constrained due to character limits.
Key Implications
- Modeling Complexity: Blogs require handling longer sentences, while Twitter demands short-context predictions.
- Feature Engineering: Sentence length can be a useful predictor of text type and intent.
- Efficiency Considerations: Different pre-processing strategies (truncation, summarization) may be needed for long vs. short text formats.
Word Frequency Analysis
Using tidytext, we tokenize words and remove stopwords.
✅ Findings:
- Common words like love, like, number are frequent on Twitter.
- Removing stopwords helps focus on more meaningful words.
Structural Profiling: N-Gram Analysis
N-grams (bigrams & trigrams) show word patterns. We analyze word sequences (bigrams & trigrams) to find common phrases.
Insights:
- Most frequent phrases include the top 20 words as seen in the word cloud.
- This helps in predictive text modeling by leveraging phrase-based learning.
Linguistic & Semantic Profiling
We analyze profanity, foreign words, and named entities.
Profanity Filtering
The presence of profanity words in the dataset is an important factor in text prediction models, as it impacts both user experience and content appropriateness.
Foreign Language Detection
To evaluate how many words come from foreign languages, I utilized the cld2 package in R, which is a powerful tool for language detection. This package identifies the primary language of a given text based on character patterns and linguistic models.
- The majority of words were correctly classified as English, but a small percentage of foreign words appeared, likely due to borrowed words, multilingual content, or user-generated text (especially in blogs).
Vocabulary Coverage
We calculate how many unique words cover 50% and 90% of the dataset. Understanding vocabulary coverage is crucial for optimizing predictive text models.
Findings:
- A small set of words dominates everyday language usage.
- This aligns with Zipf’s Law, which states that a few words occur very frequently, while most words appear rarely.
- This can help compress dictionary size for the prediction model.
** Next Steps: Building the Predictive Model**
Based on our findings, the predictive model will:
1. Use N-grams: Predict next word based on previous 1–3 words.
2. Handle unseen n-grams: Use a backoff strategy when encountering new phrases.
3. Optimize performance: Reduce model size & computation time for real-time predictions.
4. Deploy in a Shiny App: Provide an interactive interface for testing predictions.
Conclusion
This report provides an initial analysis of the dataset. The next step is to build a machine learning model and deploy it in a Shiny App. 🚀