Text Prediction Project
Exploratory Data Analysis
Report Date: 2025-11-17
Data Summary
The dataset consists of text from three sources:
- Blogs: 899,288 lines, 37 million words
- News: 77,259 lines, 34 million words
- Twitter: 2,360,148 lines, 30 million words
Total: Over 100 million words for training the prediction
algorithm.
Word Frequency Analysis
Plot showing most common words
Common words like “the”, “and”, and “to” are most
frequent
Prediction Algorithm Plan
Approach:
- Clean data - remove profanity, normalize text
- Build n-gram models - analyze word sequences
- Create backoff method - use context for
predictions
- Build Shiny app - user-friendly interface
Features:
- Real-time word prediction
- Mobile-friendly design
- Fast response times
Next Steps
- Complete data processing
- Build and test prediction algorithm
- Develop Shiny application
- Deploy to production
This report demonstrates initial exploratory analysis for the
text prediction project.