SwiftKey Next Word Prediction Using N-Gram Models
PROJECT OVERVIEW
Objective:
- Build a predictive text model using SwiftKey dataset
- Predict the next word based on user input text
- Deploy the model using a Shiny web application
Key Idea:
- Use Natural Language Processing (NLP)
- Apply n-gram language modeling for prediction
DATASET
Data Sources:
- Blogs dataset, News dataset, and Twitter dataset
Dataset Summary:
- Blogs: 899,288 lines
- News: 1,010,242 lines
- Twitter: 2,360,148 lines
Key Insight:
- Common English words dominate the corpus
- Frequently occurring bigrams and trigrams reveal language patterns
- A relatively small vocabulary covers a large portion of total word occurrences
MODEL APPROACH
Modeling Technique:
- Trigram model (primary prediction)
- Bigram model (backoff strategy)
Workflow:
- Input text is tokenized
- Trigram model is checked first
- If no match, bigram model is used
- If still no match, default word is returned This ensures robust prediction even for unseen inputs.