Executive Summary
This report summarizes exploratory analyses of the SwiftKey English
text data sets: blogs, news, and Twitter. We focus on line counts, word
counts, and line lengths, highlighting key features of the data. These
insights will guide the development of a text prediction algorithm and
Shiny app.
Basic Summaries
Number of lines per dataset
Blogs |
1000 |
News |
1000 |
Twitter |
1000 |
Maximum characters per line in each dataset
Blogs |
1912 |
News |
982 |
Twitter |
140 |
Key Observations
- Blogs tend to have longer lines and more words per line compared to
Twitter.
- Twitter lines are short but frequent, reflecting tweet length
limitations.
- News lines are medium-length and relatively uniform.
Plans for Prediction Algorithm and Shiny App
- Goal: Predict the next word a user is likely to
type based on previous context.
- Approach: Use n-gram models and frequency tables
derived from these datasets.
- Shiny App: Provide a simple interface for typing
text and showing top predicted words.
- Future Steps: Explore more lines, filter profanity,
and optimize for performance on large datasets.