This report explores text data from Twitter, blogs, and news to prepare for a word prediction app. We analyzed the data’s structure and patterns to ensure we’re on track to build a user-friendly tool.
We successfully loaded three text files: Twitter, blogs, and news. To manage the large dataset, we used a 10% sample for analysis. The table below summarizes the full datasets’ size:
## Warning in readLines("en_US/en_US.news.txt",
## encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'en_US/en_US.news.txt'
| File | Lines | Words |
|---|---|---|
| 2360148 | 9 | |
| Blogs | 0 | 0 |
| News | 630799 | 0 |
The datasets vary in style:
The bar chart below shows the top 10 words in a Twitter sample (after removing common words like “the” and punctuation):
The histogram below shows the distribution of words per line in the Twitter sample, highlighting that most lines are short (under 20 words):
We will build a tool that predicts the next word a user types, similar to a phone’s auto-complete. For example, if someone types “I love to,” the tool might suggest “eat” or “run” based on common patterns. To keep it fast, we’ll use a smaller dataset and focus on frequent word combinations.
The Shiny app will be simple and user-friendly:
This will make typing faster and more intuitive, like a virtual keyboard assistant.
This analysis confirms we’ve successfully loaded and explored the data, identified key patterns, and planned a practical word prediction app. We’re ready to develop the algorithm and app, with feedback welcome to improve our approach.