The SwiftKey dataset contains text from three sources: blogs, news articles, and Twitter posts.
Min: 1 1st Quartile: 47 Median: 156 Mean: 230 3rd Quartile: 329 Max: 40833
Min: 1 1st Quartile: 110 Median: 185 Mean: 201.2 3rd Quartile: 268 Max: 11384
Min: 2 1st Quartile: 37 Median: 64 Mean: 68.68 3rd Quartile: 100 Max: 140
The Twitter dataset contains the highest number of records, followed by news and blogs. Blog entries are generally longer and show the largest maximum character count. Twitter posts are the shortest because of character limits, while news articles fall between blogs and tweets in terms of length.
The dataset provides a large and diverse collection of English text that can be used for natural language processing and predictive text modeling. The variation in text length across the three sources will help create a robust next-word prediction model.