Milestone Report

Dataset Summary

The SwiftKey dataset contains text from three sources: blogs, news articles, and Twitter posts.

Number of Lines

  • Blogs: 899288
  • News: 1010206
  • Twitter: 2360148

Character Count Summary

Blogs

Min: 1 1st Quartile: 47 Median: 156 Mean: 230 3rd Quartile: 329 Max: 40833

News

Min: 1 1st Quartile: 110 Median: 185 Mean: 201.2 3rd Quartile: 268 Max: 11384

Twitter

Min: 2 1st Quartile: 37 Median: 64 Mean: 68.68 3rd Quartile: 100 Max: 140

Observations

The Twitter dataset contains the highest number of records, followed by news and blogs. Blog entries are generally longer and show the largest maximum character count. Twitter posts are the shortest because of character limits, while news articles fall between blogs and tweets in terms of length.

Conclusion

The dataset provides a large and diverse collection of English text that can be used for natural language processing and predictive text modeling. The variation in text length across the three sources will help create a robust next-word prediction model.