Executive Summary

This report presents an exploratory analysis of the SwiftKey training dataset. The data consists of text collected from blogs, news articles, and Twitter posts. The objective is to understand the structure of the datasets and prepare for building a predictive text model.

Data Summary

The datasets analyzed are:

Blogs
News
Twitter

Basic Statistics

Dataset	Lines	Words
Blogs	1000	41890
News	1000	33489
Twitter	1000	12782

Findings

Blog data contains the highest number of words.
Twitter data contains shorter text entries.
News articles have moderate text length.
The datasets contain sufficient textual information for language modeling.

Visualization

The histogram below illustrates the distribution of blog line lengths.

Future Plans

The next phase of the project will focus on:

Text cleaning.
Tokenization.
N-gram generation.
Predictive text model creation.
Development of a Shiny application for next-word prediction.

Conclusion

The exploratory analysis confirms that the SwiftKey datasets provide a strong foundation for building a predictive text application.

Future Plans

The next phase of the project will focus on:

Text cleaning.
Tokenization.
N-gram generation.
Predictive text model creation.
Development of a Shiny application for next-word prediction.

Conclusion

The exploratory analysis confirms that the SwiftKey datasets provide a strong foundation for building a predictive text application.

SwiftKey Capstone Milestone Report

Suchir

2026-06-07

Executive Summary

Data Summary

Basic Statistics

Findings

Visualization

Future Plans

Conclusion

Future Plans

Conclusion