Executive Summary

This report presents an exploratory analysis of the SwiftKey training dataset. The data consists of text collected from blogs, news articles, and Twitter posts. The objective is to understand the structure of the datasets and prepare for building a predictive text model.

Data Summary

The datasets analyzed are:

Basic Statistics

Dataset Lines Words
Blogs 1000 41890
News 1000 33489
Twitter 1000 12782

Findings

Visualization

The histogram below illustrates the distribution of blog line lengths.

Future Plans

The next phase of the project will focus on:

  1. Text cleaning.
  2. Tokenization.
  3. N-gram generation.
  4. Predictive text model creation.
  5. Development of a Shiny application for next-word prediction.

Conclusion

The exploratory analysis confirms that the SwiftKey datasets provide a strong foundation for building a predictive text application.

Future Plans

The next phase of the project will focus on:

  1. Text cleaning.
  2. Tokenization.
  3. N-gram generation.
  4. Predictive text model creation.
  5. Development of a Shiny application for next-word prediction.

Conclusion

The exploratory analysis confirms that the SwiftKey datasets provide a strong foundation for building a predictive text application.