Overview

This report presents an exploratory analysis of the SwiftKey dataset. The dataset contains text from blogs, news articles, and Twitter posts. The purpose of this analysis is to understand the size and characteristics of the data before building a predictive text model and Shiny application.

Data Summary

File Lines Words
Blogs 899288 37546806
News 1010206 34761151
Twitter 2360148 30096690

The Twitter dataset contains the largest number of lines, while the Blogs dataset contains the highest number of words.

Exploratory Findings

Visualization

library(ggplot2)

stats <- data.frame(
  File = c("Blogs","News","Twitter"),
  Words = c(37546806,34761151,30096690)
)

ggplot(stats, aes(x = File, y = Words)) +
  geom_bar(stat = "identity") +
  ggtitle("Word Counts by Dataset")

Plans for Prediction Algorithm

The prediction algorithm will use n-gram language models to predict the next word based on previously entered words. The data will be cleaned, tokenized, and analyzed to identify common word sequences.

Plans for Shiny Application

The Shiny application will provide an interface where users can enter text and receive predicted next-word suggestions generated by the language model.

Conclusion

The exploratory analysis demonstrates the scale and diversity of the SwiftKey dataset. These findings will guide the development of a predictive text model and interactive Shiny application.