Overview

This report presents an exploratory analysis of the SwiftKey dataset. The dataset contains text from blogs, news articles, and Twitter posts. The purpose of this analysis is to understand the size and characteristics of the data before building a predictive text model and Shiny application.

Data Summary

File	Lines	Words
Blogs	899288	37546806
News	1010206	34761151
Twitter	2360148	30096690

The Twitter dataset contains the largest number of lines, while the Blogs dataset contains the highest number of words.

Exploratory Findings

Blogs contain longer entries and more words per line.
Twitter contains shorter messages but the largest number of individual entries.
News articles contain more formal language and sentence structure.
The dataset provides a diverse representation of written English.

Visualization

library(ggplot2)

stats <- data.frame(
  File = c("Blogs","News","Twitter"),
  Words = c(37546806,34761151,30096690)
)

ggplot(stats, aes(x = File, y = Words)) +
  geom_bar(stat = "identity") +
  ggtitle("Word Counts by Dataset")

Plans for Prediction Algorithm

The prediction algorithm will use n-gram language models to predict the next word based on previously entered words. The data will be cleaned, tokenized, and analyzed to identify common word sequences.

Plans for Shiny Application

The Shiny application will provide an interface where users can enter text and receive predicted next-word suggestions generated by the language model.

Conclusion

The exploratory analysis demonstrates the scale and diversity of the SwiftKey dataset. These findings will guide the development of a predictive text model and interactive Shiny application.

SwiftKey Exploratory Data Analysis