Exploratory Analysis of SwiftKey Data

Executive Summary

This report summarizes exploratory analyses of the SwiftKey English text data sets: blogs, news, and Twitter. We focus on line counts, word counts, and line lengths, highlighting key features of the data. These insights will guide the development of a text prediction algorithm and Shiny app.

Basic Summaries

Number of lines per dataset
Dataset	Lines
Blogs	1000
News	1000
Twitter	1000

Maximum characters per line in each dataset
Dataset	MaxChars
Blogs	1912
News	982
Twitter	140

Histograms

Key Observations

Blogs tend to have longer lines and more words per line compared to Twitter.
Twitter lines are short but frequent, reflecting tweet length limitations.
News lines are medium-length and relatively uniform.

Plans for Prediction Algorithm and Shiny App

Goal: Predict the next word a user is likely to type based on previous context.
Approach: Use n-gram models and frequency tables derived from these datasets.
Shiny App: Provide a simple interface for typing text and showing top predicted words.
Future Steps: Explore more lines, filter profanity, and optimize for performance on large datasets.