Milestone Report: Exploratory Analysis of SwiftKey Text Data

Executive Summary

This report shows my exploratory analysis of the SwiftKey text data (blogs, news, and Twitter) for the Coursera Data Science Capstone.
The ultimate goal is to build a predictive text model (like SwiftKey keyboard) that suggests the next word as you type, and deliver it as a simple Shiny app.

I have successfully loaded the data and performed basic summaries, created plots, and identified key features.

1. Basic Summary of the Three Files

## [1] "./Data/final/en_US/en_US.blogs.txt"  
## [2] "./Data/final/en_US/en_US.news.txt"   
## [3] "./Data/final/en_US/en_US.twitter.txt"

## Files loaded successfully!

## Blogs lines:  899288

## News lines:  1010242

## Twitter lines:  2360148

Summary Statistics of the Three Data Files
Source	File_Size_MB	Number_of_Lines	Longest_Line_Chars
Blogs	NA	899288	40833
News	NA	1010242	11384
Twitter	NA	2360148	140

Main observations: - Blogs is the largest file (~200 MB) and contains very long lines. - Twitter has millions of short lines. - News is moderate in size.

2. Interesting Findings

Love vs Hate in Twitter

## Love appears 77639 times | Hate appears 15561 times | Ratio ≈ 5

Most Frequent Words

Distribution of Line Lengths

3. Plans

Prediction Model: - Clean text - Build n-grams - Use backoff

Shiny App: - Input text - Show top 3 predictions

End of report.