Executive Summary

This report shows my exploratory analysis of the SwiftKey text data (blogs, news, and Twitter) for the Coursera Data Science Capstone.
The ultimate goal is to build a predictive text model (like SwiftKey keyboard) that suggests the next word as you type, and deliver it as a simple Shiny app.

I have successfully loaded the data and performed basic summaries, created plots, and identified key features.

1. Basic Summary of the Three Files

## [1] "./Data/final/en_US/en_US.blogs.txt"  
## [2] "./Data/final/en_US/en_US.news.txt"   
## [3] "./Data/final/en_US/en_US.twitter.txt"
## Files loaded successfully!
## Blogs lines:  899288
## News lines:  1010242
## Twitter lines:  2360148
Summary Statistics of the Three Data Files
Source File_Size_MB Number_of_Lines Longest_Line_Chars
Blogs NA 899288 40833
News NA 1010242 11384
Twitter NA 2360148 140

Main observations: - Blogs is the largest file (~200 MB) and contains very long lines. - Twitter has millions of short lines. - News is moderate in size.

2. Interesting Findings

Love vs Hate in Twitter

## Love appears 77639 times | Hate appears 15561 times | Ratio ≈ 5

Most Frequent Words

Distribution of Line Lengths

3. Plans

Prediction Model: - Clean text - Build n-grams - Use backoff

Shiny App: - Input text - Show top 3 predictions

End of report.