This report shows my exploratory analysis of the SwiftKey text data
(blogs, news, and Twitter) for the Coursera Data Science Capstone.
The ultimate goal is to build a predictive text model (like SwiftKey
keyboard) that suggests the next word as you type, and deliver it as a
simple Shiny app.
I have successfully loaded the data and performed basic summaries, created plots, and identified key features.
## [1] "./Data/final/en_US/en_US.blogs.txt"
## [2] "./Data/final/en_US/en_US.news.txt"
## [3] "./Data/final/en_US/en_US.twitter.txt"
## Files loaded successfully!
## Blogs lines: 899288
## News lines: 1010242
## Twitter lines: 2360148
| Source | File_Size_MB | Number_of_Lines | Longest_Line_Chars |
|---|---|---|---|
| Blogs | NA | 899288 | 40833 |
| News | NA | 1010242 | 11384 |
| NA | 2360148 | 140 |
Main observations: - Blogs is the largest file (~200 MB) and contains very long lines. - Twitter has millions of short lines. - News is moderate in size.
## Love appears 77639 times | Hate appears 15561 times | Ratio ≈ 5
Prediction Model: - Clean text - Build n-grams - Use backoff
Shiny App: - Input text - Show top 3 predictions
End of report.