This milestone report summarizes the initial phases of developing a predictive text application as part of Coursera’s Data Science Capstone. It covers data acquisition, preprocessing steps, exploratory analysis, and outlines the next steps in building the predictive model and application.
We used three text datasets provided by SwiftKey:
en_US.blogs.txt)en_US.news.txt)en_US.twitter.txt)These datasets were downloaded and stored locally for analysis.
Initial file characteristics:
library(readr)
library(stringi)
library(quanteda)
library(quanteda.textstats)
library(ggplot2)
library(knitr)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
| Source | Size_MB | Total_Lines | Total_Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546250 |
| News | 196.28 | 1010242 | 34762395 |
| 159.36 | 2360148 | 30093372 |
Due to large sizes, 10,000 random lines from each dataset were sampled for efficient analysis:
| Source | Size_MB | Total_Lines | Total_Words |
|---|---|---|---|
| Combined Samples | 6.8 bytes | 30000 | 889456 |
The sampled data were cleaned by:
The top 15 frequent words (excluding common stop words):
The top 15 frequent two-word phrases:
Our predictive model will:
The resulting predictive model will be integrated into a Shiny application designed to:
Feedback and suggestions are welcome to guide future improvements.