title: “Data Science Capstone Milestone Report” author: “Swagath A S” output: html_document ———————
This report summarizes the exploratory analysis of the SwiftKey dataset. The dataset contains text from blogs, news articles, and Twitter posts. The objective is to understand the data before building a predictive text model.
| File | Lines | Words | Characters |
|---|---|---|---|
| Blogs | 175838 | 7266520 | 40418992 |
| News | 205345 | 7002472 | 41620762 |
| 607242 | 7817233 | 42309478 |
The Twitter dataset contains the highest number of lines. All three files contain millions of words and characters, making them suitable for building a language prediction model.
The most common words identified in the sample were:
the, to, and, a, of, in, i, that, for, is
These words are common English words and occur frequently in natural language text.
files <- c("Blogs","News","Twitter")
lines <- c(175838,205345,607242)
barplot(lines,
names.arg=files,
main="Line Counts by File",
ylab="Number of Lines")
Twitter contains the largest number of text entries. Blogs and News contain fewer lines but longer text. Common English words dominate the corpus. The dataset is large enough to support n-gram analysis and predictive text modeling.
The next step is to clean and tokenize the text data. After preprocessing, unigram, bigram, and trigram models will be created. These models will be used to predict the next word entered by a user. The final deliverable will be a Shiny application capable of next-word prediction.