title: “Data Science Capstone Milestone Report” author: “Swagath A S” output: html_document ———————

Overview

This report summarizes the exploratory analysis of the SwiftKey dataset. The dataset contains text from blogs, news articles, and Twitter posts. The objective is to understand the data before building a predictive text model.

Basic Summary of the Data

File Lines Words Characters
Blogs 175838 7266520 40418992
News 205345 7002472 41620762
Twitter 607242 7817233 42309478

The Twitter dataset contains the highest number of lines. All three files contain millions of words and characters, making them suitable for building a language prediction model.

Most Frequent Words

The most common words identified in the sample were:

the, to, and, a, of, in, i, that, for, is

These words are common English words and occur frequently in natural language text.

Plot of File Line Counts

files <- c("Blogs","News","Twitter")
lines <- c(175838,205345,607242)

barplot(lines,
        names.arg=files,
        main="Line Counts by File",
        ylab="Number of Lines")

Findings

Twitter contains the largest number of text entries. Blogs and News contain fewer lines but longer text. Common English words dominate the corpus. The dataset is large enough to support n-gram analysis and predictive text modeling.

Future Plans

The next step is to clean and tokenize the text data. After preprocessing, unigram, bigram, and trigram models will be created. These models will be used to predict the next word entered by a user. The final deliverable will be a Shiny application capable of next-word prediction.