title: “Data Science Capstone Milestone Report” author: “Swagath A S” output: html_document ———————

Overview

This report summarizes the exploratory analysis of the SwiftKey dataset. The dataset contains text from blogs, news articles, and Twitter posts. The objective is to understand the data before building a predictive text model.

Basic Summary of the Data

File	Lines	Words	Characters
Blogs	175838	7266520	40418992
News	205345	7002472	41620762
Twitter	607242	7817233	42309478

The Twitter dataset contains the highest number of lines. All three files contain millions of words and characters, making them suitable for building a language prediction model.

Most Frequent Words

The most common words identified in the sample were:

the, to, and, a, of, in, i, that, for, is

These words are common English words and occur frequently in natural language text.

Plot of File Line Counts

files <- c("Blogs","News","Twitter")
lines <- c(175838,205345,607242)

barplot(lines,
        names.arg=files,
        main="Line Counts by File",
        ylab="Number of Lines")

Findings

Twitter contains the largest number of text entries. Blogs and News contain fewer lines but longer text. Common English words dominate the corpus. The dataset is large enough to support n-gram analysis and predictive text modeling.

Future Plans

The next step is to clean and tokenize the text data. After preprocessing, unigram, bigram, and trigram models will be created. These models will be used to predict the next word entered by a user. The final deliverable will be a Shiny application capable of next-word prediction.