Introduction

The objective of this project is to explore the SwiftKey dataset and understand the structure of natural language data. The final goal of the capstone project is to build a predictive text model that can suggest the next word based on previously entered words.

The dataset consists of three English language text files:

Blogs
News
Twitter

These files were provided as part of the SwiftKey Capstone Project.

Data Summary

data.frame(
  File = c("Blogs","News","Twitter"),
  Lines = c(899288,1010242,2360148)
)

##      File   Lines
## 1   Blogs  899288
## 2    News 1010242
## 3 Twitter 2360148

Data Sampling

A random sample was taken from each file to reduce computational requirements while preserving the characteristics of the dataset.

Data Cleaning

The following preprocessing steps were performed:

Converted text to lowercase
Removed punctuation
Removed numbers
Removed extra white spaces
Removed unwanted symbols

Word Frequency Analysis

The most common words found in the dataset were:

the
and
that
for
you

Most Frequent Words

barplot(
c(50965,30056,13297,10063,8982),
names.arg=c("the","and","that","for","you"),
las=2,
main="Top Words"
)

Bigram Analysis

Some common bigrams found were:

of the
in the
to the
it’s
on the

Trigram Analysis

Some common trigrams found were:

i don’t
one of the
a lot of
i’m not
it’s a

Coverage Analysis

cover50 <- 239
cover90 <- 7873

cover50

## [1] 239

cover90

## [1] 7873

Findings

Some important observations from the analysis are:

Common English words dominate the corpus.
A relatively small number of words account for a large percentage of total word occurrences.
Frequently occurring bigrams and trigrams represent common conversational patterns.
The dataset is suitable for building an n-gram based predictive text model.

Future Work

A basic n-gram prediction model has already been developed using unigram, bigram, and trigram frequencies. The next stage of the project is to further improve the prediction accuracy by implementing efficient backoff strategies and optimizing the model for memory and runtime performance. Finally, the model will be integrated into a Shiny application that provides real-time next-word predictions for user input.

SwiftKey Capstone Milestone Report

Saurabh Bhatt

2026-06-14