Introduction

The objective of this project is to explore the SwiftKey dataset and understand the structure of natural language data. The final goal of the capstone project is to build a predictive text model that can suggest the next word based on previously entered words.

The dataset consists of three English language text files:

These files were provided as part of the SwiftKey Capstone Project.

Data Summary

data.frame(
  File = c("Blogs","News","Twitter"),
  Lines = c(899288,1010242,2360148)
)
##      File   Lines
## 1   Blogs  899288
## 2    News 1010242
## 3 Twitter 2360148

Data Sampling

A random sample was taken from each file to reduce computational requirements while preserving the characteristics of the dataset.

Data Cleaning

The following preprocessing steps were performed:

Word Frequency Analysis

The most common words found in the dataset were:

Most Frequent Words

barplot(
c(50965,30056,13297,10063,8982),
names.arg=c("the","and","that","for","you"),
las=2,
main="Top Words"
)

Bigram Analysis

Some common bigrams found were:

Trigram Analysis

Some common trigrams found were:

Coverage Analysis

cover50 <- 239
cover90 <- 7873

cover50
## [1] 239
cover90
## [1] 7873

Findings

Some important observations from the analysis are:

Future Work

A basic n-gram prediction model has already been developed using unigram, bigram, and trigram frequencies. The next stage of the project is to further improve the prediction accuracy by implementing efficient backoff strategies and optimizing the model for memory and runtime performance. Finally, the model will be integrated into a Shiny application that provides real-time next-word predictions for user input.