The objective of this project is to explore the SwiftKey dataset and understand the structure of natural language data. The final goal of the capstone project is to build a predictive text model that can suggest the next word based on previously entered words.
The dataset consists of three English language text files:
These files were provided as part of the SwiftKey Capstone Project.
data.frame(
File = c("Blogs","News","Twitter"),
Lines = c(899288,1010242,2360148)
)
## File Lines
## 1 Blogs 899288
## 2 News 1010242
## 3 Twitter 2360148
A random sample was taken from each file to reduce computational requirements while preserving the characteristics of the dataset.
The following preprocessing steps were performed:
The most common words found in the dataset were:
barplot(
c(50965,30056,13297,10063,8982),
names.arg=c("the","and","that","for","you"),
las=2,
main="Top Words"
)
Some common bigrams found were:
Some common trigrams found were:
cover50 <- 239
cover90 <- 7873
cover50
## [1] 239
cover90
## [1] 7873
Some important observations from the analysis are:
A basic n-gram prediction model has already been developed using unigram, bigram, and trigram frequencies. The next stage of the project is to further improve the prediction accuracy by implementing efficient backoff strategies and optimizing the model for memory and runtime performance. Finally, the model will be integrated into a Shiny application that provides real-time next-word predictions for user input.