Introduction

The goal of this capstone project is to build a next‑word prediction model similar to the SwiftKey keyboard, using a large corpus of English text from blogs, news, and Twitter.[web:10][web:6]
This milestone report shows that the training data have been successfully loaded, summarizes key characteristics of the corpus, presents basic exploratory analysis, and outlines the plan for the prediction algorithm and Shiny application.[web:98][web:131]

Data and summary statistics

The training data were downloaded from the official Coursera–SwiftKey link, which provides text files for several languages; this analysis focuses on the three US English files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.[web:6][web:117]
After loading the files into R, basic statistics were computed, including number of lines, characters, words, and file sizes, to understand the overall scale of the data set.[web:6][web:160]

text top_n <- 20

ggplot(head(df_uni, top_n), aes(x = reorder(term, -freq), y = freq)) + geom_bar(stat = “identity”, fill = “steelblue”) + labs(title = “Top 20 Unigrams”, x = “Word”, y = “Frequency”) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(head(df_bi, top_n), aes(x = reorder(term, -freq), y = freq)) + geom_bar(stat = “identity”, fill = “darkgreen”) + labs(title = “Top 20 Bigrams”, x = “Bigram”, y = “Frequency”) + theme(axis.text.x = element_text(angle = 45, hjust = 1))