The goal of this capstone project is to build a
next‑word prediction model similar to the SwiftKey
keyboard, using a large corpus of English text from blogs, news, and
Twitter.[web:10][web:6]
This milestone report shows that the training data have been
successfully loaded, summarizes key characteristics of the corpus,
presents basic exploratory analysis, and outlines the plan for the
prediction algorithm and Shiny application.[web:98][web:131]
The training data were downloaded from the official Coursera–SwiftKey
link, which provides text files for several languages; this analysis
focuses on the three US English files: en_US.blogs.txt,
en_US.news.txt, and
en_US.twitter.txt.[web:6][web:117]
After loading the files into R, basic statistics were computed,
including number of lines, characters, words, and file sizes, to
understand the overall scale of the data set.[web:6][web:160]
text top_n <- 20
ggplot(head(df_uni, top_n), aes(x = reorder(term, -freq), y = freq)) + geom_bar(stat = “identity”, fill = “steelblue”) + labs(title = “Top 20 Unigrams”, x = “Word”, y = “Frequency”) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(head(df_bi, top_n), aes(x = reorder(term, -freq), y = freq)) + geom_bar(stat = “identity”, fill = “darkgreen”) + labs(title = “Top 20 Bigrams”, x = “Bigram”, y = “Frequency”) + theme(axis.text.x = element_text(angle = 45, hjust = 1))