title: “Exploratory Analysis and Plan for Next-Word Prediction App” output: html_document date: “2026-06-14” ——————
This project analyzes text data to build a next-word prediction model. The goal is to understand the data and prepare for building a prediction algorithm and a Shiny app.
blogs <- readLines("en_US.blogs.txt", n = 2000, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", n = 2000, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", n = 2000, encoding = "UTF-8", skipNul = TRUE)
data_summary <- data.frame(
File = c("Blogs", "News", "Twitter"),
Line_Count = c(length(blogs), length(news), length(twitter)),
Word_Count = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
data_summary
## File Line_Count Word_Count
## 1 Blogs 2000 81987
## 2 News 2000 69609
## 3 Twitter 2000 25389
sample_data <- c(blogs, news, twitter)
corpus <- Corpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
dtm <- DocumentTermMatrix(corpus)
freq <- col_sums(dtm)
freq <- sort(freq, decreasing = TRUE)
barplot(freq[1:10], main = "Top 10 Frequent Words", las = 2)
This analysis provides a foundation for building a next-word prediction model and Shiny app.