title: “Exploratory Analysis and Plan for Next-Word Prediction App” output: html_document date: “2026-06-14” ——————

Introduction

This project analyzes text data to build a next-word prediction model. The goal is to understand the data and prepare for building a prediction algorithm and a Shiny app.

Data Loading (Sample Only for Performance)

blogs <- readLines("en_US.blogs.txt", n = 2000, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", n = 2000, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", n = 2000, encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

data_summary <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Line_Count = c(length(blogs), length(news), length(twitter)),
  Word_Count = c(sum(stri_count_words(blogs)),
                 sum(stri_count_words(news)),
                 sum(stri_count_words(twitter)))
)

data_summary
##      File Line_Count Word_Count
## 1   Blogs       2000      81987
## 2    News       2000      69609
## 3 Twitter       2000      25389

Combine and Clean Data

sample_data <- c(blogs, news, twitter)

corpus <- Corpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

Word Frequency Analysis

dtm <- DocumentTermMatrix(corpus)
freq <- col_sums(dtm)
freq <- sort(freq, decreasing = TRUE)

barplot(freq[1:10], main = "Top 10 Frequent Words", las = 2)

Findings

Plan for Prediction Algorithm

Plan for Shiny App

Conclusion

This analysis provides a foundation for building a next-word prediction model and Shiny app.