1. Introduction

This project explores text data from blogs, news, and Twitter.
The goal is to build a next-word prediction model.

2. Load Data (SAFE VERSION - NO CRASH)

set.seed(123)

read_sample <- function(file, n = 10000) {
  con <- file(file, "r")
  lines <- readLines(con, n = n, encoding = "UTF-8", skipNul = TRUE)
  close(con)
  return(lines)
}

blogs <- read_sample("en_US.blogs.txt", 10000)
news <- read_sample("en_US.news.txt", 10000)
twitter <- read_sample("en_US.twitter.txt", 10000)

3. Basic Summary

summary_table <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),
  AvgWords = c(
    mean(stri_count_words(blogs)),
    mean(stri_count_words(news)),
    mean(stri_count_words(twitter))
  )
)

knitr::kable(summary_table)

Dataset	Lines	Words	AvgWords
Blogs	10000	413215	41.3215
News	10000	349062	34.9062
Twitter	10000	126736	12.6736

4. Word Distribution Plots

Blogs

blogs_wc <- stri_count_words(blogs)

ggplot(data.frame(x = blogs_wc), aes(x)) +
  geom_histogram(binwidth = 10, fill = "steelblue") +
  labs(title = "Blogs Word Distribution", x = "Words", y = "Frequency")

News

news_wc <- stri_count_words(news)

ggplot(data.frame(x = news_wc), aes(x)) +
  geom_histogram(binwidth = 10, fill = "darkgreen") +
  labs(title = "News Word Distribution", x = "Words", y = "Frequency")

Twitter

twitter_wc <- stri_count_words(twitter)

ggplot(data.frame(x = twitter_wc), aes(x)) +
  geom_histogram(binwidth = 5, fill = "tomato") +
  labs(title = "Twitter Word Distribution", x = "Words", y = "Frequency")

5. Key Observations

Blogs have longer text entries
News is formal and structured
Twitter is short and informal
Data is large but manageable using sampling
Suitable for building predictive text models

6. Next Word Prediction Plan

We will build an N-gram model:

Clean text (lowercase, remove punctuation)
Tokenize words
Build unigram, bigram, trigram models
Use backoff method for prediction

7. Shiny App Plan

The Shiny app will:

Take user input text
Predict next word instantly
Use precomputed n-gram tables
Provide a simple interface

8. Conclusion

This analysis prepares the dataset for building a predictive text application.

Exploratory Data Analysis for Next Word Prediction

Vansh Ojha

2026-06-26