1. Introduction

This project explores text data from blogs, news, and Twitter.

The goal is to build a next-word prediction model using the Coursera Data Science Capstone dataset.

2. Load Data (SAFE VERSION - NO CRASH)

library(stringi)
library(ggplot2)
library(knitr)

set.seed(123)

read_sample <- function(file, n = 10000){

  con <- file(file, "r")

  lines <- readLines(con,
                     n = n,
                     encoding = "UTF-8",
                     skipNul = TRUE)

  close(con)

  return(lines)
}

blogs   <- read_sample("en_US.blogs.txt",10000)
news    <- read_sample("en_US.news.txt",10000)
twitter <- read_sample("en_US.twitter.txt",10000)

3. Basic Summary

summary_table <- data.frame(

  Dataset = c("Blogs","News","Twitter"),

  Lines = c(
    length(blogs),
    length(news),
    length(twitter)
  ),

  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),

  AvgWords = c(
    mean(stri_count_words(blogs)),
    mean(stri_count_words(news)),
    mean(stri_count_words(twitter))
  )
)

kable(summary_table,digits=2)

Dataset	Lines	Words	AvgWords
Blogs	10000	413215	41.32
News	10000	349062	34.91
Twitter	10000	126736	12.67

4. Word Distribution Plots

Blogs

blogs_wc <- stri_count_words(blogs)

ggplot(data.frame(words=blogs_wc),
       aes(x=words))+

  geom_histogram(
    bins=40,
    fill="steelblue",
    color="white"
  )+

  labs(
    title="Blogs Word Distribution",
    x="Words",
    y="Frequency"
  )+

  theme_minimal()

News

news_wc <- stri_count_words(news)

ggplot(data.frame(words=news_wc),
       aes(x=words))+

  geom_histogram(
    bins=40,
    fill="darkgreen",
    color="white"
  )+

  labs(
    title="News Word Distribution",
    x="Words",
    y="Frequency"
  )+

  theme_minimal()

Twitter

twitter_wc <- stri_count_words(twitter)

ggplot(data.frame(words=twitter_wc),
       aes(x=words))+

  geom_histogram(
    bins=30,
    fill="tomato",
    color="white"
  )+

  labs(
    title="Twitter Word Distribution",
    x="Words",
    y="Frequency"
  )+

  theme_minimal()

5. Key Observations

Blogs have longer text entries.
News is formal and structured.
Twitter contains short and informal text.
Sampling 10,000 lines is sufficient for exploratory analysis.
Word distributions are right-skewed.
The dataset is appropriate for building a predictive text model.

6. Next Word Prediction Plan

The next stage of the project will include:

Cleaning the text
- Convert to lowercase
- Remove punctuation
- Remove numbers
- Remove extra whitespace
Tokenization
Build:
- Unigram model
- Bigram model
- Trigram model
Apply backoff strategy for prediction.

7. Shiny App Plan

The Shiny application will:

Accept user input text.
Predict the next word.
Use precomputed n-gram frequency tables.
Display top predictions.
Provide a simple and interactive interface.

8. Conclusion

This exploratory analysis provides a strong understanding of the three text sources and their characteristics. The findings will guide the preprocessing, language modeling, and development of a next-word prediction application using n-gram models.

Exploratory Data Analysis for Next Word Prediction

Samyak Nahta

2026-06-29