1. Introduction

This project explores text data from blogs, news, and Twitter.

The goal is to build a next-word prediction model using the Coursera Data Science Capstone dataset.


2. Load Data (SAFE VERSION - NO CRASH)

library(stringi)
library(ggplot2)
library(knitr)

set.seed(123)

read_sample <- function(file, n = 10000){

  con <- file(file, "r")

  lines <- readLines(con,
                     n = n,
                     encoding = "UTF-8",
                     skipNul = TRUE)

  close(con)

  return(lines)
}

blogs   <- read_sample("en_US.blogs.txt",10000)
news    <- read_sample("en_US.news.txt",10000)
twitter <- read_sample("en_US.twitter.txt",10000)

3. Basic Summary

summary_table <- data.frame(

  Dataset = c("Blogs","News","Twitter"),

  Lines = c(
    length(blogs),
    length(news),
    length(twitter)
  ),

  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),

  AvgWords = c(
    mean(stri_count_words(blogs)),
    mean(stri_count_words(news)),
    mean(stri_count_words(twitter))
  )
)

kable(summary_table,digits=2)
Dataset Lines Words AvgWords
Blogs 10000 413215 41.32
News 10000 349062 34.91
Twitter 10000 126736 12.67

4. Word Distribution Plots

Blogs

blogs_wc <- stri_count_words(blogs)

ggplot(data.frame(words=blogs_wc),
       aes(x=words))+

  geom_histogram(
    bins=40,
    fill="steelblue",
    color="white"
  )+

  labs(
    title="Blogs Word Distribution",
    x="Words",
    y="Frequency"
  )+

  theme_minimal()

News

news_wc <- stri_count_words(news)

ggplot(data.frame(words=news_wc),
       aes(x=words))+

  geom_histogram(
    bins=40,
    fill="darkgreen",
    color="white"
  )+

  labs(
    title="News Word Distribution",
    x="Words",
    y="Frequency"
  )+

  theme_minimal()

Twitter

twitter_wc <- stri_count_words(twitter)

ggplot(data.frame(words=twitter_wc),
       aes(x=words))+

  geom_histogram(
    bins=30,
    fill="tomato",
    color="white"
  )+

  labs(
    title="Twitter Word Distribution",
    x="Words",
    y="Frequency"
  )+

  theme_minimal()


5. Key Observations


6. Next Word Prediction Plan

The next stage of the project will include:


7. Shiny App Plan

The Shiny application will:


8. Conclusion

This exploratory analysis provides a strong understanding of the three text sources and their characteristics. The findings will guide the preprocessing, language modeling, and development of a next-word prediction application using n-gram models.