1. Introduction

The purpose of this report is to demonstrate initial exploration of the text data that will later be used to build a prediction algorithm and Shiny application.

At this stage, the goal is not to build a model, but to understand the size, structure, and basic characteristics of the data.

This report is written for a non-technical audience and highlights only the most important findings.


2. Data Loading

# Example if files are in "final/en_US/" folder
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news  <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

cat("Data loaded successfully!\n")
## Data loaded successfully!

3. Data Summary

data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

knitr::kable(data_summary, format.args = list(big.mark = ","),
             caption = "Summary of Text Data Sources")
Summary of Text Data Sources
Source Lines Words
Blogs 899,288 37,546,250
News 1,010,242 34,762,395
Twitter 2,360,148 30,093,413

The table above shows:


4. Word Count Distribution

Let’s examine how many words appear in each line across the three sources.

blog_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)

# Summary statistics
summary_stats <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Mean = c(mean(blog_words, na.rm = TRUE),
           mean(news_words, na.rm = TRUE),
           mean(twitter_words, na.rm = TRUE)),
  Median = c(median(blog_words, na.rm = TRUE),
             median(news_words, na.rm = TRUE),
             median(twitter_words, na.rm = TRUE)),
  Max = c(max(blog_words, na.rm = TRUE),
          max(news_words, na.rm = TRUE),
          max(twitter_words, na.rm = TRUE))
)

knitr::kable(summary_stats, digits = 2,
             caption = "Word Count Statistics per Line")
Word Count Statistics per Line
Source Mean Median Max
Blogs 41.75 28 6726
News 34.41 32 1796
Twitter 12.75 12 47

4.1 Distribution Plots

par(mfrow = c(1, 3))

hist(blog_words[blog_words < 200], 
     breaks = 50,
     main = "Blogs",
     xlab = "Words per Line",
     col = "lightblue",
     border = "white")

hist(news_words[news_words < 200], 
     breaks = 50,
     main = "News",
     xlab = "Words per Line",
     col = "lightgreen",
     border = "white")

hist(twitter_words[twitter_words < 200], 
     breaks = 50,
     main = "Twitter",
     xlab = "Words per Line",
     col = "lightcoral",
     border = "white")

par(mfrow = c(1, 1))

Note: Histograms limited to < 200 words per line for better visualization


5. Key Findings


6. Next Steps

Future analysis will include:

  1. Sampling the data for computational efficiency
  2. Text cleaning and preprocessing
  3. Building n-gram models (unigrams, bigrams, trigrams)
  4. Analyzing word frequencies
  5. Creating a prediction algorithm
  6. Developing a Shiny application for text prediction