Exploratory Analysis of Text Data

1. Introduction

The purpose of this report is to demonstrate initial exploration of the text data that will later be used to build a prediction algorithm and Shiny application.

At this stage, the goal is not to build a model, but to understand the size, structure, and basic characteristics of the data.

This report is written for a non-technical audience and highlights only the most important findings.

2. Data Loading

# Example if files are in "final/en_US/" folder
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news  <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

cat("Data loaded successfully!\n")

## Data loaded successfully!

3. Data Summary

data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

knitr::kable(data_summary, format.args = list(big.mark = ","),
             caption = "Summary of Text Data Sources")

Summary of Text Data Sources
Source	Lines	Words
Blogs	899,288	37,546,250
News	1,010,242	34,762,395
Twitter	2,360,148	30,093,413

The table above shows:

Blogs contains 899,288 lines with approximately 37,546,250 words
News contains 1,010,242 lines with approximately 34,762,395 words
Twitter contains 2,360,148 lines with approximately 30,093,413 words

4. Word Count Distribution

Let’s examine how many words appear in each line across the three sources.

blog_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)

# Summary statistics
summary_stats <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Mean = c(mean(blog_words, na.rm = TRUE),
           mean(news_words, na.rm = TRUE),
           mean(twitter_words, na.rm = TRUE)),
  Median = c(median(blog_words, na.rm = TRUE),
             median(news_words, na.rm = TRUE),
             median(twitter_words, na.rm = TRUE)),
  Max = c(max(blog_words, na.rm = TRUE),
          max(news_words, na.rm = TRUE),
          max(twitter_words, na.rm = TRUE))
)

knitr::kable(summary_stats, digits = 2,
             caption = "Word Count Statistics per Line")

Word Count Statistics per Line
Source	Mean	Median	Max
Blogs	41.75	28	6726
News	34.41	32	1796
Twitter	12.75	12	47

4.1 Distribution Plots

par(mfrow = c(1, 3))

hist(blog_words[blog_words < 200], 
     breaks = 50,
     main = "Blogs",
     xlab = "Words per Line",
     col = "lightblue",
     border = "white")

hist(news_words[news_words < 200], 
     breaks = 50,
     main = "News",
     xlab = "Words per Line",
     col = "lightgreen",
     border = "white")

hist(twitter_words[twitter_words < 200], 
     breaks = 50,
     main = "Twitter",
     xlab = "Words per Line",
     col = "lightcoral",
     border = "white")

par(mfrow = c(1, 1))

Note: Histograms limited to < 200 words per line for better visualization

5. Key Findings

Twitter has the shortest messages, with a mean of 12.8 words per line
Blogs have a mean of 41.8 words per line
News articles have a mean of 34.4 words per line
The distributions show different patterns across sources, reflecting their different communication styles

6. Next Steps

Future analysis will include:

Sampling the data for computational efficiency
Text cleaning and preprocessing
Building n-gram models (unigrams, bigrams, trigrams)
Analyzing word frequencies
Creating a prediction algorithm
Developing a Shiny application for text prediction