Exploratory Data Analysis of Text Data (Data Science Capstone)

1. Introduction

This report presents a basic exploratory data analysis of three English text datasets: blogs, news, and tweets. The goal is to understand their general structure before building a predictive model.

2. Loading the Data

We use a small sample (10,000 lines) from each file to keep processing manageable.

sample_size <- 10000

# Adjust the path as needed
blogs <- readLines("dados/final/en_US/en_US.blogs.txt", n = sample_size, warn = FALSE)
news <- readLines("dados/final/en_US/en_US.news.txt", n = sample_size, warn = FALSE)
twitter <- readLines("dados/final/en_US/en_US.twitter.txt", n = sample_size, warn = FALSE)

3. Summary Statistics

We compute basic statistics: number of lines, words, memory size, and average words per line.

generate_summary <- function(text_data, name) {
  lines <- length(text_data)
  words <- sum(stri_count_words(text_data))
  size_mb <- object.size(text_data) / (1024^2)
  avg_words <- mean(stri_count_words(text_data))
  
  data.frame(
    Source = name,
    Lines = lines,
    Words = words,
    Size_MB = round(size_mb, 2),
    Avg_Words_Per_Line = round(avg_words, 2)
  )
}

summary <- rbind(
  generate_summary(blogs, "Blogs"),
  generate_summary(news, "News"),
  generate_summary(twitter, "Twitter")
)

summary

##    Source Lines  Words   Size_MB Avg_Words_Per_Line
## 1   Blogs 10000 412805 2.8 bytes              41.28
## 2    News 10000 348070 2.6 bytes              34.81
## 3 Twitter 10000 126511 1.4 bytes              12.65

4. Histogram: Words per Line in Twitter

twitter_word_counts <- stri_count_words(twitter)

qplot(twitter_word_counts, bins = 30, main = "Words per Line Distribution (Twitter)",
      xlab = "Words per Line", ylab = "Frequency")

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

5. Observations

Twitter contains much shorter messages compared to blogs or news.
Blogs tend to have longer, more variable entries.
News text is more consistent in structure and length.

6. Plan for Prediction Model

The next step is to build a word prediction model using n-grams (sequences of 1, 2, or 3 words).
We will apply a backoff strategy to predict the next word when a direct match is not found.
This model will be deployed in a simple Shiny app where users can enter text and receive word suggestions.