1. Introduction

This report presents a basic exploratory data analysis of three English text datasets: blogs, news, and tweets. The goal is to understand their general structure before building a predictive model.

2. Loading the Data

We use a small sample (10,000 lines) from each file to keep processing manageable.

sample_size <- 10000

# Adjust the path as needed
blogs <- readLines("dados/final/en_US/en_US.blogs.txt", n = sample_size, warn = FALSE)
news <- readLines("dados/final/en_US/en_US.news.txt", n = sample_size, warn = FALSE)
twitter <- readLines("dados/final/en_US/en_US.twitter.txt", n = sample_size, warn = FALSE)

3. Summary Statistics

We compute basic statistics: number of lines, words, memory size, and average words per line.

generate_summary <- function(text_data, name) {
  lines <- length(text_data)
  words <- sum(stri_count_words(text_data))
  size_mb <- object.size(text_data) / (1024^2)
  avg_words <- mean(stri_count_words(text_data))
  
  data.frame(
    Source = name,
    Lines = lines,
    Words = words,
    Size_MB = round(size_mb, 2),
    Avg_Words_Per_Line = round(avg_words, 2)
  )
}

summary <- rbind(
  generate_summary(blogs, "Blogs"),
  generate_summary(news, "News"),
  generate_summary(twitter, "Twitter")
)

summary
##    Source Lines  Words   Size_MB Avg_Words_Per_Line
## 1   Blogs 10000 412805 2.8 bytes              41.28
## 2    News 10000 348070 2.6 bytes              34.81
## 3 Twitter 10000 126511 1.4 bytes              12.65

4. Histogram: Words per Line in Twitter

twitter_word_counts <- stri_count_words(twitter)

qplot(twitter_word_counts, bins = 30, main = "Words per Line Distribution (Twitter)",
      xlab = "Words per Line", ylab = "Frequency")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

5. Observations

6. Plan for Prediction Model

The next step is to build a word prediction model using n-grams (sequences of 1, 2, or 3 words).
We will apply a backoff strategy to predict the next word when a direct match is not found.
This model will be deployed in a simple Shiny app where users can enter text and receive word suggestions.