Project Overview

The objective of this project is to preprocess the dataset, clean it, and obtain descriptive statistics before modeling it. In addition, a basic exploratory analysis with graphics will be performed.

Import data

# Import data ####
base_path <- "./data/final/en_US"
lst_name <- c("en_US.blogs", "en_US.news", "en_US.twitter")
lst_full_name <- file.path(base_path, paste0(lst_name, ".txt"))


data_list <- setNames(
  lapply(lst_full_name, function(file) {
    content <- readLines(file, encoding = "UTF-8", warn = FALSE)
    cat(sprintf("Archivo %s cargado con éxito.\n", file))
    content
  }),
  lst_name
)

## Archivo ./data/final/en_US/en_US.blogs.txt cargado con éxito.
## Archivo ./data/final/en_US/en_US.news.txt cargado con éxito.
## Archivo ./data/final/en_US/en_US.twitter.txt cargado con éxito.

Basic data summaries

We will extract key statistics from each file, including: - Total number of lines in the dataset. - Total number of words across all lines. - Distribution of line lengths, which will help us understand the typical size of text entries in each dataset. - This information will give a high-level overview of the text content and its complexity.

# Basic summaries for each file in lst_name
for (name in lst_name) {
  cat("\nProcessing file:", name, "\n")
  
  # Get the file's content from data_list
  lines <- data_list[[name]]
  
  # Count number of lines
  line_count <- length(lines)
  
  # Count total words: split each line by whitespace and sum the lengths
  word_count <- sum(sapply(strsplit(lines, "\\s+"), length))
  
  # Print the counts
  cat("Line count:", line_count, "\n")
  cat("Word count:", word_count, "\n")
  
  # Create a basic frequency table of line lengths (number of characters per line)
  line_lengths <- nchar(lines)
  line_length_table <- table(line_lengths)
  cat("Frequency table of line lengths (first 5 values):\n")
  print(head(line_length_table, 5))
}

## 
## Processing file: en_US.blogs 
## Line count: 899288 
## Word count: 37334131 
## Frequency table of line lengths (first 5 values):
## line_lengths
##    1    2    3    4    5 
##   23  210  515 1064 1645 
## 
## Processing file: en_US.news 
## Line count: 77259 
## Word count: 2643969 
## Frequency table of line lengths (first 5 values):
## line_lengths
##  2  3  4  5  6 
##  6  8  8 30 55 
## 
## Processing file: en_US.twitter 
## Line count: 2360148 
## Word count: 30373543 
## Frequency table of line lengths (first 5 values):
## line_lengths
##    2    3    4    5    6 
##    2   98  189  847 1795

Visualizing data characteristics

To better understand the distribution of words in each dataset, we will generate histograms showing the number of words per line. This visualization will help us see how text length varies between blogs, news, and Twitter posts, highlighting key differences in writing styles and content structure.

These insights will be useful for making data-driven decisions about how to process and analyze the text further.

# Plotting histograms for each file in lst_name
for (name in lst_name) {
  cat("\nGenerating histogram for file:", name, "\n")
  
  # Get the file's content
  lines <- data_list[[name]]
  
  # Calculate the number of words per line
  words_per_line <- sapply(strsplit(lines, "\\s+"), length)
  
  # Create a histogram
  hist(words_per_line,
       main = paste("Histogram of Words per Line in", name),
       xlab = "Words per Line",
       col = "skyblue",
       border = "white")
  
  # Pause to allow the plot window to update (optional)
  # readline(prompt="Press [enter] to continue")
}

## 
## Generating histogram for file: en_US.blogs

## 
## Generating histogram for file: en_US.news

## 
## Generating histogram for file: en_US.twitter

Text data preprocessing report

Osiris - E. Rosas

2025-03-17

Project Overview

Import data

Basic data summaries

Visualizing data characteristics