This report presents an exploratory analysis of text data from three sources (blogs, news articles, and Twitter) to support the development of a text prediction application. The goal is to build a Shiny app that predicts the next word a user is likely to type, similar to smartphone keyboard suggestions. This analysis examines the basic characteristics of our training data and outlines the approach for building the prediction algorithm.
The training data consists of three text files containing English-language text from different sources:
These files provide a diverse corpus of natural language that will be used to train our prediction model.
# Load required libraries
library(tidyverse)
library(stringr)
library(knitr)
library(kableExtra)
# Define data file paths (adjust path as needed)
data_dir <- "final/en_US"
if (!dir.exists(data_dir)) {
# Try alternative common locations
data_dir <- "../final/en_US"
if (!dir.exists(data_dir)) {
data_dir <- "."
}
}
# File paths
blog_file <- file.path(data_dir, "en_US.blogs.txt")
news_file <- file.path(data_dir, "en_US.news.txt")
twitter_file <- file.path(data_dir, "en_US.twitter.txt")
# Function to read file and return basic info
read_file_info <- function(file_path, source_name) {
if (!file.exists(file_path)) {
return(data.frame(
Source = source_name,
File_Exists = "No",
Lines = NA,
Characters = NA,
Words = NA,
Avg_Chars_Per_Line = NA,
Avg_Words_Per_Line = NA
))
}
# Read file
lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
# Calculate statistics
num_lines <- length(lines)
num_chars <- sum(nchar(lines))
num_words <- sum(str_count(lines, "\\S+"))
avg_chars_per_line <- round(num_chars / num_lines, 2)
avg_words_per_line <- round(num_words / num_lines, 2)
return(data.frame(
Source = source_name,
File_Exists = "Yes",
Lines = num_lines,
Characters = num_chars,
Words = num_words,
Avg_Chars_Per_Line = avg_chars_per_line,
Avg_Words_Per_Line = avg_words_per_line
))
}
# Read all files
blog_info <- read_file_info(blog_file, "Blogs")
news_info <- read_file_info(news_file, "News")
twitter_info <- read_file_info(twitter_file, "Twitter")
# Combine into summary table
file_summary <- rbind(blog_info, news_info, twitter_info)The following table provides a summary of the three data files:
| Source | File Exists | Lines | Characters | Words | Avg Chars/Line | Avg Words/Line |
|---|---|---|---|---|---|---|
| Blogs | Yes | 899288 | 206824505 | 37334131 | 229.99 | 41.52 |
| News | Yes | 1010242 | 203223159 | 34372530 | 201.16 | 34.02 |
| Yes | 2360148 | 162096031 | 30373543 | 68.68 | 12.87 |
Key Observations: - All three files have been successfully loaded - The files contain substantial amounts of text data suitable for training - Blogs typically have longer lines (more characters and words per line) - Twitter messages are shorter, as expected given the character limit - News articles fall somewhere in between
To understand the structure of our data, we examine the distribution of line lengths (in characters) for each source:
# Function to analyze line lengths
analyze_line_lengths <- function(file_path, source_name, sample_size = 10000) {
if (!file.exists(file_path)) {
return(data.frame(Source = character(), Length = numeric()))
}
lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
# Sample if file is very large
if (length(lines) > sample_size) {
lines <- sample(lines, sample_size)
}
line_lengths <- nchar(lines)
return(data.frame(
Source = rep(source_name, length(line_lengths)),
Length = line_lengths
))
}
# Analyze all files (using samples for large files)
if (file.exists(blog_file)) {
blog_lengths <- analyze_line_lengths(blog_file, "Blogs", 10000)
} else {
blog_lengths <- data.frame(Source = character(), Length = numeric())
}
if (file.exists(news_file)) {
news_lengths <- analyze_line_lengths(news_file, "News", 10000)
} else {
news_lengths <- data.frame(Source = character(), Length = numeric())
}
if (file.exists(twitter_file)) {
twitter_lengths <- analyze_line_lengths(twitter_file, "Twitter", 10000)
} else {
twitter_lengths <- data.frame(Source = character(), Length = numeric())
}
# Combine data
all_lengths <- rbind(blog_lengths, news_lengths, twitter_lengths)Observations: - Blogs: Show a wide range of line lengths, with many longer lines (typical of blog posts) - News: Moderate line lengths, relatively consistent distribution - Twitter: Concentrated at shorter lengths due to character limits, with a long tail
# Function to analyze word counts per line
analyze_word_counts <- function(file_path, source_name, sample_size = 10000) {
if (!file.exists(file_path)) {
return(data.frame(Source = character(), WordCount = numeric()))
}
lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
# Sample if file is very large
if (length(lines) > sample_size) {
lines <- sample(lines, sample_size)
}
word_counts <- str_count(lines, "\\S+")
return(data.frame(
Source = rep(source_name, length(word_counts)),
WordCount = word_counts
))
}
# Analyze all files
if (file.exists(blog_file)) {
blog_words <- analyze_word_counts(blog_file, "Blogs", 10000)
} else {
blog_words <- data.frame(Source = character(), WordCount = numeric())
}
if (file.exists(news_file)) {
news_words <- analyze_word_counts(news_file, "News", 10000)
} else {
news_words <- data.frame(Source = character(), WordCount = numeric())
}
if (file.exists(twitter_file)) {
twitter_words <- analyze_word_counts(twitter_file, "Twitter", 10000)
} else {
twitter_words <- data.frame(Source = character(), WordCount = numeric())
}
# Combine data
all_words <- rbind(blog_words, news_words, twitter_words)| Source | Mean Chars | Median Chars | Max Chars | Min Chars | Mean Words | Median Words | Max Words |
|---|---|---|---|---|---|---|---|
| Blogs | 227.66 | 152 | 2690 | 1 | 41.61 | 28 | 602 |
| News | 201.61 | 184 | 1505 | 1 | 34.14 | 31 | 289 |
| 68.86 | 65 | 140 | 5 | 12.82 | 12 | 36 |
Data Volume: The corpus contains millions of words across all three sources, providing a rich foundation for training a prediction model.
Source Diversity: Each source has distinct characteristics:
Line Length Variation: The wide variation in line lengths suggests we’ll need to handle different text patterns in our prediction algorithm.
Data Quality: The data appears clean and well-formatted, which will simplify preprocessing steps.
The text prediction algorithm will use an n-gram model approach:
N-gram Extraction: Break down the text into sequences of 1-4 words (unigrams, bigrams, trigrams, and 4-grams)
Frequency Analysis: Count how often each n-gram appears in the training data
Prediction Strategy:
Smoothing: Use techniques like backoff or interpolation to handle cases where specific n-grams haven’t been seen before
The Shiny application will provide:
Text Input Box: Users can type text and see predictions in real-time
Prediction Display: Show the top 3-5 most likely next words as buttons or suggestions
Visualization: Optional charts showing prediction confidence or n-gram frequencies
User-Friendly Interface: Clean, intuitive design that works on desktop and mobile devices
This exploratory analysis demonstrates that we have successfully loaded and examined the training data. The three text sources provide a diverse corpus suitable for building a text prediction model. The data shows expected patterns (longer blog posts, shorter tweets) and contains sufficient volume for training.
The next phase will focus on implementing the n-gram prediction algorithm and building the Shiny application to provide an interactive text prediction experience.
Report generated on 2025-11-29